Removal of Duplicate URL using Multiple Sequence of Alignment Method

Publication Date : 22/01/2017

DOI : 10.21884/IJMTER.2017.4018.78XVN

Author(s) :

Shraddha Sarode , B. S. Chordia.

Volume/Issue :
Volume 4
Issue 1
(01 - 2017)

Abstract :

Word wide web is the largest repository of information today so it has become a primary mean for generating and locating the information on the web. A problem faced by web crawler is that it contains a large numbers of URL which corresponds to page with duplicate and near duplicate content. Such URL are called as DUST which is an important problem of search engine because search engines have to waste time in crawling these redundant URL leading to waste of resources such as bandwidth and disc storage, low quality ranking and poor user experience .To cope with this problem a number of methods are used that remove the duplicate documents without fetching their content.To overcome thistask many methods uses normalization rules to transform all duplicate URLs into the same canonical form. An important task is to derive more general and precise rules.In this paper a framework is used called as DUSTER which derives quality rules by the use of multiple sequence alignment method. Use of multiple sequence alignment in removing of duplicate URL helps to derive effective rules.

January 16, 2017