
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

Improvements of HITS Algorithms for Spam Links
Yasuhito ASANO Yu TEZUKA Takao NISHIZEKI
Publication
IEICE TRANSACTIONS on Information and Systems
Vol.E91D
No.2
pp.200208 Publication Date: 2008/02/01 Online ISSN: 17451361
DOI: 10.1093/ietisy/e91d.2.200 Print ISSN: 09168532 Type of Manuscript: Special Section PAPER (Special Section on Foundations of Computer Science) Category: Scoring Algorithms Keyword: scoring algorithm, Web pages, HITS, BHITS, PageRank, search engine, Web graph, spam page, spam links,
Full Text: PDF(215.3KB)>>
Summary:
The HITS algorithm proposed by Kleinberg is one of the representative methods of scoring Web pages by using hyperlinks. In the days when the algorithm was proposed, most of the pages given high score by the algorithm were really related to a given topic, and hence the algorithm could be used to find related pages. However, the algorithm and the variants including Bharat's improved HITS, abbreviated to BHITS, proposed by Bharat and Henzinger cannot be used to find related pages any more on today's Web, due to an increase of spam links. In this paper, we first propose three methods to find "linkfarms," that is, sets of spam links forming a densely connected subgraph of a Web graph. We then present an algorithm, called a trustscore algorithm, to give high scores to pages which are not spam pages with a high probability. Combining the three methods and the trustscore algorithm with BHITS, we obtain several variants of the HITS algorithm. We ascertain by experiments that one of them, named TaN+BHITS using the trustscore algorithm and the method of finding linkfarms by employing name servers, is most suitable for finding related pages on today's Web. Our algorithms take time and memory no more than those required by the original HITS algorithm, and can be executed on a PC with a small amount of main memory.

