For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Detecting Partial and Near Duplication in the Blogosphere
Yeo-Chan YOON Myung-Gil JANG Hyun-Ki KIM So-Young PARK
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2012/02/01
Online ISSN: 1745-1361
Print ISSN: 0916-8532
Type of Manuscript: LETTER
Category: Data Engineering, Web Information Systems
duplicate detection, sentence fingerprint, information retrieval, blogs,
Full Text: PDF(217.2KB)>>
In this paper, we propose a duplicate document detection model recognizing both partial duplicates and near duplicates. The proposed model can detect partial duplicates as well as exact duplicates by splitting a large document into many small sentence fingerprints. Furthermore, the proposed model can detect even near duplicates, the result of trivial revisions, by filtering the common words and reordering the word sequence.