Detecting Partial and Near Duplication in the Blogosphere

Yeo-Chan YOON  Myung-Gil JANG  Hyun-Ki KIM  So-Young PARK  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E95-D   No.2   pp.681-685
Publication Date: 2012/02/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E95.D.681
Print ISSN: 0916-8532
Type of Manuscript: LETTER
Category: Data Engineering, Web Information Systems
Keyword: 
duplicate detection,  sentence fingerprint,  information retrieval,  blogs,  

Full Text: PDF(217.2KB)>>
Buy this Article




Summary: 
In this paper, we propose a duplicate document detection model recognizing both partial duplicates and near duplicates. The proposed model can detect partial duplicates as well as exact duplicates by splitting a large document into many small sentence fingerprints. Furthermore, the proposed model can detect even near duplicates, the result of trivial revisions, by filtering the common words and reordering the word sequence.