|
For Full-Text PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
|
Detecting Partial and Near Duplication in the Blogosphere
Yeo-Chan YOON Myung-Gil JANG Hyun-Ki KIM So-Young PARK
Publication
IEICE TRANSACTIONS on Information and Systems
Vol.E95-D
No.2
pp.681-685 Publication Date: 2012/02/01 Online ISSN: 1745-1361
DOI: 10.1587/transinf.E95.D.681 Print ISSN: 0916-8532 Type of Manuscript: LETTER Category: Data Engineering, Web Information Systems Keyword: duplicate detection, sentence fingerprint, information retrieval, blogs,
Full Text: PDF(217.2KB)>>
Summary:
In this paper, we propose a duplicate document detection model recognizing both partial duplicates and near duplicates. The proposed model can detect partial duplicates as well as exact duplicates by splitting a large document into many small sentence fingerprints. Furthermore, the proposed model can detect even near duplicates, the result of trivial revisions, by filtering the common words and reordering the word sequence.
|
|