Topic Extraction for Documents Based on Compressibility Vector

Nuo ZHANG  Toshinori WATANABE  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E95-D   No.10   pp.2438-2446
Publication Date: 2012/10/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E95.D.2438
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Artificial Intelligence, Data Mining
Keyword: 
topic extraction,  document analysis,  PRDC,  relation analysis,  clustering,  data compression,  

Full Text: PDF(1MB)>>
Buy this Article




Summary: 
Nowadays, there are a great deal of e-documents being accessed on the Internet. It would be helpful if those documents and significant extract contents could be automatically analyzed. Similarity analysis and topic extraction are widely used as document relation analysis techniques. Most of the methods being proposed need some processes such as stemming, stop words removal, and etc. In those methods, natural language processing (NLP) technology is necessary and hence they are dependent on the language feature and the dataset. In this study, we propose novel document relation analysis and topic extraction methods based on text compression. Our proposed approaches do not require NLP, and can also automatically evaluate documents. We challenge our proposal with model documents, URCS and Reuters-21578 dataset, for relation analysis and topic extraction. The effectiveness of the proposed methods is shown by the simulations.