Using Topic Keyword Clusters for Automatic Document Clustering

Hsi-Cheng CHANG  Chiun-Chieh HSU  

IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.8   pp.1852-1860
Publication Date: 2005/08/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.8.1852
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Document Image Understanding and Digital Documents)
Category: Document Clustering
document clustering,  topic keyword clustering,  weighted undirected graph,  information retrieval,  

Full Text: PDF(2.3MB)>>
Buy this Article

Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms, frequently perform unsatisfactorily for large text collections, since the computation complexities of the conventional data clustering methods increase very quickly with the number of data items. Poor clustering results degrade intelligent applications such as event tracking and information extraction. This paper presents an unsupervised document clustering method which identifies topic keyword clusters of the text corpus. The proposed method adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords in the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor approach and the keyword clustering technique. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method is assessed against conventional data clustering methods on a web news corpus. The experimental results show that the proposed method is an efficient and effective clustering approach.