For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Using Topic Keyword Clusters for Automatic Document Clustering
Hsi-Cheng CHANG Chiun-Chieh HSU
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2005/08/01
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Document Image Understanding and Digital Documents)
Category: Document Clustering
document clustering, topic keyword clustering, weighted undirected graph, information retrieval,
Full Text: PDF(2.3MB)>>
Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms, frequently perform unsatisfactorily for large text collections, since the computation complexities of the conventional data clustering methods increase very quickly with the number of data items. Poor clustering results degrade intelligent applications such as event tracking and information extraction. This paper presents an unsupervised document clustering method which identifies topic keyword clusters of the text corpus. The proposed method adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords in the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor approach and the keyword clustering technique. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method is assessed against conventional data clustering methods on a web news corpus. The experimental results show that the proposed method is an efficient and effective clustering approach.