
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

A Support Vector and KMeans Based Hybrid Intelligent Data Clustering Algorithm
Liang SUN Shinichi YOSHIDA Yanchun LIANG
Publication
IEICE TRANSACTIONS on Information and Systems
Vol.E94D
No.11
pp.22342243 Publication Date: 2011/11/01
Online ISSN: 17451361
DOI: 10.1587/transinf.E94.D.2234
Print ISSN: 09168532 Type of Manuscript: PAPER Category: Artificial Intelligence, Data Mining Keyword: data clustering, kernel methods, support vector clustering, KMeans clustering,
Full Text: PDF(3MB) >>Buy this Article
Summary:
Support vector clustering (SVC), a recently developed unsupervised learning algorithm, has been successfully applied to solving many reallife data clustering problems. However, its effectiveness and advantages deteriorate when it is applied to solving complex realworld problems, e.g., those with large proportion of noise data points and with connecting clusters. This paper proposes a support vector and KMeans based hybrid algorithm to improve the performance of SVC. A new SVC training method is developed based on analysis of a Gaussian kernel radius function. An empirical study is conducted to guide better selection of the standard deviation of the Gaussian kernel. In the proposed algorithm, firstly, the outliers which increase problem complexity are identified and removed by training a global SVC. The refined data set is then clustered by a kernelbased KMeans algorithm. Finally, several local SVCs are trained for the clusters and then each removed data point is labeled according to the distance from it to the local SVCs. Since it exploits the advantages of both SVC and KMeans, the proposed algorithm is capable of clustering compact and arbitrary organized data sets and of increasing robustness to outliers and connecting clusters. Experiments are conducted on 2D data sets generated by mixture models and benchmark data sets taken from the UCI machine learning repository. The cluster error rate is lower than 3.0% for all the selected data sets. The results demonstrate that the proposed algorithm compared favorably with existing SVC algorithms.

