An EM-Based Approach for Mining Word Senses from Corpora

Thatsanee CHAROENPORN  Canasai KRUENGKRAI  Thanaruk THEERAMUNKONG  Virach SORNLERTLAMVANICH  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E90-D   No.4   pp.775-782
Publication Date: 2007/04/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e90-d.4.775
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
Keyword: 
corpus-based lexicography,  word sense discrimination,  clustering,  EM algorithm,  principal component analysis,  

Full Text: PDF>>
Buy this Article




Summary: 
Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.