For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
An EM-Based Approach for Mining Word Senses from Corpora
Thatsanee CHAROENPORN Canasai KRUENGKRAI Thanaruk THEERAMUNKONG Virach SORNLERTLAMVANICH
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2007/04/01
Online ISSN: 1745-1361
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
corpus-based lexicography, word sense discrimination, clustering, EM algorithm, principal component analysis,
Full Text: PDF>>
Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.