Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Hiroyuki KAJI  Yasutsugu MORIMOTO  

IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.2   pp.289-301
Publication Date: 2005/02/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.2.289
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
word-sense disambiguation,  unsupervised learning,  comparable corpora,  

Full Text: PDF(1.9MB)
>>Buy this Article

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns word associations by consulting a bilingual dictionary and calculates correlation between senses of a target polysemous word and its associated words, which can be regarded as clues for identifying the sense of the target word. To overcome the problem of disparity of topical coverage between corpora of the two languages as well as the problem of ambiguity in word-association alignment, an algorithm for iteratively calculating a sense-vs.-clue correlation matrix for each target word was devised. Word-sense disambiguation for each instance of the target word is done by selecting the sense that maximizes the score, i.e., a weighted sum of the correlations between each sense and clues appearing in the context of the instance. An experiment using Wall Street Journal and Nihon Keizai Shimbun corpora together with the EDR bilingual dictionary showed that the new method has promising performance; namely, the F-measure of its sense selection was 74.6% compared to a baseline of 62.8%. The developed method will possibly be extended into a fully unsupervised method that features automatic division and definition of word senses.