For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Automatic Induction of Romanization Systems from Bilingual Corpora
Keiko TAGUCHI Andrew FINCH Seiichi YAMAMOTO Eiichiro SUMITA
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2015/02/01
Online ISSN: 1745-1361
Type of Manuscript: PAPER
Category: Artificial Intelligence, Data Mining
romanization, transliteration, input method,
Full Text: PDF>>
In this article we present a novel corpus-based method for inducing romanization systems for languages through a bilingual alignment of transliteration word pairs. First, the word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterion. As far as we are aware, this paper is the only one to describe a method for automatically deriving complete romanization systems. Unlike existing human-derived romanization systems, the proposed method is able to discover induced romanization systems tailored for specific purposes, for example, for use in data mining, or efficient user input methods. Our experiments study the romanization of four totally different languages: Russian, Japanese, Hindi and Myanmar. The first two languages already have standard romanization systems in regular use, Hindi has a large number of diverse systems, and Myanmar has no standard system for romanization. We compare our induced romanization system to existing systems for Russian and Japanese. We find that the systems so induced are almost identical to Russian, and 69% identical to Japanese. We applied our approach to the task of transliteration mining, and used Levenshtein distance as the romanization selection criterion. Our experiments show that our induced romanization system was able to match the performance of the human created system for Russian, and offer substantially improved mining performance for Japanese. We provide an analysis of the mechanism our approach uses to improve mining performance, and also analyse the differences in characteristics between the induced system for Japanese and the official Japanese Nihon-shiki system. In order to investigate the limits of our approach, we studied the romanization of Myanmar, a low-resource language with a large vocabulary of graphemes. We estimate the approximate corpus size required to effectively romanize the most frequency k graphemes in the language for all values of k up to 1800.