Automatic Induction of Romanization Systems from Bilingual Corpora

Keiko TAGUCHI  Andrew FINCH  Seiichi YAMAMOTO  Eiichiro SUMITA  

IEICE TRANSACTIONS on Information and Systems   Vol.E98-D   No.2   pp.381-393
Publication Date: 2015/02/01
Publicized: 2014/11/14
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2014EDP7236
Type of Manuscript: PAPER
Category: Artificial Intelligence, Data Mining
romanization,  transliteration,  input method,  

Full Text: PDF>>
Buy this Article

In this article we present a novel corpus-based method for inducing romanization systems for languages through a bilingual alignment of transliteration word pairs. First, the word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterion. As far as we are aware, this paper is the only one to describe a method for automatically deriving complete romanization systems. Unlike existing human-derived romanization systems, the proposed method is able to discover induced romanization systems tailored for specific purposes, for example, for use in data mining, or efficient user input methods. Our experiments study the romanization of four totally different languages: Russian, Japanese, Hindi and Myanmar. The first two languages already have standard romanization systems in regular use, Hindi has a large number of diverse systems, and Myanmar has no standard system for romanization. We compare our induced romanization system to existing systems for Russian and Japanese. We find that the systems so induced are almost identical to Russian, and 69% identical to Japanese. We applied our approach to the task of transliteration mining, and used Levenshtein distance as the romanization selection criterion. Our experiments show that our induced romanization system was able to match the performance of the human created system for Russian, and offer substantially improved mining performance for Japanese. We provide an analysis of the mechanism our approach uses to improve mining performance, and also analyse the differences in characteristics between the induced system for Japanese and the official Japanese Nihon-shiki system. In order to investigate the limits of our approach, we studied the romanization of Myanmar, a low-resource language with a large vocabulary of graphemes. We estimate the approximate corpus size required to effectively romanize the most frequency k graphemes in the language for all values of k up to 1800.