Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity

Takao DOI  Eiichiro SUMITA  

IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.6   pp.1256-1264
Publication Date: 2005/06/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.6.1256
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
corpus-based machine translation,  utterance splitting,  N-gram language model,  similarity,  edit-distance,  

Full Text: PDF>>
Buy this Article

In order to boost the translation quality of corpus-based MT systems for speech translation, the technique of splitting an input utterance appears promising. In previous research, many methods used word-sequence characteristics like N-gram clues among splitting positions. In this paper, to supplement splitting methods based on word-sequence characteristics, we introduce another clue using similarity based on edit-distance. In our splitting method, we generate candidates for utterance splitting based on N-grams, and select the best one by measuring the utterance similarity against a corpus. This selection is founded on the assumption that a corpus-based MT system can correctly translate an utterance that is similar to an utterance in its training corpus. We conducted experiments using three MT systems: two EBMT systems, one of which uses a phrase as a translation unit and the other of which uses an utterance, and an SMT system. The translation results under various conditions were evaluated by objective measures and a subjective measure. The experimental results demonstrate that the proposed method is valuable for the three systems. Using utterance similarity can improve the translation quality.