For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity
Takao DOI Eiichiro SUMITA
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2005/06/01
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
corpus-based machine translation, utterance splitting, N-gram language model, similarity, edit-distance,
Full Text: PDF>>
In order to boost the translation quality of corpus-based MT systems for speech translation, the technique of splitting an input utterance appears promising. In previous research, many methods used word-sequence characteristics like N-gram clues among splitting positions. In this paper, to supplement splitting methods based on word-sequence characteristics, we introduce another clue using similarity based on edit-distance. In our splitting method, we generate candidates for utterance splitting based on N-grams, and select the best one by measuring the utterance similarity against a corpus. This selection is founded on the assumption that a corpus-based MT system can correctly translate an utterance that is similar to an utterance in its training corpus. We conducted experiments using three MT systems: two EBMT systems, one of which uses a phrase as a translation unit and the other of which uses an utterance, and an SMT system. The translation results under various conditions were evaluated by objective measures and a subjective measure. The experimental results demonstrate that the proposed method is valuable for the three systems. Using utterance similarity can improve the translation quality.