Training Set Selection for Building Compact and Efficient Language Models

Keiji YASUDA  Hirofumi YAMAMOTO  Eiichiro SUMITA  

IEICE TRANSACTIONS on Information and Systems   Vol.E92-D   No.3   pp.506-511
Publication Date: 2009/03/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E92.D.506
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
speech translation,  sentence clustering,  language modeling,  large size corpus,  TC-STAR,  

Full Text: PDF>>
Buy this Article

For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.