Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources

Juryong CHEON  Youngjoong KO  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E100-D   No.2   pp.405-408
Publication Date: 2017/02/01
Online ISSN: 1745-1361
Type of Manuscript: LETTER
Category: Natural Language Processing
Keyword: 
automatic parallel corpus construction,  language resources,  sentence similarity calculation,  Wikipedia,  

Full Text: PDF(449KB)
>>Buy this Article


Summary: 
In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.