For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources
Juryong CHEON Youngjoong KO
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2017/02/01
Online ISSN: 1745-1361
Type of Manuscript: LETTER
Category: Natural Language Processing
automatic parallel corpus construction, language resources, sentence similarity calculation, Wikipedia,
Full Text: PDF(449KB)
>>Buy this Article
In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.