A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System


IEICE TRANSACTIONS on Information and Systems   Vol.E90-D   No.11   pp.1841-1852
Publication Date: 2007/11/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e90-d.11.1841
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
Cross Language Question Answering,  Indonesian-English CLQA,  limited resource language,  machine learning,  

Full Text: PDF>>
Buy this Article

We have built a CLQA (Cross Language Question Answering) system for a source language with limited data resources (e.g. Indonesian) using a machine learning approach. The CLQA system consists of four modules: question analyzer, keyword translator, passage retriever and answer finder. We used machine learning in two modules, the question classifier (part of the question analyzer) and the answer finder. In the question classifier, we classify the EAT (Expected Answer Type) of a question by using SVM (Support Vector Machine) method. Features for the classification module are basically the output of our shallow question parsing module. To improve the classification score, we use statistical information extracted from our Indonesian corpus. In the answer finder module, using an approach different from the common approach in which answer is located by matching the named entity of the word corpus with the EAT of question, we locate the answer by text chunking the word corpus. The features for the SVM based text chunking process consist of question features, word corpus features and similarity scores between the word corpus and the question keyword. In this way, we eliminate the named entity tagging process for the target document. As for the keyword translator module, we use an Indonesian-English dictionary to translate Indonesian keywords into English. We also use some simple patterns to transform some borrowed English words. The keywords are then combined in boolean queries in order to retrieve relevant passages using IDF scores. We first conducted an experiment using 2,837 questions (about 10% are used as the test data) obtained from 18 Indonesian college students. We next conducted a similar experiment using the NTCIR (NII Test Collection for IR Systems) 2005 CLQA task by translating the English questions into Indonesian. Compared to the Japanese-English and Chinese-English CLQA results in the NTCIR 2005, we found that our system is superior to others except for one system that uses a high data resource employing 3 dictionaries. Further, a rough comparison with two other Indonesian-English CLQA systems revealed that our system achieved higher accuracy score.