Multi-Stage Automatic NE and PoS Annotation Using Pattern-Based and Statistical-Based Techniques for Thai Corpus Construction

Nattapong TONGTEP  Thanaruk THEERAMUNKONG  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E96-D   No.10   pp.2245-2256
Publication Date: 2013/10/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E96.D.2245
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
Keyword: 
multi-stage annotation,  named entity,  part of speech,  corpus construction,  syllabic alphabetic language,  

Full Text: PDF(2.2MB)>>
Buy this Article




Summary: 
Automated or semi-automated annotation is a practical solution for large-scale corpus construction. However, the special characteristics of Thai language, such as lack of word-boundary and sentence-boundary markers, trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. The two chunking stages are pattern matching-based named entity (NE) extraction and dictionary-based word segmentation while the three succeeding tagging stages are dictionary-, pattern- and statist09812490981249ical-based tagging. Applying heuristics of ambiguity priority, NE extraction is performed first on an original text using a set of patterns, in the order of pattern ambiguity. Next, the remaining text is segmented into words with a dictionary. The obtained chunks are then tagged with types of named entities or parts-of-speech (PoS) using dictionaries, patterns and statistics. Focusing on the reduction of human intervention in corpus construction, our experimental results show that the dictionary-based tagging process can assign unique tags to 64.92% of the words, with the remaining of 24.14% unknown words and 10.94% ambiguously tagged words. Later, the pattern-based tagging can reduce unknown words to only 13.34% while the statistical-based tagging can solve the ambiguously tagged words to only 3.01%.