Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation

Thanaruk THEERAMUNKONG  Thanasan TANHERMHONG  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E87-D   No.5   pp.1254-1260
Publication Date: 2004/05/01
Online ISSN: 
DOI: 
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
Keyword: 
word segmentation,  decision tree induction,  statistics,  Thai character cluster,  

Full Text: PDF(501.9KB)>>
Buy this Article




Summary: 
This paper proposes two alternative approaches that do not make use of a dictionary but instead utilizes different types of learned features to segment words in a language that has no explicit word boundary. Both methods utilize decision trees as knowledge representation acquired from a training corpus in the segmentation process. The first method, a language-dependent technique, applies a set of constructed features patterns based on character types to generate a set of heuristic segmentation rules. It separates a running text into a sequence of small chunks based on the given patterns, and constructs a decision tree for word segmentation. The second method extracts statistics of character sequences from a training corpus and uses them as features for the process of constructing a set of rules by decision tree induction. The latter needs no linguistic knowledge. By experiments on Thai language, both methods achieve relatively high accuracy but the latter performs much better.