For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation
Thanaruk THEERAMUNKONG Thanasan TANHERMHONG
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2004/05/01
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
word segmentation, decision tree induction, statistics, Thai character cluster,
Full Text: PDF>>
This paper proposes two alternative approaches that do not make use of a dictionary but instead utilizes different types of learned features to segment words in a language that has no explicit word boundary. Both methods utilize decision trees as knowledge representation acquired from a training corpus in the segmentation process. The first method, a language-dependent technique, applies a set of constructed features patterns based on character types to generate a set of heuristic segmentation rules. It separates a running text into a sequence of small chunks based on the given patterns, and constructs a decision tree for word segmentation. The second method extracts statistics of character sequences from a training corpus and uses them as features for the process of constructing a set of rules by decision tree induction. The latter needs no linguistic knowledge. By experiments on Thai language, both methods achieve relatively high accuracy but the latter performs much better.