Extracting Partial Parsing Rules from Tree-Annotated Corpus: Toward Deterministic Global Parsing

Myung-Seok CHOI  Kong-Joo LEE  Key-Sun CHOI  Gil Chang KIM  

IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.6   pp.1248-1255
Publication Date: 2005/06/01
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Natural Language Processing
partial parsing,  deterministic parsing,  automatic rule extraction,  underspecification,  decision tree,  Korean,  

Full Text: PDF(587KB)
>>Buy this Article

It is not always possible to find a global parse for an input sentence owing to problems such as errors of a sentence, incompleteness of lexicon and grammar. Partial parsing is an alternative approach to respond to these problems. Partial parsing techniques try to recover syntactic information efficiently and reliably by sacrificing completeness and depth of analysis. One of the difficulties in partial parsing is how the grammar might be automatically extracted. In this paper we present a method of automatically extracting partial parsing rules from a tree-annotated corpus using the decision tree method. Our goal is deterministic global parsing using partial parsing rules, in other words, to extract partial parsing rules with higher accuracy and broader expansion. First, we define a rule template that enables to learn a subtree for a given substring, so that the resultant rules can be more specific and stricter to apply. Second, rule candidates extracted from a training corpus are enriched with contextual and lexical information using the decision tree method and verified through cross-validation. Last, we underspecify non-deterministic rules by merging substructures with ambiguity in those rules. The learned grammar is similar to phrase structure grammar with contextual and lexical information, but allows building structures of depth one or more. Thanks to automatic learning, the partial parsing rules can be consistent and domain-independent. Partial parsing with this grammar processes an input sentence deterministically using longest-match heuristics, and recursively applies rules to an input sentence. The experiments showed that the partial parser using automatically extracted rules is not only accurate and efficient but also achieves reasonable coverage for Korean.