Robust n-Gram Model of Japanese Character and Its Application to Document Recognition

Hiroki MORI  Hirotomo ASO  Shozo MAKINO  

IEICE TRANSACTIONS on Information and Systems   Vol.E79-D   No.5   pp.471-476
Publication Date: 1996/05/25
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Issue on Character Recognition and Document Understanding)
Category: Postprocessing
character recognition,  n-gram model,  postprocessing,  deleted interpolation method,  

Full Text: PDF>>
Buy this Article

A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.