For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Robust n-Gram Model of Japanese Character and Its Application to Document Recognition
Hiroki MORI Hirotomo ASO Shozo MAKINO
IEICE TRANSACTIONS on Information and Systems
Publication Date: 1996/05/25
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Issue on Character Recognition and Document Understanding)
character recognition, n-gram model, postprocessing, deleted interpolation method,
Full Text: PDF>>
A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.