Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Shigeru YOSHIDA  Takashi MORIHARA  Hironori YAHAGI  Noriko ITANI  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E85-A    No.12    pp.2933-2938
Publication Date: 2002/12/01
Online ISSN: 
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Information Theory
lossless,  text compression,  language,  word-based,  

Full Text: PDF(330.9KB)>>
Buy this Article

16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.