Modification of LZSS by Using Structures of Hangul Characters for Hangul Text Compression

Jae Young LEE  Keong Mo SUNG  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E79-A   No.11   pp.1904-1910
Publication Date: 1996/11/25
Online ISSN: 
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Information Theory and Coding Theory
information theory,  coding theory,  text compression,  hangul processing,  

Full Text: PDF(653.4KB)>>
Buy this Article

This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.