For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Modification of LZSS by Using Structures of Hangul Characters for Hangul Text Compression
Jae Young LEE Keong Mo SUNG
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Publication Date: 1996/11/25
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Information Theory and Coding Theory
information theory, coding theory, text compression, hangul processing,
Full Text: PDF(653.4KB)>>
This paper suggests modified LZSS which is suitable for compressing Hangul data by Hangul character token and the string token with small size based on Hangul properties. The Hangul properties can be described in 2 ways. 1) The structure of a Hangul character consists of 3 letters: The first sound letter, the middle sound letter, and the last sound letter which are called Cho-seong, Jung-seong, and Jong-seong, respectively. 2) The code of Hangul is represented by 2 bytes. The first property is used for making the character token processing Hangul characters which occupies most of the unmatched characters. That is, the unmatched Hangul characters are replaced with one Hangul character token represented by Huffman codes of Cho-seong, Jung-seong, and Jong-seong in regular sequence, instead of 2 character tokens. The second property is used to shorten the size of the string token processing matched string. In other words, since more than 75% of Hangul data are Hangul and Hangul codes are constructed in 2 bytes, the addresses of the window of LZSS can be assigned in 2-byte unit. As a result, the distance field and the length field of the string token can be lessened by one bit each. After compressing Hangul data through these tokens, about 3% of improvement could be made in compression ratio.