
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

Approximate Frequent Pattern Discovery in Compressed Space
Shouhei FUKUNAGA Yoshimasa TAKABATAKE Tomohiro I Hiroshi SAKAMOTO
Publication
IEICE TRANSACTIONS on Information and Systems
Vol.E101D
No.3
pp.593601 Publication Date: 2018/03/01
Online ISSN: 17451361 Type of Manuscript: Special Section PAPER (Special Section on Foundations of Computer Science — Frontiers of Theoretical Computer Science —) Category: Keyword: grammar compression, online algorithm, approximate frequent pattern discovery,
Full Text: PDF(775KB) >>Buy this Article
Summary:
A grammar compression is a restricted contextfree grammar (CFG) that derives a single string deterministically. The goal of a grammar compression algorithm is to develop a smaller CFG by finding and removing duplicate patterns, which is simply a frequent pattern discovery process. Any frequent pattern can be obtained in linear time; however, a huge working space is required for longer patterns, and the entire string must be preloaded into memory. We propose an online algorithm to address this problem approximately within compressed space. For an input sequence of symbols, a_{1},a_{2},..., let G_{i} be a grammar compression for the string a_{1}a_{2}…a_{i}. In this study, an online algorithm is considered one that can compute G_{i+1} from (G_{i},a_{i+1}) without explicitly decompressing G_{i}. Here, let G be a grammar compression for string S. We say that variable X approximates a substring P of S within approximation ratio δ iff for any interval [i,j] with P=S[i,j], the parse tree of G has a node labeled with X that derives S[l,r] for a subinterval [l,r] of [i,j] satisfying [l,r]≥δ[i,j]. Then, G solves the frequent pattern discovery problem approximately within δ iff for any frequent pattern P of S, there exists a variable that approximates P within δ. Here, δ is called the approximation ratio of G for S. Previously, the best approximation ratio obtained by a polynomial time algorithm was Ω(1/lg^{2}P). The main contribution of this work is to present a new lower bound Ω(1/<^{*}SlgP) that is smaller than the previous bound when lg^{*}S<lgP. Experimental results demonstrate that the proposed algorithm extracts sufficiently long frequent patterns and significantly reduces memory consumption compared to the offline algorithm in the previous work.

