
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

Unified Likelihood Ratio Estimation for High to ZeroFrequency NGrams
Masato KIKUCHI Kento KAWAKAMI Kazuho WATANABE Mitsuo YOSHIDA Kyoji UMEMURA
Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Vol.E104A
No.8
pp.10591074 Publication Date: 2021/08/01 Publicized: 2021/02/08 Online ISSN: 17451337
DOI: 10.1587/transfun.2020EAP1088 Type of Manuscript: PAPER Category: Mathematical Systems Science Keyword: likelihood ratio, the lowfrequency problem, the zerofrequency problem, uLSIF,
Full Text: PDF(3.5MB)>>
Summary:
Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of N items, called an Ngram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on Ngram frequency information. A naive estimation approach that uses only Ngram frequencies is sensitive to lowfrequency (rare) Ngrams and not applicable to zerofrequency (unobserved) Ngrams; these are known as the low and zerofrequency problems, respectively. To address these problems, we propose a method for decomposing Ngrams into item units and then applying their frequencies along with the original Ngram frequencies. Our method can obtain the estimates of unobserved Ngrams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often cooccur in practice and therefore maintains their dependencies by using the relevant Ngram frequencies. We also introduce a regularization to achieve robust estimation for rare Ngrams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies.


