Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

Izumi SUZUKI  Yoshiki MIKAMI  Ario OHSATO  

IEICE TRANSACTIONS on Information and Systems   Vol.E91-D   No.11   pp.2545-2551
Publication Date: 2008/11/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e91-d.11.2545
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Knowledge, Information and Creativity Support System)
Category: Knowledge Acquisition
Web mining,  text categorization,  cosine similarity,  document indexing,  language identification,  

Full Text: PDF>>
Buy this Article

A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.