A Sampling Method Using Search API and Wikipedia for Social Media Analysis

Shohei OHSAWA  Yutaka MATSUO  

Publication
D - Abstracts of IEICE TRANSACTIONS on Information and Systems (Japanese Edition)   Vol.J100-D   No.10   pp.870-881
Publication Date: 2017/10/01
Online ISSN: 1881-0225
Type of Manuscript: PAPER
Category: 
Keyword: 
dictionary-based sampling,  Facebook,  Wikipedia,  estimated Jaccard coefficient,  

Full Text(in Japanese): PDF(679.9KB)
>>Buy this Article


Summary: 
In social media analysis, several researchers perform sampling from API (application programming interface) provided by the social media such as Facebook and Twitter to collect attribute information of entities to be analyzed. There are few reports of sampling method from search API, and hence it is not obvious how to sample from the API efficiently. This paper shows a method which enables us to improve the efficiency of sampling by using Wikipedia ontology. Our method generates multiple dictionaries from a given ontology, and changes using dictionary adaptively in conformity to a target topic. Besides, we propose estimated Jaccard coefficient as an evaluation criterion for a dictinoary. The expeiment reports that our method samples 18 million entities, 25.8% of all the entities in Facebook, and the method with estimated Jaccard coefficient outperforms existing methods.