Capacity Control of Social Media Diffusion for Real-Time Analysis System

Miki ENOKI  Issei YOSHIDA  Masato OGUCHI  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E100-D   No.4   pp.776-784
Publication Date: 2017/04/01
Online ISSN: 1745-1361
Type of Manuscript: Special Section PAPER (Special Section on Data Engineering and Information Management)
Category: 
Keyword: 
information diffusion,  social media,  in-memory database,  microblogging,  stream processing,  

Full Text: PDF(1.5MB)
>>Buy this Article


Summary: 
In Twitter-like services, countless messages are being posted in real-time every second all around the world. Timely knowledge about what kinds of information are diffusing in social media is quite important. For example, in emergency situations such as earthquakes, users provide instant information on their situation through social media. The collective intelligence of social media is useful as a means of information detection complementary to conventional observation. We have developed a system for monitoring and analyzing information diffusion data in real-time by tracking retweeted tweets. A tweet retweeted by many users indicates that they find the content interesting and impactful. Analysts who use this system can find tweets retweeted by many users and identify the key people who are retweeted frequently by many users or who have retweeted tweets about particular topics. However, bursting situations occur when thousands of social media messages are suddenly posted simultaneously, and the lack of machine resources to handle such situations lowers the system's query performance. Since our system is designed to be used interactively in real-time by many analysts, waiting more than one second for a query results is simply not acceptable. To maintain an acceptable query performance, we propose a capacity control method for filtering incoming tweets using extra attribute information from tweets themselves. Conventionally, there is a trade-off between the query performance and the accuracy of the analysis results. We show that the query performance is improved by our proposed method and that our method is better than the existing methods in terms of maintaining query accuracy.