O-means: An Optimized Clustering Method for Analyzing Spam Based Attacks

Jungsuk SONG  Daisuke INOUE  Masashi ETO  Hyung Chan KIM  Koji NAKAO  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E94-A   No.1   pp.245-254
Publication Date: 2011/01/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E94.A.245
Print ISSN: 0916-8508
Type of Manuscript: Special Section PAPER (Special Section on Cryptography and Information Security)
Category: Network Security
spam,  clustering,  feature,  K-means clustering method,  

Full Text: PDF(1023.4KB)>>
Buy this Article

In recent years, the number of spam emails has been dramatically increasing and spam is recognized as a serious internet threat. Most recent spam emails are being sent by bots which often operate with others in the form of a botnet, and skillful spammers try to conceal their activities from spam analyzers and spam detection technology. In addition, most spam messages contain URLs that lure spam receivers to malicious Web servers for the purpose of carrying out various cyber attacks such as malware infection, phishing attacks, etc. In order to cope with spam based attacks, there have been many efforts made towards the clustering of spam emails based on similarities between them. The spam clusters obtained from the clustering of spam emails can be used to identify the infrastructure of spam sending systems and malicious Web servers, and how they are grouped and correlate with each other, and to minimize the time needed for analyzing Web pages. Therefore, it is very important to improve the accuracy of the spam clustering as much as possible so as to analyze spam based attacks more accurately. In this paper, we present an optimized spam clustering method, called O-means, based on the K-means clustering method, which is one of the most widely used clustering methods. By examining three weeks of spam gathered in our SMTP server, we observed that the accuracy of the O-means clustering method is about 87% which is superior to the previous clustering methods. In addition, we define 12 statistical features to compare similarity between spam emails, and we determined a set of optimized features which makes the O-means clustering method more effective.