Event De-Noising Convolutional Neural Network for Detecting Malicious URL Sequences from Proxy Logs

Toshiki SHIBAHARA  Kohei YAMANISHI  Yuta TAKATA  Daiki CHIBA  Taiga HOKAGUCHI  Mitsuaki AKIYAMA  Takeshi YAGI  Yuichi OHSITA  Masayuki MURATA  

Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E101-A   No.12   pp.2149-2161
Publication Date: 2018/12/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E101.A.2149
Type of Manuscript: Special Section PAPER (Special Section on Information Theory and Its Applications)
Category: Cryptography and Information Security
Keyword: 
drive-by download attack,  communication log analysis,  deep neural network,  data augmentation,  

Full Text: PDF(3.9MB)
>>Buy this Article


Summary: 
The number of infected hosts on enterprise networks has been increased by drive-by download attacks. In these attacks, users of compromised popular websites are redirected toward websites that exploit vulnerabilities of a browser and its plugins. To prevent damage, detection of infected hosts on the basis of proxy logs rather than blacklist-based filtering has started to be researched. This is because blacklists have become difficult to create due to the short lifetime of malicious domains and concealment of exploit code. To detect accesses to malicious websites from proxy logs, we propose a system for detecting malicious URL sequences on the basis of three key ideas: focusing on sequences of URLs that include artifacts of malicious redirections, designing new features related to software other than browsers, and generating new training data with data augmentation. To find an effective approach for classifying URL sequences, we compared three approaches: an individual-based approach, a convolutional neural network (CNN), and our new event de-noising CNN (EDCNN). Our EDCNN reduces the negative effects of benign URLs redirected from compromised websites included in malicious URL sequences. Evaluation results show that only our EDCNN with proposed features and data augmentation achieved a practical classification performance: a true positive rate of 99.1%, and a false positive rate of 3.4%.