For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Event De-Noising Convolutional Neural Network for Detecting Malicious URL Sequences from Proxy Logs
Toshiki SHIBAHARA Kohei YAMANISHI Yuta TAKATA Daiki CHIBA Taiga HOKAGUCHI Mitsuaki AKIYAMA Takeshi YAGI Yuichi OHSITA Masayuki MURATA
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Publication Date: 2018/12/01
Online ISSN: 1745-1337
Type of Manuscript: Special Section PAPER (Special Section on Information Theory and Its Applications)
Category: Cryptography and Information Security
drive-by download attack, communication log analysis, deep neural network, data augmentation,
Full Text: PDF(3.9MB)>>
The number of infected hosts on enterprise networks has been increased by drive-by download attacks. In these attacks, users of compromised popular websites are redirected toward websites that exploit vulnerabilities of a browser and its plugins. To prevent damage, detection of infected hosts on the basis of proxy logs rather than blacklist-based filtering has started to be researched. This is because blacklists have become difficult to create due to the short lifetime of malicious domains and concealment of exploit code. To detect accesses to malicious websites from proxy logs, we propose a system for detecting malicious URL sequences on the basis of three key ideas: focusing on sequences of URLs that include artifacts of malicious redirections, designing new features related to software other than browsers, and generating new training data with data augmentation. To find an effective approach for classifying URL sequences, we compared three approaches: an individual-based approach, a convolutional neural network (CNN), and our new event de-noising CNN (EDCNN). Our EDCNN reduces the negative effects of benign URLs redirected from compromised websites included in malicious URL sequences. Evaluation results show that only our EDCNN with proposed features and data augmentation achieved a practical classification performance: a true positive rate of 99.1%, and a false positive rate of 3.4%.