Finding Important People in a Video Using Deep Neural Networks with Conditional Random Fields

Mayu OTANI  Atsushi NISHIDA  Yuta NAKASHIMA  Tomokazu SATO  Naokazu YOKOYA  

IEICE TRANSACTIONS on Information and Systems   Vol.E101-D   No.10   pp.2509-2517
Publication Date: 2018/10/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2018EDP7029
Type of Manuscript: PAPER
Category: Image Recognition, Computer Vision
neural network,  conditional random field,  important people classification,  

Full Text: PDF(1.4MB)>>
Buy this Article

Finding important regions is essential for applications, such as content-aware video compression and video retargeting to automatically crop a region in a video for small screens. Since people are one of main subjects when taking a video, some methods for finding important regions use a visual attention model based on face/pedestrian detection to incorporate the knowledge that people are important. However, such methods usually do not distinguish important people from passers-by and bystanders, which results in false positives. In this paper, we propose a deep neural network (DNN)-based method, which classifies a person into important or unimportant, given a video containing multiple people in a single frame and captured with a hand-held camera. Intuitively, important/unimportant labels are highly correlated given that corresponding people's spatial motions are similar. Based on this assumption, we propose to boost the performance of our important/unimportant classification by using conditional random fields (CRFs) built upon the DNN, which can be trained in an end-to-end manner. Our experimental results show that our method successfully classifies important people and the use of a DNN with CRFs improves the accuracy.