Attentive Sequences Recurrent Network for Social Relation Recognition from Video

Jinna LV  Bin WU  Yunlei ZHANG  Yunpeng XIAO  

IEICE TRANSACTIONS on Information and Systems   Vol.E102-D   No.12   pp.2568-2576
Publication Date: 2019/12/01
Publicized: 2019/09/02
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2019EDP7104
Type of Manuscript: PAPER
Category: Image Recognition, Computer Vision
social relation recognition,  video analysis,  deep learning,  LSTM,  attention mechanism,  

Full Text: FreePDF(1.2MB)

Recently, social relation analysis receives an increasing amount of attention from text to image data. However, social relation analysis from video is an important problem, which is lacking in the current literature. There are still some challenges: 1) it is hard to learn a satisfactory mapping function from low-level pixels to high-level social relation space; 2) how to efficiently select the most relevant information from noisy and unsegmented video. In this paper, we present an Attentive Sequences Recurrent Network model, called ASRN, to deal with the above challenges. First, in order to explore multiple clues, we design a Multiple Feature Attention (MFA) mechanism to fuse multiple visual features (i.e. image, motion, body, and face). Through this manner, we can generate an appropriate mapping function from low-level video pixels to high-level social relation space. Second, we design a sequence recurrent network based on Global and Local Attention (GLA) mechanism. Specially, an attention mechanism is used in GLA to integrate global feature with local sequence feature to select more relevant sequences for the recognition task. Therefore, the GLA module can better deal with noisy and unsegmented video. At last, extensive experiments on the SRIV dataset demonstrate the performance of our ASRN model.