Self-Supervised Learning of Video Representation for Anticipating Actions in Early Stage

Yinan LIU  Qingbo WU  Liangzhi TANG  Linfeng XU  

IEICE TRANSACTIONS on Information and Systems   Vol.E101-D   No.5   pp.1449-1452
Publication Date: 2018/05/01
Publicized: 2018/02/21
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2018EDL8013
Type of Manuscript: LETTER
Category: Pattern Recognition
action anticipation,  video frame encoding,  convolutional neural network,  

Full Text: PDF(858.5KB)>>
Buy this Article

In this paper, we propose a novel self-supervised learning of video representation which is capable to anticipate the video category by only reading its short clip. The key idea is that we employ the Siamese convolutional network to model the self-supervised feature learning as two different image matching problems. By using frame encoding, the proposed video representation could be extracted from different temporal scales. We refine the training process via a motion-based temporal segmentation strategy. The learned representations for videos can be not only applied to action anticipation, but also to action recognition. We verify the effectiveness of the proposed approach on both action anticipation and action recognition using two datasets namely UCF101 and HMDB51. The experiments show that we can achieve comparable results with the state-of-the-art self-supervised learning methods on both tasks.