Face Retrieval in Large-Scale News Video Datasets

Thanh Duc NGO  Hung Thanh VU  Duy-Dinh LE  Shin'ichi SATOH  

IEICE TRANSACTIONS on Information and Systems   Vol.E96-D   No.8   pp.1811-1825
Publication Date: 2013/08/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E96.D.1811
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Image Recognition, Computer Vision
face-track extraction,  face-track matching,  large-scale,  news video,  

Full Text: PDF(6.3MB)>>
Buy this Article

Face retrieval in news video has been identified as a challenging task due to the huge variations in the visual appearance of the human face. Although several approaches have been proposed to deal with this problem, their extremely high computational cost limits their scalability to large-scale video datasets that may contain millions of faces of hundreds of characters. In this paper, we introduce approaches for face retrieval that are scalable to such datasets while maintaining competitive performances with state-of-the-art approaches. To utilize the variability of face appearances in video, we use a set of face images called face-track to represent the appearance of a character in a video shot. Our first proposal is an approach for extracting face-tracks. We use a point tracker to explore the connections between detected faces belonging to the same character and then group them into one face-track. We present techniques to make the approach robust against common problems caused by flash lights, partial occlusions, and scattered appearances of characters in news videos. In the second proposal, we introduce an efficient approach to match face-tracks for retrieval. Instead of using all the faces in the face-tracks to compute their similarity, our approach obtains a representative face for each face-track. The representative face is computed from faces that are sampled from the original face-track. As a result, we significantly reduce the computational cost of face-track matching while taking into account the variability of faces in face-tracks to achieve high matching accuracy. Experiments are conducted on two face-track datasets extracted from real-world news videos, of such scales that have never been considered in the literature. One dataset contains 1,497 face-tracks of 41 characters extracted from 370 hours of TRECVID videos. The other dataset provides 5,567 face-tracks of 111 characters observed from a television news program (NHK News 7) over 11 years. We make both datasets publically accessible by the research community. The experimental results show that our proposed approaches achieved a remarkable balance between accuracy and efficiency.