Segmentation of the Speaker's Face Region with Audiovisual Correlation

Yuyu LIU  Yoichi SATO  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E93-D   No.7   pp.1965-1975
Publication Date: 2010/07/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E93.D.1965
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Multimedia Pattern Processing
Keyword: 
speaker detection,  audiovisual analysis,  segmentation,  graph cut,  

Full Text: PDF>>
Buy this Article




Summary: 
The ability to find the speaker's face region in a video is useful for various applications. In this work, we develop a novel technique to find this region within different time windows, which is robust against the changes of view, scale, and background. The main thrust of our technique is to integrate audiovisual correlation analysis into a video segmentation framework. We analyze the audiovisual correlation locally by computing quadratic mutual information between our audiovisual features. The computation of quadratic mutual information is based on the probability density functions estimated by kernel density estimation with adaptive kernel bandwidth. The results of this audiovisual correlation analysis are incorporated into graph cut-based video segmentation to resolve a globally optimum extraction of the speaker's face region. The setting of any heuristic threshold in this segmentation is avoided by learning the correlation distributions of speaker and background by expectation maximization. Experimental results demonstrate that our method can detect the speaker's face region accurately and robustly for different views, scales, and backgrounds.