Speaker Diarization and Source Number Estimation Based on Audio-Visual Integration

Yukoh WAKABAYASHI  Koji INOUE  Masato NAKAYAMA  Takanobu NISHIURA  Yoichi YAMASHITA  Hiromasa YOSHIMOTO  Tatsuya KAWAHARA  

Publication
D - Abstracts of IEICE TRANSACTIONS on Information and Systems (Japanese Edition)   Vol.J99-D   No.3   pp.326-336
Publication Date: 2016/03/01
Online ISSN: 1881-0225
DOI: 
Type of Manuscript: Special Section PAPER (Special Section on Student Research)
Category: 
Keyword: 
speaker diarization,  sound source localization,  multi-modal,  source number estimation,  multi-party conversation,  

Full Text(in Japanese): PDF(1.4MB)
>>Buy this Article


Summary: 
We present speaker diarization and source number estimation method based on audio-visual integration in multi-party conversation. Speaker diarization represents the estimation “who speaks when." This plays an important role for understanding utterance contents and analyzing human-human interaction such as turn-talking and timing of back-channel. We integrate sound source localization and participants head location from audio and visual information, respectively. Moreover, we conduct source number estimation, which is essential to the improvement of sound source localization, by using audio-visual integration. In the past, the number has been assumed to be known. However, it is difficult to know it in advance in natural conversations. Experimental results show the proposed method improves diarization and source number accuracy compared with the conventional methods.