Accent Sandhi Estimation of Tokyo Dialect of Japanese Using Conditional Random Fields

Masayuki SUZUKI  Ryo KUROIWA  Keisuke INNAMI  Shumpei KOBAYASHI  Shinya SHIMIZU  Nobuaki MINEMATSU  Keikichi HIROSE  

IEICE TRANSACTIONS on Information and Systems   Vol.E100-D   No.4   pp.655-661
Publication Date: 2017/04/01
Publicized: 2016/12/08
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2016AWI0004
Type of Manuscript: INVITED PAPER (Special Section on Award-winning Papers)
Japanese text-to-speech,  accent sandhi,  accent phrase boundary estimation,  accent type estimation,  conditional random field,  

Full Text: FreePDF

When synthesizing speech from Japanese text, correct assignment of accent nuclei for input text with arbitrary contents is indispensable in obtaining naturally-sounding synthetic speech. A phenomenon called accent sandhi occurs in utterances of Japanese; when a word is uttered in a sentence, its accent nucleus may change depending on the contexts of preceding/succeeding words. This paper describes a statistical method for automatically predicting the accent nucleus changes due to accent sandhi. First, as the basis of the research, a database of Japanese text was constructed with labels of accent phrase boundaries and accent nucleus positions when uttered in sentences. A single native speaker of Tokyo dialect Japanese annotated all the labels for 6,344 Japanese sentences. Then, using this database, a conditional-random-field-based method was developed using this database to predict accent phrase boundaries and accent nuclei. The proposed method predicted accent nucleus positions for accent phrases with 94.66% accuracy, clearly surpassing the 87.48% accuracy obtained using our rule-based method. A listening experiment was also conducted on synthetic speech obtained using the proposed method and that obtained using the rule-based method. The results show that our method significantly improved the naturalness of synthetic speech.