A System for the Synthesis of High-Quality Speech from Texts on General Weather Conditions

Keikichi HIROSE  Hiroya FUJISAKI  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E76-A   No.11   pp.1971-1980
Publication Date: 1993/11/25
Online ISSN: 
Print ISSN: 0916-8508
Type of Manuscript: Special Section PAPER (Special Section on Speech Synthesis: Current Technologies and Thier Application)
text-to-speech conversion system,  linguistic processing,  phonological processing,  syntactic boundary,  prosodic symbols,  fundamental frequency contour,  terminal-analog speech synthesizer,  

Full Text: PDF(987.6KB)>>
Buy this Article

A text-to-speech conversion system for Japanese has been developed for the purpose of producing high-quality speech output. This system consists of four processing stages: 1) linguistic processing, 2) phonological processing, 3) control parameter generation, and 4) speech waveform generation. Although the processing at the first stage is restricted to the texts on general weather conditions, the other three stages can also cope with texts of news and narrations on other topics. Since the prosodic features of speech are largely related to the linguistic information, such as word accent, syntactic structure and discourse structure, linguistic processing of a wider range than ever, at least a sentence, is indispensable to obtain good quality speech with respect to the prosody. From this point of view, input text was restricted to the weather forecast sentences and a method for linguistic processing was developed to conduct morpheme, syntactic and semantic analyses simultaneously. A quantitative model for generating fundamental frequency contours was adopted to make a good reflection of the linguistic information on the prosody of synthetic speech. A set of prosodic rules was constructed to generate prosodic symbols representing prosodic structures of the text from the linguistic information obtained at the first stage. A new speech synthesizer based on the terminal analog method was also developed to improve the segmental quality of synthetic speech. It consists of four paths of cascade connection of pole/zero filters and three waveform generators. The four paths are respectively used for the synthesis of vowels and vowel-like sounds, nasal murmur and buzz bar, friction, and plosion, while the three generators produce voicing source waveform approximated by polynomials, white Gaussian noise source for fricatives and impulse source for plosives. The validity of the approach above has been confirmed by the listening tests using speech synthesized by the developed system. Improvements both in the quality of prosodic features and in the quality of segmental features were realized for the synthetic speech.