PolyTTS: Polyglot text-to-speech synthesis

Next: LESAN: Lexical and syntactic Up: Some projects of the Previous: ISRL: Improving speech recognition

PolyTTS: Polyglot text-to-speech synthesis

The task of a polyglot text-to-speech (TTS) synthesis system is to transform mixed-lingual text into appropriate speech. First steps towards such a system have been made with the projects POSSY/TTS'99 (diphone library for polyglot TTS) and LESAN (lexical and syntactic analysis of mixed-lingual sentences).

In this project further steps towards a complete polyglot TTS synthesis system will be made, namely:

Our monolingual TTS system SVOX can easily be configured for a new language by simply replacing all its databases (lexica, grammars, rule sets for accentuation and phrasing, neural network for fundamental frequency control, diphone library, etc.) by those of the new language. In contrast to this, a polyglot TTS system must hold the databases of a certain set of languages simultaneously and apply them appropriately (see paper [PR03]). Thus our monolingual TTS system needed a major redesign. The new system will be able to handle the language mixing phenomena in all processing steps, and therefore is called polySVOX (for further information, please see paper [RP06], article [RP07], or thesis [Rom09b]. Some audio examples can be found at the polySVOX demo site.

One of the most difficult problems of polyglot TTS synthesis is the generation of adequate prosody. Investigations in project LESAN have shown, that foreign inclusions are phonetically and prosodically assimilated to the base language. But the degree of assimilation of the embedded language to the base language, the depends strongly on size of the inclusion, and in particular sharply contrasts between the language regions in Switzerland: In the French speaking part the assimilation is much stronger than in the German speaking part. In other words: A polyglot TTS system that can be used in the German speaking part of Switzerland has to distinguish very strongly between the pronunciation of German sentences (base language) and the pronunciation of the foreign inclusions.
These very general rules are far from being sufficient for the prosody generation in a polyglot TTS system. Therefore, appropriate investigations will be made in this project, in order to get a more complete knowledge of this issue. Furthermore, we will try to use statistical models (particularly neural networks) for prosody control (see [RPB05]). This approach has shown to be very successful in the monolingual case. Although there are still many open questions, we consider this approach very promising.

Supported by: This project was partly supported by the NCCR IM2 (i.e. by the Swiss National Science Foundation).

Next: LESAN: Lexical and syntactic Up: Some projects of the Previous: ISRL: Improving speech recognition

Last updated: Mon Nov 20 15:00:45 CET 2017 by: Beat Pfister

!!! Dieses Dokument stammt aus dem ETH Web-Archiv und wird nicht mehr gepflegt !!!
!!! This document is stored in the ETH Web archive and is no longer maintained !!!