Speech-controlled machines can successfully be deployed in rather quiet locations. However, they fail in situations of high background noise and in particular in the presence of other voices. In these situations the error rate of the speech recognition component of such systems rises drastically and, even worse, the system cannot distinguish between the spoken commands of the user and other background speech.
The aim of this project is to improve the noise robustness of speech-based human machine interaction (HMI) by means of information from the visual channel. For example, by observing the mouth of the user, distinguishing user speech from background speech may become much more reliable. The basic idea is to exploit helpful information from the visual channel and combine it with the information from the audio channel. This principle can be applied in various tasks such as voice activity detection, speech recognition and user verification. These tasks are particularly important in speech-based HMI.
This is a joint project of the Computer Vision Laboratory and the speech processing group.
Technical reports of the work done so far are: [PN10], [Nag11], [Nag12], [Nag13], and [Nag14].
Publications (speech processing group): [NP12a], [NP12b], [NHP13a], [NP14], and [NHP15]. [TGPV16].