Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Aleksic, Petar S.; Williams, Jay J.; Wu, Zhilin; Katsaggelos, Aggelos K.

doi:10.1155/S1110865702206162

Research Article
Published: 28 November 2002

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Petar S. Aleksic¹,
Jay J. Williams¹,
Zhilin Wu¹ &
…
Aggelos K. Katsaggelos¹

EURASIP Journal on Advances in Signal Processing volume 2002, Article number: 150948 (2002) Cite this article

1469 Accesses
37 Citations
Metrics details

Abstract

We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Northwestern University, 2145 North Sheridan Road, Evanston, IL, 60208-3118, USA
Petar S. Aleksic, Jay J. Williams, Zhilin Wu & Aggelos K. Katsaggelos

Authors

Petar S. Aleksic
View author publications
You can also search for this author in PubMed Google Scholar
Jay J. Williams
View author publications
You can also search for this author in PubMed Google Scholar
Zhilin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Aggelos K. Katsaggelos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petar S. Aleksic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aleksic, P.S., Williams, J.J., Wu, Z. et al. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features. EURASIP J. Adv. Signal Process. 2002, 150948 (2002). https://doi.org/10.1155/S1110865702206162

Download citation

Received: 03 December 2001
Revised: 19 May 2002
Published: 28 November 2002
DOI: https://doi.org/10.1155/S1110865702206162

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Abstract

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords