A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Article Dans Une Revue IEEE Transactions on Multimedia Année : 2016

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Résumé

Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audiovisual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and false recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-based and appearance-based features in this paper. Specifically, a set of geometry-based features is proposed based on an advanced facial landmark localization method. In order to obtain robust and discriminative representation, a spatiotemporal lip feature is put forward concerning similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel two-step keyword spotting strategy based on decision fusion is proposed in order to make the best use of audiovisual speech and adapt to diverse noise conditions. Weights generated using a neural network combine acoustic and visual contributions. Experimental results on OuluVS dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive performance compared to the state of the art. Additionally, the proposed audiovisual keyword spotting method based on decision-level fusion significantly improves the noise robustness and attains better performance than feature-level fusion, which is also capable of adapting to various noisy conditions.
Fichier principal
Vignette du fichier
AVkeyword-doublecolumn.pdf (4.35 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02535026 , version 1 (07-04-2020)

Identifiants

Citer

Pingping Wu, Hong Liu, Xiaofei Li, Ting Fan, Xuewu Zhang. A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion. IEEE Transactions on Multimedia, 2016, 18 (3), pp.326-338. ⟨10.1109/TMM.2016.2520091⟩. ⟨hal-02535026⟩
67 Consultations
81 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More