Vision-language integration using constrained local semantic features

This paper tackles two recent promising issues in the field of computer vision, namely "the integration of linguistic and visual information'' and "the use of semantic features to represent the image content''. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.

Mots clés

Image classification Image retrieval Bi-modal classification Semantic features Concept-based sparsification Constrained local regions Vision-language integration Common latent space Pure concept space

Domaines

Informatique [cs] Vision par ordinateur et reconnaissance de formes [cs.CV]

Léna Le Roy : Connectez-vous pour contacter le contributeur

https://cea.hal.science/cea-01803830

Soumis le : jeudi 31 mai 2018-07:55:55

Dernière modification le : mercredi 3 avril 2024-11:14:12

Dates et versions

cea-01803830 , version 1 (31-05-2018)

Identifiants

HAL Id : cea-01803830 , version 1
DOI : 10.1016/j.cviu.2017.05.017

Citer

Youssef Tamaazousti, Hervé Le Borgne, Adrian Popescu, Etienne Gadeski, Alexandru Ginsca, et al.. Vision-language integration using constrained local semantic features. Computer Vision and Image Understanding, 2017, 163, pp.41-57. ⟨10.1016/j.cviu.2017.05.017⟩. ⟨cea-01803830⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA CENTRALESUPELEC DRT MICS CEA-UPSAY UNIV-PARIS-SACLAY LIST GS-ENGINEERING GS-COMPUTER-SCIENCE GS-SPORT-HUMAN-MOVEMENT

144 Consultations

0 Téléchargements