U. Ahsan and I. Essa, Clustering social event images using kernel canonical correlation analysis, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW '14, pp.814-819, 2014.

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, 2014.

X. Chen and C. L. Zitnick, Mind's eye: A recurrent visual representation for image caption generation, CVPR, 2015.

J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet et al., On the role of correlation and abstraction in cross-modal multimedia retrieval, vol.36, pp.521-535, 2014.

J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang et al., Subcategory-aware object classification, Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp.827-834, 2013.

F. Feng, X. Wang, and R. Li, Cross-modal retrieval with correspondence autoencoder, Proc. of ACM Intl. Conf. on Multimedia, MM '14, 2014.

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean et al., Devise: A deep visual-semantic embedding model, NIPS, pp.2121-2129, 2013.

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, IJCV, vol.106, issue.2, pp.210-233, 2014.

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, Canonical correlation analysis: An overview with application to learning methods, Neural Comput, vol.16, issue.12, pp.2639-2664, 2004.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, TPAMI, vol.37, issue.9, pp.1904-1916, 2015.

M. Hodosh, P. Young, and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research, pp.853-899, 2013.

Y. Huang, Z. Wu, L. Wang, and T. Tan, Feature coding in image classification: A comprehensive study, TPAMI, vol.36, issue.3, pp.493-506, 2014.

S. J. Hwang and K. Grauman, Learning the relative importance of objects from tagged images for retrieval and crossmodal search, IJCV, vol.100, issue.2, pp.134-153, 2012.

S. J. Hwang and K. Grauman, Reading between the lines: Object localization using implicit cues from image tags, TPAMI, vol.34, issue.6, pp.1145-1158, 2012.

H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez et al., Aggregating local image descriptors into compact codes, TPAMI, vol.34, issue.9, pp.1704-1716, 2012.
URL : https://hal.archives-ouvertes.fr/inria-00633013

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

A. Karpathy, A. Joulin, and F. F. Li, Deep fragment embeddings for bidirectional image sentence mapping, Advances in neural information processing systems, pp.1889-1897, 2014.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, CoRR, 2013.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proceedings of the 28th international conference on machine learning (ICML-11), pp.689-696, 2011.

F. Perronnin and D. Larlus, Fisher vectors meet neural networks: A hybrid classification architecture, CVPR, 2015.

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, Collecting image annotations using amazon's mechanical turk, Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pp.139-147, 2010.

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image classification with the fisher vector: Theory and practice, vol.IJCV, pp.222-245, 2013.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, vol.2, pp.207-218, 2014.

N. Srivastava and R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing systems, pp.2222-2230, 2012.

W. Wang, R. Arora, K. Livescu, and J. Bilmes, On deep multi-view representation learning, International Conference on Machine Learning, 2015.

Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong et al., CNN: single-label to multi-label. CoRR, abs/1406, vol.5726, 2014.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, TACL, vol.2, pp.67-78, 2014.

A. Znaidia, A. Shabou, H. L. Borgne, C. Hudelot, and N. Paragios, Bag-of-multimedia-words for image classification, ICPR, pp.1509-1512, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00825187