J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet et al., On the role of correlation and abstraction in cross-modal multimedia retrieval, TPAMI, vol.36, issue.3, pp.521-535, 2014.

F. Feng, X. Wang, and R. Li, Cross-modal retrieval with correspondence autoencoder, Proc. of ACM Intl. Conf. on Multimedia, MM '14, 2014.

Y. Feng and M. Lapata, Topic models for image annotation and text illustration, Human Language Technologies: 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics, HLT '10, pp.831-839, 2010.

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, IJCV, vol.106, issue.2, pp.210-233, 2014.

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, Canonical correlation analysis: An overview with application to learning methods, Neural Comput, vol.16, issue.12, pp.2639-2664, 2004.

H. Hotelling, Relations between two sets of variables, Biometrika, vol.28, pp.312-377, 1936.

S. J. Hwang and K. Grauman, Learning the relative importance of objects from tagged images for retrieval and cross-modal search, IJCV, vol.100, issue.2, pp.134-153, 2012.

X. Mao, B. Lin, D. Cai, X. He, and J. Pei, Parallel field alignment for cross media retrieval, Proc. of ACM Intl. Conf. on Multimedia, MM '13, 2013.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2013.