Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration

Morgane Pierre-Jean; Jean-François Deleuze; Edith Le Floch; Florence Mauger

doi:10.1093/bib/bbz138

Article Dans Une Revue Briefings in Bioinformatics Année : 2019

Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration

(1) , (2, 1, 3) , (1) , (2)

1
2
3

Morgane Pierre-Jean

Fonction : Auteur

Centre National de Recherche en Génomique Humaine

Jean-François Deleuze

Fonction : Auteur
PersonId : 1015006

Centre National de Génotypage

Centre National de Recherche en Génomique Humaine

Centre d'Etude du Polymorphisme Humain

Edith Le Floch

Fonction : Auteur

Centre National de Recherche en Génomique Humaine

Florence Mauger

Fonction : Auteur

Centre National de Génotypage

Résumé

Recent advances in NGS sequencing, microarrays and mass spectrometry for omics data production have enabled the generation and collection of different modalities of high-dimensional molecular data. The integration of multiple omics datasets is a statistical challenge, due to the limited number of individuals, the high number of variables and the heterogeneity of the datasets to integrate. Recently, a lot of tools have been developed to solve the problem of integrating omics data including canonical correlation analysis, matrix factorization and SM. These commonly used techniques aim to analyze simultaneously two or more types of omics. In this article, we compare a panel of 13 unsupervised methods based on these different approaches to integrate various types of multi-omics datasets: iClusterPlus, regularized generalized canonical correlation analysis, sparse generalized canonical correlation analysis, multiple co-inertia analysis (MCIA), integrative-NMF (intNMF), SNF, MoCluster, mixKernel, CIMLR, LRAcluster, ConsensusClustering, PINSPlus and multi-omics factor analysis (MOFA). We evaluate the ability of the methods to recover the subgroups and the variables that drive the clustering on eight benchmarks of simulation. MOFA does not provide any results on these benchmarks. For clustering, SNF, MoCluster, CIMLR, LRAcluster, ConsensusClustering and intNMF provide the best results. For variable selection, MoCluster outperforms the others. However, the performance of the methods seems to depend on the heterogeneity of the datasets (especially for MCIA, intNMF and iClusterPlus). Finally, we apply the methods on three real studies with heterogeneous data and various phenotypes. We conclude that MoCluster is the best method to analyze these omics data. Availability: An R package named CrIMMix is available on GitHub at https://github.com/CNRGH/crimmix to reproduce all the results of this article.

Mots clés

multi-omics unsupervised integrative methods benchmarks real data performance evaluation

Domaines

Statistiques [math.ST] Applications [stat.AP]

Fichier principal

bbz138_Approval.pdf (1.26 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Morgane Pierre-Jean : Connectez-vous pour contacter le contributeur

https://cea.hal.science/cea-02393847

Soumis le : mercredi 4 décembre 2019-15:18:39

Dernière modification le : jeudi 4 avril 2024-03:10:23

Archivage à long terme le : jeudi 5 mars 2020-20:23:06

Dates et versions

cea-02393847 , version 1 (04-12-2019)

Identifiants

HAL Id : cea-02393847 , version 1
DOI : 10.1093/bib/bbz138

Citer

Morgane Pierre-Jean, Jean-François Deleuze, Edith Le Floch, Florence Mauger. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Briefings in Bioinformatics, 2019, ⟨10.1093/bib/bbz138⟩. ⟨cea-02393847⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA UNIV-PARIS7 USPC UNIV-PARIS-SACLAY JACOB CEA-DRF CNRGH

209 Consultations

880 Téléchargements

Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager