Graph sketching-based Space-efficient Data Clustering

Anne Morvan 1 Krzysztof Choromanski 2 Cedric Gouy-Pailler 1 Jamal Atif 3, 4
1 LADIS - Laboratoire d'analyse des données et d'intelligence des systèmes
DM2I - Département Métrologie Instrumentation & Information : DRT/LIST/DM2I
4 MILES
LAMSADE - Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision
Abstract : In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based non-parametric method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a sketched version of the dissimilarity graph G between the N objects to cluster. Unlike k-means, k-medians or k-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic semi-streaming model in handling the dissimilarity graph G as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring O(Npolylog(N)) space, far better than the theoretical memory cost O(N2) of G. The recovered approximate MST T as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on T in a time linear in N. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.
Complete list of metadatas

https://hal-cea.archives-ouvertes.fr/cea-01838501
Contributor : Marie-France Robbe <>
Submitted on : Friday, July 13, 2018 - 2:09:33 PM
Last modification on : Wednesday, August 14, 2019 - 11:30:02 AM

Links full text

Identifiers

Collections

Citation

Anne Morvan, Krzysztof Choromanski, Cedric Gouy-Pailler, Jamal Atif. Graph sketching-based Space-efficient Data Clustering. 2018 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, May 2018, San Diego, United States. ⟨10.1137/1.9781611975321.2⟩. ⟨cea-01838501⟩

Share

Metrics

Record views

57