https://hal-cea.archives-ouvertes.fr/cea-01223434Krzakala, F.F.KrzakalaSU - Sorbonne UniversitéMoore, C.C.MoorePercy Fitzpatrick Institute - University of Cape TownMossel, E.E.MosselNeeman, J.J.NeemanSly, A.A.SlyZdeborova, LenkaLenkaZdeborovaIPHT - Institut de Physique Théorique - UMR CNRS 3681 - CEA - Commissariat à l'énergie atomique et aux énergies alternatives - Université Paris-Saclay - CNRS - Centre National de la Recherche ScientifiqueZhang, P.P.ZhangLaboratory of Infrared Material and Devices - NBU - Ningbo UniversitySpectral redemption in clustering sparse networksHAL CCSD2015[PHYS] Physics [physics]De Laborderie, Emmanuelle2019-02-07 11:19:282023-02-10 04:37:042019-02-07 11:21:38enJournal articleshttps://hal-cea.archives-ouvertes.fr/cea-01223434/document10.1073/pnas.1312486110application/pdf1Spectral algorithms are classic approaches to clustering and community detection in networks. However, for sparse networks the standard versions of these algorithms are suboptimal, in some cases completely failing to detect communities even when other algorithms such as belief propagation can do so. Here we introduce a new class of spectral algorithms based on a non-backtracking walk on the directed edges of the graph. The spectrum of this operator is much better-behaved than that of the adjacency matrix or other commonly used matrices, maintaining a strong separation between the bulk eigenvalues and the eigenvalues relevant to community structure even in the sparse case. We show that our algorithm is optimal for graphs generated by the stochastic block model, detecting communities all the way down to the theoretical limit. We also show the spectrum of the non-backtracking operator for some real-world networks, illustrating its advantages over traditional spectral clustering. Detecting communities or modules is a central task in the study of social, biological, and technological networks. Two of the most popular approaches are statistical inference, where we fix a generative model such as the stochastic block model to the network [1, 2]; and spectral methods, where we classify vertices according to the eigenvectors of a matrix associated with the network such as its adjacency matrix or Laplacian [3]. Both statistical inference and spectral methods have been shown to work well in networks that are sufficiently dense, or when the graph is regular [4-8]. However, for sparse networks with widely varying degrees, the community detection problem is harder. Indeed, it was recently shown [9-11] that there is a phase transition below which communities present in the underlying block model are impossible for any algorithm to detect. While standard spectral algorithms succeed down to this transition when the network is sufficiently dense, with an average degree growing as a function of network size [8], in the case where the average degree is constant these methods fail significantly above the transition [12]. Thus there is a large regime in which statistical inference succeeds in detecting communities, but where current spectral algorithms fail. It was conjectured in [11] that this gap is artificial and that there exists a spectral algorithm that succeeds all the way to the detectability transition even in the sparse case. Here, we propose an algorithm based on a linear operator considerably different from the adjacency matrix or its variants: namely, a matrix that represents a walk on the directed edges of the network, with backtracking prohibited. We give strong evidence that this algorithm indeed closes the gap. The fact that this operator has better spectral properties than, for instance, the standard random walk operator has been used in the past in the context of random matrices and random graphs [13-15]. In the theory of zeta functions of graphs, it is known as the edge adjacency operator, or the Hashimoto matrix [16]. It has been used to show fast mixing for the non-backtracking random walk [17], and arises in connection to belief propagation [18, 19], in particular to rigorously analyze the behavior of belief propagation for clustering problems on regular graphs [5]. It has also been used as a feature vector to classify graphs [20]. However, using this operator as a foundation for spectral clustering and community detection appears to be novel. We show that the resulting spectral algorithms are optimal for networks generated by the stochastic block model, finding communities all the way down to the detectability transition. That is, at any point above this transition, there is a gap between the eigenvalues related to the community structure and the bulk distribution of eigenvalues coming from the random graph structure, allowing us to find a labeling correlated with the true communities. In addition to our analytic results on stochastic block models, we also illustrate the advantages of the non-backtracking operator over existing approaches for some real networks.