Efficient structural node similarity computation on billion-scale graphs

Abstract

Structural node similarity is widely used in analyzing complex networks. As one of the structural node similarity metrics, role similarity has the good merit of indicating automorphism (isomorphism). Existing algorithms to compute role similarity (e.g., Role Sim and NED) suffer from severe performance bottlenecks and thus cannot handle large real-world graphs. In this paper, we propose a new framework, namely Struct Sim, to compute nodes’ role similarity. Under this framework, we first prove that Struct Sim is an admissible role similarity metric based on the maximum matching. While the maximum matching is still too costly to scale, we then devise the Bin Count matching that not only is efficient to compute but also guarantees the admissibility of Struct Sim. Bin Count-based Struct Sim admits a precomputed index to query a single pair of node in \(O(k\log D)\) time, where k is a small user-defined parameter and D is the maximum node degree. To build the index, we further devise an FM-sketch-based technique that can handle graphs with billions of edges. Extensive empirical studies show that Struct Sim performs much better than the existing works regarding both effectiveness and efficiency when applied to compute structural node similarities on the real-world graphs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    It is isomorphism confirmation when computing similarity between nodes in two different graphs.

  2. 2.

    In the following, we will use the metric (e.g., Role Sim) and the algorithm to compute the metric interchangeably.

  3. 3.

    In [23], Role Sim has a third initialization, namely “ALL-1” initialization, which renders same similarity scores as the degree-ratio initialization.

  4. 4.

    https://dblp.uni-trier.de/xml/.

  5. 5.

    http://www.anac.gov.br/.

  6. 6.

    http://transtats.bts.gov/.

  7. 7.

    http://ec.europa.eu/.

  8. 8.

    We tried \(k=1\), \(k=5\) and \(k=8\) for k-NN and adopted \(k=5\) as all the baselines got better results under this k value.

References

  1. 1.

    Ahmed, A., Shervashidze, N., Narayanamurthy, S.M., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 37–48 (2013)

  2. 2.

    Antonellis, I., Garcia-Molina, H., Chang, C.: Simrank++: query rewriting through link analysis of the click graph. Proc. VLDB Endow. 1(1), 408–421 (2008)

    Article  Google Scholar 

  3. 3.

    Avis, D.: A survey of heuristics for the weighted matching problem. Networks 13(4), 475–493 (1983)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, NIPS, pp. 585–591 (2001)

  5. 5.

    BlogCatalog. https://github.com/quark0/TAE/tree/master/data/BlogCatalog-dataset

  6. 6.

    Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602 (2004)

  7. 7.

    Cao, S., Lu, W., Xu, Q.: Grarep: Learning graph representations with global structural information. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, pp. 891–900 (2015)

  8. 8.

    Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1145–1152 (2016)

  9. 9.

    Chamberlain, B.P., Clough, J.R., Deisenroth, M.P.: Neural embeddings of graphs in hyperbolic space. CoRR, abs/1705.10359 (2017)

  10. 10.

    Chen, X., Lai, L., Qin, L., Lin,X.: Structsim: querying structural node similarity at billion scale. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20–24, 2020, pp. 1950–1953 (2020)

  11. 11.

    Conte, A., Ferraro, G., Grossi, R., Marino, A., Sadakane, K., Uno, T.: Node similarity with q-grams for real-world labeled networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1282–1291 (2018)

  12. 12.

    Davis, D., Yaveroğlu, Ö.N., Malod-Dognin, N., Stojmirovic, A., Pržulj, N.: Topology-function conservation in protein–protein interaction networks. Bioinformatics 31(10), 1632–1639 (2015)

    Article  Google Scholar 

  13. 13.

    Distinguishability, C.: A theoretical analysis of normalized discounted cumulative gain (ndcg) ranking measures

  14. 14.

    Donnat, C., Zitnik, M., Hallac, D., Leskovec,J.: Learning structural node embeddings via diffusion wavelets. I:n Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1320–1329 (2018)

  15. 15.

    Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Fogaras, D., Rácz, B.: Scaling link-based similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 641–650 (2005)

  17. 17.

    Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Onizuka, M.: Efficient search algorithm for simrank. In: 29th IEEE International Conference on Data Engineering, pp. 589–600 (2013)

  18. 18.

    Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016)

  19. 19.

    Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40(3), 52–74 (2017)

    Google Scholar 

  20. 20.

    Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L., Koutra, D., Faloutsos, C., Li, L.: Rolx: structural role extraction & mining in large graphs. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1231–1239 (2012)

  21. 21.

    Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., Faloutsos, C.: It’s who you know: graph mining using recursive structural features. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–671 (2011)

  22. 22.

    Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)

  23. 23.

    Jin, R., Lee, V.E., Hong, H. Axiomatic ranking of network role similarity. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 922–930 (2011)

  24. 24.

    Jin, R., Lee, V.E., Li, L.: Scalable and axiomatic ranking of network role similarity. ACM Trans. Knowl. Discov. Data 8(1), 3:1–3:37 (2014)

    Article  Google Scholar 

  25. 25.

    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Kuhn, H.W.: The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008, pp. 29–47 (2010)

  27. 27.

    Kusumoto, M., Maehara, T., Kawarabayashi, K.: Scalable similarity search for simrank. In: Proceedings of the 2014 International Conference on Management of Data, pp. 325–336 (2014)

  28. 28.

    Leicht, E.A., Holme, P., Newman, M.E.: Vertex similarity in networks. Phys. Rev. E 73(2), 026120 (2006)

    Article  Google Scholar 

  29. 29.

    Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (2014)

  30. 30.

    Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of simrank for static and dynamic information networks. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 465–476 (2010)

  31. 31.

    Lin, X., Yuan, Y., Zhang, Q., Zhang, Y. Selecting stars: the k most representative skyline operator. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 86–95 (2007)

  32. 32.

    Lin, Z., Lyu, M. R., King, I.: Matchsim: a novel neighbor-based similarity measure with maximum neighborhood matching. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1613–1616 (2009)

  33. 33.

    Liu, D., Huang, J., Lin, C.: Recommendation with social roles. IEEE Access 6, 36420–36427 (2018)

    Article  Google Scholar 

  34. 34.

    Liu, Y., Zheng, B., He, X., Wei, Z., Xiao, X., Zheng, K., Lu, J.: Probesim: scalable single-source and top-k simrank computations on dynamic graphs. Proc. VLDB Endow. 11(1), 14–26 (2017)

    Article  Google Scholar 

  35. 35.

    Lorrain, F., White, H.C.: Structural equivalence of individuals in social networks. J. Math. Sociol. 1(1), 49–80 (1971)

    Article  Google Scholar 

  36. 36.

    Lyu, T., Zhang, Y., Zhang, Y.: Enhancing the network embedding quality with structural similarity. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 147–156 (2017)

  37. 37.

    Optimization and approximation in deterministic sequencing and scheduling: a survey. Volume 5 of Annals of Discrete Mathematics, pp. 287–326 (1979)

  38. 38.

    Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114 (2016)

  39. 39.

    Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)

  40. 40.

    Perozzi, B., Kulkarni, V., Skiena, S.: Walklets: multiscale graph embeddings for interpretable network classification. CoRR, abs/1605.02115 (2016)

  41. 41.

    Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394 (2017)

  42. 42.

    Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)

  43. 43.

    Rossi, R. A., Gallagher, B., Neville, J., Henderson, K.: Modeling dynamic behavior in large evolving graphs. In: Sixth ACM International Conference on Web Search and Data Mining, pp. 667–676 (2013)

  44. 44.

    Serrano, M.A., Boguná, M.: Topology of the world trade web. Phys. Rev. E 68(1), 015101 (2003)

    Article  Google Scholar 

  45. 45.

    Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  MATH  Google Scholar 

  46. 46.

    Tang, J., Qu, M., Wang, M., Zhang, M., Yan, , Mei, Q.: LINE: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015)

  47. 47.

    Tian, B., Xiao, X.: SLING: a near-optimal index structure for simrank. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1859–1874 (2016)

  48. 48.

    Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234 (2016)

  49. 49.

    Wang, X., Tang, L., Gao, H., Liu, H.: Discovering overlapping groups in social media. In: 2010 IEEE International Conference on Data Mining. IEEE, pp. 569–578 (2010)

  50. 50.

    Wang, Y., Lian, X., Chen, L.: Efficient simrank tracking in dynamic graphs. In: 2018 IEEE 34th International Conference on Data Engineering, pp. 545–556 (2018)

  51. 51.

    Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)

    Book  Google Scholar 

  52. 52.

    Yu, W., Lin, X., Zhang, W.: Towards efficient simrank computation on large networks. In: 29th IEEE International Conference on Data Engineering, pp. 601–612 (2013)

  53. 53.

    Yu, W., Lin, X., Zhang, W., Chang, L., Pei, J.: More is simpler: effectively and efficiently assessing node-pair similarities based on hyperlinks. Proc. VLDB Endow. 7(1), 13–24 (2013)

    Article  Google Scholar 

  54. 54.

    Yu, W., Lin, X., Zhang, W., Pei, J., McCann, J.A.: Simrank: effective and scalable pairwise similarity search based on graph topology. VLDB J. 28(3), 401–426 (2019)

    Article  Google Scholar 

  55. 55.

    Yu, W., McCann, J.A.: Efficient partial-pairs simrank search for large networks. Proc. VLDB Endow. 8(5), 569–580 (2015)

    Article  Google Scholar 

  56. 56.

    Yu, W., McCann, J.A.: High quality graph-based similarity search. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 83–92 (2015)

  57. 57.

    Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)

    MathSciNet  Article  Google Scholar 

  58. 58.

    Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 553–562 (2009)

  59. 59.

    Zheng, W., Zou, L., Feng, Y., Chen, L., Zhao, D.: Efficient simrank-based similarity join over large graphs. Proc. VLDB Endow. 6(7), 493–504 (2013)

    Article  Google Scholar 

  60. 60.

    Zhu, H., Meng, X., Kollios, G.: NED: an inter-graph node metric based on edit distance. Proc. VLDB Endow. 10(6), 697–708 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

Xuemin Lin is supported by NSFC61232006, 2018YFB1003504, ARC DP200101338, ARC DP180103096 and ARC DP170101628. Lu Qin is supported by ARC FT200100787.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Longbin Lai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Lai, L., Qin, L. et al. Efficient structural node similarity computation on billion-scale graphs. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00654-9

Download citation

Keywords

  • Node similarity
  • Role similarity
  • Efficiency
  • Link analysis