Extraction of Co-authorship Networks

  • Miloš SavićEmail author
  • Mirjana Ivanović
  • Lakhmi C. Jain
Part of the Intelligent Systems Reference Library book series (ISRL, volume 148)


The extraction of a co-authorship network from a set of bibliographic records in which articles and authors are uniquely identified is an easily solvable problem. However, in a vast majority of bibliographic databases authors are identified by their names. This causes the problem of correct identification of nodes in co-authorship networks due to ambiguous author names. In this chapter we present an overview of initial-based, heuristic and machine learning approaches to the name disambiguation problem. Then, we study the performance of various string similarity measures for detecting name synonyms in bibliographic records. After that, we propose a novel method for disambiguating author names that is based on reference similarity networks and community detection techniques. Finally, we present a case study investigating the impact of name disambiguation on the structure of co-authorship networks.


  1. 1.
    Bhattacharya, I., Getoor, L.: A latent Dirichlet model for unsupervised entity resolution. In: Proceedings of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, Bethesda, MD, USA, pp. 47–58 (2006). Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), (2007). Scholar
  3. 3.
    Bird, C., Barr, E., Nash, A., Devanbu, P., Filkov, V., Su, Z.: Structure and dynamics of research collaboration in computer science. In: Proceedings of the Ninth SIAM International Conference on Data Mining, p. 826837. SIAM (2009)CrossRefGoogle Scholar
  4. 4.
    Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10), P10,008 (2008)CrossRefGoogle Scholar
  5. 5.
    Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6, 1579–1619 (2005)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Cen, L., Dragut, E.C., Si, L., Ouzzani, M.: Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, pp. 741–744. ACM, USA (2013).
  7. 7.
    Chen, Y., Brner, K., Fang, S.: Evolving collaboration networks in Scientometrics in 1978–2010: a micromacro analysis. Scientometrics 95(3), 1051–1070 (2013). Scholar
  8. 8.
    Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70, 066111 (2004).
  9. 9.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration, pp. 73–78 (2003)Google Scholar
  10. 10.
    Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J. Am. Soc. Inf. Sci. Tech. 61(9), 1853–1870 (2010). Scholar
  11. 11.
    Csardi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst. p. 1695 (2006)Google Scholar
  12. 12.
    Culotta, A., Kanani, P., Hall, R., Wick, M., McCallum, A.: Author disambiguation using error-driven machine learning with a ranking loss function. In: Sixth International Workshop on Information Integration on the Web (IIWeb-07). Vancouver, Canada (2007)Google Scholar
  13. 13.
    Danon, L., Daz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure identification. J. Stat. Mech. Theory Exp. 2005(09), P09008 (2005)CrossRefGoogle Scholar
  14. 14.
    Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. Portland, Oregon, USA (1996)Google Scholar
  15. 15.
    Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. 2(2), 10:1–10:23 (2011). Scholar
  16. 16.
    Fegley, B.D., Torvik, V.I.: Has large-scale named-entity network analysis been resting on a flawed assumption?. PLoS ONE 8(7), e70299 (2013). Scholar
  17. 17.
    Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. SIGMOD Rec. 41(2), 15–26 (2012). Scholar
  18. 18.
    Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pp. 39–48. ACM, New York, USA (2010).
  19. 19.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(35), 75–174 (2010). Scholar
  20. 20.
    Gurney, T., Horlings, E., Van Den Besselaar, P.: Author disambiguation using multi-aspect similarity indicators. Scientometrics 91(2), 435–449 (2012). Scholar
  21. 21.
    Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’04, pp. 296–305. ACM, New York, USA (2004).
  22. 22.
    Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM Symposium on Applied Computing, SAC ’05, pp. 1065–1069. ACM, New York, USA (2005).
  23. 23.
    Han, H., Zha, H., Giles, C.L.: A model-based k-means algorithm for name disambiguation. In: ISWC 2003 Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. CEUR-WS (2003).
  24. 24.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’05, pp. 334–343. ACM, New York, USA (2005).
  25. 25.
    Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Proceedings of the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases, PKDD’06, pp. 536–544. Springer, Berlin (2006). Scholar
  26. 26.
    Huang, J., Zhuang, Z., Li, J., Giles, C.L.: Collaboration over time: Characterizing and modeling network evolution. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pp. 107–116. ACM, New York, USA (2008).
  27. 27.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  28. 28.
    Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 429–434. Morgan Kaufmann Publishers Inc., San Francisco, USA (2007)Google Scholar
  29. 29.
    Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inf. Proces. Manag. 45(1), 84–97 (2009). Scholar
  30. 30.
    Khabsa, M., Treeratpituk, P., Giles, C.L.: Large scale author name disambiguation in digital libraries. In: 2014 IEEE International Conference on Big Data, pp. 41–42 (2014).
  31. 31.
    Khabsa, M., Treeratpituk, P., Giles, C.L.: Online person name disambiguation with constraints. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’15, pp. 37–46. ACM, New York, USA (2015).
  32. 32.
    Kim, J., Diesner, J.: The effect of data pre-processing on understanding the evolution of collaboration networks. J. Inf. 9(1), 226 – 236 (2015). Scholar
  33. 33.
    Kim, J., Diesner, J.: Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. J. Assoc. Inf. Sci. Tech. 67(6), 1446–1461 (2016). Scholar
  34. 34.
    Laender, A.H., Gonçalves, M.A., Cota, R.G., Ferreira, A.A., Santos, R.L.T., Silva, A.J.: Keeping a digital library clean: new solutions to old problems. In: Proceedings of the Eighth ACM Symposium on Document Engineering, DocEng ’08, pp. 257–262. ACM, New York, USA (2008).
  35. 35.
    Levin, F.H., Heuser, C.A.: Evaluating the use of social networks in author name disambiguation in digital libraries. J. Inf. Data Manag. 1(2), 183–198 (2010)Google Scholar
  36. 36.
    Levin, M., Krawczyk, S., Bethard, S., Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation. J. Am. Soc. Inf. Sci. Tech. 63(5), 1030–1047 (2012). Scholar
  37. 37.
    Ley, M.: DBLP: Some lessons learned. Proc. VLDB Endow. 2(2), 1493–1500 (2009). Scholar
  38. 38.
    Lindsey, D.: Production and citation measures in the sociology of science: the problem of multiple authorship. Soc. Stud. Sci. 10(2), 145–162 (1980)CrossRefGoogle Scholar
  39. 39.
    Liu, W., Islamaj Dogan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Lu, Z., Wilbur, W.J.: Author name disambiguation for pubmed. J. Assoc. Inf. Sci. Tech. 65(4), 765–781 (2014). Scholar
  40. 40.
    Liu, Y., Li, W., Huang, Z., Fang, Q.: A fast method based on multiple clustering for name disambiguation in bibliographic citations. J. Assoc. Inf. Sci. Tech. 66(3), 634–644 (2015). Scholar
  41. 41.
    Malin, B.: Unsupervised name disambiguation via social network similarity. In: Proceedings of the Workshop on Link Analysis, Counterterrorism, and Security, in conjunction with the SIAM International Conference on Data Mining, pp. 93–102 (2005)Google Scholar
  42. 42.
    Martin, T., Ball, B., Karrer, B., Newman, M.E.J.: Coauthorship and citation patterns in the Physical Review. Phys. Rev. E 88, 012814 (2013).
  43. 43.
    McRae-Spencer, D.M., Shadbolt, N.R.: Also by the same author: Aktiveauthor, a citation graph approach to name disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 53–54. ACM, New York, USA (2006).
  44. 44.
    Mena-Chalco, J.P., Digiampietri, L.A., Lopes, F.M., Cesar, R.M.: Brazilian bibliometric coauthorship networks. J. Assoc. Inf. Sci. Tech. 65(7), 1424–1445 (2014). Scholar
  45. 45.
    Mijajlović, Z., Ognjanovic, Z., Pejovic, A.: Digitization of mathematical editions in Serbia. Math. Comput. Sci. 3(3), 251–263 (2010). Scholar
  46. 46.
    Milojević, S.: Accuracy of simple, initials-based methods for author name disambiguation. J. Informetr. 7(4), 767–773 (2013). Scholar
  47. 47.
    Moody, J.: The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999. Am.Sociol. Rev. 69(2), 213–238 (2004)CrossRefGoogle Scholar
  48. 48.
    Newman, M.E.J.: Scientific collaboration networks I: network construction and fundamental results. Phys. Rev. E 64, 016131 (2001).
  49. 49.
    Newman, M.E.J.: Scientific collaboration networks II: shortest paths, weighted networks, and centrality. Phys. Rev. E 64, 016132 (2001).
  50. 50.
    Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001).
  51. 51.
    Newman, M.E.J.: Who is the best connected scientist? A study of scientific coauthorship networks. In: Ben-Naim E., Frauenfelder H., Toroczkai Z. (eds.) Complex Networks. Lecture Notes in Physics, vol. 650, pp. 337–370. Springer, Berlin (2004). Scholar
  52. 52.
    Ochoa, X., Mndez, G., Duval, E.: Who we are: Analysis of 10 years of the ED-MEDIA conference. In: Siemens G., Fulford C. (eds.) Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2009, pp. 189–200. AACE (2009)Google Scholar
  53. 53.
    On, B., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity resolution using quasi-cliques. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), pp. 1008–1015 (2006). 18–22 December 2006, Hong Kong, China.
  54. 54.
    On, B.W., Lee, D.: Scalable name disambiguation using multi-level graph partition. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 575–580 (2007). Scholar
  55. 55.
    On, B.W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’05, pp. 344–353. ACM, New York, USA (2005).
  56. 56.
    On, B.W., Lee, I., Lee, D.: Scalable clustering methods for the name disambiguation problem. Knowl. Inf. Syst. 31(1), 129–151 (2012). Scholar
  57. 57.
    Pons, P., Latapy, M.: Computing communities in large networks using random walks. J. Graph Algorithms Appl. 10(2), 191–218 (2006)MathSciNetCrossRefGoogle Scholar
  58. 58.
    Porter, M.F.: An algorithm for suffix stripping. In: Sparck Jones K., Willett P (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco, USA (1997)Google Scholar
  59. 59.
    Radovanović, M., Ferlež, J., Mladenić, D., Grobelnik, M., Ivanović, M.: Mining and visualizing scientific publication data from Vojvodina. Novi Sad J. Math. 37(2), 161–180 (2007)zbMATHGoogle Scholar
  60. 60.
    Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the quality of person names in dblp. In: Gonzalo J., Thanos C., Verdejo M., Carrasco R. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 4172, pp. 508–511. Springer, Berlin (2006). Scholar
  61. 61.
    Rosvall, M., Bergstrom, C.T.: Maps of information flow reveal community structure in complex networks. 105(4), 1118–1123 (2007).
  62. 62.
    Saha, T.K., Zhang, B., Hasan, M.A.: Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 5(1), 11 (2015).
  63. 63.
    Savić, M., Ivanović, M., Dimić Surla, B.: Analysis of intra-institutional research collaboration: a case of a Serbian faculty of sciences. Scientometrics pp. 1–22 (2016).
  64. 64.
    Savić, M., Ivanović, M., Radovanović, M., Ognjanović, Z., Pejović, A., Jakšić Kruger, T.: The structure and evolution of scientific collaboration in serbian mathematical journals. Scientometrics 101(3), 1805–1830 (2014). Scholar
  65. 65.
    Savić, M., Ivanović, M., Radovanović, M., Surla, B.D.: Towards culture-sensitive extensions of CRISs: Gender-based researcher evaluation. In: Model and Data Engineering: 6th International Conference, MEDI 2016, Almería, Spain, 21-23 Sept 2016, pp. 332–345. Springer International Publishing, New York (2016). Scholar
  66. 66.
    Schulz, C., Mazloumian, A., Petersen, A.M., Penner, O., Helbing, D.: Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci. 3(1), 11 (2014).
  67. 67.
    Shin, D., Kim, T., Choi, J., Kim, J.: Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100(1), 15–50 (2014). Scholar
  68. 68.
    Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 880–891. IEEE Computer Society, Washington, USA (2009).
  69. 69.
    Soler, J.M.: Separating the articles of authors with the same name. Scientometrics 72(2), 281–290 (2007). Scholar
  70. 70.
    Song, M., Kim, E.H.J., Kim, H.J.: Exploring author name disambiguation on PubMed-scale. J. Informetr. 9(4), 924–941 (2015). Scholar
  71. 71.
    Song, Y., Huang, J., Councill, I.G., Li, J., Giles, C.L.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pp. 342–351. ACM, USA (2007).
  72. 72.
    Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans. Knowl. Data Eng. 24(6), 975–987 (2012). Scholar
  73. 73.
    Tang, L., Walsh, J.P.: Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics 84(3), 763–784 (2010). Scholar
  74. 74.
    TePaske-King, B., Richert, N.: The identification of authors in the Mathematical Reviews database. Issues in Science and Technology Librarianship (31) (2001).
  75. 75.
    Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in MEDLINE. ACM Trans. Knowl. Discov. Data 3(3), 11:1–11:29 (2009). Scholar
  76. 76.
    Torvik, V.I., Weeber, M., Swanson, D.R., Smalheiser, N.R.: A probabilistic similarity metric for medline records: a model for author name disambiguation. J. Am. Soc Inf. Sci. Tech. 56(2), 140–158 (2005). Scholar
  77. 77.
    Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pp. 39–48. ACM, USA (2009).
  78. 78.
    Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F., Meira Jr., W.: Cost-effective on-demand associative author name disambiguation. Inf. Process. Manag. 48(4), 680–697 (2012). Scholar
  79. 79.
    Wang, F., Li, J., Tang, J., Zhang, J., Wang, K.: Name disambiguation using atomic clusters. In: The Ninth International Conference on Web-Age Information Management, pp. 357–364 (2008).
  80. 80.
    Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., Pinheiro, D.: A boosted-trees method for name disambiguation. Scientometrics 93(2), 391–411 (2012). Scholar
  81. 81.
    Winkler, W.E.: Overview of record linkage and current research directions. Tech. Rep. RR2006/02, US Bureau of the Census (2006)Google Scholar
  82. 82.
    Wu, H., Li, B., Pei, Y., He, J.: Unsupervised author disambiguation using dempster—shafer theory. Scientometrics 101(3), 1955–1972 (2014). Scholar
  83. 83.
    Yang, K.H., Peng, H.T., Jiang, J.Y., Lee, H.M., Ho, J.M.: Author Name Disambiguation for Citations Using Topic and Web Correlation, pp. 185–196. Springer, Berlin (2008).
  84. 84.
    Zhu, J., Yang, Y., Xie, Q., Wang, L., Hassan, S.U.: Robust hybrid name disambiguation framework for large databases. Scientometrics 98(3), 2255–2274 (2014). Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Miloš Savić
    • 1
    Email author
  • Mirjana Ivanović
    • 1
  • Lakhmi C. Jain
    • 2
  1. 1.Faculty of Sciences, Department of Mathematics and InformaticsUniversity of Novi SadNovi SadSerbia
  2. 2.Centre for Artificial Intelligence, Faculty of Engineering and Information TechnologyUniversity of Technology SydneySydneyAustralia

Personalised recommendations