Advertisement

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

  • Jinseok KimEmail author
Article
  • 3 Downloads

Abstract

Clustering results in author name disambiguation are often evaluated by measures such as Cluster-F, K-metric, Pairwise-F, Splitting and Lumping Error, and B-cubed. Although these measures have different evaluation approaches, this paper shows that they can be calculated in a single framework by a set of common steps that compare truth and predicted clusters through two hash tables recording information about name instances with their predicted cluster indices and frequencies of those indices per truth cluster. This integrative calculation reduces greatly calculation runtime, which is scalable to a clustering task involving millions of name instances within a few seconds. During the integration process, B-cubed and K-metric are shown to produce the same precision and recall scores. In addition, name instance pairs for Pairwise-F are counted using a heuristic, which enables the proposed method to surpass a state-of-the-art algorithm in speedy calculation. Details of the integrative calculation are described with examples and pseudo-code to assist scholars to implement each measure easily and validate the correctness of implementation. The integrative calculation will help scholars compare similarities and differences of multiple measures before they select ones that characterize best the clustering performances of their disambiguation methods.

Keywords

Author name disambiguation Entity resolution Clustering Evaluation measure Pairwise-F 

Notes

Acknowledgements

This work was supported by grants from the National Science Foundation (#1561687 and #1535370), the Alfred P. Sloan Foundation and the Ewing Marion Kauffman Foundation.

References

  1. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.  https://doi.org/10.1007/s10791-008-9066-8.CrossRefGoogle Scholar
  2. Backes, T. (2018). The impact of name-matching and blocking on author disambiguation. In Paper presented at the proceedings of the 27th ACM international conference on information and knowledge management, Torino, Italy.  https://doi.org/10.1145/3269206.3271699
  3. Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.  https://doi.org/10.1002/asi.23329.CrossRefGoogle Scholar
  4. Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.  https://doi.org/10.1002/asi.21363.CrossRefGoogle Scholar
  5. Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2017). Person name disambiguation in the web using adaptive threshold clustering. Journal of the Association for Information Science and Technology, 68(7), 1751–1762.CrossRefGoogle Scholar
  6. Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 1–23.  https://doi.org/10.1145/1891879.1891883.CrossRefGoogle Scholar
  7. Fegley, B. D., & Torvik, V. I. (2013). Has large-scale named-entity network analysis been resting on a flawed assumption? PLoS ONE.  https://doi.org/10.1371/journal.pone.0070299.Google Scholar
  8. Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. Sigmod Record, 41(2), 15–26.CrossRefGoogle Scholar
  9. Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. J Assoc Inf Sci Technol, 65(6), 1257–1278.  https://doi.org/10.1002/asi.22992.CrossRefGoogle Scholar
  10. Han, H., Yao, C., Fu, Y., Yu, Y., Zhang, Y., & Xu, S. (2017). Semantic fingerprints-based author name disambiguation in Chinese documents. Scientometrics, 111(3), 1879–1896.  https://doi.org/10.1007/s11192-017-2338-6.CrossRefGoogle Scholar
  11. Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. Berlin: Springer.CrossRefGoogle Scholar
  12. Hussain, I., & Asghar, S. (2017). A survey of author name disambiguation techniques: 2010–2016. The Knowledge Engineering Review, 32, e22.CrossRefGoogle Scholar
  13. Hussain, I., & Asghar, S. (2018). DISC: Disambiguating homonyms using graph structural clustering. Journal of Information Science, 44(6), 830–847.  https://doi.org/10.1017/S0269888917000182.CrossRefGoogle Scholar
  14. Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3), 452–465.  https://doi.org/10.1016/j.ipm.2010.10.001.CrossRefGoogle Scholar
  15. Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886.  https://doi.org/10.1007/s11192-018-2824-5.CrossRefGoogle Scholar
  16. Kim, J., & Diesner, J. (2015). The effect of data pre-processing on understanding the evolution of collaboration networks. Journal of Informetrics, 9(1), 226–236.  https://doi.org/10.1016/j.joi.2015.01.002.CrossRefGoogle Scholar
  17. Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461.  https://doi.org/10.1002/asi.23489.CrossRefGoogle Scholar
  18. Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117(1), 511–526.  https://doi.org/10.1007/s11192-018-2865-9.CrossRefGoogle Scholar
  19. Kim, K., Sefid, A., & Giles, C. L. (2017). Scaling author name disambiguation with CNF Blocking. arXiv preprint arXiv:1709.09657.
  20. Lerchenmueller, M. J., & Sorenson, O. (2016). Author disambiguation in PubMed: Evidence on the precision and recall of authority among NIH-funded scientists. PLoS ONE, 11(7), e0158731.  https://doi.org/10.1371/journal.pone.0158731.CrossRefGoogle Scholar
  21. Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047.  https://doi.org/10.1002/asi.22621.CrossRefGoogle Scholar
  22. Ley, M. (2009). DBLP: Some lessons learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500.CrossRefGoogle Scholar
  23. Li, G. C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., et al. (2014). Disambiguation and co-authorship networks of the US patent inventor database (1975–2010). Research Policy, 43(6), 941–955.  https://doi.org/10.1016/j.respol.2014.01.012.CrossRefGoogle Scholar
  24. Liu, W., Islamaj Dogan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781.  https://doi.org/10.1002/asi.23063.CrossRefGoogle Scholar
  25. Liu, Y., Li, W., Huang, Z., & Fang, Q. (2015). A fast method based on multiple clustering for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 66(3), 634–644.  https://doi.org/10.1002/asi.23063.CrossRefGoogle Scholar
  26. Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised learning. Knowledge Engineering and Semantic Web, Kesw, 2016(649), 272–287.  https://doi.org/10.1007/978-3-319-45880-9_21.CrossRefGoogle Scholar
  27. Maidasani, H., Namata, G., Huang, B., Getoor, L. (2012). Entity resolution evaluation measures. Retrieved from http://honors.cs.umd.edu/reports/hitesh.pdf.
  28. Meilă, M. (2003). Comparing clusterings by the variation of information. In Learning theory and kernel machines (pp. 173–187). Berlin: Springer.Google Scholar
  29. Menestrina, D., Whang, S. E., & Garcia-Molina, H. (2010). Evaluating entity resolution results. Proceedings of the VLDB Endowment, 3(1–2), 208–219.CrossRefGoogle Scholar
  30. Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship networks in author name disambiguation for common names. Paper presented at the 20th international conference on theory and practice of digital libraries (TPDL 2016), Hannover, Germany.  https://doi.org/10.1007/978-3-319-43997-6_31
  31. Müller, M. C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500.  https://doi.org/10.1007/s11192-017-2363-5.CrossRefzbMATHGoogle Scholar
  32. Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Paper presented at the Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, Austin, TX, USA.Google Scholar
  33. Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18(5), 379–412.  https://doi.org/10.1007/s10791-015-9261-3.CrossRefGoogle Scholar
  34. Reitz, F., & Hoffmann, O. (2013). Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities. In T. Özyer, J. Rokne, G. Wagner, & A. H. P. Reuser (Eds.), The influence of technology on social network analysis and mining (pp. 427–453). Vienna: Springer Vienna.CrossRefGoogle Scholar
  35. Santana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. (2017). Incremental author name disambiguation by exploiting domain-specific heuristics. Journal of the Association for Information Science and Technology, 68(4), 931–945.  https://doi.org/10.1002/asi.23726.CrossRefGoogle Scholar
  36. Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50.  https://doi.org/10.1007/s11192-014-1289-4.CrossRefGoogle Scholar
  37. Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology, 43, 287–313.CrossRefGoogle Scholar
  38. Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology, 63(9), 1820–1833.  https://doi.org/10.1002/asi.22695.CrossRefGoogle Scholar
  39. Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data.  https://doi.org/10.1145/1552303.1552304.Google Scholar
  40. Wu, H., Li, B., Pei, Y. J., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972.  https://doi.org/10.1007/s11192-014-1283-x.CrossRefGoogle Scholar
  41. Zhang, Y., Zhang, F., Yao, P., & Tang, J. (2018). Name disambiguation in AMiner: Clustering, maintenance, and human in the loop. Paper presented at the Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK.Google Scholar
  42. Zhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P. C., & Tang, Y. (2018). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114(3), 781–794.  https://doi.org/10.1007/s11192-017-2611-8.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2019

Authors and Affiliations

  1. 1.Institute for Research on Innovation and Science, Survey Research Center, Institute for Social ResearchUniversity of MichiganAnn ArborUSA

Personalised recommendations