Skip to main content

Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning

  • Conference paper
  • First Online:
Knowledge Engineering and Semantic Web (KESW 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 649))

Included in the following conference series:

Abstract

Building on more than one million crowdsourced annotations that we publicly release, we propose a new automated disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing new phonetic-based blocking strategies, thereby increasing recall; (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary; and (iii) showing the importance of balancing negative and positive examples when learning the linkage function.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \((n,m)-\) TF-IDF vectors are TF-IDF vectors computed from n, \(n+1\), ..., m-grams.

  2. 2.

    https://github.com/inspirehep/beard.

  3. 3.

    https://github.com/glouppe/paper-author-disambiguation.

  4. 4.

    This holds for the data we extracted, but may in the future, with the rise of non-Western researchers, be an underestimate of the ambiguous cases.

References

  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  2. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: experiences from the scikit-learn project. CoRR, abs/1309.0238 (2013)

    Google Scholar 

  3. Chin, W.-S., Zhuang, Y., Juan, Y.-C., Wu, F., Tung, H.-Y., Yu, T., Wang, J.-P., Chang, C.-X., Yang, C.-P., Chang, W.-C., et al.: Effective string processing and matching for author disambiguation. J. Mach. Learn. Res. 15(1), 3037–3064 (2014)

    MathSciNet  MATH  Google Scholar 

  4. Culotta, A., Kanani, P., Hall, R., Wick, M., McCallum, A.: Author disambiguation using error-driven machine learning with a ranking loss function. In: 6th International Workshop on Information Integration on the Web (IIWeb-2007), Vancouver, Canada (2007)

    Google Scholar 

  5. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  6. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)

    Article  MATH  Google Scholar 

  7. Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Rec. 41(2), 15–26 (2012)

    Article  Google Scholar 

  8. Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of 10th Annual Joint Conference on Digital Libraries, pp. 39–48. ACM (2010)

    Google Scholar 

  9. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gentil-Beccot, A., Mele, S., Holtkamp, A., O’Connell, H.B., Brooks, T.C.: Information resources in high-energy physics: Surveying the present landscape and charting the future course. J. Am. Soc. Inf. Sci. Technol. 60(1), 150–160 (2009)

    Article  Google Scholar 

  11. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)

    Article  MATH  Google Scholar 

  12. Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of 2004 Joint ACM/IEEE Conference on Digital Libraries, pp. 296–305. IEEE (2004)

    Google Scholar 

  13. Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 536–544. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., Lee, J.-H.: On co-authorship for author disambiguation. Inf. Process. Manag. 45(1), 84–97 (2009)

    Article  Google Scholar 

  15. Khabsa, M., Treeratpituk, P., Giles, C.L.: Large scale author name disambiguation in digital libraries. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 41–42. IEEE (2014)

    Google Scholar 

  16. Lange, D., Naumann, F.: Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate. In: Proceedings of 20th ACM International Conference on Information and Knowledge Management, pp. 243–248. ACM (2011)

    Google Scholar 

  17. Levin, M., Krawczyk, S., Bethard, S., Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation. J. Am. Soc. Inf. Sci. Technol. 63(5), 1030–1047 (2012)

    Article  Google Scholar 

  18. Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Wilbur, W.J.: Author name disambiguation for pubmed. J. Assoc. Inf. Sci. Technol. 65(4), 765–781 (2014)

    Article  Google Scholar 

  19. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)

    Google Scholar 

  20. Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security, vol. 1401, pp. 93–102 (2005)

    Google Scholar 

  21. Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  22. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  23. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)

    MathSciNet  Google Scholar 

  24. Ruggles, S., Sobek, M., Fitch, C.A., Hall, P.K., Ronnander, C.: Integrated Public Use Microdata Series. Historical Census Projects, Department of History, University of Minnesota (2008)

    Google Scholar 

  25. Schulz, C., Mazloumian, A., Petersen, A.M., Penner, O., Helbing, D.: Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci. 3(1), 1–14 (2014)

    Article  Google Scholar 

  26. Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Ann. Rev. Inf. Sci. Technol. 43(1), 1–43 (2009)

    Article  Google Scholar 

  27. Song, Y., Huang, J., Councill, I.G., Li, J., Giles, C.L.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 342–351. ACM (2007)

    Google Scholar 

  28. Strotmann, A., Zhao, D.: Author name disambiguation: what difference does it make in author-based citation analysis? J. Am. Soc. Inf. Sci. Technol. 63(9), 1820–1833 (2012)

    Article  Google Scholar 

  29. Taft, R.L.: Name search techniques. Technical report Special Report No. 1, New York State Identification and Intelligence System, Albany, NY, February 1970

    Google Scholar 

  30. The National Archives. The soundex indexing system, May 2007

    Google Scholar 

  31. Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans. Knowl. Disc. Data (TKDD) 3(3), 11 (2009)

    Google Scholar 

  32. Tran, H.N., Huynh, T., Do, T.: Author name disambiguation by using deep neural network. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014, Part I. LNCS, vol. 8397, pp. 123–132. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  33. Treeratpituk, P., Giles, C.L.: Disambiguating authors in Academic Publications using random forests. In: Proceedings of 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48. ACM (2009)

    Google Scholar 

  34. Treeratpituk, P., Giles, C.L.: Name-ethnicity classification and ethnicity-sensitive name matching. In: AAAI, Citeseer (2012)

    Google Scholar 

  35. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mateusz Susik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Louppe, G., Al-Natsheh, H.T., Susik, M., Maguire, E.J. (2016). Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning. In: Ngonga Ngomo, AC., Křemen, P. (eds) Knowledge Engineering and Semantic Web. KESW 2016. Communications in Computer and Information Science, vol 649. Springer, Cham. https://doi.org/10.1007/978-3-319-45880-9_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45880-9_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45879-3

  • Online ISBN: 978-3-319-45880-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics