, Volume 98, Issue 3, pp 2255–2274 | Cite as

Robust hybrid name disambiguation framework for large databases

  • Jia Zhu
  • Yi Yang
  • Qing Xie
  • Liwei Wang
  • Saeed-Ul Hassan


In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.


Name disambiguation Multidimensional scaling Genre identification Clustering 


  1. Aleman-Meza, B., Nagarajan, M., & Ramakrishnan, C. (2006). Semantic analytics on social networks: Experiences in addressing the problem of conflict of interest detection. World Wide Web Conference Communication (pp. 407–416).Google Scholar
  2. Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. (pp. 207–212) New York: Springer.Google Scholar
  3. Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M. M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. International Symposium on String Processing and Information Retrieval (pp. 350–359).Google Scholar
  4. Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. Proceedings of the 2nd international workshop on Information Quality in information Systems. (pp. 69–76).Google Scholar
  5. Han, H., Giles, C. L., & Hong, Y. Z. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital librarie (pp. 296–305).Google Scholar
  6. Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 334–343).Google Scholar
  7. Haykin, S. (1999). Neural networks: A comprehensive foundation.Google Scholar
  8. Huang, J., & Seyda Ertekin, C. L. G. (2006). Efficient name disambiguation for large scale databases. Proc. of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 536–544).Google Scholar
  9. Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions on Database System 31(2):716–767.Google Scholar
  10. Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing and Management 45(1):84–97.Google Scholar
  11. Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences (pp. 99–108).Google Scholar
  12. Koehler, H., Zhou, X., Sadiq, S., Shu, Y., & Taylor, K. (2010). Sampling dirty data for matching attributes. SIGMOD (pp. 63–74).Google Scholar
  13. Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple classifier fusion. Pattern Recognition, 34(2), 299–314.CrossRefMATHGoogle Scholar
  14. Orrite, C., Rodriguez, M., Martinez, F., & Fairhurst, M. (2008). Classifier ensemble generation for the majority vote rule. Progress in Pattern Recognition, Image Analysis and Applications (pp. 340–347).Google Scholar
  15. Pedro, D., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3), 103–137.Google Scholar
  16. Pereira, D. A., Ribeiro, B. N., Ziviani, N., Alberto, H. F., Goncalves, A. M., & Ferreira, A. A. (2009). Using web information for author name disambiguation. Proceedings of the 9th ACM/IEEE Joint Conference on Digital Libraries (pp. 49–58).Google Scholar
  17. Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1, 30–34.MathSciNetGoogle Scholar
  18. Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 342–352).Google Scholar
  19. Tan, Y. F., Kan, M. Y., & Lee, D. W. (2006). Search engine driven author disambiguation. 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 314–315).Google Scholar
  20. Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics (pp. 683–697).Google Scholar
  21. Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. H. (2008). Author name disambiguation for citations using topic and web correlation. Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).Google Scholar
  22. Yin, X. X., & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering (pp. 1242–1246)Google Scholar
  23. Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (pp. 718–729).Google Scholar
  24. Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International WWW (pp. 1223–1224).Google Scholar
  25. Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings of a Joint conference on APWeb/WAIM (pp. 320–331)Google Scholar
  26. Zhu, J., Zhou, X. F., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. WISE (pp. 282–289).Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2013

Authors and Affiliations

  • Jia Zhu
    • 1
  • Yi Yang
    • 2
  • Qing Xie
    • 3
  • Liwei Wang
    • 4
  • Saeed-Ul Hassan
    • 5
  1. 1.School of Computer ScienceSouth China Normal UniversityGuangzhouChina
  2. 2.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  3. 3.Division of CEMSEKing Abdullah University of Science and TechnologyThuwalSaudi Arabia
  4. 4.Wuhan UniversityWuhanChina
  5. 5.COMSATS Institute of Information TechnologyLahorePakistan

Personalised recommendations