Abstract
Many projects like the DBLP bibliography have to use names as identifiers for persons. Names however are neither unique nor is it guaranteed that a person is referred to by only one name. This causes inconsistencies which reduce the data quality of a collection. Though there are a large number of algorithmic approaches to solve this problem, little is known on the properties of the inconsistent entities. We show how to extract a large number of past name inconsistencies from the DBLP data set. We analyze the social network properties of these names and of the communities they belong to. We evaluate the usefulness of different properties to differentiate defective and none-defective names and present an approach which can predict the probability that a name will need correction in the future.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bavelas, A.: Communication patterns in task-oriented groups. J. Acoust. Soc. Am. 22, 725–730 (1950)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 1–36 (2007)
Bilenko, M., Mooney, R.J., Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)
Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001)
Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: KDD, pp. 554–560. ACM, New York (2006)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324. ACM, New York (2003)
Chwistek, L., Hetper, W.: New foundation of formal metamathematics. J. Symb. Log. 3(1), 1–36 (1938)
D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction approaches. In: MSR, pp. 31–41. IEEE, Piscataway (2010)
Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on DBLP bibliography data. In: ICDM, pp. 163–172. IEEE Computer Society, Los Alamitos (2008)
Dimitrov, M., Zhou, H.: Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging. In: ASPLOS, pp. 61–72. ACM, New York (2009)
Elmacioglu, E., Lee, D.: On six degrees of separation in DBLP-DB and more. SIGMOD Rec. 34(2), 33–40 (2005)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: JCDL, pp. 39–48. ACM, New York (2010)
Freeman, L.C.: A set of measures of centrality based upon betweeness. Sociometry 40(1), 35–41 (1977)
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99(12), 7821–7826 (2002)
Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp. 296–305. ACM, New York (2004)
Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: SAC, pp. 1065–1069. ACM, New York (2005)
Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp. 334–343, ACM, New York (2005)
Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: PKDD. Lecture Notes in Computer Science, vol. 4213, pp. 536–544. Springer, Berlin (2006)
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., Lee, J.-H.: On co-authorship for author disambiguation. Inf. Process. Manag. 45(1), 84–97 (2009)
Lee, D., On, B.-W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: IQIS, pp. 69–76. ACM, New York (2005)
Levin, F.H., Heuser, C.A.: Evaluating the use of social networks in author name disambiguation in digital libraries. JIDM 1(2), 183–198 (2010)
Levin, F.H., Heuser, C.A.: Using genetic programming to evaluate the impact of social network analysis in author name disambiguation. In: AMW. CEUR Workshop Proceedings, vol. 619 (2010). CEUR-WS.org
On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: JCDL, pp. 344–353. ACM, New York (2005)
On, B.-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In: JCDL, pp. 51–52. ACM, New York (2006)
Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005)
Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL, pp. 49–58. ACM, New York (2009)
Pham, M.C., Klamma, R.: The structure of the computer science knowledge network. In: ASONAM, pp. 17–24. IEEE Computer Society, Los Alamitos (2010)
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101(9), 2658 (2004)
Reuther, P., Walter, B.: Survey on test collections and techniques for personal name matching. IJMSO 1(2), 89–99 (2006)
Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the quality of person names in DBLP. In: ECDL. Lecture Notes in Computer Science, vol. 4172, pp. 508–511. Springer, Berlin (2006)
Shin, D., Kim, T., Jung, H., Choi, J.: Automatic method for author name disambiguation using social networks. In: AINA, pp. 1263–1270. IEEE Computer Society, Los Alamitos (2010)
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD, pp. 350–359. ACM, New York (2002)
Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: SIGIR, pp. 10–17. ACM, New York (2010)
Zhang, B., Horvath, S.: A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4(1), 1128 (2005)
Acknowledgements
We thank Manh Cuong Pham and Ralf Klamma for providing us with the thematic clustering data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Wien
About this chapter
Cite this chapter
Reitz, F., Hoffmann, O. (2013). Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities. In: Özyer, T., Rokne, J., Wagner, G., Reuser, A. (eds) The Influence of Technology on Social Network Analysis and Mining. Lecture Notes in Social Networks, vol 6. Springer, Vienna. https://doi.org/10.1007/978-3-7091-1346-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-7091-1346-2_19
Published:
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-1345-5
Online ISBN: 978-3-7091-1346-2
eBook Packages: Computer ScienceComputer Science (R0)