Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities

Reitz, Florian; Hoffmann, Oliver

doi:10.1007/978-3-7091-1346-2_19

Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities

Florian Reitz⁵ &
Oliver Hoffmann^5,6

Chapter
First Online: 21 December 2012

2617 Accesses
6 Citations

Part of the book series: Lecture Notes in Social Networks ((LNSN,volume 6))

Abstract

Many projects like the DBLP bibliography have to use names as identifiers for persons. Names however are neither unique nor is it guaranteed that a person is referred to by only one name. This causes inconsistencies which reduce the data quality of a collection. Though there are a large number of algorithmic approaches to solve this problem, little is known on the properties of the inconsistent entities. We show how to extract a large number of past name inconsistencies from the DBLP data set. We analyze the social network properties of these names and of the communities they belong to. We evaluate the usefulness of different properties to differentiate defective and none-defective names and present an approach which can predict the probability that a name will need correction in the future.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://dblp.uni-trier.de/xml/hdblp.xml.gz

References

Bavelas, A.: Communication patterns in task-oriented groups. J. Acoust. Soc. Am. 22, 725–730 (1950)
Article Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 1–36 (2007)
Article Google Scholar
Bilenko, M., Mooney, R.J., Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)
Article Google Scholar
Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001)
Article MATH Google Scholar
Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: KDD, pp. 554–560. ACM, New York (2006)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324. ACM, New York (2003)
Google Scholar
Chwistek, L., Hetper, W.: New foundation of formal metamathematics. J. Symb. Log. 3(1), 1–36 (1938)
Article MATH Google Scholar
D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction approaches. In: MSR, pp. 31–41. IEEE, Piscataway (2010)
Google Scholar
Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on DBLP bibliography data. In: ICDM, pp. 163–172. IEEE Computer Society, Los Alamitos (2008)
Google Scholar
Dimitrov, M., Zhou, H.: Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging. In: ASPLOS, pp. 61–72. ACM, New York (2009)
Google Scholar
Elmacioglu, E., Lee, D.: On six degrees of separation in DBLP-DB and more. SIGMOD Rec. 34(2), 33–40 (2005)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: JCDL, pp. 39–48. ACM, New York (2010)
Google Scholar
Freeman, L.C.: A set of measures of centrality based upon betweeness. Sociometry 40(1), 35–41 (1977)
Article Google Scholar
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99(12), 7821–7826 (2002)
Article MathSciNet MATH Google Scholar
Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp. 296–305. ACM, New York (2004)
Google Scholar
Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: SAC, pp. 1065–1069. ACM, New York (2005)
Google Scholar
Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp. 334–343, ACM, New York (2005)
Google Scholar
Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: PKDD. Lecture Notes in Computer Science, vol. 4213, pp. 536–544. Springer, Berlin (2006)
Google Scholar
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., Lee, J.-H.: On co-authorship for author disambiguation. Inf. Process. Manag. 45(1), 84–97 (2009)
Article Google Scholar
Lee, D., On, B.-W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: IQIS, pp. 69–76. ACM, New York (2005)
Google Scholar
Levin, F.H., Heuser, C.A.: Evaluating the use of social networks in author name disambiguation in digital libraries. JIDM 1(2), 183–198 (2010)
Google Scholar
Levin, F.H., Heuser, C.A.: Using genetic programming to evaluate the impact of social network analysis in author name disambiguation. In: AMW. CEUR Workshop Proceedings, vol. 619 (2010). CEUR-WS.org
On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: JCDL, pp. 344–353. ACM, New York (2005)
Google Scholar
On, B.-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In: JCDL, pp. 51–52. ACM, New York (2006)
Google Scholar
Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005)
Article Google Scholar
Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL, pp. 49–58. ACM, New York (2009)
Google Scholar
Pham, M.C., Klamma, R.: The structure of the computer science knowledge network. In: ASONAM, pp. 17–24. IEEE Computer Society, Los Alamitos (2010)
Google Scholar
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101(9), 2658 (2004)
Article Google Scholar
Reuther, P., Walter, B.: Survey on test collections and techniques for personal name matching. IJMSO 1(2), 89–99 (2006)
Article Google Scholar
Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the quality of person names in DBLP. In: ECDL. Lecture Notes in Computer Science, vol. 4172, pp. 508–511. Springer, Berlin (2006)
Google Scholar
Shin, D., Kim, T., Jung, H., Choi, J.: Automatic method for author name disambiguation using social networks. In: AINA, pp. 1263–1270. IEEE Computer Society, Los Alamitos (2010)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD, pp. 350–359. ACM, New York (2002)
Google Scholar
Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: SIGIR, pp. 10–17. ACM, New York (2010)
Google Scholar
Zhang, B., Horvath, S.: A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4(1), 1128 (2005)
MathSciNet Google Scholar

Download references

Acknowledgements

We thank Manh Cuong Pham and Ralf Klamma for providing us with the thematic clustering data.

Author information

Authors and Affiliations

University of Trier, Universitätsring 1, Trier, Germany
Florian Reitz & Oliver Hoffmann
Schloss Dagstuhl – Leibniz-Zentrum für Informatik GmbH, Wadern, Germany
Oliver Hoffmann

Authors

Florian Reitz
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Hoffmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Reitz .

Editor information

Editors and Affiliations

Department of Computer Engineering, TOBB University, Sogutozu Cad No. 43, Sogutozu Ankara, Turkey
Tansel Özyer
Computer Science, University of Calgary, University Dr. NW 2500, Calgary, T2N 1N4, Canada
Jon Rokne
IPSC, European Commission Joint Research Cent., Via Enrico Fermi 2749, Ispra, 21027, Italy
Gerhard Wagner
De Wetstraat 16, Leiden, 2332 XT, Netherlands
Arno H.P. Reuser

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Reitz, F., Hoffmann, O. (2013). Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities. In: Özyer, T., Rokne, J., Wagner, G., Reuser, A. (eds) The Influence of Technology on Social Network Analysis and Mining. Lecture Notes in Social Networks, vol 6. Springer, Vienna. https://doi.org/10.1007/978-3-7091-1346-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-7091-1346-2_19
Published: 21 December 2012
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-1345-5
Online ISBN: 978-3-7091-1346-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics