Metadata Enrichment via Topic Models for Author Name Disambiguation

Bernardi, Raffaella; Le, Dieu-Thu

doi:10.1007/978-3-642-23160-5_7

Raffaella Bernardi²⁰ &
Dieu-Thu Le²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6699))

Included in the following conference series:

572 Accesses
1 Citations

Abstract

This paper tackles the well known problem of Author Name Disambiguation (AND) in Digital Libraries (DL). Following [14,13], we assume that an individual tends to create a distinctively coherent body of work that can hence form a single cluster containing all of his/her articles yet distinguishing them from those of everyone else with the same name. Still, we believe the information contained in a DL may be not sufficient to allow an automatic detection of such clusters; this lack of information becomes even more evident in federated digital libraries, where the labels assigned by librarians may belong to different controlled vocabularies or different classification systems, and in digital libraries on the web where records may be not assigned neither subject headings nor classification numbers. Hence, we exploit Topic Models, extracted from Wikipedia, to enhance records metadata and use Agglomerative Clustering to disambiguate ambiguous author names by clustering together similar records; records in different clusters are supposed to have been written by different people. We investigate the following two research questions: (a) are the Classification Systems and Subject Heading labels manually assigned by librarians general and informative enough to disambiguate Author Names via clustering techniques? (b) Do Topic Models induce from large corpora the conceptual information necessary for labelling automatically DL metadata and grasp topic similarities of the records? To answer these questions, we will use the Library Catalogue of the Bolzano University Library as case study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Di Lauro, T., Choudhury, G.S., Patton, M., Warner, J.W., Brown, E.W.: Automated name authority contol and enhanced searching in the levy collection. D-Lib Magazine 7(4) (2001)
Google Scholar
Han, H., Zha, H., Lee Giles, C.: Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL 2005: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 334–343. ACM, New York (2005)
Google Scholar
Heinrich, G.: Parameter estimation for text analysis, Technical report, University of Leipzig (2008)
Google Scholar
Herskovic, J.R., Tanaka, L.Y., Hersh, W., Bernstam, E.V.: A day in the life of pubmed:analysis of typical days’ query log. J. Amer. Med. Inform. Ass. 14, 212–220 (2007)
Article Google Scholar
Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 536–544. Springer, Heidelberg (2006)
Chapter Google Scholar
Le, D.-T., Nguyen, C.-T., Ha, Q.-T., Phan, X.H., Horiguchi, S.: Matching and ranking with hidden topics towards online contextual advertising. In: Web Intelligence, Sydney, NSW, Australia, pp. 888–891 (2008)
Google Scholar
Newman, D., Hagedorn, K., Chemudugunta, C., Smyth, P.: Subject metadata enrichment using statistical topic models. In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 366–375. ACM, New York (2007)
Chapter Google Scholar
On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: JCDL 2005: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 344–353. ACM, New York (2005)
Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. London, Edinburgh and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572 (1901)
Article MATH Google Scholar
Phan, X.-H., Nguyen, C.-T., Le, D.-T., Nguyen, L.-M., Horiguchi, S., Ha, Q.-T.: A hidden topic-based framework towards building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering 99 (2010) (prePrints)
Google Scholar
Steyvers, M., Griffiths, T.: Probablistic topic models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Anaylsis: A Road to Meaning. Lawrence Erlbaum, Mahwah (2006)
Google Scholar
Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans. Knowl. Discov. Data 3(3), 1–29 (2009)
Article Google Scholar
Torvik, V.I., Weeber, M., Swanson, D.R., Smalheiser, N.R.: A probabilistic similarity metric for medline records: A model for author name disambiguation: Research articles. J. Am. Soc. Inf. Sci. Technol. 56(2), 140–158 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

DISI, University of Trento, Italy
Raffaella Bernardi & Dieu-Thu Le

Authors

Raffaella Bernardi
View author publications
You can also search for this author in PubMed Google Scholar
Dieu-Thu Le
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Trento, Povo, Italy
Raffaella Bernardi & Ilya Zaihrayeu &
The European Library, c/o De Koninklijke Bibliotheek, The National Library of the Netherlands, The Hague, The Netherlands
Sally Chambers
University of Bremen, Germany
Björn Gottfried
Xerox Research Centre Europe, Meylan, France
Frédérique Segond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bernardi, R., Le, DT. (2011). Metadata Enrichment via Topic Models for Author Name Disambiguation. In: Bernardi, R., Chambers, S., Gottfried, B., Segond, F., Zaihrayeu, I. (eds) Advanced Language Technologies for Digital Libraries. NLP4DL AT4DL 2009 2009. Lecture Notes in Computer Science, vol 6699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23160-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-23160-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23159-9
Online ISBN: 978-3-642-23160-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics