Named Entity Matching in Publication Databases

Szczuka, Marcin; Betliński, Paweł; Herba, Kamil

doi:10.1007/978-3-642-32115-3_20

Marcin Szczuka²⁶,
Paweł Betliński²⁶ &
Kamil Herba²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7413))

Included in the following conference series:

International Conference on Rough Sets and Current Trends in Computing

1928 Accesses
2 Citations

Abstract

We present a case study in approximate data matching for a database system that contains information about scientific publications. The approximate matching process is meant to identify whether several records in the database are in fact repeated instances of the same real-world object. In our case study we are concerned with matching instances of objects such as XML documents, persons’ names, affiliations, journal names, and so on. The particular data we are dealing with is a representation of the PubMed Central document corpus within the data warehouse that is a part of the SONCA system. SONCA system is being developed as one of components of the general scientific information platform SYNAT.

This work was supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland, the Polish National Science Centre grant 2011/01/B/ST6/03867 and by the Polish National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Association for Computing Machinery: The Digital Library: the ACM Guide to Computing Literature. WWW Page (2012), http://librarians.acm.org/acm-guide-computing-literature
Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003), http://www.ncbi.nlm.nih.gov/books/NBK21087/
Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)
Google Scholar
Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)
Article Google Scholar
Herba, K.: Semantic recognition and tagging of scientific articles. Master’s thesis, Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland (2011) (in Polish)
Google Scholar
Infobright, Inc.: Infobright Enterprise Edition (IEE). WWW Page (2012), http://infobright.com
Jonnalagadda, S., Topham, P.: Nemo: Extraction and normalization of organization names from pubmed affiliation strings. Journal of Biomedical Discovery and Collaboration 5, 50–75 (2010)
Google Scholar
Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: Rdbms model for scientific articles analytics. In: Bembenik, et al. [3], ch. 4, pp. 49–60
Google Scholar
Nadkarni, P.: The EAV/CR model of data representation. Tech. rep., Center for Medical Informatics, Yale University School of Medicine (2000), http://ycmi.med.yale.edu/nadkarni/eav_cr_frame.html
National Center for Biotechnology Information: Archiving and Interchange Tag Set (2008), http://dtd.nlm.nih.gov/archiving/
Nguyen, A.L., Nguyen, H.S.: On designing the sonca system. In: Bembenik, et al. [3], ch. 2, pp. 9–35
Google Scholar
Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [3], ch. 1, pp. 1–8
Google Scholar
Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the Quality of Person Names in DBLP. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 508–511. Springer, Heidelberg (2006)
Chapter Google Scholar
Szczuka, M., Ślęzak, D.: Representation and Evaluation of Granular Systems. In: Watada, J., Watanabe, T., Phillips-Wren, G., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies. SIST, vol. 15, pp. 287–296. Springer, Heidelberg (2012)
Chapter Google Scholar
Tsai, R.T.H., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y., Hsu, W.L.: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(S-5) (2006)
Google Scholar
Zhang, D., Tang, J., Li, J.Z., Wang, K.: A constraint-based probabilistic framework for name disambiguation. In: Silva, M.J., Laender, A.H.F., Baeza-Yates, R.A., McGuinness, D.L., Olstad, B., Olsen, Ø.H., Falcão, A.O. (eds.) CIKM, pp. 1019–1022. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics, The University of Warsaw, Banacha 2, 02-097, Warsaw, Poland
Marcin Szczuka, Paweł Betliński & Kamil Herba

Authors

Marcin Szczuka
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Betliński
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Herba
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Regina, S4S 0A2, Regina, SK, Canada
JingTao Yao
School of Information Science and Technology, Southwest Jiaotong University, 610031, Chengdu, P.R. China
Yan Yang
Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965, Poznan, Poland
Roman Słowiński
Faculty of Economics, University of Catania, Corso Italia, 55, 95129, Catania, Italy
Salvatore Greco
School of Management and Engineering, Nanjing University, 210093, Nanjing, Jiangsu, P.R. China
Huaxiong Li
Machine Intelligence Unit, Indian Statistical Institute (ISI), 700108, Kolkata, India
Sushmita Mitra
Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008, Warsaw, Poland
Lech Polkowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szczuka, M., Betliński, P., Herba, K. (2012). Named Entity Matching in Publication Databases. In: Yao, J., et al. Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science(), vol 7413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32115-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-32115-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32114-6
Online ISBN: 978-3-642-32115-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics