Abstract
We present a case study in approximate data matching for a database system that contains information about scientific publications. The approximate matching process is meant to identify whether several records in the database are in fact repeated instances of the same real-world object. In our case study we are concerned with matching instances of objects such as XML documents, persons’ names, affiliations, journal names, and so on. The particular data we are dealing with is a representation of the PubMed Central document corpus within the data warehouse that is a part of the SONCA system. SONCA system is being developed as one of components of the general scientific information platform SYNAT.
This work was supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland, the Polish National Science Centre grant 2011/01/B/ST6/03867 and by the Polish National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Association for Computing Machinery: The Digital Library: the ACM Guide to Computing Literature. WWW Page (2012), http://librarians.acm.org/acm-guide-computing-literature
Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003), http://www.ncbi.nlm.nih.gov/books/NBK21087/
Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)
Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)
Herba, K.: Semantic recognition and tagging of scientific articles. Master’s thesis, Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland (2011) (in Polish)
Infobright, Inc.: Infobright Enterprise Edition (IEE). WWW Page (2012), http://infobright.com
Jonnalagadda, S., Topham, P.: Nemo: Extraction and normalization of organization names from pubmed affiliation strings. Journal of Biomedical Discovery and Collaboration 5, 50–75 (2010)
Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: Rdbms model for scientific articles analytics. In: Bembenik, et al. [3], ch. 4, pp. 49–60
Nadkarni, P.: The EAV/CR model of data representation. Tech. rep., Center for Medical Informatics, Yale University School of Medicine (2000), http://ycmi.med.yale.edu/nadkarni/eav_cr_frame.html
National Center for Biotechnology Information: Archiving and Interchange Tag Set (2008), http://dtd.nlm.nih.gov/archiving/
Nguyen, A.L., Nguyen, H.S.: On designing the sonca system. In: Bembenik, et al. [3], ch. 2, pp. 9–35
Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [3], ch. 1, pp. 1–8
Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the Quality of Person Names in DBLP. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 508–511. Springer, Heidelberg (2006)
Szczuka, M., Ślęzak, D.: Representation and Evaluation of Granular Systems. In: Watada, J., Watanabe, T., Phillips-Wren, G., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies. SIST, vol. 15, pp. 287–296. Springer, Heidelberg (2012)
Tsai, R.T.H., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y., Hsu, W.L.: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(S-5) (2006)
Zhang, D., Tang, J., Li, J.Z., Wang, K.: A constraint-based probabilistic framework for name disambiguation. In: Silva, M.J., Laender, A.H.F., Baeza-Yates, R.A., McGuinness, D.L., Olstad, B., Olsen, Ø.H., Falcão, A.O. (eds.) CIKM, pp. 1019–1022. ACM (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Szczuka, M., Betliński, P., Herba, K. (2012). Named Entity Matching in Publication Databases. In: Yao, J., et al. Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science(), vol 7413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32115-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-32115-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32114-6
Online ISBN: 978-3-642-32115-3
eBook Packages: Computer ScienceComputer Science (R0)