Advertisement

Entity Deduplication on ScholarlyData

  • Ziqi Zhang
  • Andrea Giovanni Nuzzolese
  • Anna Lisa GentileEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10249)

Abstract

ScholarlyData is the new and currently the largest reference linked dataset of the Semantic Web community about papers, people, organisations, and events related to its academic conferences. Originally started from the Semantic Web Dog Food (SWDF), it addressed multiple issues on data representation and maintenance by (i) adopting a novel data model and (ii) establishing an open source workflow to support the addition of new data from the community. Nevertheless, the major issue with the current dataset is the presence of multiple URIs for the same entities, typically in persons and organisations. In this work we: (i) perform entity deduplication on the whole dataset, using supervised classification methods; (ii) devise a protocol to choose the most representative URI for an entity and deprecate duplicated ones, while ensuring backward compatibilities for them; (iii) incorporate the automatic deduplication step in the general workflow to reduce the creation of duplicate URIs when adding new data. Our early experiment focused on the person and organisation URIs and results show significant improvement over state-of-the-art solutions. We managed to consolidate, on the entire dataset, over 100 and 800 pairs of duplicate person and organisation URIs and their associated triples (over 1,800 and 5,000) respectively, hence significantly improving the overall quality and connectivity of the data graph. Integrated into the ScholarlyData data publishing workflow, we believe that this serves a major step towards the creation of clean, high-quality scholarly linked data on the Semantic Web.

References

  1. 1.
    Bryl, V., Birukou, A., Eckert, K., Kessler, M.: What is in the proceedings? combining publishers and researchers perspectives. In: SePublica 2014 (2014)Google Scholar
  2. 2.
    Clark, K., Manning, C.: Entity-centric coreference resolution with model stacking. In: Association for Computational Linguistics (2015)Google Scholar
  3. 3.
    Duan, S., Fokoue, A., Hassanzadeh, O.: Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. pp. 49–64 (2012)CrossRefGoogle Scholar
  4. 4.
    Gentile, A.L., Acosta, M., Costabello, L., Nuzzolese, A.G., Presutti, V., Reforgiato Recupero, D.: Conference live: accessible and sociable conference semantic data. In: Proceedings of WWW Companion, pp. 1007–1012 (2015)Google Scholar
  5. 5.
    Glaser, H., Jaffri, A., Millard, I.: Managing co-reference on the semantic web. In: Linked Data on the Web (LDOW 2009) (2009)Google Scholar
  6. 6.
    Halpin, H., Presutti, V.: The identity of resources on the web: an ontology for web architecture. Appl. Ontol. 6(3), 263–293 (2011)Google Scholar
  7. 7.
    Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of SIGMOD 1995. ACM (1995)CrossRefGoogle Scholar
  8. 8.
    Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endow. 5(11), 1638–1649 (2012)CrossRefGoogle Scholar
  9. 9.
    Lebo, T., Sahoo, S., McGuinness, D.: Prov-o: The prov ontology. W3C recommendation, W3C, April 2013. https://www.w3.org/TR/prov-o/
  10. 10.
    Lee, D., Kang, J., Mitra, P., Giles, C.L., On, B.-W.: Are your citations. Commun. ACM 50(12), 33–38 (2007)CrossRefGoogle Scholar
  11. 11.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6, 167–195 (2013)Google Scholar
  12. 12.
    Mamun, A.-A., Aseltine, R., Rajasekaran, S.: Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4), e0154446 (2016)CrossRefGoogle Scholar
  13. 13.
    Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food — the ESWC and ISWC metadata projects. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 802–815. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-76298-0_58CrossRefGoogle Scholar
  14. 14.
    Nentwig, M., Hartung, M., Ngomo, A.-C.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web (Preprint):1–18 (2015)Google Scholar
  15. 15.
    Ngomo, A.-C.N., Auer, S.: LIMES: a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI 2011, pp. 2312–2317 (2011)Google Scholar
  16. 16.
    Nuzzolese, A.G., Gentile, A.L., Presutti, V., Gangemi, A.: Conference Linked data: the scholarlydata project. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 150–158. Springer, Cham (2016). doi: 10.1007/978-3-319-46547-0_16CrossRefGoogle Scholar
  17. 17.
    Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with rexplore. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41335-3_29CrossRefGoogle Scholar
  18. 18.
    Papadakis, G., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. pp. 535–544 (2011)Google Scholar
  19. 19.
    Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2), 85–94 (2009)CrossRefGoogle Scholar
  20. 20.
    Solecki, B., Silva, L., Efimov, D.: KDD cup 2013: author disambiguation. In: Proceedings of the 2013 KDD Cup 2013 Workshop, KDD Cup 2013, pp. 9:1–9:3. ACM, New York (2013)Google Scholar
  21. 21.
    Zhang, Z., Gentile, A.L., Blomqvist, E., Augenstein, I., Ciravegna, F.: An unsupervised data-driven method to discover equivalent relations in large linked datasets. Semant. web 8(2), 197–223 (2017)CrossRefGoogle Scholar
  22. 22.
    Zheng, J., Chapman, W., Crowley, R., Savova, G.: Coreference resolution: a review of general methodologies and applications in the clinical domain. Biomed. Inform. 44(6), 1113–1122 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ziqi Zhang
    • 1
  • Andrea Giovanni Nuzzolese
    • 2
  • Anna Lisa Gentile
    • 3
    Email author
  1. 1.Nottingham Trent UniversityNottinghamUK
  2. 2.Semantic Technology LabISTC-CNRRomeItaly
  3. 3.IBM Research AlmadenSan JoseUSA

Personalised recommendations