DBkWik: extracting and integrating knowledge from thousands of Wikis

  • Sven HertlingEmail author
  • Heiko Paulheim
Regular Paper


Popular cross-domain knowledge graphs, such as DBpedia and YAGO, are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia. Furthermore, we discuss the potential use of DBkWik as a benchmark for knowledge graph matching.


Knowledge graph creation Information extraction Linked open data Knowledge graph matching 



We would like to thank Alexandra Hofmann, Samresh Perchani, and Jan Portisch, who helped developing the first prototype of DBkWik in the course of a student project.


  1. 1.
    Algergawy A, Cheatham M, Faria D, Ferrara A, Fundulaki I, Harrow I, Hertling S, Jiménez-Ruiz E, Karam N, Khiat A, Lambrix P, Li H, Montanelli S, Paulheim H, Pesquita C, Saveta T, Schmidt D, Shvaiko P, Splendiani A, Thiéblin E, Trojahn C, Vataščinová J, Zamazal O, Zhou L (2018) Results of the ontology alignment evaluation initiative 2018. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  2. 2.
    Alstott J, Bullmore E, Plenz D (2014) Powerlaw: a Python package for analysis of heavy-tailed distributions. PloS one 9(1):e85777CrossRefGoogle Scholar
  3. 3.
    Bryl V, Bizer C (2014) Learning conflict resolution strategies for cross-language Wikipedia data fusion. In: Proceedings of the 23rd international conference on world wide web. ACM, pp 1129–1134Google Scholar
  4. 4.
    Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on web search and data mining, pp 101–110Google Scholar
  5. 5.
    Clauset A, Shalizi CR, Newman ME (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703MathSciNetCrossRefGoogle Scholar
  6. 6.
    Dohrn H, Riehle D (2011) Design and implementation of the Sweble Wikitext parser: unlocking the structured data of wikipedia. In: Proceedings of the 7th international symposium on wikis and open collaboration. ACM, pp 72–81Google Scholar
  7. 7.
    Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 601–610Google Scholar
  8. 8.
    Endris KM, Giménez-García JM, Thakkar H, Demidova E, Zimmermann A, Lange C, Simperl E (2017) Dataset reuse: an analysis of references in community discussions, publications and data. Extraction 500:1Google Scholar
  9. 9.
    Erling O (2012) Virtuoso, a hybrid rdbms/graph column store. IEEE Data Eng Bull 35(1):3–8Google Scholar
  10. 10.
    Euzenat J, Meilicke C, Stuckenschmidt H, Shvaiko P, Trojahn C (2011) Ontology alignment evaluation initiative: six years of experience. J Data Semant XV:158–192CrossRefGoogle Scholar
  11. 11.
    Faria D, Pesquita C, Balasubramani BS, Tervo T, Carriço D, Garrilha R, Couto FM, Cruz IF (2018) Results of AML participation in OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  12. 12.
    Fellbaum C (1998) WordNet—an electronic lexical database. MIT Press, CambridgeCrossRefGoogle Scholar
  13. 13.
    Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378CrossRefGoogle Scholar
  14. 14.
    Galárraga L, Teflioudi C, Hose K, Suchanek FM (2015) Fast rule mining in ontological knowledge bases with AMIE++. VLDB J Int J Very Large Data Bases 24(6):707–730CrossRefGoogle Scholar
  15. 15.
    Guzewicz P, Manolescu I (2018) Quotient RDF summaries based on type hierarchies. In: DESWeb 2018—data engineering meets the semantic web 2018Google Scholar
  16. 16.
    Hauser DJ, Schwarz N (2016) Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behav Res Methods 48(1):400–407. CrossRefGoogle Scholar
  17. 17.
    Heath T, Bizer C (2011) Linked data: evolving the web into a global data space, vol 1, no 1. Synthesis lectures on the semantic web: theory and technology. Morgan & Claypool, San Rafael, pp 1–136 Google Scholar
  18. 18.
    Heist N, Paulheim H (2019) Uncovering the semantics of Wikipedia categories. In: International semantic web conferenceGoogle Scholar
  19. 19.
    Heist N, Hertling S, Paulheim H (2018) Language-agnostic relation extraction from abstracts in Wikis. Information 9(4):75CrossRefGoogle Scholar
  20. 20.
    Hertling S, Paulheim H (2017) Webisalod: providing hypernymy relations extracted from the web as linked open data. In: International semantic web conference. Springer, pp 111–119Google Scholar
  21. 21.
    Hertling S, Paulheim H (2018a) Dbkwik: A consolidated knowledge graph from thousands of wikis. In: 2018 IEEE international conference on big knowledge (ICBK). IEEE, pp 17–24Google Scholar
  22. 22.
    Hertling S, Paulheim H (2018b) Dome results for OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  23. 23.
    Hofmann A, Perchani S, Portisch J, Hertling S, Paulheim H (2017) Dbkwik: towards knowledge graph creation from thousands of wikis. In: International semantic web conference (posters and demos)Google Scholar
  24. 24.
    Jiménez-Ruiz E, Grau BC, Cross V (2018) Logmap family participation in the OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  25. 25.
    Kazai G (2011) In search of quality in crowdsourcing for search engine evaluation. Springer, Berlin, pp 165–176. CrossRefGoogle Scholar
  26. 26.
    Kliegr T (2015) Linked hypernyms: enriching DBpedia with targeted hypernym discovery. Web Semant Sci Serv Agents World Wide Web 31:59–69CrossRefGoogle Scholar
  27. 27.
    Laadhar A, Ghozzi F, Megdiche I, Ravat F, Teste O, Gargouri F (2018) OAEI 2018 results of POMap++. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  28. 28.
    Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174 CrossRefGoogle Scholar
  29. 29.
    Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196Google Scholar
  30. 30.
    Lehmann J (2009) Dl-learner: learning concepts in description logics. J Mach Learn Res 10(Nov):2639–2642MathSciNetzbMATHGoogle Scholar
  31. 31.
    Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2013) DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant Web J 6(2):278–286 Google Scholar
  32. 32.
    Lenat DB (1995) CYC: a large-scale investment in knowledge infrastructure. Commun ACM 38(11):33–38CrossRefGoogle Scholar
  33. 33.
    Mahdisoltani F, Biega J, Suchanek FM (2013) YAGO3: a knowledge base from multilingual Wikipedias. In: CIDRGoogle Scholar
  34. 34.
    Muñoz E, Hogan A, Mileo A (2014) Using linked data to mine RDF from Wikipedia’s tables. In: Proceedings of the 7th ACM international conference on web search and data mining. ACM, pp 533–542Google Scholar
  35. 35.
    Noia TD, Ostuni VC, Tomeo P, Sciascio ED (2016) Sprank: semantic path-based ranking for top-n recommendations using linked open data. ACM Trans Intell Syst Technol (TIST) 8(1):9Google Scholar
  36. 36.
    Nuzzolese AG, Gangemi A, Presutti V, Ciancarini P (2012) Type inference through the analysis of wikipedia links. In: LDOWGoogle Scholar
  37. 37.
    Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:489–508CrossRefGoogle Scholar
  38. 38.
    Paulheim H (2017) Data-driven joint debugging of the DBpedia mappings and ontology. In: European semantic web conference. Springer, pp 404–418Google Scholar
  39. 39.
    Paulheim H (2018) How much is a triple? estimating the cost of knowledge graph creation. In: ISWC 2018 posters and demonstrations, industry and blue sky ideas tracksGoogle Scholar
  40. 40.
    Paulheim H, Bizer C (2013) Type inference on noisy RDF data. In: International semantic web conference. Springer, pp 510–525Google Scholar
  41. 41.
    Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst (IJSWIS) 10(2):63–86CrossRefGoogle Scholar
  42. 42.
    Paulheim H, Gangemi A (2015) Serving DBpedia with DOLCE—more than just adding a cherry on top. In: International semantic web conference. Springer, pp 180–196Google Scholar
  43. 43.
    Paulheim H, Ponzetto SP (2013) Extending DBpedia with Wikipedia list pages. In: NLP-DBPEDIA workshopGoogle Scholar
  44. 44.
    Paulheim H, Hertling S, Ritze D (2013) Towards evaluating interactive ontology matching tools. In: Extended semantic web conference. Springer, pp 31–45Google Scholar
  45. 45.
    Ponzetto SP, Strube M (2008) Wikitaxonomy: a large scale knowledge resource. In: ECAI, Citeseer, vol 178, pp 751–752Google Scholar
  46. 46.
    Rico M, Mihindukulasooriya N, Kontokostas D, Paulheim H, Hellmann S, Gómez-Pérez A (2018) Predicting incorrect mappings: a data-driven approach applied to DBpedia. In: Proceedings of the 33rd annual ACM symposium on applied computing, pp 323–330Google Scholar
  47. 47.
    Ringler D, Paulheim H (2017) One knowledge graph to rule them all? analyzing the differences between DBpedia, YAGO, Wikidata & co. In: Joint German/Austrian conference on artificial intelligence (Künstliche Intelligenz). Springer, pp 366–372Google Scholar
  48. 48.
    Roussille P, Megdiche I, Teste O, Trojahn C (2018) Holontology: results of the 2018 OAEI evaluation campaign. In: OM 2018-13th ISWC workshop on ontology matchingGoogle Scholar
  49. 49.
    Schmachtenberg M, Bizer C, Paulheim H (2014) Adoption of the linked data best practices in different topical domains. In: International semantic web conference. Springer, pp 245–260Google Scholar
  50. 50.
    Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto SP (2016) A large database of hypernymy relations extracted from the web. In: LRECGoogle Scholar
  51. 51.
    Töpper G, Knuth M, Sack H (2012) DBpedia ontology enrichment for inconsistency detection. In: Proceedings of the 8th international conference on semantic systems. ACM, pp 33–40Google Scholar
  52. 52.
    Völker J, Niepert M (2011) Statistical schema induction. In: Extended semantic web conference. Springer, pp 124–138Google Scholar
  53. 53.
    Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledge base. Commun ACM 57(10):78–85CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Data and Web Science GroupUniversity of MannheimMannheimGermany

Personalised recommendations