Advertisement

BTC-2019: The 2019 Billion Triple Challenge Dataset

  • José-Miguel Herrera
  • Aidan HoganEmail author
  • Tobias Käfer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11779)

Abstract

Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.

Resource DOI:  https://doi.org/10.5281/zenodo.2634588

Resource type: Dataset

Notes

Acknowledgements

This work was supported by Fondecyt Grant No. 1181896 and by the Millenium Institute for Foundational Research on Data (IMFD).

References

  1. 1.
    Avgoustaki, A., Flouris, G., Fundulaki, I., Plexousakis, D.: Provenance management for evolving RDF datasets. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 575–592. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-34129-3_35CrossRefGoogle Scholar
  2. 2.
    Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2011 entity track. In: Text REtrieval Conference (TREC). NIST (2011)Google Scholar
  3. 3.
    Bechhofer, S., Harth, A.: The semantic web challenge 2014. J. Web Semant. 35, 141 (2015)CrossRefGoogle Scholar
  4. 4.
    Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11964-9_14CrossRefGoogle Scholar
  5. 5.
    Bizer, C., Maynard, D.: The semantic web challenge 2010. J. Web Semant. 9(3), 315 (2011)CrossRefGoogle Scholar
  6. 6.
    Bizer, C., Maynard, D.: The semantic web challenge 2011. J. Web Semant. 16, 32 (2012)CrossRefGoogle Scholar
  7. 7.
    Bizer, C., Mika, P.: The semantic web challenge 2009. J. Web Semant. 8(4), 341 (2010)CrossRefGoogle Scholar
  8. 8.
    Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25073-6_6CrossRefGoogle Scholar
  9. 9.
    Böhm, C., Lorey, J., Naumann, F.: Creating void descriptions for web-scale data. J. Web Semant. 9(3), 339–345 (2011)CrossRefGoogle Scholar
  10. 10.
    Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 2104–2108. ACM (2012)Google Scholar
  11. 11.
    Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)Google Scholar
  12. 12.
    Campinas, S., Ceccarelli, D., Perry, T.E., Delbru, R., Balog, K., Tummarello, G.: The sindice-2011 dataset for entity-oriented search on the web of data. In: International Workshop on Entity-Oriented Search (EOS), pp. 26–32 (2011)Google Scholar
  13. 13.
    Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: International Conference on World Wide Web (WWW), pp. 1101–1102. ACM (2008)Google Scholar
  14. 14.
    Cheng, J., Ke, Y., Chu, S., Özsu, M.T.: Efficient core decomposition in massive networks. In: International Conference on Data Engineering (ICDE), pp. 51–62. IEEE (2011)Google Scholar
  15. 15.
    Cheng, J., Zhu, L., Ke, Y., Chu, S.: Fast algorithms for maximal clique enumeration with limited memory. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1240–1248. ACM (2012)Google Scholar
  16. 16.
    d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with watson. In: International Workshop on Evaluation of Ontologies (EON), pp. 1–10. CEUR-WS.org (2007)Google Scholar
  17. 17.
    Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: Aroyo, L., et al. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 240–256. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-13489-0_17CrossRefGoogle Scholar
  18. 18.
    Ding, L., Finin, T.: Characterizing the semantic web on the web. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006).  https://doi.org/10.1007/11926078_18CrossRefGoogle Scholar
  19. 19.
    Ding, L., et al.: Swoogle: a search and metadata engine for the semantic web. In: International Conference on Information and Knowledge Management (CIKM), pp. 652–659. ACM (2004)Google Scholar
  20. 20.
    Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: SameAs networks and beyond: analyzing deployment status and implications of owl:sameas in linked data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 145–160. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-17746-0_10CrossRefGoogle Scholar
  21. 21.
    Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19, 22–41 (2013)CrossRefGoogle Scholar
  22. 22.
    Gallego, M.A., Fernández, J., Martínez-Prieto, M., de la Fuente, P.: RDF visualization using a three-dimensional adjacency matrix. In: Semantic Search Workshop (SEMSEARCH) (2011)Google Scholar
  23. 23.
    Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: Linked Data on the Web (LDOW). CEUR-WS.org (2012)Google Scholar
  24. 24.
    Goodman, E.L., Jimenez, E., Mizell, D., al-Saffar, S., Adolf, B., Haglin, D.: High-Performance computing applied to semantic databases. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 31–45. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21064-8_3CrossRefGoogle Scholar
  25. 25.
    Görlitz, O., Thimm, M., Staab, S.: SPLODGE: systematic generation of SPARQL benchmark queries for linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 116–132. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35176-1_8CrossRefGoogle Scholar
  26. 26.
    Groppe, J., Groppe, S.: Parallelizing join computations of SPARQL queries for large semantic web databases. In: Symposium on Applied Computing (SAC), pp. 1681–1686. ACM (2011)Google Scholar
  27. 27.
    Guéret, C., Groth, P., van Harmelen, F., Schlobach, S.: Finding the achilles heel of the web of data: using network analysis for link-recommendation. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 289–304. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-17746-0_19CrossRefGoogle Scholar
  28. 28.
    Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD International Conference on Management of Data, pp. 289–300. ACM (2014)Google Scholar
  29. 29.
    Harth, A., Bechhofer, S.: The semantic web challenge 2013. J. Web Semant. 27–28, 1 (2014)Google Scholar
  30. 30.
    Harth, A., Maynard, D.: The semantic web challenge 2012. J. Web Semant. 24, 1–2 (2014)CrossRefGoogle Scholar
  31. 31.
    Harth, A., Umbrich, J., Decker, S.: MultiCrawler: a pipelined architecture for crawling and indexing semantic web data. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 258–271. Springer, Heidelberg (2006).  https://doi.org/10.1007/11926078_19CrossRefGoogle Scholar
  32. 32.
    Heflin, J., Song, D.: Ontology instance linking: towards interlinked knowledge graphs. In: AAAI Conference on Artificial Intelligence, pp. 4163–4169. AAAI (2016)Google Scholar
  33. 33.
    Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: algorithms for leaning and labelling blank nodes. TWEB 11(4), 22:1–22:62 (2017)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Semant. 9(4), 365–401 (2011)CrossRefGoogle Scholar
  35. 35.
    Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Workshops Proceedings of the International Conference on Data Engineering (ICDE), pp. 1–6. IEEE (2013)Google Scholar
  36. 36.
    Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: ISWC Posters & Demonstrations. CEUR-WS (2010)Google Scholar
  37. 37.
    Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linked data dynamics. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 213–227. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38288-8_15CrossRefGoogle Scholar
  38. 38.
    Käfer, T., Wins, A., Acosta, M.: Modelling and analysing dynamic linked data using RDF and SPARQL. In: Workshop on Dataset PROFILing and fEderated Search for Web Data (PROFILES) (2017)Google Scholar
  39. 39.
    Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Semant. 16, 52–58 (2012)CrossRefGoogle Scholar
  40. 40.
    Ladwig, G., Tran, T.: Index structures and top-k join algorithms for native keyword search databases. In: Conference on Information and Knowledge Management (CIKM), pp. 1505–1514. ACM (2011)Google Scholar
  41. 41.
    Lehmberg, O., Ritze, D., Ristoski, P., Meusel, R., Paulheim, H., Bizer, C.: The mannheim search join engine. J. Web Semant. 35, 159–166 (2015)CrossRefGoogle Scholar
  42. 42.
    Liu, B., Huang, K., Li, J., Zhou, M.: An incremental and distributed inference method for large-scale ontologies based on MapReduce paradigm. IEEE Trans. Cybern. 45(1), 53–64 (2015)CrossRefGoogle Scholar
  43. 43.
    Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11964-9_18CrossRefGoogle Scholar
  44. 44.
    Mika, P., Hendler, J.: The semantic web challenge 2008. J. Web Semant. 7(4), 271 (2009)CrossRefGoogle Scholar
  45. 45.
    Mulay, K., Kumar, P.S.: SPRING: ranking the results of SPARQL queries on linked data. In: International Conference on Management of Data (COMAD), pp. 47–56. Allied Publishers (2011)Google Scholar
  46. 46.
    Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: SIGMOD International Conference on Management of Data, pp. 627–640. ACM (2009)Google Scholar
  47. 47.
    Neumayer, R., Balog, K., Nørvåg, K.: When simple is (more than) good enough: effective semantic search with (almost) no semantics. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 540–543. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-28997-2_59CrossRefGoogle Scholar
  48. 48.
    Nikolov, A., Motta, E.: Capturing emerging relations between schema ontologies on the web of data. In: Consuming Linked Data (COLD). CEUR (2010)Google Scholar
  49. 49.
    Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples. In: International Conference on Information Integration and Web-based Applications and Services, pp. 453–460. ACM (2010)Google Scholar
  50. 50.
    Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: ISWC Posters & Demonstrations, pp. 245–248. CEUR-WS.org (2013)Google Scholar
  51. 51.
    Rula, A., Palmonari, M., Harth, A., Stadtmüller, S., Maurino, A.: On the diversity and availability of temporal information in linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 492–507. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35176-1_31CrossRefGoogle Scholar
  52. 52.
    Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in Hadoop. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 165–176. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32925-8_17CrossRefGoogle Scholar
  53. 53.
    Speiser, S., Harth, A.: Integrating linked data and services with linked data services. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 170–184. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21034-1_12CrossRefGoogle Scholar
  54. 54.
    Stadtmüller, S., Harth, A., Grobelnik, M.: Accessing information about linked data vocabularies with vocab.cc. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.T. (eds.) CSWS 2012. SPCOM, pp. 391–396. (2012).  https://doi.org/10.1007/978-1-4614-6880-6_34CrossRefGoogle Scholar
  55. 55.
    Tran, T., Mika, P., Wang, H., Grobelnik, M.: SemSearch’11: the 4th semantic search workshop. In: International Conference on World Wide Web (Companion Volume), pp. 315–316. ACM (2011)Google Scholar
  56. 56.
    Tummarello, G., Delbru, R., Oren, E.: Sindice.com: weaving the open linked data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 552–565. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-76298-0_40CrossRefGoogle Scholar
  57. 57.
    Umbrich, J., Karnstedt, M., Hogan, A., Parreira, J.X.: Hybrid SPARQL queries: fresh vs. fast results. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 608–624. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35176-1_38CrossRefGoogle Scholar
  58. 58.
    Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed reasoning using MapReduce. In: Bernstein, A., et al. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 634–649. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-04930-9_40CrossRefGoogle Scholar
  59. 59.
    Urbani, J., Maassen, J., Drost, N., Seinstra, F.J., Bal, H.E.: Scalable RDF data compression with MapReduce. Concurrency Comput.: Pract. Experience 25(1), 24–39 (2013)CrossRefGoogle Scholar
  60. 60.
    Wang, J., Cheng, J.: Truss decomposition in massive networks. PVLDB 5(9), 812–823 (2012)Google Scholar
  61. 61.
    Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. 29(8), 1751–1764 (2017)CrossRefGoogle Scholar
  62. 62.
    Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient SPARQL query evaluation via automatic data partitioning. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7826, pp. 244–258. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37450-0_18CrossRefGoogle Scholar
  63. 63.
    Fang, Y., Si, L., Somasundaram, N., Al-Ansari, S., Yu, Z., Xian, Y.: Purdue at TREC 2010 entity track: a probabilistic framework for matching types between candidate and target entities (2010)Google Scholar
  64. 64.
    Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)Google Scholar
  65. 65.
    Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)Google Scholar
  66. 66.
    Zhang, X., Song, D., Priya, S., Daniels, Z., Reynolds, K., Heflin, J.: Exploring linked data with contextual tag clouds. J. Web Semant. 24, 33–39 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • José-Miguel Herrera
    • 1
  • Aidan Hogan
    • 1
    Email author
  • Tobias Käfer
    • 2
  1. 1.IMFD; DCCUniversidad de ChileSantiagoChile
  2. 2.Karlsruhe Institute of Technology (KIT)KarlsruheGermany

Personalised recommendations