Anytime Large-Scale Analytics of Linked Open Data

  • Arnaud SouletEmail author
  • Fabian M. Suchanek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11778)


Analytical queries are queries with numerical aggregators: computing the average number of objects per property, identifying the most frequent subjects, etc. Such queries are essential to monitor the quality and the content of the Linked Open Data (LOD) cloud. Many analytical queries cannot be executed directly on the SPARQL endpoints, because the fair use policy cuts off expensive queries. In this paper, we show how to rewrite such queries into a set of queries that each satisfy the fair use policy. We then show how to execute these queries in such a way that the result provably converges to the exact query answer. Our algorithm is an anytime algorithm, meaning that it can give intermediate approximate results at any time point. Our experiments show that the approach converges rapidly towards the exact solution, and that it can compute even complex indicators at the scale of the LOD cloud.



This work was partially supported by the grant ANR-16-CE23-0007-01 (“DICOS”).


  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: the Logical Level. Addison-Wesley Longman Publishing Co., Inc, Boston (1995)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). Scholar
  3. 3.
    Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., et al. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012). Scholar
  4. 4.
    Auer, S., Lehmann, J., Hellmann, S.: LinkedGeoData: adding a spatial dimension to the web of data. In: Bernstein, A., et al. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 731–746. Springer, Heidelberg (2009). Scholar
  5. 5.
    Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: Querying RDF streams with c-SPARQL. ACM SIGMOD Rec. 39(1), 20–26 (2010)CrossRefGoogle Scholar
  6. 6.
    Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inf. 41(5), 706–7016 (2008)CrossRefGoogle Scholar
  7. 7.
    Bienvenu, M., Deutch, D., Martinenghi, D., Senellart, P., Suchanek, F.M.: Dealing with the deep web and all its quirks. In: VLDS (2012)Google Scholar
  8. 8.
    Bolles, A., Grawunder, M., Jacobi, J.: Streaming SPARQL - extending SPARQL to process data streams. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 448–462. Springer, Heidelberg (2008). Scholar
  9. 9.
    Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM Sigmod Rec. 26(1), 65–74 (1997)CrossRefGoogle Scholar
  10. 10.
    Codd, E.F., Codd, S.B., Salley, C.T.: Providing OLAP (on-line analytical processing) to user-analysts: an IT mandate. Codd Date 32 (1993)Google Scholar
  11. 11.
    Colazzo, D., Goasdoué, F., Manolescu, I., Roatiş, A.: RDF analytics: lenses over semantic graphs. In: WWW (2014)Google Scholar
  12. 12.
    Costabello, L., Villata, S., Vagliano, I., Gandon, F.: Assisted policy management for SPARQL endpoints access control. In: ISWC Demo (2013)Google Scholar
  13. 13.
    Cyganiak, R.: A relational algebra for SPARQL. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170 35 (2005)Google Scholar
  14. 14.
    Forchhammer, B., Jentzsch, A., Naumann, F.: LODOP - multi-query optimization for linked data profiling queries. In: PROFILES@ESWC (2014)Google Scholar
  15. 15.
    Franke, C., Morin, S., Chebotko, A., Abraham, J., Brazier, P.: Distributed semantic web data management in HBase and MySQL cluster. In: CLOUD (2011)Google Scholar
  16. 16.
    Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM (2017)Google Scholar
  17. 17.
    Gottron, T.: Of sampling and smoothing: approximating distributions over linked open data. In: PROFILES@ ESWC (2014)Google Scholar
  18. 18.
    Goujon, M., et al.: A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Res. 38(Suppl\(\_\)2), W695–W699 (2010)CrossRefGoogle Scholar
  19. 19.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. ACM Sigmod Rec. 26, 171–182 (1997)CrossRefGoogle Scholar
  20. 20.
    Ibragimov, D., Hose, K., Pedersen, T.B., Zimányi, E.: Processing aggregate queries in a federation of SPARQL endpoints. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 269–285. Springer, Cham (2015). Scholar
  21. 21.
    Khan, Y., et al.: SAFE: policy aware SPARQL query federation over RDF data cubes. In: Workshop on Semantic Web Applications for Life Sciences (2014)Google Scholar
  22. 22.
    Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: the journey using a nested triplegroup algebra. VLDB J. 4(12), 1426–1429 (2011)Google Scholar
  23. 23.
    Kotoulas, S., Urbani, J., Boncz, P., Mika, P.: Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig. ISWC 2012. LNCS, vol. 7649, pp. 247–262. Springer, Heidelberg (2012). Scholar
  24. 24.
    Lajus, J., Suchanek, F.M.: Are all people married? Determining obligatory attributes in knowledge bases. In: WWW (2018)Google Scholar
  25. 25.
    Manolescu, I., Mazuran, M.: Speeding up RDF aggregate discovery through sampling. In: Workshop on Big Data Visual Exploration (2019)Google Scholar
  26. 26.
    Muñoz, E., Nickles, M.: Statistical relation cardinality bounds in knowledge bases. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W.I. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. LNCS, vol. 11310, pp. 67–97. Springer, Heidelberg (2018). Scholar
  27. 27.
    Nirkhiwale, S., Dobra, A., Jermaine, C.: A sampling algebra for aggregate estimation. VLDB J. 6(14), 1798–1809 (2013)Google Scholar
  28. 28.
    Olken, F.: Random sampling from databases. Ph.D. thesis, University of California, Berkeley (1993)Google Scholar
  29. 29.
    Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 137–153. Springer, Cham (2018). Scholar
  30. 30.
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008). Scholar
  31. 31.
    Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)CrossRefGoogle Scholar
  32. 32.
    Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. VLDB J. 9(10), 804–815 (2016)Google Scholar
  33. 33.
    Sejdiu, G., Ermilov, I., Lehmann, J., Mami, M.N.: DistLODStats: distributed computation of RDF dataset statistics. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 206–222. Springer, Cham (2018). Scholar
  34. 34.
    Soulet, A., Giacometti, A., Markhoff, B., Suchanek, F.M.: Representativeness of knowledge bases with the generalized Benford’s Law. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 374–390. Springer, Cham (2018). Scholar
  35. 35.
    Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Mag. 17(3), 73 (1996)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Université de Tours, LIFATBloisFrance
  2. 2.Telecom Paris, Institut Polytechnique de ParisParisFrance

Personalised recommendations