PathQuery Pregel: high-performance graph query with bulk synchronous processing

Abstract

High-performance graph query systems are a scalable way to mine information in Knowledge Graphs, especially when the queries benefit from a high-level expressive query language. This paper presents techniques to algorithmically compile queries expressed in a high-level language (e.g., Datalog) into a directed acyclic graph query plan and details how these queries can be run on a Pregel graph vertex-centric compute system. Our solution, called PathQuery Pregel, creates plans for any conjunctive or disjunctive queries with aggregation and negation; we describe how the query execution extracts graph results optimally while avoiding many join operations where parallel map execution is permitted. We provide details of how we scaled this system out to execute large set of queries in parallel over the Google Knowledge Graph, a graph of 70B edges, or facts; we describe our production experience with PathQuery Pregel.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. 1.

    Aberger CR, Lamb A, Tu S, Nötzli A, Olukotun K, Ré C (2017) Emptyheaded: a relational engine for graph processing. ACM Trans Database Syst 42(4):20:1–20:44. https://doi.org/10.1145/3129246

    MathSciNet  Article  Google Scholar 

  2. 2.

    Angles R, Gutiérrez C (2008) Survey of graph database models. ACM Comput Surv 40(1):1:1–1:39. https://doi.org/10.1145/1322432.1322433

    Article  Google Scholar 

  3. 3.

    Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press, Addison-Wesley, Boston

    Google Scholar 

  4. 4.

    Bast H, Buchhold B, Haussmann E (2016) Semantic search on text and knowledge bases. Found Trends Inf Retr 10(2–3):119–271. https://doi.org/10.1561/1500000032

    Article  Google Scholar 

  5. 5.

    Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Wang JT (ed) Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008. ACM, pp 1247–1250. https://doi.org/10.1145/1376616.1376746

  6. 6.

    Ceri S, Gottlob G, Tanca L (1989) What you always wanted to know about datalog (and never dared to ask). IEEE Trans Knowl Data Eng 1(1):146–166. https://doi.org/10.1109/69.43410

    Article  Google Scholar 

  7. 7.

    Chassel RJ. The mapcar function. https://www.gnu.org/software/emacs/manual/html_node/eintr/mapcar.html

  8. 8.

    Chavarría-Miranda DG, Castellana VG, Morari A, Haglin D, Feo J (2016) Graql: a query language for high-performance attributed graph databases. In: 2016 IEEE international parallel and distributed processing symposium workshops, IPDPS workshops 2016, Chicago, IL, USA, May 23–27, 2016. IEEE Computer Society, pp 1453–1462. https://doi.org/10.1109/IPDPSW.2016.216

  9. 9.

    Colmerauer A, Roussel P (1993) The birth of prolog. In: Lee JAN, Sammet JE (eds) History of programming languages conference (HOPL-II), Preprints, Cambridge, Massachusetts, USA, April 20–23, 1993. ACM, pp 37–52. https://doi.org/10.1145/154766.155362

  10. 10.

    Contributors W (2015) Synchronous circuit—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Synchronous_circuit&oldid=696626873. Online; Accessed 15 Feb 2018

  11. 11.

    Contributors W (2017) Call-with-current-continuation—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Call-with-current-continuation&oldid=811008297. Online; Accessed 16 Feb 2018

  12. 12.

    Contributors W (2017) Inverted index—Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Inverted_index

  13. 13.

    Contributors W (2017) Topological sorting—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Topological_sorting&oldid=805893309. Online; Accessed 15 Feb 2018

  14. 14.

    Contributors W (2018) Backtracking—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Backtracking&oldid=824587461. Online; Accessed 16 Feb 2018

  15. 15.

    Contributors W (2018) Kevin Bacon—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Kevin_Bacon&oldid=823782926. Online; Accessed 16 Feb 2018

  16. 16.

    Contributors W (2018) Knowledge Graph—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Knowledge_Graph&oldid=822449387. Online; Accessed 15 Feb 2018

  17. 17.

    Cuzzocrea A, Cosulschi M, Virgilio RD (2016) An effective and efficient mapreduce algorithm for computing bfs-based traversals of large-scale RDF graphs. Algorithms 9(1):7. https://doi.org/10.3390/a9010007

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Brewer EA, Chen P (eds) 6th Symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004. USENIX Association, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html

  19. 19.

    Facebook: Apache giraph. http://giraph.apache.org/

  20. 20.

    Fan J, Raj AGS, Patel JM (2015) The case against specialized graph analytics engines. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper20.pdf

  21. 21.

    Group RW Rdf. https://www.w3.org/RDF/

  22. 22.

    Hillis WD (1989) The connection machine. MIT Press, Cambridge

    Google Scholar 

  23. 23.

    Hong S, Salihoglu S, Widom J, Olukotun K (2014) Simplifying scalable graph processing with a domain-specific language. In: Kaeli DR, Moseley T (eds) 12th Annual IEEE/ACM international symposium on code generation and optimization, CGO 2014, Orlando, FL, USA, February 15–19, 2014. ACM, p 208. https://doi.org/10.1145/2544137.2544162

  24. 24.

    Junghanns M, Petermann A, Neumann M, Rahm E (2017) Management and analysis of big graph data: current systems and open challenges. In: Zomaya AY, Sakr S (eds) Handbook of Big Data technologies. Springer, New York, pp 457–505. https://doi.org/10.1007/978-3-319-49340-4_14

  25. 25.

    Kim K, Moon B, Kim H (2014) Rg-index: An RDF graph index for efficient SPARQL query processing. Expert Syst Appl 41(10):4596–4607. https://doi.org/10.1016/j.eswa.2014.01.027

    Article  Google Scholar 

  26. 26.

    LinkedIn (2018) Linkedin economic graph. https://economicgraph.linkedin.com/

  27. 27.

    Lipsett R, Schaefer CF, Ussery C (1989) VHDL: hardware description and design. Kluwer, Dordrecht

    Google Scholar 

  28. 28.

    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8):716–727

    Google Scholar 

  29. 29.

    Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Elmagarmid AK, Agrawal D (eds) Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010. ACM, pp 135–146. https://doi.org/10.1145/1807167.1807184

  30. 30.

    Marcus G (2012) The web gets smarter. https://www.newyorker.com/culture/culture-desk/the-web-gets-smarter. Online; Accessed 15 Feb 2018

  31. 31.

    Moustafa WE, Papavasileiou V, Yocum K, Deutsch A (2016) Datalography: scaling datalog graph analytics on graph processing systems. In: Joshi J, Karypis G, Liu L, Hu X, Ak R, Xia Y, Xu W, Sato A,  Rachuri S, Ungar LH, Yu PS, Govindaraju R, Suzumura T (eds) 2016 IEEE international conference on Big Data, BigData 2016, Washington DC, USA, December 5–8, 2016. IEEE, pp 56–65. https://doi.org/10.1109/BigData.2016.7840589

  32. 32.

    Reference O Moore’s law. http://www.oxfordreference.com/view/10.1093/oi/authority.20110803100208256

  33. 33.

    Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the SHARD triple-store. In: Tilevich E, Eugster P (eds) SPLASH workshop on programming support innovations for emerging distributed applications (PSI EtA-\(\Psi\)Theta 2010), October 17, 2010, Reno/Tahoe, Nevada, USA. ACM, p 4. https://doi.org/10.1145/1940747.1940751

  34. 34.

    Seo J, Park J, Shin J, Lam MS (2013) Distributed socialite: a datalog-based language for large-scale graph analysis. PVLDB 6(14):1906–1917

    Google Scholar 

  35. 35.

    Singhal A. Introducing the knowledge graph: things, not strings. https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

  36. 36.

    Steele LS Jr, Hillis WD (1986) Connection machine LISP: fine-grained parallel symbolic processing. In: LISP and functional programming, pp 279–297

  37. 37.

    Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https://doi.org/10.1145/79173.79181

    Article  Google Scholar 

  38. 38.

    Vincent J (2018) Apple boasts about sales; google boasts about how good its AI is . https://www.theverge.com/2016/10/4/13122406/google-phone-event-stats. Online; Accessed 15 Feb 2018

  39. 39.

    Wang J, Balazinska M, Halperin D (2015) Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB 8(12):1542–1553

    Google Scholar 

  40. 40.

    Yan D, Cheng J, Lu Y, Ng W (2015) Effective techniques for message reduction and load balancing in distributed graph computation. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th international conference on World Wide Web, WWW 2015, Florence, Italy, May 18–22, 2015. ACM, pp 1307–1317. https://doi.org/10.1145/2736277.2741096

  41. 41.

    Yan D, Cheng J, Xing K, Lu Y, Ng W, Bu Y (2014) Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB 7(14):1821–1832

    Google Scholar 

  42. 42.

    Yan Z, Liu M (1996) The RTL binding and mapping approach of VHDL high-level synthesis system HLS/BIT. J. Comput Sci Technol 11(6):562–569. https://doi.org/10.1007/BF02951619

    Article  Google Scholar 

  43. 43.

    Zhang Q, Yan D, Cheng J (2016) Quegel: a general-purpose system for querying big graphs. In: Özcan F, Koutrika G, Madden S (eds) Proceedings of the 2016 international conference on management of data, SIGMOD conference 2016, San Francisco, CA, USA, June 26–July 01, 2016. ACM, pp 2189–2192. https://doi.org/10.1145/2882903.2899398

Download references

Acknowledgements

None of this would have happened without the passion, knowledge and vision of our late friend and colleague Warren Harris. We regret his passing immensely and we dedicate this summary of our work to his memory. We are proud of what we have achieved together and of what we learned from Warren.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bogdan Arsintescu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Bogdan Arsintescu: Currently employed by LinkedIn; work performed 100% while at Google.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arsintescu, B., Deo, S. & Harris, W. PathQuery Pregel: high-performance graph query with bulk synchronous processing. Pattern Anal Applic 23, 1493–1504 (2020). https://doi.org/10.1007/s10044-019-00841-z

Download citation

Keywords

  • Distributed graph compute
  • Pregel
  • Graph query
  • Bulk synchronous parallel computing
  • Graph database