High-performance graph query systems are a scalable way to mine information in Knowledge Graphs, especially when the queries benefit from a high-level expressive query language. This paper presents techniques to algorithmically compile queries expressed in a high-level language (e.g., Datalog) into a directed acyclic graph query plan and details how these queries can be run on a Pregel graph vertex-centric compute system. Our solution, called PathQuery Pregel, creates plans for any conjunctive or disjunctive queries with aggregation and negation; we describe how the query execution extracts graph results optimally while avoiding many join operations where parallel map execution is permitted. We provide details of how we scaled this system out to execute large set of queries in parallel over the Google Knowledge Graph, a graph of 70B edges, or facts; we describe our production experience with PathQuery Pregel.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Aberger CR, Lamb A, Tu S, Nötzli A, Olukotun K, Ré C (2017) Emptyheaded: a relational engine for graph processing. ACM Trans Database Syst 42(4):20:1–20:44. https://doi.org/10.1145/3129246
Angles R, Gutiérrez C (2008) Survey of graph database models. ACM Comput Surv 40(1):1:1–1:39. https://doi.org/10.1145/1322432.1322433
Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press, Addison-Wesley, Boston
Bast H, Buchhold B, Haussmann E (2016) Semantic search on text and knowledge bases. Found Trends Inf Retr 10(2–3):119–271. https://doi.org/10.1561/1500000032
Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Wang JT (ed) Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008. ACM, pp 1247–1250. https://doi.org/10.1145/1376616.1376746
Ceri S, Gottlob G, Tanca L (1989) What you always wanted to know about datalog (and never dared to ask). IEEE Trans Knowl Data Eng 1(1):146–166. https://doi.org/10.1109/69.43410
Chassel RJ. The mapcar function. https://www.gnu.org/software/emacs/manual/html_node/eintr/mapcar.html
Chavarría-Miranda DG, Castellana VG, Morari A, Haglin D, Feo J (2016) Graql: a query language for high-performance attributed graph databases. In: 2016 IEEE international parallel and distributed processing symposium workshops, IPDPS workshops 2016, Chicago, IL, USA, May 23–27, 2016. IEEE Computer Society, pp 1453–1462. https://doi.org/10.1109/IPDPSW.2016.216
Colmerauer A, Roussel P (1993) The birth of prolog. In: Lee JAN, Sammet JE (eds) History of programming languages conference (HOPL-II), Preprints, Cambridge, Massachusetts, USA, April 20–23, 1993. ACM, pp 37–52. https://doi.org/10.1145/154766.155362
Contributors W (2015) Synchronous circuit—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Synchronous_circuit&oldid=696626873. Online; Accessed 15 Feb 2018
Contributors W (2017) Call-with-current-continuation—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Call-with-current-continuation&oldid=811008297. Online; Accessed 16 Feb 2018
Contributors W (2017) Inverted index—Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Inverted_index
Contributors W (2017) Topological sorting—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Topological_sorting&oldid=805893309. Online; Accessed 15 Feb 2018
Contributors W (2018) Backtracking—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Backtracking&oldid=824587461. Online; Accessed 16 Feb 2018
Contributors W (2018) Kevin Bacon—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Kevin_Bacon&oldid=823782926. Online; Accessed 16 Feb 2018
Contributors W (2018) Knowledge Graph—Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Knowledge_Graph&oldid=822449387. Online; Accessed 15 Feb 2018
Cuzzocrea A, Cosulschi M, Virgilio RD (2016) An effective and efficient mapreduce algorithm for computing bfs-based traversals of large-scale RDF graphs. Algorithms 9(1):7. https://doi.org/10.3390/a9010007
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Brewer EA, Chen P (eds) 6th Symposium on operating system design and implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004. USENIX Association, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
Facebook: Apache giraph. http://giraph.apache.org/
Fan J, Raj AGS, Patel JM (2015) The case against specialized graph analytics engines. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper20.pdf
Group RW Rdf. https://www.w3.org/RDF/
Hillis WD (1989) The connection machine. MIT Press, Cambridge
Hong S, Salihoglu S, Widom J, Olukotun K (2014) Simplifying scalable graph processing with a domain-specific language. In: Kaeli DR, Moseley T (eds) 12th Annual IEEE/ACM international symposium on code generation and optimization, CGO 2014, Orlando, FL, USA, February 15–19, 2014. ACM, p 208. https://doi.org/10.1145/2544137.2544162
Junghanns M, Petermann A, Neumann M, Rahm E (2017) Management and analysis of big graph data: current systems and open challenges. In: Zomaya AY, Sakr S (eds) Handbook of Big Data technologies. Springer, New York, pp 457–505. https://doi.org/10.1007/978-3-319-49340-4_14
Kim K, Moon B, Kim H (2014) Rg-index: An RDF graph index for efficient SPARQL query processing. Expert Syst Appl 41(10):4596–4607. https://doi.org/10.1016/j.eswa.2014.01.027
LinkedIn (2018) Linkedin economic graph. https://economicgraph.linkedin.com/
Lipsett R, Schaefer CF, Ussery C (1989) VHDL: hardware description and design. Kluwer, Dordrecht
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8):716–727
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Elmagarmid AK, Agrawal D (eds) Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010. ACM, pp 135–146. https://doi.org/10.1145/1807167.1807184
Marcus G (2012) The web gets smarter. https://www.newyorker.com/culture/culture-desk/the-web-gets-smarter. Online; Accessed 15 Feb 2018
Moustafa WE, Papavasileiou V, Yocum K, Deutsch A (2016) Datalography: scaling datalog graph analytics on graph processing systems. In: Joshi J, Karypis G, Liu L, Hu X, Ak R, Xia Y, Xu W, Sato A, Rachuri S, Ungar LH, Yu PS, Govindaraju R, Suzumura T (eds) 2016 IEEE international conference on Big Data, BigData 2016, Washington DC, USA, December 5–8, 2016. IEEE, pp 56–65. https://doi.org/10.1109/BigData.2016.7840589
Reference O Moore’s law. http://www.oxfordreference.com/view/10.1093/oi/authority.20110803100208256
Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the SHARD triple-store. In: Tilevich E, Eugster P (eds) SPLASH workshop on programming support innovations for emerging distributed applications (PSI EtA-\(\Psi\)Theta 2010), October 17, 2010, Reno/Tahoe, Nevada, USA. ACM, p 4. https://doi.org/10.1145/1940747.1940751
Seo J, Park J, Shin J, Lam MS (2013) Distributed socialite: a datalog-based language for large-scale graph analysis. PVLDB 6(14):1906–1917
Singhal A. Introducing the knowledge graph: things, not strings. https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html
Steele LS Jr, Hillis WD (1986) Connection machine LISP: fine-grained parallel symbolic processing. In: LISP and functional programming, pp 279–297
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https://doi.org/10.1145/79173.79181
Vincent J (2018) Apple boasts about sales; google boasts about how good its AI is . https://www.theverge.com/2016/10/4/13122406/google-phone-event-stats. Online; Accessed 15 Feb 2018
Wang J, Balazinska M, Halperin D (2015) Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB 8(12):1542–1553
Yan D, Cheng J, Lu Y, Ng W (2015) Effective techniques for message reduction and load balancing in distributed graph computation. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th international conference on World Wide Web, WWW 2015, Florence, Italy, May 18–22, 2015. ACM, pp 1307–1317. https://doi.org/10.1145/2736277.2741096
Yan D, Cheng J, Xing K, Lu Y, Ng W, Bu Y (2014) Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB 7(14):1821–1832
Yan Z, Liu M (1996) The RTL binding and mapping approach of VHDL high-level synthesis system HLS/BIT. J. Comput Sci Technol 11(6):562–569. https://doi.org/10.1007/BF02951619
Zhang Q, Yan D, Cheng J (2016) Quegel: a general-purpose system for querying big graphs. In: Özcan F, Koutrika G, Madden S (eds) Proceedings of the 2016 international conference on management of data, SIGMOD conference 2016, San Francisco, CA, USA, June 26–July 01, 2016. ACM, pp 2189–2192. https://doi.org/10.1145/2882903.2899398
None of this would have happened without the passion, knowledge and vision of our late friend and colleague Warren Harris. We regret his passing immensely and we dedicate this summary of our work to his memory. We are proud of what we have achieved together and of what we learned from Warren.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Bogdan Arsintescu: Currently employed by LinkedIn; work performed 100% while at Google.
About this article
Cite this article
Arsintescu, B., Deo, S. & Harris, W. PathQuery Pregel: high-performance graph query with bulk synchronous processing. Pattern Anal Applic 23, 1493–1504 (2020). https://doi.org/10.1007/s10044-019-00841-z
- Distributed graph compute
- Graph query
- Bulk synchronous parallel computing
- Graph database