Advertisement

World Wide Web

, Volume 22, Issue 4, pp 1611–1638 | Cite as

GStar: an efficient framework for answering top-k star queries on billion-node knowledge graphs

  • Jiahui JinEmail author
  • Junzhou Luo
  • Samamon Khemmarat
  • Fang Dong
  • Lixin Gao
Article

Abstract

Massive knowledge graphs, such as Linked Open Data or Freebase, contain billions of labeled entities and relationships. Star queries aim to identify an entity given a set of related entities, and they are common with massive knowledge graphs. It is important to find the best way to answer star queries, and we can do this by treating it as a graph pattern-matching problem. Because knowledge graphs are noisy and incomplete in nature, we must find answers that match the star pattern closely, and extract a precise match if possible. Thus, here we propose GStar, a framework to identify the top-k best answers for a star query. GStar effectively and efficiently answers top-k star queries on billion-node graphs through a novel query model, an index-free query algorithm, and a distributed query system. We evaluate GStar through experiments on real-world knowledge graphs. Experimental results show that our query model effectively answers real-life star-pattern queries; our query algorithm can answer top-k queries in a near-real-time manner without requiring expensive graph indices; and the distributed system scales well with both the graph size and number of machines used for computation.

Keywords

Graph pattern matching Knowledge graphs Billion-node graphs Top-k query Big data Distributed system 

Notes

Acknowledgements

This work is supported by National Key R&D Program of China 2017YFB1003000, National Natural Science Foundation of China under Grants No. 61702096, No. 61632008, No. 61320106007, No. 61572129, No. 61602112, No. 61502097, No. 61370207 and No. 61702097; International S&T Cooperation Program of China No. 2015DFA10490; the Natural Science Foundation of Jiangsu Province under grant BK20170689; BK20160695 and Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No.BM2003201; Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No.93K-9; the Fundamental Research Funds for the Central Universities; and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization and Collaborative Innovation Center of Wireless Communications Technology. This work is also partially supported by U.S. NSF grants CNS-1217284 and CCF-1018114. Jiahui Jin was a visiting student at UMass Amherst, supported by China Scholarship Council, when this work was performed. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the sponsor. Preliminary version [14] of this paper appeared in Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14).

References

  1. 1.
    Akiba, T., Sommer, C., Kawarabayashi, K.-i.: Shortest-path queries for complex networks: Exploiting low tree-width outside the core. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp 144–155. ACM, New York (2012)Google Scholar
  2. 2.
    Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp 349–360. ACM, New York (2013)Google Scholar
  3. 3.
    Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - a crystallization point for the Web of data. Web Semant. 7(3), 154–165 (2009)CrossRefGoogle Scholar
  4. 4.
    Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25, 163–177 (2001)CrossRefzbMATHGoogle Scholar
  5. 5.
    Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the Fourth SIAM International Conference on Data Mining, SDM’04, pp. 442–446 (2004)Google Scholar
  6. 6.
    Checconi, F., Petrini, F.: Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’14, pp. 425–434 (2014)Google Scholar
  7. 7.
    Cheng, J., Zeng, X., Yu, J.X.: Top-k graph pattern matching over large graphs. In: Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE ’13, pp. 1033–1044 (2013)Google Scholar
  8. 8.
    Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Ni, L., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A Web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 601–610 (2014)Google Scholar
  9. 9.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, pp 102–113. ACM, New York (2001)Google Scholar
  10. 10.
    Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: From intractable to polynomial time. PVLDB, 3(1–2), 264–275 (2010)Google Scholar
  11. 11.
    Han W.-S., Lee, J., Pham, M.-D., Yu, J.X.: igraph: A framework for comparisons of disk-based graph indexing techniques. PVLDB 3(1), 449–459 (2010)Google Scholar
  12. 12.
    He, H., Wang, H., Yang, J., Yu, P.S.: Blinks: Ranked keyword searches on graphs. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, pp 305–316. ACM, New York (2007)Google Scholar
  13. 13.
    Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4), 11,1–11,58 (2008)CrossRefGoogle Scholar
  14. 14.
    Jin, J., Khemmarat, S., Gao, L., Luo, J.: A distributed approach for top-k star queries on massive information networks. In: Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, ICPADS ’14, pp. 9–16 (2014)Google Scholar
  15. 15.
    Jin, J., Luo, J., Khemmarat, S., Dong, F., Gao, L.: Supplementary file of gstar: An efficient framework for answering top-k star queries on billion-node knowledge graphs. http://cse.seu.edu.cn/PersonalPage/jhjin/upload/supplementary-file-wwwj.pdf (2017)
  16. 16.
    Jin, J., Luo, J., Khemmarat, S., Gao, L.: Querying Web-scale knowledge graphs through effective pruning of search space. IEEE Trans Parallel Distrib Syst 28 (8), 2342–2356 (2017)CrossRefGoogle Scholar
  17. 17.
    Khan, A., Li, N., Yan, X., Guan, Z., Chakraborty, S., Tao, S.: Neighborhood based fast graph search in large networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp 901–912. ACM, New York (2011)Google Scholar
  18. 18.
    Khan, A., Wu, Y., Aggarwal, C.C., Yan, X.: Nema: Fast graph search with label similarity. PVLDB 6(3), 181–192 (2013)Google Scholar
  19. 19.
    Khemmarat, S., Gao, L.: Fast top-k path-based relevance query on massive graphs. In Proceedings of the 30th IEEE International Conference on Data Engineering, ICDE ’14, pp. 316–327 (2014)Google Scholar
  20. 20.
    Lee, J., Han, W.-S., Kasperovics, R., Lee, J.-H.: An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 133–144 (2012)Google Scholar
  21. 21.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: A framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)Google Scholar
  22. 22.
    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp 135–146. ACM, New York (2010)Google Scholar
  23. 23.
    Neumann, T., Weikum, G.: Rdf-3x: A risc-style engine for rdf. PVLDB 1(1), 647–659 (2008)Google Scholar
  24. 24.
    Neumann, T., Bender, M., Michel, S., Schenkel, R., Triantafillou, P., Weikum, G.: Distributed top-k aggregation queries at large. Distrib Parallel Datab 26(1), 3–27 (2009)CrossRefGoogle Scholar
  25. 25.
    Power, R., Li, J.: Piccolo: Building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp 1–14. USENIX Association, Berkeley (2010)Google Scholar
  26. 26.
    Qiu, T., Qiao, R., Han, M., Sangaiah, A.K., Lee, I.: A lifetime-enhanced data collecting scheme for internet of things. IEEE Commun. Mag. 55(11), 132–137 (2017)CrossRefGoogle Scholar
  27. 27.
    Qiu, T., Zhao, A., Xia, F., Si, W., Wu, D.: ROSE: Robustness strategy for scale-free wireless sensor networks. IEEE/ACM Trans. Network. 25(5), 2944–2959 (2017)CrossRefGoogle Scholar
  28. 28.
    Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow. 1(1), 364–375 (2008)CrossRefGoogle Scholar
  29. 29.
    Stanton, I., Kliot, G.: Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp 1222–1230. ACM, New York (2012)Google Scholar
  30. 30.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp 697–706. ACM, New York (2007)Google Scholar
  31. 31.
    Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: Efficient subgraph matching on billion node graphs. PVLDB 5(9), 788–799 (2012)Google Scholar
  32. 32.
    Tian, Y., Patel, J.M.: Tale: A tool for approximate large graph matching. In: Proceedings of the 24th IEEE International Conference on Data Engineering, ICDE ’08, pp. 963–972 (2008)Google Scholar
  33. 33.
    Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pp 737–746. ACM, New York (2007)Google Scholar
  34. 34.
    Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  35. 35.
    Yan, D., Cheng, J., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng, W.: A general-purpose query-centric framework for querying big graphs. PVLDB 9(7), 564–575 (2016)Google Scholar
  36. 36.
    Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for Web scale rdf data. PVLDB, 265–276 (2013)Google Scholar
  37. 37.
    Zhang, Y., Gao, Q., Gao, L., Wang, C.: Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation. IEEE Tran Parallel Distrib Syst 25(8), 2091–2100 (2014)CrossRefGoogle Scholar
  38. 38.
    Zou, L., Chen, L., Tamer Özsu, M.: Distance-join: Pattern match query in a large graph database. PVLDB 2(1), 886–897 (2009)Google Scholar
  39. 39.
    Zou, L., Mo, J., Chen, L., Tamer Özsu, M., Dongyan, Z.: gStore: Answering SPARQL queries via subgraph matching. PVLDB 4(8), 482–493 (2011)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringSoutheast UniversityNanjingChina
  2. 2.Department of Electrical and Computer EngineeringUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations