Similarity Search in Large-Scale Graph Databases

Zhao, Peixiang

doi:10.1007/978-3-319-49340-4_15

Peixiang Zhao³

7372 Accesses
1 Citations

Abstract

Graphs are ubiquitous and play an essential role in modeling and representing complex structures in real-world networked applications. Given a graph database that comprises a large collection of graphs, it is fundamental and critical to enable fast and flexible search for structurally similar graphs. In this paper, we survey recent graph similarity search techniques and specifically focus on the work based on the graph edit distance (GED) metric. State-of-the-art approaches for the GED based similarity search typically adopt a pruning and verification framework. They first take advantage of some easy-to-compute lower-bounds of graph edit distance, and use novel graph indexing structures to efficiently evaluate such lower-bounds between graphs in the graph database and the query graph. This way, graphs that violate the GED lower-bound constraints can be identified and filtered from the graph database from further investigation. Then, the costly GED verification is performed only for the graphs that pass the GED lower-bound evaluation. We examine existing GED lower-bounds, graph index structures, and similarity search algorithms in detail, and compare different similarity search methods from multiple aspects including index construction cost, similarity search performance, and applicability in real-world graph databases. In the end, we envision and discuss the future research directions related to similarity search and high-performance query processing in large-scale graph databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

C.C. Aggarwal, H. Wang, Managing and Mining Graph Data (Springer, US, 2010)
Google Scholar
L. Babai, Graph isomorphism in quasipolynomial time. in Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC’16) (2016), pp. 684–697
Google Scholar
D.F. Barbieri, D. Braga, S. Ceri, E.D. Valle, M. Grossniklaus, Querying rdf streams with c-sparql. SIGMOD Rec. 39(1), 20–26 (2010)
Article MATH Google Scholar
P. Barceló Baeza, Querying graph databases. in Proceedings of the 32nd Symposium on Principles of Database Systems (PODS’13) (2013), pp. 175–188
Google Scholar
H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Article Google Scholar
S. Berretti, A. Del Bimbo, E. Vicario, Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1089–1105 (2001)
Article Google Scholar
K.M. Borgwardt, H.-P. Kriegel, Shortest-path kernels on graphs. in Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05) (2005), pp. 74–81
Google Scholar
H. Bunke, On a relation between graph edit distance and maximum common subgraph. Pattern Recogn. Lett. 18(9), 689–694 (1997)
Article Google Scholar
H. Bunke, Error correcting graph matching: on the influence of the underlying cost function. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 917–922 (1999)
Article Google Scholar
H. Bunke, K. Shearer, A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)
Article MATH Google Scholar
X. Chen, K.S. Candan, M.L. Sapino, P.Shakarian, KSGM: Keynode-driven scalable graph matching. in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15) (2015), pp. 1101–1110
Google Scholar
H. Cheng, D. Lo, Y. Zhou, X. Wang, X. Yan, Identifying bug signatures using discriminative graph mining. in Proceedings of the Eighteenth International Symposium on Software Testing and Analysis (ISSTA’09) (2009), pp. 141–152
Google Scholar
J. Cheng, Y. Ke, W. Ng, Efficient query processing on graph databases. ACM Trans. Database Syst. 34(1), 2:1–2:48 (2009)
Article Google Scholar
S. Choudhury, L. Holder, G. Chin, A. Ray, S. Beus, J. Feo, Streamworks: a system for dynamic graph search. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 1101–1104
Google Scholar
D. Conte, P. Foggia, C. Sansone, M. Vento, Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004)
Article Google Scholar
D.J. Cook, L.B. Holder, Mining Graph Data (Wiley, New Jersey, 2006)
Book MATH Google Scholar
R. Fagin, A. Lotem, M. Naor, Optimal aggregation algorithms for middleware. in Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’01) (2001), pp. 102–113
Google Scholar
W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, Y. Wu, Graph pattern matching: from intractable to polynomial time. Proc. VLDB Endow. 3(1–2), 264–275 (2010)
Article Google Scholar
S. Fankhauser, K. Riesen, H. Bunke, Speeding up graph edit distance computation through fast bipartite matching. in Proceedings of the 8th International Conference on Graph-based Representations in Pattern Recognition (GBRPR’11) (2011), pp. 102–111
Google Scholar
B. Gallagher, Matching structure and semantics: a survey on graph-based pattern matching. in American Association for Artificial Intelligence (AAAI’06), vol. 6 (2006), pp. 45–53
Google Scholar
X. Gao, B. Xiao, D. Tao, X. Li, A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)
Article MathSciNet Google Scholar
M.R. Garey, D.S. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness (W. H. Freeman & Co., New York, 1990)
MATH Google Scholar
K. Gouda, M. Arafa, An improved global lower bound for graph edit similarity search. Pattern Recogn. Lett. 58, 8–14 (2015)
Article Google Scholar
L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, Approximate string joins in a database (almost) for free. in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01) (2001), pp. 491–500
Google Scholar
W.-S. Han, J. Lee, J.-H. Lee, Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 337–348
Google Scholar
W.-S. Han, M.-D. Pham, J. Lee, R. Kasperovics, J.X. Yu, Igraph in action: performance analysis of disk-based graph indexing techniques. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 1241–1242
Google Scholar
H. He, A.K. Singh, Closure-tree: an index structure for graph queries. in Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) (2006), pp. 38–49
Google Scholar
H. He, A.K. Singh, Graphs-at-a-time: query language and access methods for graph databases. in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08) (2008), pp. 405–418
Google Scholar
H.H. Hung, S.S. Bhowmick, B.Q. Truong, B. Choi, S. Zhou, Quble: blending visual subgraph query formulation with query processing on large networks. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 1097–1100
Google Scholar
N. Jayaram, S. Goyal, C. Li, VIIQ: Auto-suggestion enabled visual interface for interactive graph query formulation. Proc. VLDB Endow. 8(12), 1940–1943 (2015)
Article Google Scholar
C. Jin, S.S. Bhowmick, X. Xiao, J. Cheng, B. Choi, GBLENDER: towards blending visual query formulation and query processing in graph databases. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 111–122
Google Scholar
A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao, Neighborhood based fast graph search in large networks. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 901–912
Google Scholar
A. Khan, Y. Wu, C.C. Aggarwal, X. Yan, NeMa: Fast graph search with label similarity. Proc. VLDB Endow. 6(3), 181–192 (2013)
Article Google Scholar
H.W. Kuhn, B. Yaw, The hungarian method for the assignment problem. Naval Res. Logist. Quart. 83–97 (1955)
Google Scholar
J. Lee, W.-S. Han, R. Kasperovics, J.-H. Lee, An in-depth comparison of subgraph isomorphism algorithms in graph databases. in Proceedings of the 39th International Conference on Very Large Data Bases (PVLDB’13) (2013), pp. 133–144
Google Scholar
C. Li, J. Lu, Y. Lu, Efficient merging and filtering algorithms for approximate string searches. in Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08) (2008), pp. 257–266
Google Scholar
C. Li, B. Wang, X. Yang, VGRAM: improving performance of approximate queries on string collections using variable-length grams. in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07) (2007), pp. 303–314
Google Scholar
S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo, Strong simulation: Capturing topology in graph pattern matching. ACM Trans. Database Syst. 39(1), 4:1–4:46 (2014)
Google Scholar
M. Neuhaus, H. Bunke, Bridging the Gap Between Graph Edit Distance and Kernel Machines (World Scientific Publishing, Singapore, 2007)
Book MATH Google Scholar
H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27(1), 29–34 (1999)
Article Google Scholar
J. Qin, W. Wang, Y. Lu, C. Xiao, X. Lin, Efficient exact edit similarity query processing with the asymmetric signature scheme. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 1033–1044
Google Scholar
S.A. Rahman, M. Bashton, G.L. Holliday, R. Schrader, J.M. Thornton, Small molecule subgraph detector (SMSD) toolkit. J. Cheminform. 1, 1–12 (2009)
Article Google Scholar
S. Ranu, M. Hoang, A. Singh, Answering top-k representative queries on graph databases. in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14) (2014), pp. 1163–1174
Google Scholar
S. Ranu, A.K. Singh, Indexing and mining topological patterns for drug discovery. in Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12) (2012), pp. 562–565
Google Scholar
K. Riesen, S. Emmenegger, H. Bunke, A novel software toolkit for graph edit distance computation. in 9th International Workshop on Graph-Based Representations in Pattern Recognition (2013), pp. 142–151
Google Scholar
S. Sakr, S. Elnikety, Y. He, G-SPARQL: A hybrid engine for querying large attributed graphs. in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12) (2012), pp. 335–344
Google Scholar
M. Schmidt, M. Meier, G. Lausen, Foundations of SPARQL query optimization. in Proceedings of the 13th International Conference on Database Theory (ICDT’10) (2010), pp. 4–33
Google Scholar
H. Shang, X. Lin, Y. Zhang, J.X. Yu, W. Wang, Connected substructure similarity search. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 903–914
Google Scholar
A. Tefas, C. Kotropoulos, I. Pitas, Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication. IEEE Trans. Pattern Anal. Mach. Intell. 23(7), 735–746 (2001)
Article Google Scholar
Y. Tian, R.C. Mceachin, C. Santos, D.J. States, J.M. Patel, SAGA: a subgraph matching tool for biological graphs. Bioinformatics 23(2), 232–239 (2007)
Article Google Scholar
E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Article MathSciNet MATH Google Scholar
J.R. Ullmann, An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)
Article MathSciNet Google Scholar
G. Wang, B. Wang, X. Yang, G. Yu, Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl. Data Eng. 24(3), 440–451 (2012)
Article Google Scholar
X. Wang, X. Ding, A.K.H. Tung, S. Ying, H. Jin, An efficient graph indexing method. in Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12) (2012), pp. 210–221
Google Scholar
X. Yan, J. Han, gSpan: graph-based substructure pattern mining. in Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02) (2002), pp. 721–724
Google Scholar
X. Yan, P.S. Yu, J. Han, Graph indexing: a frequent structure-based approach. in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04) (2004), pp. 335–346
Google Scholar
X. Yan, P.S. Yu, J. Han, Substructure similarity search in graph databases. in Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD’05) (2005), pp. 766–777
Google Scholar
Y. Yuan, G. Wang, J.Y. Xu, L. Chen, Efficient distributed subgraph similarity matching. VLDB J. 24(3), 369–394 (2015)
Article Google Scholar
Z. Zeng, A.K.H. Tung, J. Wang, J. Feng, L. Zhou, Comparing stars: On approximating graph edit distance. Proc. VLDB Endow. 2(1), 25–36 (2009)
Article Google Scholar
S. Zhang, J. Yang, W. Jin, SAPPER: Subgraph indexing and approximate matching in large graphs. Proc. VLDB Endow. 3(1–2), 1185–1194 (2010)
Article Google Scholar
Z. Zhang, M. Hadjieleftheriou, B.C. Ooi, D. Srivastava, Bed-tree: an all-purpose index structure for string similarity search based on edit distance. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 915–926
Google Scholar
P. Zhao, J. Han, On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)
Article Google Scholar
P. Zhao, J.X. Yu, P.S. Yu, Graph indexing: tree + delta \(\ge \) graph. in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07) (2007), pp. 938–949
Google Scholar
X. Zhao, C. Xiao, X. Lin, Q. Liu, W. Zhang, A partition-based approach to structure similarity search. PVLDB 7(3), 169–180 (2013)
Google Scholar
X. Zhao, C. Xiao, X. Lin, W. Wang, Efficient graph similarity joins with edit distance constraints. in Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12) (2012), pp. 834–845
Google Scholar
X. Zhao, C. Xiao, X. Lin, W. Wang, Y. Ishikawa, Efficient processing of graph similarity queries with edit distance constraints. VLDB J. 22(6), 727–752 (2013)
Article Google Scholar
W. Zheng, L. Zou, X. Lian, D. Wang, D. Zhao, Graph similarity search with edit distance constraint in large graph databases. in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management (CIKM’13) (2013), pp. 1595–1600
Google Scholar
G. Zhu, X. Lin, K. Zhu, W. Zhang, J.X. Yu, TreeSpan: efficiently computing similarity all-matching. in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12) (2012), pp. 529–540
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Florida State University, 1017 Academic Way, James Love Building, Tallahassee, FL, 32306, USA
Peixiang Zhao

Authors

Peixiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peixiang Zhao .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhao, P. (2017). Similarity Search in Large-Scale Graph Databases. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_15
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics