Semantic-Aware Metadata Organization for Exact-Matching Queries

Hua, Yu; Liu, Xue

doi:10.1007/978-981-13-2721-6_4

Yu Hua³ &
Xue Liu⁴

445 Accesses

Abstract

Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This section proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files’ metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces which shows that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems (©{2012}IEEE. Reprinted, with permission, from Ref. [1].).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Given a clear context, we will simply use top-k queries in place of top-k NN queries.

References

Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 2, 337–344 (2012)
Article Google Scholar
J. Nunez, High end computing file system and I/O R&D gaps roadmap, in High Performance Computer Science Week, ASCR Computer Science Research (2008)
Google Scholar
J.R. Douceur, J. Howell, Distributed directory service in the farsite file system, in Proceedings of the OSDI (2006), pp. 321–334
Google Scholar
S.A. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of the OSDI (2006)
Google Scholar
D. Agrawal, S. Das, A.E. Abbadi, Big data and cloud computing: new wine or just new bottles? in VLDB tutorial (2010)
Article Google Scholar
M. Stonebraker, U. Cetintemel, One size fits all: an idea whose time has come and gone, in Proceedings of the ICDE (2005)
Google Scholar
A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of the FAST (2009)
Google Scholar
D. Roselli, J. Lorch, T. Anderson, A comparison of file system workloads, in Proceedings of the USENIX Conference (2000), pp. 41–54
Google Scholar
A. Traeger, E. Zadok, N. Joukov, C. Wright, A nine year study of file system and storage benchmarking. ACM Trans. Storage 2, 1–56 (2008)
Article Google Scholar
A. Szalay, New challenges in petascale scientific databases, in Keynote Talk in Scientific and Statistical Database Management Conference (SSDBM) (2008)
Google Scholar
M. Seltzer, N. Murphy, Hierarchical file systems are dead, in Proceedings of the HotOS (2009)
Google Scholar
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber, Bigtable: a distributed storage system for structured data, in Proceedings of the OSDI (2006)
Google Scholar
D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W. OToole, Semantic file systems, in Proceedings of the SOSP (1991)
Google Scholar
P. Gu, J. Wang, Y. Zhu, H. Jiang, P. Shang, A novel weighted-graph-based grouping algorithm for metadata prefetching. IEEE Trans. Comput. 1, 1–15 (2010)
Article MathSciNet Google Scholar
S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Article Google Scholar
C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)
Article MathSciNet Google Scholar
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
T. Hofmann, Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 22(1), 89–115 (2004)
Article MathSciNet Google Scholar
T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 50–57
Google Scholar
S. Doraimani, A. Iamnitchi, File grouping for scientific data management: lessons from experimenting with real traces, in Proceedings of the HPDC (2008)
Google Scholar
A. Leung, S. Pasupathy, G. Goodson, E. Miller, Measurement and analysis of large-scale network file system workloads, in Proceedings of the USENIX Conference (2008)
Google Scholar
P. Xia, D. Feng, H. Jiang, L. Tian, F. Wang, FARMER: a Novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance, in Proceedings of the HPDC (2008)
Google Scholar
E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of the FAST (2002)
Google Scholar
S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)
Google Scholar
D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of the FAST (2003)
Google Scholar
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with metadata semantic-awareness for next-generation file systems, Technical Report (University of Nebraska- Lincoln, TR-UNL-CSE-2008-0012, November, 2008)
Google Scholar
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness. FAST Work-in-Progress Report and Poster Session (February, 2009)
Google Scholar
B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the FAST (2008)
Google Scholar
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of the FAST (2009)
Google Scholar
X. Liu, A. Aboulnaga, K. Salem, X. Li, CLIC: client-informed caching for storage servers, in Proceedings of the FAST (2009)
Google Scholar
M. Li, E. Varki, S. Bhatia, A. Merchant, TaP: table-based prefetching for storage caches, in Proceedings of the FAST (2008)
Google Scholar
A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of the SIGMOD (1984)
Google Scholar
B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Y. Hua, Y. Zhu, H. Jiang, D. Feng, L. Tian, Scalable and adaptive metadata management in ultra large-scale file systems, in Proceedings of the ICDCS (2008)
Google Scholar
J. Hartigan, M. Wong, Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)
Article Google Scholar
G. Salton, A. Wong, C. Yang, A vector space model for information retrieval. J. Am. Soc. Inf. Retr. 3, 613–620 (1975)
MATH Google Scholar
M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, and information retrieval. SIAM Rev. 41, 335–362 (1999)
Article MathSciNet Google Scholar
G. Golub, C. Van Loan, Matrix Computations (Johns Hopkins University Press, USA, 1996)
MATH Google Scholar
G. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1997)
MATH Google Scholar
A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
P. Moreno, P. Ho, N. Vasconcelos, A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications, in Advances in Neural Information Processing Systems (2004)
Google Scholar
Z. Rached, F. Alajaji, L. Campbell, The Kullback-Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)
Article MathSciNet Google Scholar
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of ACM/IEEE Supercomputing Conference (SC) (2009)
Google Scholar
P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in STOC (1998), pp. 604–613
Google Scholar
V. Gaede, O. Guenther, Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)
Article Google Scholar
C.A.N. Soules, G.R. Goodson, J.D. Strunk, G.R. Ganger, Metadata efficiency in versioning file systems, in Proceedings of the FAST (2003)
Google Scholar
L. Fan, P. Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide area web cache sharing protocol, IEEE/ACM Trans. Netw. 8(3) (2000)
Google Scholar
Y. Zhu, H. Jiang, J. Wang, F. Xian, HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst. 19(4), 1–14 (2008)
Article Google Scholar
D. Comer, The Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)
Article MathSciNet Google Scholar
C. du Mouza, W. Litwin, P. Rigaux, SD-Rtree: A scalable distributed Rtree, Proceedings of the IEEE ICDE (2007), pp. 296–305
Google Scholar
A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography (CRC Press, Baco Raton, 1997)
MATH Google Scholar
D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W.O. Jr, Semantic file systems, in Proceedings of the SOSP (1991)
Google Scholar
Google Desktop, http://www.desktop.google.com/
C. Soules, G. Ganger, Connections: using context to enhance file search, in Proceedings of the SOSP (2005)
Google Scholar
J. Kleinberg, Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet Google Scholar
Google, http://www.google.com/
S. Patil, G. Gibson, GIGA+: scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, Hubei, China
Yu Hua
McGill University, Montreal, QC, Canada
Xue Liu

Authors

Yu Hua
View author publications
You can also search for this author in PubMed Google Scholar
Xue Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Hua .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hua, Y., Liu, X. (2019). Semantic-Aware Metadata Organization for Exact-Matching Queries. In: Searchable Storage in Cloud Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-2721-6_4

Download citation

DOI: https://doi.org/10.1007/978-981-13-2721-6_4
Published: 09 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2720-9
Online ISBN: 978-981-13-2721-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics