Abstract
Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This section proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files’ metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces which shows that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems (©{2012}IEEE. Reprinted, with permission, from Ref. [1].).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Given a clear context, we will simply use top-k queries in place of top-k NN queries.
References
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 2, 337–344 (2012)
J. Nunez, High end computing file system and I/O R&D gaps roadmap, in High Performance Computer Science Week, ASCR Computer Science Research (2008)
J.R. Douceur, J. Howell, Distributed directory service in the farsite file system, in Proceedings of the OSDI (2006), pp. 321–334
S.A. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of the OSDI (2006)
D. Agrawal, S. Das, A.E. Abbadi, Big data and cloud computing: new wine or just new bottles? in VLDB tutorial (2010)
M. Stonebraker, U. Cetintemel, One size fits all: an idea whose time has come and gone, in Proceedings of the ICDE (2005)
A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of the FAST (2009)
D. Roselli, J. Lorch, T. Anderson, A comparison of file system workloads, in Proceedings of the USENIX Conference (2000), pp. 41–54
A. Traeger, E. Zadok, N. Joukov, C. Wright, A nine year study of file system and storage benchmarking. ACM Trans. Storage 2, 1–56 (2008)
A. Szalay, New challenges in petascale scientific databases, in Keynote Talk in Scientific and Statistical Database Management Conference (SSDBM) (2008)
M. Seltzer, N. Murphy, Hierarchical file systems are dead, in Proceedings of the HotOS (2009)
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber, Bigtable: a distributed storage system for structured data, in Proceedings of the OSDI (2006)
D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W. OToole, Semantic file systems, in Proceedings of the SOSP (1991)
P. Gu, J. Wang, Y. Zhu, H. Jiang, P. Shang, A novel weighted-graph-based grouping algorithm for metadata prefetching. IEEE Trans. Comput. 1, 1–15 (2010)
S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
T. Hofmann, Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 22(1), 89–115 (2004)
T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 50–57
S. Doraimani, A. Iamnitchi, File grouping for scientific data management: lessons from experimenting with real traces, in Proceedings of the HPDC (2008)
A. Leung, S. Pasupathy, G. Goodson, E. Miller, Measurement and analysis of large-scale network file system workloads, in Proceedings of the USENIX Conference (2008)
P. Xia, D. Feng, H. Jiang, L. Tian, F. Wang, FARMER: a Novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance, in Proceedings of the HPDC (2008)
E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of the FAST (2002)
S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)
D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of the FAST (2003)
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with metadata semantic-awareness for next-generation file systems, Technical Report (University of Nebraska- Lincoln, TR-UNL-CSE-2008-0012, November, 2008)
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness. FAST Work-in-Progress Report and Poster Session (February, 2009)
B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the FAST (2008)
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of the FAST (2009)
X. Liu, A. Aboulnaga, K. Salem, X. Li, CLIC: client-informed caching for storage servers, in Proceedings of the FAST (2009)
M. Li, E. Varki, S. Bhatia, A. Merchant, TaP: table-based prefetching for storage caches, in Proceedings of the FAST (2008)
A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of the SIGMOD (1984)
B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Y. Hua, Y. Zhu, H. Jiang, D. Feng, L. Tian, Scalable and adaptive metadata management in ultra large-scale file systems, in Proceedings of the ICDCS (2008)
J. Hartigan, M. Wong, Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)
G. Salton, A. Wong, C. Yang, A vector space model for information retrieval. J. Am. Soc. Inf. Retr. 3, 613–620 (1975)
M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, and information retrieval. SIAM Rev. 41, 335–362 (1999)
G. Golub, C. Van Loan, Matrix Computations (Johns Hopkins University Press, USA, 1996)
G. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1997)
A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
P. Moreno, P. Ho, N. Vasconcelos, A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications, in Advances in Neural Information Processing Systems (2004)
Z. Rached, F. Alajaji, L. Campbell, The Kullback-Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of ACM/IEEE Supercomputing Conference (SC) (2009)
P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in STOC (1998), pp. 604–613
V. Gaede, O. Guenther, Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)
C.A.N. Soules, G.R. Goodson, J.D. Strunk, G.R. Ganger, Metadata efficiency in versioning file systems, in Proceedings of the FAST (2003)
L. Fan, P. Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide area web cache sharing protocol, IEEE/ACM Trans. Netw. 8(3) (2000)
Y. Zhu, H. Jiang, J. Wang, F. Xian, HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst. 19(4), 1–14 (2008)
D. Comer, The Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)
C. du Mouza, W. Litwin, P. Rigaux, SD-Rtree: A scalable distributed Rtree, Proceedings of the IEEE ICDE (2007), pp. 296–305
A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography (CRC Press, Baco Raton, 1997)
D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W.O. Jr, Semantic file systems, in Proceedings of the SOSP (1991)
Google Desktop, http://www.desktop.google.com/
C. Soules, G. Ganger, Connections: using context to enhance file search, in Proceedings of the SOSP (2005)
J. Kleinberg, Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Google, http://www.google.com/
S. Patil, G. Gibson, GIGA+: scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Hua, Y., Liu, X. (2019). Semantic-Aware Metadata Organization for Exact-Matching Queries. In: Searchable Storage in Cloud Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-2721-6_4
Download citation
DOI: https://doi.org/10.1007/978-981-13-2721-6_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2720-9
Online ISBN: 978-981-13-2721-6
eBook Packages: Computer ScienceComputer Science (R0)