Skip to main content

Semantic-Aware Metadata Organization for Exact-Matching Queries

  • Chapter
  • First Online:
Searchable Storage in Cloud Computing
  • 445 Accesses

Abstract

Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This section proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files’ metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces which shows that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems (©{2012}IEEE. Reprinted, with permission, from Ref. [1].).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Given a clear context, we will simply use top-k queries in place of top-k NN queries.

References

  1. Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 2, 337–344 (2012)

    Article  Google Scholar 

  2. J. Nunez, High end computing file system and I/O R&D gaps roadmap, in High Performance Computer Science Week, ASCR Computer Science Research (2008)

    Google Scholar 

  3. J.R. Douceur, J. Howell, Distributed directory service in the farsite file system, in Proceedings of the OSDI (2006), pp. 321–334

    Google Scholar 

  4. S.A. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of the OSDI (2006)

    Google Scholar 

  5. D. Agrawal, S. Das, A.E. Abbadi, Big data and cloud computing: new wine or just new bottles? in VLDB tutorial (2010)

    Article  Google Scholar 

  6. M. Stonebraker, U. Cetintemel, One size fits all: an idea whose time has come and gone, in Proceedings of the ICDE (2005)

    Google Scholar 

  7. A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of the FAST (2009)

    Google Scholar 

  8. D. Roselli, J. Lorch, T. Anderson, A comparison of file system workloads, in Proceedings of the USENIX Conference (2000), pp. 41–54

    Google Scholar 

  9. A. Traeger, E. Zadok, N. Joukov, C. Wright, A nine year study of file system and storage benchmarking. ACM Trans. Storage 2, 1–56 (2008)

    Article  Google Scholar 

  10. A. Szalay, New challenges in petascale scientific databases, in Keynote Talk in Scientific and Statistical Database Management Conference (SSDBM) (2008)

    Google Scholar 

  11. M. Seltzer, N. Murphy, Hierarchical file systems are dead, in Proceedings of the HotOS (2009)

    Google Scholar 

  12. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber, Bigtable: a distributed storage system for structured data, in Proceedings of the OSDI (2006)

    Google Scholar 

  13. D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W. OToole, Semantic file systems, in Proceedings of the SOSP (1991)

    Google Scholar 

  14. P. Gu, J. Wang, Y. Zhu, H. Jiang, P. Shang, A novel weighted-graph-based grouping algorithm for metadata prefetching. IEEE Trans. Comput. 1, 1–15 (2010)

    Article  MathSciNet  Google Scholar 

  15. S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)

    Article  Google Scholar 

  16. C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)

    Article  MathSciNet  Google Scholar 

  17. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  18. T. Hofmann, Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 22(1), 89–115 (2004)

    Article  MathSciNet  Google Scholar 

  19. T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 50–57

    Google Scholar 

  20. S. Doraimani, A. Iamnitchi, File grouping for scientific data management: lessons from experimenting with real traces, in Proceedings of the HPDC (2008)

    Google Scholar 

  21. A. Leung, S. Pasupathy, G. Goodson, E. Miller, Measurement and analysis of large-scale network file system workloads, in Proceedings of the USENIX Conference (2008)

    Google Scholar 

  22. P. Xia, D. Feng, H. Jiang, L. Tian, F. Wang, FARMER: a Novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance, in Proceedings of the HPDC (2008)

    Google Scholar 

  23. E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of the FAST (2002)

    Google Scholar 

  24. S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)

    Google Scholar 

  25. D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of the FAST (2003)

    Google Scholar 

  26. Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with metadata semantic-awareness for next-generation file systems, Technical Report (University of Nebraska- Lincoln, TR-UNL-CSE-2008-0012, November, 2008)

    Google Scholar 

  27. Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness. FAST Work-in-Progress Report and Poster Session (February, 2009)

    Google Scholar 

  28. B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the FAST (2008)

    Google Scholar 

  29. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of the FAST (2009)

    Google Scholar 

  30. X. Liu, A. Aboulnaga, K. Salem, X. Li, CLIC: client-informed caching for storage servers, in Proceedings of the FAST (2009)

    Google Scholar 

  31. M. Li, E. Varki, S. Bhatia, A. Merchant, TaP: table-based prefetching for storage caches, in Proceedings of the FAST (2008)

    Google Scholar 

  32. A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of the SIGMOD (1984)

    Google Scholar 

  33. B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  34. Y. Hua, Y. Zhu, H. Jiang, D. Feng, L. Tian, Scalable and adaptive metadata management in ultra large-scale file systems, in Proceedings of the ICDCS (2008)

    Google Scholar 

  35. J. Hartigan, M. Wong, Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)

    Article  Google Scholar 

  36. G. Salton, A. Wong, C. Yang, A vector space model for information retrieval. J. Am. Soc. Inf. Retr. 3, 613–620 (1975)

    MATH  Google Scholar 

  37. M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, and information retrieval. SIAM Rev. 41, 335–362 (1999)

    Article  MathSciNet  Google Scholar 

  38. G. Golub, C. Van Loan, Matrix Computations (Johns Hopkins University Press, USA, 1996)

    MATH  Google Scholar 

  39. G. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1997)

    MATH  Google Scholar 

  40. A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  41. P. Moreno, P. Ho, N. Vasconcelos, A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications, in Advances in Neural Information Processing Systems (2004)

    Google Scholar 

  42. Z. Rached, F. Alajaji, L. Campbell, The Kullback-Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)

    Article  MathSciNet  Google Scholar 

  43. Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of ACM/IEEE Supercomputing Conference (SC) (2009)

    Google Scholar 

  44. P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in STOC (1998), pp. 604–613

    Google Scholar 

  45. V. Gaede, O. Guenther, Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)

    Article  Google Scholar 

  46. C.A.N. Soules, G.R. Goodson, J.D. Strunk, G.R. Ganger, Metadata efficiency in versioning file systems, in Proceedings of the FAST (2003)

    Google Scholar 

  47. L. Fan, P. Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide area web cache sharing protocol, IEEE/ACM Trans. Netw. 8(3) (2000)

    Google Scholar 

  48. Y. Zhu, H. Jiang, J. Wang, F. Xian, HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst. 19(4), 1–14 (2008)

    Article  Google Scholar 

  49. D. Comer, The Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)

    Article  MathSciNet  Google Scholar 

  50. C. du Mouza, W. Litwin, P. Rigaux, SD-Rtree: A scalable distributed Rtree, Proceedings of the IEEE ICDE (2007), pp. 296–305

    Google Scholar 

  51. A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography (CRC Press, Baco Raton, 1997)

    MATH  Google Scholar 

  52. D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W.O. Jr, Semantic file systems, in Proceedings of the SOSP (1991)

    Google Scholar 

  53. Google Desktop, http://www.desktop.google.com/

  54. C. Soules, G. Ganger, Connections: using context to enhance file search, in Proceedings of the SOSP (2005)

    Google Scholar 

  55. J. Kleinberg, Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  Google Scholar 

  56. Google, http://www.google.com/

  57. S. Patil, G. Gibson, GIGA+: scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Hua .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hua, Y., Liu, X. (2019). Semantic-Aware Metadata Organization for Exact-Matching Queries. In: Searchable Storage in Cloud Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-2721-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2721-6_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2720-9

  • Online ISBN: 978-981-13-2721-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics