Abstract
The challenges of handling the explosive growth in data volume and complexity cause the increasing needs for semantic queries. The semantic queries can be interpreted as the correlation-aware retrieval, while containing approximate results. Existing cloud storage systems mainly fail to offer an adequate capability for the semantic queries. Since the true value or worth of data heavily depends on how efficiently semantic search can be carried out on the data in (near-) real-time, large fractions of data end up with their values being lost or significantly reduced due to the data staleness. To address this problem, we propose a near real-time and cost-effective semantic queries based methodology, called FAST. The idea behind FAST is to explore and exploit the semantic correlation within and among datasets via correlation-aware hashing and manageable flat-structured addressing to significantly reduce the processing latency, while incurring acceptably small loss of data-search accuracy. The near real-time property of FAST enables rapid identification of correlated files and the significant narrowing of the scope of data to be processed. FAST supports several types of data analytics, which can be implemented in existing searchable storage systems. We conduct a real-world use case in which children reported missing in an extremely crowded environment (e.g., a highly popular scenic spot on a peak tourist day) are identified in a timely fashion by analyzing 60 million images using FAST. FAST is further improved by using semantic-aware namespace to provide dynamic and adaptive namespace management for ultra-large storage systems. Extensive experimental results demonstrate the efficiency and efficacy of FAST in the performance improvements ({2016}IEEE. Reprinted, with permission, from Ref. [1].).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Y. Hua, H. Jiang, D. Feng, Real-time semantic search using approximate methodology for large-scale storage systems. Trans. Parallel Distrib. Syst. (TPDS) 27(4), 1212–1225 (2016)
M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
A. Marathe, R. Harris, D.K. Lowenthal, B.R. de Supinski, B. Rountree, M. Schulz, X. Yuan, A comparative study of high-performance computing on the cloud, in Proceedings of HPDC (2013)
P. Nath, B. Urgaonkar, A. Sivasubramaniam, Evaluating the usefulness of content addressable storage for high-performance data intensive applications, in Proceedings of HPDC (2008)
Gartner, Inc., Forecast: consumer digital storage needs, 2010–2016 (2012)
Storage Newsletter, 7% of consumer content in cloud storage in 2011, 36% in 2016 (2012)
J. Gantz, D. Reinsel, The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east, in International Data Corporation (IDC) iView, Dec 2012
Y. Hua, W. He, X. Liu, D. Feng, SmartEye: real-time and efficient cloud image sharing for disaster environments, in Proceedings of INFOCOM (2015)
Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in Proceedings of CVPR (2004)
Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in Proceedings of ACM Multimedia (2004)
J. Liu, Z. Huang, H.T. Shen, H. Cheng, Y. Chen, Presenting diverse location views with real-time near-duplicate photo elimination, in Proceedings of ICDE (2013)
D. Zhan, H. Jiang, S.C. Seth, CLU: co-optimizing locality and utility in thread-aware capacity management for shared last level caches. IEEE Trans. Comput. 63(7), 1656–1667 (2014)
P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of STOC (1998), pp. 604–613
R. Pagh, F. Rodler, Cuckoo hashing, in Proceedings of ESA (2001), pp. 121–133
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Xu, SANE: semantic-aware namespace in ultra-large-scale file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 25(5), 1328–1338 (2014)
Changewave Research. http://www.changewaveresearch.com (2011)
X. Tan, S. Chen, Z.-H. Zhou, F. Zhang, Face recognition from a single image per person: a survey. Pattern Recognit. 39(9), 1725–1745 (2006)
T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 2037–2041 (2006)
X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 19(6), 1635–1650 (2010)
J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)
Y. Hua, X. Liu, Scheduling heterogeneous flows with delay-aware deduplication for avionics applications. IEEE Trans. Parallel Distrib. Syst. 23(9), 1790–1802 (2012)
A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of FAST (2009)
Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of SC (2009)
E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of FAST (2002), pp. 15–30
S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production Windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)
D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of FAST (2003), pp. 203–216
J.L. Hellerstein, Google cluster data. http://googleresearch.blogspot.com/2010/01/google-cluster-data.html, Jan 2010
D. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe LSH: efficient indexing for high-dimensional similarity search, in Proceedings of VLDB (2007), pp. 950–961
B. Debnath, S. Sengupta, J. Li, ChunkStash: speeding up inline storage deduplication using flash memory, in Proceedings of USENIX ATC (2010)
Y. Hua, H. Jiang, D. Feng, FAST: near real-time searchable data analytics for the cloud, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2014)
D. Lowe, Object recognition from local scale-invariant features, in Proceedings of IEEE ICCV (1999)
A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB (1999), pp. 518–529
M. Datar, N. Immorlica, P. Indyk, V. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in Proceedings of the Annual Symposium on Computational Geometry (2004)
Y. Tao, K. Yi, C. Sheng, P. Kalnis, Quality and efficiency in high-dimensional nearest neighbor search, in Proceedings of SIGMOD (2009)
A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of ACM SIGMOD (1984), pp. 47–57
Y. Liu, L. Guo, F. Li, S. Chen, An empirical evaluation of battery power consumption for streaming data transmission to mobile devices, in Proceedings of Multimedia (2011), pp. 473–482
Monsoon Power Monitor. http://www.msoon.com (2012)
A. Viswanathan, A. Hussain, J. Mirkovic, S. Schwab, J. Wroclawski, A semantic framework for data analysis in networked systems, in Proceedings of NSDI (2011)
S. Lakshminarasimhan, J. Jenkins, I. Arkatkar, Z. Gong, H. Kolla, S.-H. Ku, S. Ethier, J. Chen, C.-S. Chang, S. Klasky et al., ISABELA-QA: query-driven analytics with ISABELA-compressed extreme-scale scientific data, in Proceedings of SC (2011)
M. Mihailescu, G. Soundararajan, C. Amza, MixApart: decoupled analytics for shared storage systems, in Proceedings of FAST (2013)
J.C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci et al., Combining in-situ and in-transit processing to enable extreme-scale scientific analysis, in Proceedings of SC (2012)
H. Huang, N. Zhang, W. Wang, G. Das, A. Szalay, Just-in-time analytics on large file systems, in Proceedings of FAST (2011)
S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci, 391–407 (1990)
C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)
S. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of OSDI (2006)
C. Maltzahn, E. Molina-Estolano, A. Khurana, A.J. Nelson, S.A. Brandt, S. Weil, Ceph as a scalable alternative to the Hadoop distributed file system, in login: The USENIX Magazine, August 2010
J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, B. Austin, E.W. Bethel, R.D. Ryne, A. Shoshani et al., Parallel index and query for large scale data analysis, in Proceedings of SC (2011)
Y. Hua, B. Xiao, B. Veeravalli, D. Feng, Locality-sensitive bloom filter for approximate membership query. IEEE Trans. Comput. 61(6), 817–830 (2012)
J.B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, S. Brandt, SciHadoop: array-based query processing in Hadoop, in Proceedings of SC (2011)
B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of FAST (2008)
D. Bhagwat, K. Eshghi, D. Long, M. Lillibridge, Extreme binning: scalable, parallel deduplication for chunk-based file backup, in Proceedings IEEE MASCOTS (2009)
W. Xia, H. Jiang, D. Feng, Y. Hua, SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in Proceedings of USENIX ATC (2011)
W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, P. Shilane, Tradeoffs in scalable data routing for deduplication clusters, in Proceedings of FAST (2011)
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of FAST (2009)
A. Muthitacharoen, B. Chen, D. Mazieres, A low-bandwidth network file system, in Proceedings of SOSP (2001)
D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, J. Kunkel, A study on data deduplication in HPC storage systems, in Proceedings of SC (2012)
B. Aggarwal, A. Akella, A. Anand, A. Balachandran, P. Chitnis, C. Muthukrishnan, R. Ramjee, G. Varghese, EndRE: an end-system redundancy elimination service for enterprises, in Proceedings of NSDI (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Hua, Y., Liu, X. (2019). Near Real-Time Searchable Analytics for Images. In: Searchable Storage in Cloud Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-2721-6_6
Download citation
DOI: https://doi.org/10.1007/978-981-13-2721-6_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2720-9
Online ISBN: 978-981-13-2721-6
eBook Packages: Computer ScienceComputer Science (R0)