Near Real-Time Searchable Analytics for Images

  • Yu HuaEmail author
  • Xue Liu


The challenges of handling the explosive growth in data volume and complexity cause the increasing needs for semantic queries. The semantic queries can be interpreted as the correlation-aware retrieval, while containing approximate results. Existing cloud storage systems mainly fail to offer an adequate capability for the semantic queries. Since the true value or worth of data heavily depends on how efficiently semantic search can be carried out on the data in (near-) real-time, large fractions of data end up with their values being lost or significantly reduced due to the data staleness. To address this problem, we propose a near real-time and cost-effective semantic queries based methodology, called FAST. The idea behind FAST is to explore and exploit the semantic correlation within and among datasets via correlation-aware hashing and manageable flat-structured addressing to significantly reduce the processing latency, while incurring acceptably small loss of data-search accuracy. The near real-time property of FAST enables rapid identification of correlated files and the significant narrowing of the scope of data to be processed. FAST supports several types of data analytics, which can be implemented in existing searchable storage systems. We conduct a real-world use case in which children reported missing in an extremely crowded environment (e.g., a highly popular scenic spot on a peak tourist day) are identified in a timely fashion by analyzing 60 million images using FAST. FAST is further improved by using semantic-aware namespace to provide dynamic and adaptive namespace management for ultra-large storage systems. Extensive experimental results demonstrate the efficiency and efficacy of FAST in the performance improvements ({2016}IEEE. Reprinted, with permission, from Ref. [1].).


  1. 1.
    Y. Hua, H. Jiang, D. Feng, Real-time semantic search using approximate methodology for large-scale storage systems. Trans. Parallel Distrib. Syst. (TPDS) 27(4), 1212–1225 (2016)CrossRefGoogle Scholar
  2. 2.
    M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRefGoogle Scholar
  3. 3.
    A. Marathe, R. Harris, D.K. Lowenthal, B.R. de Supinski, B. Rountree, M. Schulz, X. Yuan, A comparative study of high-performance computing on the cloud, in Proceedings of HPDC (2013)Google Scholar
  4. 4.
    P. Nath, B. Urgaonkar, A. Sivasubramaniam, Evaluating the usefulness of content addressable storage for high-performance data intensive applications, in Proceedings of HPDC (2008)Google Scholar
  5. 5.
    Gartner, Inc., Forecast: consumer digital storage needs, 2010–2016 (2012)Google Scholar
  6. 6.
    Storage Newsletter, 7% of consumer content in cloud storage in 2011, 36% in 2016 (2012)Google Scholar
  7. 7.
    J. Gantz, D. Reinsel, The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east, in International Data Corporation (IDC) iView, Dec 2012Google Scholar
  8. 8.
    Y. Hua, W. He, X. Liu, D. Feng, SmartEye: real-time and efficient cloud image sharing for disaster environments, in Proceedings of INFOCOM (2015)Google Scholar
  9. 9.
    Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in Proceedings of CVPR (2004)Google Scholar
  10. 10.
    Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in Proceedings of ACM Multimedia (2004)Google Scholar
  11. 11.
    J. Liu, Z. Huang, H.T. Shen, H. Cheng, Y. Chen, Presenting diverse location views with real-time near-duplicate photo elimination, in Proceedings of ICDE (2013)Google Scholar
  12. 12.
    D. Zhan, H. Jiang, S.C. Seth, CLU: co-optimizing locality and utility in thread-aware capacity management for shared last level caches. IEEE Trans. Comput. 63(7), 1656–1667 (2014)MathSciNetCrossRefGoogle Scholar
  13. 13.
    P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of STOC (1998), pp. 604–613Google Scholar
  14. 14.
    R. Pagh, F. Rodler, Cuckoo hashing, in Proceedings of ESA (2001), pp. 121–133CrossRefGoogle Scholar
  15. 15.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Xu, SANE: semantic-aware namespace in ultra-large-scale file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 25(5), 1328–1338 (2014)CrossRefGoogle Scholar
  16. 16.
    Changewave Research. (2011)
  17. 17.
    X. Tan, S. Chen, Z.-H. Zhou, F. Zhang, Face recognition from a single image per person: a survey. Pattern Recognit. 39(9), 1725–1745 (2006)CrossRefGoogle Scholar
  18. 18.
    T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 2037–2041 (2006)CrossRefGoogle Scholar
  19. 19.
    X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 19(6), 1635–1650 (2010)MathSciNetCrossRefGoogle Scholar
  20. 20.
    J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)CrossRefGoogle Scholar
  21. 21.
    Y. Hua, X. Liu, Scheduling heterogeneous flows with delay-aware deduplication for avionics applications. IEEE Trans. Parallel Distrib. Syst. 23(9), 1790–1802 (2012)MathSciNetCrossRefGoogle Scholar
  22. 22.
    A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of FAST (2009)Google Scholar
  23. 23.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of SC (2009)Google Scholar
  24. 24.
    E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of FAST (2002), pp. 15–30Google Scholar
  25. 25.
    S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production Windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)Google Scholar
  26. 26.
    D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of FAST (2003), pp. 203–216Google Scholar
  27. 27.
    J.L. Hellerstein, Google cluster data., Jan 2010
  28. 28.
    D. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  29. 29.
    B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
  30. 30.
    A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)CrossRefGoogle Scholar
  31. 31.
    Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe LSH: efficient indexing for high-dimensional similarity search, in Proceedings of VLDB (2007), pp. 950–961Google Scholar
  32. 32.
    B. Debnath, S. Sengupta, J. Li, ChunkStash: speeding up inline storage deduplication using flash memory, in Proceedings of USENIX ATC (2010)Google Scholar
  33. 33.
  34. 34.
    Y. Hua, H. Jiang, D. Feng, FAST: near real-time searchable data analytics for the cloud, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2014)Google Scholar
  35. 35.
    D. Lowe, Object recognition from local scale-invariant features, in Proceedings of IEEE ICCV (1999)Google Scholar
  36. 36.
    A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB (1999), pp. 518–529Google Scholar
  37. 37.
    M. Datar, N. Immorlica, P. Indyk, V. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in Proceedings of the Annual Symposium on Computational Geometry (2004)Google Scholar
  38. 38.
    Y. Tao, K. Yi, C. Sheng, P. Kalnis, Quality and efficiency in high-dimensional nearest neighbor search, in Proceedings of SIGMOD (2009)Google Scholar
  39. 39.
    A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of ACM SIGMOD (1984), pp. 47–57CrossRefGoogle Scholar
  40. 40.
    Y. Liu, L. Guo, F. Li, S. Chen, An empirical evaluation of battery power consumption for streaming data transmission to mobile devices, in Proceedings of Multimedia (2011), pp. 473–482Google Scholar
  41. 41.
    Monsoon Power Monitor. (2012)
  42. 42.
    A. Viswanathan, A. Hussain, J. Mirkovic, S. Schwab, J. Wroclawski, A semantic framework for data analysis in networked systems, in Proceedings of NSDI (2011)Google Scholar
  43. 43.
    S. Lakshminarasimhan, J. Jenkins, I. Arkatkar, Z. Gong, H. Kolla, S.-H. Ku, S. Ethier, J. Chen, C.-S. Chang, S. Klasky et al., ISABELA-QA: query-driven analytics with ISABELA-compressed extreme-scale scientific data, in Proceedings of SC (2011)Google Scholar
  44. 44.
    M. Mihailescu, G. Soundararajan, C. Amza, MixApart: decoupled analytics for shared storage systems, in Proceedings of FAST (2013)Google Scholar
  45. 45.
    J.C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci et al., Combining in-situ and in-transit processing to enable extreme-scale scientific analysis, in Proceedings of SC (2012)Google Scholar
  46. 46.
    H. Huang, N. Zhang, W. Wang, G. Das, A. Szalay, Just-in-time analytics on large file systems, in Proceedings of FAST (2011)Google Scholar
  47. 47.
    S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci, 391–407 (1990)CrossRefGoogle Scholar
  48. 48.
    C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)MathSciNetCrossRefGoogle Scholar
  49. 49.
    S. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of OSDI (2006)Google Scholar
  50. 50.
    C. Maltzahn, E. Molina-Estolano, A. Khurana, A.J. Nelson, S.A. Brandt, S. Weil, Ceph as a scalable alternative to the Hadoop distributed file system, in login: The USENIX Magazine, August 2010Google Scholar
  51. 51.
    J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, B. Austin, E.W. Bethel, R.D. Ryne, A. Shoshani et al., Parallel index and query for large scale data analysis, in Proceedings of SC (2011)Google Scholar
  52. 52.
    Y. Hua, B. Xiao, B. Veeravalli, D. Feng, Locality-sensitive bloom filter for approximate membership query. IEEE Trans. Comput. 61(6), 817–830 (2012)MathSciNetCrossRefGoogle Scholar
  53. 53.
    J.B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, S. Brandt, SciHadoop: array-based query processing in Hadoop, in Proceedings of SC (2011)Google Scholar
  54. 54.
    B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of FAST (2008)Google Scholar
  55. 55.
    D. Bhagwat, K. Eshghi, D. Long, M. Lillibridge, Extreme binning: scalable, parallel deduplication for chunk-based file backup, in Proceedings IEEE MASCOTS (2009)Google Scholar
  56. 56.
    W. Xia, H. Jiang, D. Feng, Y. Hua, SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in Proceedings of USENIX ATC (2011)Google Scholar
  57. 57.
    W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, P. Shilane, Tradeoffs in scalable data routing for deduplication clusters, in Proceedings of FAST (2011)Google Scholar
  58. 58.
    M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of FAST (2009)Google Scholar
  59. 59.
    A. Muthitacharoen, B. Chen, D. Mazieres, A low-bandwidth network file system, in Proceedings of SOSP (2001)Google Scholar
  60. 60.
    D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, J. Kunkel, A study on data deduplication in HPC storage systems, in Proceedings of SC (2012)Google Scholar
  61. 61.
    B. Aggarwal, A. Akella, A. Anand, A. Balachandran, P. Chitnis, C. Muthukrishnan, R. Ramjee, G. Varghese, EndRE: an end-system redundancy elimination service for enterprises, in Proceedings of NSDI (2010)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Huazhong University of Science and TechnologyWuhanChina
  2. 2.McGill UniversityMontrealCanada

Personalised recommendations