Distributed and Parallel Databases

, Volume 37, Issue 1, pp 177–208 | Cite as

PLI\(^+\): efficient clustering of cloud databases

  • Dai Hai Ton ThatEmail author
  • James Wagner
  • Alexander Rasin
  • Tanu Malik
Part of the following topical collections:
  1. Special Issue on Scientific and Statistical Data Management


Commercial cloud database services increase availability of data and provide reliable access to data. Routine database maintenance tasks such as clustering, however, increase the costs of hosting data on commercial cloud instances. Clustering causes an I/O burst; clustering in one-shot depletes I/O credit accumulated by an instance and increases the cost of hosting data. An unclustered database decreases query performance by scanning large amounts of data, gradually depleting I/O credits. In this paper, we introduce Physical Location Index Plus (\({PLI}^{\small {{+}}}\)), an indexing method for databases hosted on commercial cloud. \({PLI}^{\small {{+}}}\) relies on internal knowledge of data layout, building a physical location index, which maps a range of physical co-locations with a range of attribute values to create approximately sorted buckets. As new data is inserted, writes are partitioned in memory based on incoming data distribution. The data is written to physical locations on disk in block-based partitions to favor large granularity I/O. Incoming SQL queries on indexed attribute values are rewritten in terms of the physical location ranges. As a result, \({PLI}^{\small {{+}}}\) does not decrease query performance on an unclustered cloud database instance, DBAs may choose to cluster the instance when they have sufficiently large I/O credit available for clustering thus delaying the need for clustering. We evaluate query performance over \({PLI}^{\small {{+}}}\) by comparing it with clustered, unclustered (secondary) indexes, and log-structured merge trees on real datasets. Experiments show that \({PLI}^{\small {{+}}}\) significantly delays clustering, and yet does not degrade query performance—thus achieving higher level of sortedness than unclustered indexes and log-structured merge trees. We also evaluate the quality of clustering by introducing a measure of interval sortedness, and the size of index.


Clustered indexes Relational databases Scientific data and computing 



  1. 1.
    Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp 359–370. ACM, New York (2004).
  2. 2.
    Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft sql server 2005: demo. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pp. 930–932. ACM, New York (2005).
  3. 3.
    Amazon: Amazon EBS product details. (2017a)
  4. 4.
    Amazon: Amazon RDS for PostgreSQL pricing. (2017b)
  5. 5.
  6. 6.
    Ang, C.H., Tan, K.P.: The interval B-tree. Inf. Process. Lett. 53(2), 85–89 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Catlett, C., Malik, T., Goldstein, B., Giuffrida, J., Shao, Y., Panella, A., Eder, D., Zanten, Ev, Mitchum, R., Thaler, S., Foster, I.T.: Plenario: an open data discovery and exploration platform for urban science. IEEE Data Eng. Bull. 37(4), 27–42 (2014)Google Scholar
  8. 8.
    Consortium, G.P.: A global reference for human genetic variation. Nature 526(7571), 68 (2015)CrossRefGoogle Scholar
  9. 9.
    Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. Proc VLDB Endow. 3(1–2), 48–57 (2010). CrossRefGoogle Scholar
  10. 10.
    Garfinkel, S.L.: Carving contiguous and fragmented files with fast object validation. Digit. Investig. 4, 2–12 (2007)CrossRefGoogle Scholar
  11. 11.
    Gray, R.M.: Entropy and Information Theory. Springer, New York (1990)CrossRefzbMATHGoogle Scholar
  12. 12.
    Jannen, W., Yuan, J., Zhan, Y., Akshintala, A., Esmet, J., Jiao, Y., Mittal, A., Pandey, P., Reddy, P., Walsh, L., Bender, M., Farach-Colton, M., Johnson, R., Kuszmaul, B.C., Porter, D.E.: BetrFS: a right-optimized write-optimized file system. In: USENIX Conference on File and Storage Technologies (FAST) (2015)Google Scholar
  13. 13.
    Jindal, A., Dittrich, J.: Relax and let the database do the partitioning online. In: Castellanos, M., Dayal, U., Lehner, W. (eds.) Enabling Real-Time Business Intelligence, pp. 65–80. Springer, Berlin (2012)CrossRefGoogle Scholar
  14. 14.
    Kersten, M.L., Manegold, S., et al.: Cracking the database store. In: CIDR, vol. 5, pp. 4–7 (2005)Google Scholar
  15. 15.
    Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow. 2(1), 1222–1233 (2009)CrossRefGoogle Scholar
  16. 16.
    Li, Y., He, B., Yang, R.J., Luo, Q., Yi, K.: Tree indexing on solid state drives. Proc. VLDB Endow. 3, 1–2 (2010)CrossRefGoogle Scholar
  17. 17.
    National Center for Biotechnology Information UNLoM: Genbank. (2017)
  18. 18.
    NASA: Nasa earth exchange. (2018)
  19. 19.
    O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)CrossRefzbMATHGoogle Scholar
  20. 20.
  21. 21.
    Pivarski, J., Elmer, P., Bockelman, B., Zhang, Z.: Fast access to columnar, hierarchical data via code transformation. ArXiv e-prints. arXiv:1708.08319 (2017)
  22. 22.
    Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill Inc., New York (2003)zbMATHGoogle Scholar
  23. 23.
    Richard G.G. III, Roussev, V.: Scalpel: a frugal. High performance file carver. In: Digital Forensic Research Workshop (DFRWS) (2005)Google Scholar
  24. 24.
    Schneider, T.: Unified New York City taxi and uber data. (2016a)
  25. 25.
    Schneider, T.W.: Unified New York city taxi and Uber data. (2016b)
  26. 26.
    Sears, R., Ramakrishnan, R.: bLSM: a general purpose log structured merge tree. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 217–228 (2012)Google Scholar
  27. 27.
    Seshadri, P., Swami, A.: Generalized partial indexes. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 420–427 (1995)Google Scholar
  28. 28.
    Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C., vandenBerg, J.: The SDSS Skyserver: public access to the Sloan digital sky server data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 570–581 (2002)Google Scholar
  29. 29.
    Wagner, J., Rasin, A., Malik, T., Heart, K., Jehle, H., Grier, J.: Database forensic analysis with DBCarver. In: Conference of Innovative Database Research, pp. 84–90 (2017a)Google Scholar
  30. 30.
    Wagner, J., Rasin, A., That, D.H.T., Malik, T.: PLI: augmenting live databases with custom clustered indexes. In: Proceedings of the International Conference on Scientific and Statistical Database Management (2017b)Google Scholar
  31. 31.
    Walck, C.: Handbook on Statistical Distributions for Experimentalists. University of Stockholm, Stockholm (1996)Google Scholar
  32. 32.
    Wang, X., Burns, R.C., Malik, T.: LifeRaft: data-driven, batch processing for the exploration of scientific databases. In: Proceedings Biennial Conference on Innovative Data Systems Research (CIDR) (2009)Google Scholar
  33. 33.
    Wang, X., Perlman, E., Burns, R., Malik, T., Budavári, T., Meneveau, C., Szalay, A.: Jaws: job-aware workload scheduling for the exploration of turbulence simulations. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society, Washington DC (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Computing and Digital MediaDePaul UniversityChicagoUSA

Personalised recommendations