Heterogeneity-Aware Data Placement in Hybrid Clouds

  • Jack D. MarquezEmail author
  • Juan D. Gonzalez
  • Oscar H. Mondragon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11513)


In next-generation cloud computing clusters, performance of data-intensive applications will be limited, among other factors, by disks data transfer rates. In order to mitigate performance impacts, cloud systems offering hierarchical storage architectures are becoming commonplace. The Hadoop File System (HDFS) offers a collection of storage policies that exploit different storage types such as RAM_DISK, SSD, HDD, and ARCHIVE. However, developing algorithms to leverage heterogeneous storage through an efficient data placement has been challenging. This work presents an intelligent algorithm based on genetic programming which allow to find the optimal mapping of input datasets to storage types on a Hadoop file system.


Hadoop HDFS Integer lineal programming Genetic algorithm Data placement 



Results presented in this paper were obtained using the Chameleon testbed supported by the U.S. National Science Foundation.


  1. 1.
    Zhou, K., Fu, C., Yang, S.: Big data driven smart energy management: from big data to big insights. Renew. Sustain. Energy Rev. 56, 215–225 (2016)CrossRefGoogle Scholar
  2. 2.
    Li, H., Li, H., Wen, Z., Mo, J., Wu, J.: Distributed heterogeneous storage based on data value. In: 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 264–271 (2017)Google Scholar
  3. 3.
    Bezerra, A., Hernandez, P., Espinosa, A., Moure, J.C.: Job scheduling in Hadoop with shared input policy and RAMDISK, pp. 355–363 (2014)Google Scholar
  4. 4.
    Subramanyam, R.: HDFS heterogeneous storage resource management based on data temperature, pp. 232–235 (2015)Google Scholar
  5. 5.
    Welcome to apache hadoop!Google Scholar
  6. 6.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107 (2008)CrossRefGoogle Scholar
  8. 8.
    Xiong, R., Luo, J., Dong, F.: Optimizing data placement in heterogeneous hadoop clusters. Clust. Comput. 18(4), 1465–1480 (2015)CrossRefGoogle Scholar
  9. 9.
    Archival storage, SSD & memoryGoogle Scholar
  10. 10.
    Yoon, M.S., Kamal, A.E.: Optimal dataset allocation in distributed heterogeneous clouds. In: Globecom Workshops (GC Wkshps), 2014, pp. 75–80. IEEE (2014)Google Scholar
  11. 11.
    Klein, D., Hannan, E.: An algorithm for the multiple objective integer linear programming problem. Eur. J. Oper. Res. 9(4), 378–385 (1982)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Apers, P.M.: Data allocation in distributed database systems. ACM Trans. Database Syst. (TODS) 13(3), 263–304 (1988)CrossRefGoogle Scholar
  13. 13.
    Guzek, M., Bouvry, P., Talbi, E.G.: A survey of evolutionary computation for resource management of processing in cloud computing. IEEE Comput. Intell. Mag. 10(2), 53–67 (2015)CrossRefGoogle Scholar
  14. 14.
    Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Lasdon, L., Waren, A.: Generalized reduced gradient software for linearly and nonlinearly constrained problems. Graduate School of Business, University of Texas at Austin Austin, TX (1977)Google Scholar
  16. 16.
    Coello, C.A.C., Lamont, G.B., Van Veldhuizen, D.A., et al.: Evolutionary Algorithms for Solving Multi-objective Problems, vol. 5. Springer, Boston (2007). Scholar
  17. 17.
    Gen, M., Cheng, R.: Genetic Algorithms and Engineering Optimization, vol. 7. Wiley, Hoboken (2000)Google Scholar
  18. 18.
    Srinivas, M., Patnaik, L.M.: Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Trans. Syst. Man Cybern. 24(4), 656–667 (1994)CrossRefGoogle Scholar
  19. 19.
    Chiroma, H., Abdulkareem, S., Abubakar, A., Zeki, A., Gital, A.Y., Usman, M.J.: Correlation study of genetic algorithm operators: crossover and mutation probabilities. In: Proceedings of the International Symposium on Mathematical Sciences and Computing Research, pp. 6–7 (2013)Google Scholar
  20. 20.
    About Chameleon \(|\) ChameleonGoogle Scholar
  21. 21.
    Gen, M., Cheng, R.: A survey of penalty techniques in genetic algorithms. In: Proceedings of IEEE International Conference on Evolutionary Computation, pp. 804–809. IEEE (1996)Google Scholar
  22. 22.
    Michalewicz, Z., Janikow, C.Z.: Handling constraints in genetic algorithms. In: ICGA, pp. 151–157 (1991)Google Scholar
  23. 23.
    Kolen, A.: A genetic algorithm for the partial binary constraint satisfaction problem: an application to a frequency assignment problem. Stat. Neerl. 61(1), 4–15 (2007)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Li, H., Li, H., Wen, Z., Mo, J., Wu, J.: Distributed heterogeneous storage based on data value. In: 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 264–271. IEEE (2017)Google Scholar
  25. 25.
    Krish, K., Anwar, A., Butt, A.R.: hatS: a heterogeneity-aware tiered storage for Hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 502–511. IEEE (2014)Google Scholar
  26. 26.
    Krish, K., Iqbal, M.S., Butt, A.R.: VENU: orchestrating SSDs in Hadoop storage. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 207–212 IEEE (2014)Google Scholar
  27. 27.
    Pan, F., Xiong, J., Shen, Y., Wang, T., Jiang, D.: H-scheduler: storage-aware task scheduling for heterogeneous-storage spark clusters. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 1–9. IEEE (2018)Google Scholar
  28. 28.
    Krish, K., Wadhwa, B., Iqbal, M.S., Rafique, M.M., Butt, A.R.: On efficient hierarchical storage for big data processing. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 403–408. IEEE (2016)Google Scholar
  29. 29.
    Kambatla, K., Chen, Y.: The truth about mapreduce performance on SSDs. In: 28th Large Installation System Administration Conference (LISA14), pp. 118–126 (2014)Google Scholar
  30. 30.
    Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., Rowstron, A.: Migrating server storage to SSDs: analysis of tradeoffs. In: Proceedings of the 4th ACM European Conference on Computer Systems, pp. 145–158 ACM (2009)Google Scholar
  31. 31.
    Kang, S.H., Koo, D.H., Kang, W.H., Lee, S.W.: A case for flash memory SSD in Hadoop applications. Int. J. Control. Autom. 6(1), 201–210 (2013)Google Scholar
  32. 32.
    Wei, Q., Veeravalli, B., Gong, B., Zeng, L., Feng, D.: CDRM: a cost-effective dynamic replication management scheme for cloud storage cluster. In: 2010 IEEE International Conference on Cluster Computing, pp. 188–196. IEEE (2010)Google Scholar
  33. 33.
    Islam, N.S., Lu, X., Wasi-ur Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), p. 101. IEEE (2015)Google Scholar
  34. 34.
    Xiong, R., Luo, J., Dong, F.: Optimizing data placement in heterogeneous Hadoop clusters. Clust. Comput. 18(4), 1465–1480 (2015)CrossRefGoogle Scholar
  35. 35.
    Coello, C.A.C., Montes, E.M.: Constraint-handling in genetic algorithms through the use of dominance-based tournament selection. Adv. Eng. Inform. 16(3), 193–203 (2002)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Universidad Autonoma de OccidenteCali, Valle del CaucaColombia

Personalised recommendations