Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

Vengadeswaran, S.; Balasundaram, S. R.

doi:10.1007/978-981-10-3773-3_3

S. Vengadeswaran¹⁸ &
S. R. Balasundaram¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 554))

1356 Accesses
2 Citations

Abstract

The time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability can be considered as an efficient solution for processing such large data. Hadoop’s default data placement strategy (HDDPS) places the data blocks randomly across the cluster of nodes without considering any of the execution parameters. Also, it is commonly observed that most of the data-intensive applications show grouping semantics. During any query execution only a part of the big data set is utilized. Since such grouping behavior is not considered, the default placement does not perform well, leading to increased execution time, query latency, etc. Hence an optimal data placement strategy based on grouping semantics is proposed. Initially by analyzing the user history log, the access pattern is identified and depicted as an execution graph. By applying Markov clustering algorithm, grouping pattern of the data is identified. Then optimal data placement algorithm based on statistical measures is proposed, which re-organizes the default data layouts in HDFS. This in turn increases parallel execution, resulting in improved data locality and reduced query execution time compared to HDDPS. The experimental results have strengthened the proposed algorithm and has proved to be more efficient for Big-Data sets to be processed in hetrogenous distributed environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

White, Tom.:Hadoop: The definitive guide. OReilly Media, Inc., (2012).
Google Scholar
Apache Hadoop, https://hadoop.apache.org/
Sammer, Eric.:Hadoop operations. O’Reilly Media, Inc., (2012).
Google Scholar
Yahoo! Hadoop Tutorial, https://developer.yahoo.com/hadoop/tutorial/
Shvachko, K., Kuang, H., Radia, S., & Chansler, R.: The hadoop distributed file system. In: 26th IEEE Symposium on MSST, pp. 1–10, IEEE (2010).
Google Scholar
Dean, Jeffrey, Sanjay Ghemawat.:MapReduce: simplified data processing on large clusters.Communications of the ACM 51.1, 107–113 (2008).
Google Scholar
Yuan, D., Yang, Y., Liu, X., & Chen, J.: A data placement strategy in scientific cloud workflows. Future Generation Computer Systems. 26(8), 1200–1214 (2010).
Google Scholar
Wang, Jun, Pengju Shang, & Jiangling Yin.: DRAW: a new data-grouping-aware data placement scheme for data intensive applications with interest locality. Cloud Computing for Data-Intensive Applications, Springer New York, 149–174 (2014).
Google Scholar
Lee, C., Hsieh, K., Hsieh, S., & Hsiao,H.: A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Research, 1, 14–22 (2014).
Google Scholar
Kumar, A., Deshpande, A., & Khuller, S.: Data placement and replica selection for improving co-location in distributed environments. arXiv:1302.4168 (2013).
Schaeffer, S. E.: Graph clustering.: Computer Science Review, 1(1), 27–64 (2007).
Google Scholar
Golab, L., Hadjieleftheriou, M., Karloff, H. & Saha, B.: Distributed Data Placement via Graph Partitioning. arXiv preprint arXiv:1312.0285 (2013).
Van Dongen, Stijn Marinus.: Graph clustering by flow simulation. (2001).
Google Scholar
McAuley, J., Pandey, R., & Leskovec, J.: Inferring networks of substitutable and complementary products. In: 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, ACM (2015).
Google Scholar
Stanford Network Analysis, https://snap.stanford.edu/data/web-Amazon.html
Hadoop Load balancer, https://issues.apache.org/jira/browse/HADOOP-1652

Download references

Acknowledgements

The research work reported in this paper is supported by Department of Electronics and Information Technology (DeitY), a division of Ministry of Communications and IT, Government of India., under Visvesvaraya PhD scheme for Electronics and IT.

Author information

Authors and Affiliations

National Institute of Technology, Tiruchirappalli, 620015, Tamil Nadu, India
S. Vengadeswaran & S. R. Balasundaram

Authors

S. Vengadeswaran
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Balasundaram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Vengadeswaran .

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Missouri, St. Louis, Missouri, USA
Sanjiv K. Bhatia
Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology, Allahabad, Uttar Pradesh, India
Krishn K. Mishra
CSED, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
Shailesh Tiwari
Department of Computer Science, Banaras Hindu University, Varanasi, Uttar Pradesh, India
Vivek Kumar Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vengadeswaran, S., Balasundaram, S.R. (2018). Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering. In: Bhatia, S., Mishra, K., Tiwari, S., Singh, V. (eds) Advances in Computer and Computational Sciences. Advances in Intelligent Systems and Computing, vol 554. Springer, Singapore. https://doi.org/10.1007/978-981-10-3773-3_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-3773-3_3
Published: 29 September 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3772-6
Online ISBN: 978-981-10-3773-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics