Computing the Split Points for Learning Decision Tree in MapReduce

Zhu, Mingdong; Shen, Derong; Yu, Ge; Kou, Yue; Nie, Tiezheng

doi:10.1007/978-3-642-37450-0_26

Computing the Split Points for Learning Decision Tree in MapReduce

Mingdong Zhu²¹,
Derong Shen²¹,
Ge Yu²¹,
Yue Kou²¹ &
…
Tiezheng Nie²¹

Conference paper

1872 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7826))

Abstract

The explosive growth of Data is bringing more and more challenges and opportunities to data mining. In data mining, learning decision tree is a common method, in which determining split points is the key problem. Existing methods of calculating split points in the distributed setting on large data either (1) cause high communication overhead or (2) are not universal for different levels of skewness of data distribution. In this paper, we study the properties of Gini impurity, which is a measure for determining split points, and design new algorithms for calculating split points in MapReduce. Empirical evaluation demonstrates that our method outperforms existing state-of-the-art techniques on communication cost and universality.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A Distributed Storage System for Structured Data. In: Proc. of OSDI, pp. 205–218 (2006)
Google Scholar
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s Hosted Data Serving Platform. In: Proc. of VLDB, pp. 1277–1288 (2008)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazons Highly Available Key-value Store. In: Proc. of SOSP, pp. 205–220 (2007)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of OSDI, pp. 137–150 (2004)
Google Scholar
Hadoop Project, http://hadoop.apache.org/
Laptev, N., Zeng, K., Zaniolo, C.: Early Accurate Results for Advanced Analytics on MapReduce. In: Proc. of VLDB, pp. 1028–1039 (2012)
Google Scholar
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online Aggregation and Continuous Query Support in MapReduce. In: Proc. of SIGMOD, pp. 1115–1118 (2010)
Google Scholar
Elghandour, I., Aboulnaga, A.: ReStore: Reusing Results of MapReduce Jobs. PVLDB 5(6), 586–597 (2012)
Google Scholar
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a HighLevel Dataflow System on Top of MapReduce: the Pig Experience. PVLDB 2(2), 1414–1425 (2009)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: Proc. of SIGMOD, pp. 1099–1110 (2008)
Google Scholar
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution Over a Map-Reduce Framework. PVLDB 2(2), 1626–1629 (2009)
Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT-Optimistic Decision Tree Construction. In: Proc. of SIGMOD, pp. 169–180 (1999)
Google Scholar
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for datamining. In: Proc. of EDBT, pp. 18–32 (1996)
Google Scholar
Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. In: Proc. of VLDB, pp. 416–427 (1998)
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. In: Proc. of VLDB, pp. 544–555 (1996)
Google Scholar
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proc. of KDD, pp. 71–80 (2000)
Google Scholar
Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. PVLDB 2(2), 1426–1437 (2009)
Google Scholar
Ye, J., Chow, J., Chen, J., Zheng, Z.: Stochastic Gradient Boosted Distributed Decision Trees. In: Proc. of CIKM, pp. 2061–2064 (2009)
Google Scholar
Hall, L., Chawla, N., Bowyer, K.W.: Decision tree learning on Very Large Data Dets. In: Proc. of SMC, vol. 3, pp. 2579–2584 (1998)
Google Scholar
Jestes, J., Yi, K., Li, F.: Building Wavelet Histograms on Large Data in MapReduce. PVLDB 5(2), 109–120 (2011)
Google Scholar
Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., Swami, A.: An Interval Classifier for Database Mining Appliation. In: Proc. of VLDB, pp. 560–573 (1992)
Google Scholar
Jin, R., Agrawal, G.: Communication and Memory Efficient Parallel Decision Tree Construction. In: Proc. of SDM, pp. 119–129 (2003)
Google Scholar
Ben-Haim, Y., Tom-Tov, E.: A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research (JMLR) 11, 849–872 (2010)
MathSciNet MATH Google Scholar
He, Q., Zhuang, F., Li, J., Shi, Z.: Parallel implementation of classification algorithms based on mapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 655–662. Springer, Heidelberg (2010)
Chapter Google Scholar
Yi, K., Zhang, Q.: Optimal Tracking of Distributed Heavy Hitters and Quantiles. In: Proc. of PODS, pp. 167–174 (2009)
Google Scholar
Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling Based Algorithms for Quantile Computation in Sensor Networks. In: Proc. of SIGMOD, pp. 745–756 (2011)
Google Scholar
Sina Weibo, http://www.weibo.com/

Download references

Author information

Authors and Affiliations

College of Information Science & Engineering, Northeastern University, China
Mingdong Zhu, Derong Shen, Ge Yu, Yue Kou & Tiezheng Nie

Authors

Mingdong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Binghamton University, 13902, Binghamton, NY, USA
Weiyi Meng
Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Ling Feng
Department of Computer Science, National University of Singapore, 117417, Singapore
Stéphane Bressan
Research Group Data Analystics and Computing, University of Vienna, 1090, Vienna, Austria
Werner Winiwarter
School of Computer, Wuhan University, 430072, Wuhan, China
Wei Song

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, M., Shen, D., Yu, G., Kou, Y., Nie, T. (2013). Computing the Split Points for Learning Decision Tree in MapReduce. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds) Database Systems for Advanced Applications. DASFAA 2013. Lecture Notes in Computer Science, vol 7826. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37450-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-37450-0_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37449-4
Online ISBN: 978-3-642-37450-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics