Abstract
The explosive growth of Data is bringing more and more challenges and opportunities to data mining. In data mining, learning decision tree is a common method, in which determining split points is the key problem. Existing methods of calculating split points in the distributed setting on large data either (1) cause high communication overhead or (2) are not universal for different levels of skewness of data distribution. In this paper, we study the properties of Gini impurity, which is a measure for determining split points, and design new algorithms for calculating split points in MapReduce. Empirical evaluation demonstrates that our method outperforms existing state-of-the-art techniques on communication cost and universality.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A Distributed Storage System for Structured Data. In: Proc. of OSDI, pp. 205–218 (2006)
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s Hosted Data Serving Platform. In: Proc. of VLDB, pp. 1277–1288 (2008)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazons Highly Available Key-value Store. In: Proc. of SOSP, pp. 205–220 (2007)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of OSDI, pp. 137–150 (2004)
Hadoop Project, http://hadoop.apache.org/
Laptev, N., Zeng, K., Zaniolo, C.: Early Accurate Results for Advanced Analytics on MapReduce. In: Proc. of VLDB, pp. 1028–1039 (2012)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. PVLDB 4(9), 575–585 (2011)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online Aggregation and Continuous Query Support in MapReduce. In: Proc. of SIGMOD, pp. 1115–1118 (2010)
Elghandour, I., Aboulnaga, A.: ReStore: Reusing Results of MapReduce Jobs. PVLDB 5(6), 586–597 (2012)
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a HighLevel Dataflow System on Top of MapReduce: the Pig Experience. PVLDB 2(2), 1414–1425 (2009)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: Proc. of SIGMOD, pp. 1099–1110 (2008)
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution Over a Map-Reduce Framework. PVLDB 2(2), 1626–1629 (2009)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT-Optimistic Decision Tree Construction. In: Proc. of SIGMOD, pp. 169–180 (1999)
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for datamining. In: Proc. of EDBT, pp. 18–32 (1996)
Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. In: Proc. of VLDB, pp. 416–427 (1998)
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. In: Proc. of VLDB, pp. 544–555 (1996)
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proc. of KDD, pp. 71–80 (2000)
Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. PVLDB 2(2), 1426–1437 (2009)
Ye, J., Chow, J., Chen, J., Zheng, Z.: Stochastic Gradient Boosted Distributed Decision Trees. In: Proc. of CIKM, pp. 2061–2064 (2009)
Hall, L., Chawla, N., Bowyer, K.W.: Decision tree learning on Very Large Data Dets. In: Proc. of SMC, vol. 3, pp. 2579–2584 (1998)
Jestes, J., Yi, K., Li, F.: Building Wavelet Histograms on Large Data in MapReduce. PVLDB 5(2), 109–120 (2011)
Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., Swami, A.: An Interval Classifier for Database Mining Appliation. In: Proc. of VLDB, pp. 560–573 (1992)
Jin, R., Agrawal, G.: Communication and Memory Efficient Parallel Decision Tree Construction. In: Proc. of SDM, pp. 119–129 (2003)
Ben-Haim, Y., Tom-Tov, E.: A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research (JMLR) 11, 849–872 (2010)
He, Q., Zhuang, F., Li, J., Shi, Z.: Parallel implementation of classification algorithms based on mapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 655–662. Springer, Heidelberg (2010)
Yi, K., Zhang, Q.: Optimal Tracking of Distributed Heavy Hitters and Quantiles. In: Proc. of PODS, pp. 167–174 (2009)
Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling Based Algorithms for Quantile Computation in Sensor Networks. In: Proc. of SIGMOD, pp. 745–756 (2011)
Sina Weibo, http://www.weibo.com/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, M., Shen, D., Yu, G., Kou, Y., Nie, T. (2013). Computing the Split Points for Learning Decision Tree in MapReduce. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds) Database Systems for Advanced Applications. DASFAA 2013. Lecture Notes in Computer Science, vol 7826. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37450-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-37450-0_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37449-4
Online ISBN: 978-3-642-37450-0
eBook Packages: Computer ScienceComputer Science (R0)