Skip to main content

Incremental Optimization Mechanism for Constructing a Balanced Very Fast Decision Tree for Big Data

  • Chapter
  • First Online:
Innovative Research Methodologies in Management

Abstract

Big data is a popular topic that highly attracts the attentions of researchers from all over the world. How to mine valuable information from such huge volumes of data remains an open problem. As the most widely used technology of decision tree, imperfect data stream leads to tree size explosion and detrimental accuracy problems. Over-fitting problem and the imbalanced class distribution reduce the performance of the original decision tree algorithm for stream mining. In this chapter, we propose an Optimized Very Fast Decision Tree (OVFDT) that possesses an optimized node-splitting control mechanism using Hoeffding bound. Accuracy, tree size, and learning time are the significant factors influencing the algorithm’s performance. Naturally, a bigger tree size takes longer computation time. OVFDT is a pioneer model equipped with an incremental optimization mechanism that seeks for a balance between accuracy and tree size for data stream mining. OVFDT operates incrementally by a test-then-train approach. Two new methods of functional tree leaves are proposed to improve the accuracy with which the tree model makes a prediction for a new data stream in the testing phase. The optimized node-splitting mechanism controls the tree model growth in the training phase. The experiment shows that OVFDT obtains an optimal tree structure in numeric and nominal datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Reference

  • Bifet, A., & Gavalda, R. (2007 ). Learning from Time-Changing Data with Adaptive Windowing. In Proceedings of SIAM International Conference on Data Mining (pp. 443–448).

    Google Scholar 

  • Bifet A., Geoff, H., Bernhard, P., Jesse, R., Philipp, K., Hardy, K., Timm, J., & Thomas, S. (2001). MOA: A Real-Time Analytics Open Source Framework. In Machine Learning and Knowledge Discovery in Databases (pp. 617–620). Lecture Notes in Computer Science, Volume 6913/2011.

    Google Scholar 

  • Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New Ensemble Methods for Evolving Data Streams. In Proceedings 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 139–147). New York: ACM.

    Chapter  Google Scholar 

  • Elomaa, T. (1999). The Biases of Decision Tree Pruning Strategies, Advances in Intelligent Data Analysis (pp. 63–74). Lecture Notes in Computer Science, Volume 1642/1999. Berlin/Heidelberg: Springer.

    Google Scholar 

  • Gama, J., & Kosina, P. (2011). Learning Decision Rules from Data Streams. In T. Walsh (Ed.), Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence – Volume Two (Vol. 2, pp. 1255–1260). Menlo Park: AAAI Press.

    Google Scholar 

  • Gama J, Rocha R., & Medas P. (2003). Accurate Decision Trees for Mining High-Speed Data Streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 523–528). ACM, New York.

    Google Scholar 

  • Geoffrey H., Richard K., & Bernhard P. (2005). Tie Breaking in Hoeffding Trees. In Proceedings Workshop W6: Second International Workshop on Knowledge Discovery in Data Streams (pp. 107–116).

    Google Scholar 

  • Hartline J. R. K. (2008). Incremental Optimization (PhD Thesis). Faculty of the Graduate School, Cornell University.

    Google Scholar 

  • Hashemi, S., & Yang, Y. (2009). Flexible Decision Tree for Data Stream Classification in the Presence of Concept Change, Noise and Missing Values. Data Mining and Knowledge Discovery, 19(1), 95–131.

    Article  Google Scholar 

  • Hulten G., & Domingos P. (2003). VFML – A Toolkit for Mining High-Speed Time-Changing Data Streams. http://www.cs.washington.edu/dm/vfml/

  • Hulten G., Spencer L., & Domingos P. (2001). Mining Time-Changing Data Streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97–106).

    Google Scholar 

  • Mladenic D., & Grobelnik M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes, In Proceeding ICML ‘99 Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). ISBN 1-55860-612-2, Morgan Kaufmann.

    Google Scholar 

  • Nitesh, C., Nathalie, J., & Alek, K. (2004). Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, 6(1), 1–6.

    Article  Google Scholar 

  • Oza N., & Russell S. (2001). Online Bagging and Boosting. In Artificial Intelligence and Statistics (pp. 105–112). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Pedro D., & Geoff H. (2000). Mining High-Speed Data Streams. In Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 71–80).

    Google Scholar 

  • Pfahringer B., Holmes G., & Kirkby R. (2007). New Options for Hoeffding Trees. In Proceedings in Australian Conference on Artificial Intelligence (pp. 90–99).

    Google Scholar 

  • Stefan H., Russel P., & Yun S. K. (2009). CBDT: A Concept Based Approach to Data Stream Mining (pp. 1006–1012). Lecture Notes in Computer Science, Volume 5476/2009.

    Google Scholar 

  • Yang H., & Fong S. (2011). Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning. In Proceedings of the 13th International Conference on Data Warehousing And Knowledge Discovery (pp. 471–483). Berlin/Heidelberg: Springer-Verlag.

    Google Scholar 

Download references

Acknowledgment

The authors are thankful for the financial support from the research grants “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF)”, Grant no. MYRG2015-00128-FST offered by the University of Macau, FST, and RDAO, and “A scalable data stream mining methodology: stream-based holistic analytics and reasoning in parallel”, Grant no. FDCT-126/2014/A3, offered by FDCT Macau.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Yang, H., Fong, S. (2018). Incremental Optimization Mechanism for Constructing a Balanced Very Fast Decision Tree for Big Data. In: Moutinho, L., Sokele, M. (eds) Innovative Research Methodologies in Management. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-319-64394-6_6

Download citation

Publish with us

Policies and ethics