Abstract
Big data is a popular topic that highly attracts the attentions of researchers from all over the world. How to mine valuable information from such huge volumes of data remains an open problem. As the most widely used technology of decision tree, imperfect data stream leads to tree size explosion and detrimental accuracy problems. Over-fitting problem and the imbalanced class distribution reduce the performance of the original decision tree algorithm for stream mining. In this chapter, we propose an Optimized Very Fast Decision Tree (OVFDT) that possesses an optimized node-splitting control mechanism using Hoeffding bound. Accuracy, tree size, and learning time are the significant factors influencing the algorithm’s performance. Naturally, a bigger tree size takes longer computation time. OVFDT is a pioneer model equipped with an incremental optimization mechanism that seeks for a balance between accuracy and tree size for data stream mining. OVFDT operates incrementally by a test-then-train approach. Two new methods of functional tree leaves are proposed to improve the accuracy with which the tree model makes a prediction for a new data stream in the testing phase. The optimized node-splitting mechanism controls the tree model growth in the training phase. The experiment shows that OVFDT obtains an optimal tree structure in numeric and nominal datasets.
Reference
Bifet, A., & Gavalda, R. (2007 ). Learning from Time-Changing Data with Adaptive Windowing. In Proceedings of SIAM International Conference on Data Mining (pp. 443–448).
Bifet A., Geoff, H., Bernhard, P., Jesse, R., Philipp, K., Hardy, K., Timm, J., & Thomas, S. (2001). MOA: A Real-Time Analytics Open Source Framework. In Machine Learning and Knowledge Discovery in Databases (pp. 617–620). Lecture Notes in Computer Science, Volume 6913/2011.
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New Ensemble Methods for Evolving Data Streams. In Proceedings 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 139–147). New York: ACM.
Elomaa, T. (1999). The Biases of Decision Tree Pruning Strategies, Advances in Intelligent Data Analysis (pp. 63–74). Lecture Notes in Computer Science, Volume 1642/1999. Berlin/Heidelberg: Springer.
Gama, J., & Kosina, P. (2011). Learning Decision Rules from Data Streams. In T. Walsh (Ed.), Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence – Volume Two (Vol. 2, pp. 1255–1260). Menlo Park: AAAI Press.
Gama J, Rocha R., & Medas P. (2003). Accurate Decision Trees for Mining High-Speed Data Streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 523–528). ACM, New York.
Geoffrey H., Richard K., & Bernhard P. (2005). Tie Breaking in Hoeffding Trees. In Proceedings Workshop W6: Second International Workshop on Knowledge Discovery in Data Streams (pp. 107–116).
Hartline J. R. K. (2008). Incremental Optimization (PhD Thesis). Faculty of the Graduate School, Cornell University.
Hashemi, S., & Yang, Y. (2009). Flexible Decision Tree for Data Stream Classification in the Presence of Concept Change, Noise and Missing Values. Data Mining and Knowledge Discovery, 19(1), 95–131.
Hulten G., & Domingos P. (2003). VFML – A Toolkit for Mining High-Speed Time-Changing Data Streams. http://www.cs.washington.edu/dm/vfml/
Hulten G., Spencer L., & Domingos P. (2001). Mining Time-Changing Data Streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97–106).
Mladenic D., & Grobelnik M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes, In Proceeding ICML ‘99 Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). ISBN 1-55860-612-2, Morgan Kaufmann.
Nitesh, C., Nathalie, J., & Alek, K. (2004). Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, 6(1), 1–6.
Oza N., & Russell S. (2001). Online Bagging and Boosting. In Artificial Intelligence and Statistics (pp. 105–112). San Mateo: Morgan Kaufmann.
Pedro D., & Geoff H. (2000). Mining High-Speed Data Streams. In Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 71–80).
Pfahringer B., Holmes G., & Kirkby R. (2007). New Options for Hoeffding Trees. In Proceedings in Australian Conference on Artificial Intelligence (pp. 90–99).
Stefan H., Russel P., & Yun S. K. (2009). CBDT: A Concept Based Approach to Data Stream Mining (pp. 1006–1012). Lecture Notes in Computer Science, Volume 5476/2009.
Yang H., & Fong S. (2011). Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning. In Proceedings of the 13th International Conference on Data Warehousing And Knowledge Discovery (pp. 471–483). Berlin/Heidelberg: Springer-Verlag.
Acknowledgment
The authors are thankful for the financial support from the research grants “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF)”, Grant no. MYRG2015-00128-FST offered by the University of Macau, FST, and RDAO, and “A scalable data stream mining methodology: stream-based holistic analytics and reasoning in parallel”, Grant no. FDCT-126/2014/A3, offered by FDCT Macau.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Yang, H., Fong, S. (2018). Incremental Optimization Mechanism for Constructing a Balanced Very Fast Decision Tree for Big Data. In: Moutinho, L., Sokele, M. (eds) Innovative Research Methodologies in Management. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-319-64394-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-64394-6_6
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-319-64393-9
Online ISBN: 978-3-319-64394-6
eBook Packages: Business and ManagementBusiness and Management (R0)