A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

Hu, Xue-Gang; Li, Pei-Pei; Wu, Xin-Dong; Wu, Gong-Qing

doi:10.1007/s11390-007-9084-9

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

Regular Paper
Published: 25 September 2007

Volume 22, pages 711–724, (2007)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xue-Gang Hu¹,
Pei-Pei Li¹,
Xin-Dong Wu^1,2 &
…
Gong-Qing Wu¹

63 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naïve Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Big Data Analytics: A Literature Review Paper

A survey on ensemble learning

Article 30 August 2019

References

Pedro Domingos, Geoff Hulten. Mining high-speed data streams. In Proc. Knowledge Discovery and Data Mining, Boston, MA, USA, 2000, pp.71–80.
Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA, 2003, pp.523–528.
Qiang Ding, Qin Ding, Perrizo W. Decision tree classification of spatial data streams using Peano count trees. In Proc. ACM Symposium on Applied Computing (SAC’02), Madrid, Spain, March 2002, pp.413–417.
Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. IEEE FOCS, 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, pp.359–366.
Moses Charikar, Liadan O’Callaghan, Rina Panigrahy. Better streaming algorithms for clustering problems. In Proc. the Thirty-Fifth Annual ACM Symposium on Theory of Computing, San Diego, CA, USA, 2003, pp.30–39.
L O’Cllaghan, Nina Mishra, Adam Meyerson. Streaming-data algorithms for high-quality clustering. In Proc. ICDE’02, San Jose, CA, USA, 2002, pp.685–694.
Ordonez C. Clustering binary data streams with K-means. In Proc. the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 2003, pp.12–19.
Aggarwal C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th VLDB Conference, Berlin, Germany, 2003, pp.81–92.
Xiaoyun Zhou, Zhihui Su, Baili Zhang, Yidong Yang. An efficient discovering and maintenance algorithm of subspace clustering over high dimensional data streams. Journal of Computer Research and Development, 2006, 43(5): 834–840.
Article Google Scholar
Xuli Jun, Xiekang Lin, Xu Hong. Discovering frequent itemsets over data streams. Journal of Shanghai Jiaotong University, 2006, 40(3): 502–506.
Google Scholar
Chang J H, Lee W S. Finding recent frequent itemsets adaptively over online data streams. In Proc. KDD-2003, 2003, Washington DC, USA, pp.487–492.
Lijun Xu, Kanglin Xie, Hong Xu. Discovering frequent itemsets over data streams. Journal of Shanghai Jiaotong University, Washington DC, USA, 2006, 40(3): 502–506.
Google Scholar
Guojun Mao, Xindong Wu, Xingquan Zhu, Gong Chen, Chunnian Liu. Mining maximal frequent itemsets from data streams. Journal of Information Science, 2007, 33(3): 251–262.
Article Google Scholar
Gong Chen, Xindong Wu, Xingquan Zhu. Sequential pattern mining in multiple streams. In Proc. ICDM’05, Houston, TX, USA, 2005, pp.585–588.
Utgoff P E, Berkman N C, Clouse J A. Decision tree induction based on efficient tree restructuring. Machine Learning, 1997, 29(1): 5–44.
Article MATH Google Scholar
Kalles D, Morris T. Efficient incremental induction of decision trees. Machine Learning, 1996, 24(3): 231–242.
Google Scholar
Fan W, Wang H, Yu P S, Ma S. Is random model better? On its accuracy and efficiency. In Proc. ICDM’03, Melbourne, FL, USA, 2003, pp.51–58.
Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963, 58: 13–30.
Article MATH Google Scholar
Maron O, Moore A. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. Advances in Neural Information Processing Systems, Cowan J D, Tesauro G, Alspector J (eds.), San Mato: Morgan Kaufmann, CA, 1994.
Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Computation, 1997, 9(7): 1545–1588.
Article Google Scholar
Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis and Machine Intelligence, Aug. 1998, 20(8): 832–844.
Article Google Scholar
Dietterich T G. Ensemble Methods in Machine Learning. First International Workshop on Multiple Classifier Systems, Kittler J, Roli F (eds.), New York: Springer Verlag, 2000, pp.1–15.
Google Scholar
Dietterich T G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 2000, 40(2): 139–157.
Article Google Scholar
Leo Breiman. Random forests. Machine Learning, 2001, 45(1): 5–32.
Article MATH Google Scholar
Liu T F, Ting K M, Fan W. Maximizing tree diversity by building complete-random decision trees. In Proc. the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, May 2005, pp.605–610.
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
Blake C, Keogh E, Merz C. UCI repository of machine learning databases, 1998, http://www.ics.uci.edu/~mlearn/MLRepository.html.

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230009, China
Xue-Gang Hu, Pei-Pei Li, Xin-Dong Wu & Gong-Qing Wu
Department of Computer Science, University of Vermont, Burlington, VT, 50405, USA
Xin-Dong Wu

Authors

Xue-Gang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Pei-Pei Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin-Dong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Gong-Qing Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xue-Gang Hu.

Additional information

This research is supported by the National Natural Science Foundation of China (Grant No. 60573174) and the Natural Science Foundation of Anhui Province of China (Grant No. 050420207).

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 90 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, XG., Li, PP., Wu, XD. et al. A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams. J Comput Sci Technol 22, 711–724 (2007). https://doi.org/10.1007/s11390-007-9084-9

Download citation

Received: 15 January 2007
Revised: 16 July 2007
Published: 25 September 2007
Issue Date: September 2007
DOI: https://doi.org/10.1007/s11390-007-9084-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Big Data Analytics: A Literature Review Paper

A survey on ensemble learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 90 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Big Data Analytics: A Literature Review Paper

A survey on ensemble learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 90 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation