Skip to main content
Log in

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naïve Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Pedro Domingos, Geoff Hulten. Mining high-speed data streams. In Proc. Knowledge Discovery and Data Mining, Boston, MA, USA, 2000, pp.71–80.

  2. Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA, 2003, pp.523–528.

  3. Qiang Ding, Qin Ding, Perrizo W. Decision tree classification of spatial data streams using Peano count trees. In Proc. ACM Symposium on Applied Computing (SAC’02), Madrid, Spain, March 2002, pp.413–417.

  4. Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. IEEE FOCS, 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, pp.359–366.

  5. Moses Charikar, Liadan O’Callaghan, Rina Panigrahy. Better streaming algorithms for clustering problems. In Proc. the Thirty-Fifth Annual ACM Symposium on Theory of Computing, San Diego, CA, USA, 2003, pp.30–39.

  6. L O’Cllaghan, Nina Mishra, Adam Meyerson. Streaming-data algorithms for high-quality clustering. In Proc. ICDE’02, San Jose, CA, USA, 2002, pp.685–694.

  7. Ordonez C. Clustering binary data streams with K-means. In Proc. the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 2003, pp.12–19.

  8. Aggarwal C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th VLDB Conference, Berlin, Germany, 2003, pp.81–92.

  9. Xiaoyun Zhou, Zhihui Su, Baili Zhang, Yidong Yang. An efficient discovering and maintenance algorithm of subspace clustering over high dimensional data streams. Journal of Computer Research and Development, 2006, 43(5): 834–840.

    Article  Google Scholar 

  10. Xuli Jun, Xiekang Lin, Xu Hong. Discovering frequent itemsets over data streams. Journal of Shanghai Jiaotong University, 2006, 40(3): 502–506.

    Google Scholar 

  11. Chang J H, Lee W S. Finding recent frequent itemsets adaptively over online data streams. In Proc. KDD-2003, 2003, Washington DC, USA, pp.487–492.

  12. Lijun Xu, Kanglin Xie, Hong Xu. Discovering frequent itemsets over data streams. Journal of Shanghai Jiaotong University, Washington DC, USA, 2006, 40(3): 502–506.

    Google Scholar 

  13. Guojun Mao, Xindong Wu, Xingquan Zhu, Gong Chen, Chunnian Liu. Mining maximal frequent itemsets from data streams. Journal of Information Science, 2007, 33(3): 251–262.

    Article  Google Scholar 

  14. Gong Chen, Xindong Wu, Xingquan Zhu. Sequential pattern mining in multiple streams. In Proc. ICDM’05, Houston, TX, USA, 2005, pp.585–588.

  15. Utgoff P E, Berkman N C, Clouse J A. Decision tree induction based on efficient tree restructuring. Machine Learning, 1997, 29(1): 5–44.

    Article  MATH  Google Scholar 

  16. Kalles D, Morris T. Efficient incremental induction of decision trees. Machine Learning, 1996, 24(3): 231–242.

    Google Scholar 

  17. Fan W, Wang H, Yu P S, Ma S. Is random model better? On its accuracy and efficiency. In Proc. ICDM’03, Melbourne, FL, USA, 2003, pp.51–58.

  18. Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963, 58: 13–30.

    Article  MATH  Google Scholar 

  19. Maron O, Moore A. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. Advances in Neural Information Processing Systems, Cowan J D, Tesauro G, Alspector J (eds.), San Mato: Morgan Kaufmann, CA, 1994.

  20. Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Computation, 1997, 9(7): 1545–1588.

    Article  Google Scholar 

  21. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis and Machine Intelligence, Aug. 1998, 20(8): 832–844.

    Article  Google Scholar 

  22. Dietterich T G. Ensemble Methods in Machine Learning. First International Workshop on Multiple Classifier Systems, Kittler J, Roli F (eds.), New York: Springer Verlag, 2000, pp.1–15.

    Google Scholar 

  23. Dietterich T G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 2000, 40(2): 139–157.

    Article  Google Scholar 

  24. Leo Breiman. Random forests. Machine Learning, 2001, 45(1): 5–32.

    Article  MATH  Google Scholar 

  25. Liu T F, Ting K M, Fan W. Maximizing tree diversity by building complete-random decision trees. In Proc. the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, May 2005, pp.605–610.

  26. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

  27. Blake C, Keogh E, Merz C. UCI repository of machine learning databases, 1998, http://www.ics.uci.edu/~mlearn/MLRepository.html.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xue-Gang Hu.

Additional information

This research is supported by the National Natural Science Foundation of China (Grant No. 60573174) and the Natural Science Foundation of Anhui Province of China (Grant No. 050420207).

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, XG., Li, PP., Wu, XD. et al. A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams. J Comput Sci Technol 22, 711–724 (2007). https://doi.org/10.1007/s11390-007-9084-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-007-9084-9

Keywords

Navigation