Skip to main content

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 24))

Abstract

We show that a commonly-used sampling theoretical attribute discretization algorithm ChiMerge can be implemented efficiently in the online setting. Its benefits include that it is efficient, statistically justified, robust to noise, can be made to produce low-arity partitions, and has empirically been observed to work well in practice.

The worst-case time requirement of the batch version of ChiMerge bottom-up interval merging is \(O(n\lg n)\) per attribute. We show that ChiMerge can be implemented in the online setting so that only logarithmic time is required to update the relevant data structures in connection of an insertion. Hence, the same \(O(n\lg n)\) total time as in batch setting is spent on discretization of a data stream in which the examples fall into n bins. However, maintaining just one binary search tree is not enough, we also need other data structures. Moreover, in order to guarantee equal discretization results, an up-to-date discretization cannot always be kept available, but we need to delay the updates to happen at periodic intervals. We also provide a comparative evaluation of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1(2). Now Publishers, Hanover (2005)

    MATH  Google Scholar 

  2. Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Advances in Database Systems, vol. 31. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  3. Gama, J., Gaber, M.M. (eds.): Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  4. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data streams. In: Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–508. ACM Press, New York (2004)

    Chapter  Google Scholar 

  5. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. Sixth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 71–80. ACM Press, New York (2000)

    Chapter  Google Scholar 

  6. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. Seventh ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 97–106. ACM Press, New York (2001)

    Chapter  Google Scholar 

  7. Gao, J., Fan, W., Han, J.: On appropriate assumptions to mine data streams: Analysis and practice. In: Proc. 7th IEEE International Conference on Data Mining, pp. 143–152. IEEE Computer Society Press, Los Alamitos (2007)

    Google Scholar 

  8. Gao, J., Fan, W., Han, J., Yu, P.S.: A general framework for mining concept-drifting data streams with skewed distributions. In: Proc. Seventh SIAM International Conference on Data Mining. SIAM, Philadelphia (2007)

    Google Scholar 

  9. Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 523–528. ACM Press, New York (2003)

    Chapter  Google Scholar 

  10. Jin, R., Agrawal, G.: Efficient decision tree construction for streaming data. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 571–576. ACM Press, New York (2003)

    Chapter  Google Scholar 

  11. Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 573–577. ACM Press, New York (2005)

    Chapter  Google Scholar 

  12. Gama, J., Pinto, C.: Dizcretization from data streams: Applications to histograms and data mining. In: Proc. 2006 ACM Symposium on Applied Computing, pp. 662–667. ACM Press, New York (2006)

    Chapter  Google Scholar 

  13. Pfahringer, B., Holmes, G., Kirkby, R.: Handling numeric attributes in hoeffding trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 296–307. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Elomaa, T., Lehtinen, P.: Maintaining optimal multi-way splits for numerical attributes in data streams. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 544–553. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  15. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  16. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002)

    Article  MathSciNet  Google Scholar 

  17. Yang, Y., Webb, G.I.: Discretization methods. In: The Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005)

    Google Scholar 

  18. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. Tenth National Conference on Artificial Intelligence, pp. 123–128. AAAI Press, Menlo Park (1992)

    Google Scholar 

  19. Richeldi, M., Rossotto, M.: Class-driven statistical discretization of continuous attributes. In: ECML 1995. LNCS, vol. 912, pp. 335–338. Springer, Heidelberg (1995)

    Google Scholar 

  20. Liu, H., Setiono, R.: Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9, 642–645 (1997)

    Article  Google Scholar 

  21. Tay, F.E.H., Shen, L.: A modified Chi2 algorithm for discretization. IEEE Transactions on Knowledge and Data Engineering 14(3), 666–670 (2002)

    Article  Google Scholar 

  22. Catlett, J.: Megainduction: A test flight. In: Proc. Eighth International Workshop on Machine Learning, pp. 596–599. Morgan Kaufmann, San Mateo (1991)

    Google Scholar 

  23. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–32. ACM Press, New York (1999)

    Chapter  Google Scholar 

  24. Utgoff, P.: Incremental induction of decision trees. Machine Learning 4, 161–186 (1989)

    Article  Google Scholar 

  25. Utgoff, P., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Machine Learning 29(1), 5–44 (1997)

    Article  MATH  Google Scholar 

  26. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)

    Google Scholar 

  27. Shafer, J.C., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. Twenty-Second International Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Francisco (1996)

    Google Scholar 

  28. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Pacific Grove (1984)

    MATH  Google Scholar 

  29. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  30. Hulten, G., Domingos, P.: VFML — a toolkit for mining high-speed time-changing data streams (2003)

    Google Scholar 

  31. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD 2001 Electronic Proceedings, pp. 58–66 (2001)

    Google Scholar 

  32. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  33. Kirkby, R.: Improving Hoeffding Trees. PhD thesis, University of Waikato, Department of Computer Science, New Zealand (2008), http://adt.waikato.ac.nz/public/adt-uow20080415.103751/index.html

  34. Univ. of Waikato New Zealand: MOA: Massive On-line Analysis (2008), http://www.cs.waikato.ac.nz/~abifet/MOA/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Lehtinen, P., Saarela, M., Elomaa, T. (2012). Online ChiMerge Algorithm. In: Holmes, D.E., Jain, L.C. (eds) Data Mining: Foundations and Intelligent Paradigms. Intelligent Systems Reference Library, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23241-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23241-1_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23240-4

  • Online ISBN: 978-3-642-23241-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics