Online ChiMerge Algorithm

Lehtinen, Petri; Saarela, Matti; Elomaa, Tapio

doi:10.1007/978-3-642-23241-1_10

Petri Lehtinen⁵,
Matti Saarela⁵ &
Tapio Elomaa⁵

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 24))

1897 Accesses
3 Citations

Abstract

We show that a commonly-used sampling theoretical attribute discretization algorithm ChiMerge can be implemented efficiently in the online setting. Its benefits include that it is efficient, statistically justified, robust to noise, can be made to produce low-arity partitions, and has empirically been observed to work well in practice.

The worst-case time requirement of the batch version of ChiMerge bottom-up interval merging is \(O(n\lg n)\) per attribute. We show that ChiMerge can be implemented in the online setting so that only logarithmic time is required to update the relevant data structures in connection of an insertion. Hence, the same \(O(n\lg n)\) total time as in batch setting is spent on discretization of a data stream in which the examples fall into n bins. However, maintaining just one binary search tree is not enough, we also need other data structures. Moreover, in order to guarantee equal discretization results, an up-to-date discretization cannot always be kept available, but we need to delay the updates to happen at periodic intervals. We also provide a comparative evaluation of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1(2). Now Publishers, Hanover (2005)
MATH Google Scholar
Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Advances in Database Systems, vol. 31. Springer, Heidelberg (2007)
MATH Google Scholar
Gama, J., Gaber, M.M. (eds.): Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007)
MATH Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data streams. In: Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–508. ACM Press, New York (2004)
Chapter Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. Sixth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 71–80. ACM Press, New York (2000)
Chapter Google Scholar
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. Seventh ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 97–106. ACM Press, New York (2001)
Chapter Google Scholar
Gao, J., Fan, W., Han, J.: On appropriate assumptions to mine data streams: Analysis and practice. In: Proc. 7th IEEE International Conference on Data Mining, pp. 143–152. IEEE Computer Society Press, Los Alamitos (2007)
Google Scholar
Gao, J., Fan, W., Han, J., Yu, P.S.: A general framework for mining concept-drifting data streams with skewed distributions. In: Proc. Seventh SIAM International Conference on Data Mining. SIAM, Philadelphia (2007)
Google Scholar
Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 523–528. ACM Press, New York (2003)
Chapter Google Scholar
Jin, R., Agrawal, G.: Efficient decision tree construction for streaming data. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 571–576. ACM Press, New York (2003)
Chapter Google Scholar
Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 573–577. ACM Press, New York (2005)
Chapter Google Scholar
Gama, J., Pinto, C.: Dizcretization from data streams: Applications to histograms and data mining. In: Proc. 2006 ACM Symposium on Applied Computing, pp. 662–667. ACM Press, New York (2006)
Chapter Google Scholar
Pfahringer, B., Holmes, G., Kirkby, R.: Handling numeric attributes in hoeffding trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 296–307. Springer, Heidelberg (2008)
Chapter Google Scholar
Elomaa, T., Lehtinen, P.: Maintaining optimal multi-way splits for numerical attributes in data streams. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 544–553. Springer, Heidelberg (2008)
Chapter Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002)
Article MathSciNet Google Scholar
Yang, Y., Webb, G.I.: Discretization methods. In: The Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005)
Google Scholar
Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. Tenth National Conference on Artificial Intelligence, pp. 123–128. AAAI Press, Menlo Park (1992)
Google Scholar
Richeldi, M., Rossotto, M.: Class-driven statistical discretization of continuous attributes. In: ECML 1995. LNCS, vol. 912, pp. 335–338. Springer, Heidelberg (1995)
Google Scholar
Liu, H., Setiono, R.: Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9, 642–645 (1997)
Article Google Scholar
Tay, F.E.H., Shen, L.: A modified Chi2 algorithm for discretization. IEEE Transactions on Knowledge and Data Engineering 14(3), 666–670 (2002)
Article Google Scholar
Catlett, J.: Megainduction: A test flight. In: Proc. Eighth International Workshop on Machine Learning, pp. 596–599. Morgan Kaufmann, San Mateo (1991)
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–32. ACM Press, New York (1999)
Chapter Google Scholar
Utgoff, P.: Incremental induction of decision trees. Machine Learning 4, 161–186 (1989)
Article Google Scholar
Utgoff, P., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Machine Learning 29(1), 5–44 (1997)
Article MATH Google Scholar
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)
Google Scholar
Shafer, J.C., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. Twenty-Second International Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Pacific Grove (1984)
MATH Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Hulten, G., Domingos, P.: VFML — a toolkit for mining high-speed time-changing data streams (2003)
Google Scholar
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD 2001 Electronic Proceedings, pp. 58–66 (2001)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (2001)
MATH Google Scholar
Kirkby, R.: Improving Hoeffding Trees. PhD thesis, University of Waikato, Department of Computer Science, New Zealand (2008), http://adt.waikato.ac.nz/public/adt-uow20080415.103751/index.html
Univ. of Waikato New Zealand: MOA: Massive On-line Analysis (2008), http://www.cs.waikato.ac.nz/~abifet/MOA/

Download references

Author information

Authors and Affiliations

Department of Software Systems, Tampere University of Technology, P.O. Box 553, Korkeakoulunkatu 1, FI-33101, Tampere, Finland
Petri Lehtinen, Matti Saarela & Tapio Elomaa

Authors

Petri Lehtinen
View author publications
You can also search for this author in PubMed Google Scholar
Matti Saarela
View author publications
You can also search for this author in PubMed Google Scholar
Tapio Elomaa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics andApplied Probability, University of California , 93106, Santa Barbara, CA, USA
Dawn E. Holmes
Knowledge-Based Engineering, University of South Australia, 5095, Adelaide Mawson Lakes, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lehtinen, P., Saarela, M., Elomaa, T. (2012). Online ChiMerge Algorithm. In: Holmes, D.E., Jain, L.C. (eds) Data Mining: Foundations and Intelligent Paradigms. Intelligent Systems Reference Library, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23241-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-23241-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23240-4
Online ISBN: 978-3-642-23241-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics