Handling Numeric Attributes in Hoeffding Trees

Pfahringer, Bernhard; Holmes, Geoffrey; Kirkby, Richard

doi:10.1007/978-3-540-68125-0_27

Handling Numeric Attributes in Hoeffding Trees

Bernhard Pfahringer¹,
Geoffrey Holmes¹ &
Richard Kirkby¹

Conference paper

2636 Accesses
18 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Abstract

For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into pre-defined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering 5(6), 914–925 (1993)
Article Google Scholar
Agrawal, R., Swami, A.: A one-pass space-efficient algorithm for finding quantiles. In: International Conference on Management of Data (1995)
Google Scholar
Alsabti, K., Ranka, S., Singh, V.: A one-pass algorithm for accurately estimating quantiles for disk-resident data. In: International Conference on Very Large Databases, pp. 346–355 (1997)
Google Scholar
Chan, T.F., Lewis, J.G.: Computing standard deviations: Accuracy. Communications of the ACM 22(9), 526–531 (1979)
Article MATH Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)
Google Scholar
Gama, J., Medas, P., Rocha, R.: Forest trees for on-line data. In: ACM Symposium on Applied Computing, pp. 632–636 (2004)
Google Scholar
Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: International Conference on Knowledge Discovery and Data Mining, pp. 523–528 (2003)
Google Scholar
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM Special Interest Group on Management Of Data Conference, pp. 58–66 (2001)
Google Scholar
Holmes, G., Kirkby, R., Pfahringer, B.: Stress-testing hoeffding trees. In: European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 495–502 (2005)
Google Scholar
Hulten, G., Domingos, P.: VFML – a toolkit for mining high-speed time-changing data streams (2003), http://www.cs.washington.edu/dm/vfml/
Jain, R., Chlamtac, I.: The P² algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM 28(10), 1076–1085 (1985)
Article Google Scholar
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: ACM Special Interest Group on Management Of Data Conference, pp. 426–435 (1998)
Google Scholar
Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)
Article MATH MathSciNet Google Scholar
Ross Quinlan, J.: Improved use of continuous attributes in C4. Journal of Artificial Intelligence Research 4, 77–90 (1996)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)
Article MATH MathSciNet Google Scholar
Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Waikato, Hamilton, New Zealand
Bernhard Pfahringer, Geoffrey Holmes & Richard Kirkby

Authors

Bernhard Pfahringer
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey Holmes
View author publications
You can also search for this author in PubMed Google Scholar
Richard Kirkby
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pfahringer, B., Holmes, G., Kirkby, R. (2008). Handling Numeric Attributes in Hoeffding Trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-68125-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics