Skip to main content

A Two Layers Incremental Discretization Based on Order Statistics

  • Conference paper
  • First Online:

Abstract

Large amounts of data are produced today: network logs, web data, social network data…The data amount and their arrival speed make them impossible to be stored. Such data are called streaming data. The stream specificities are: (i) data are just visible once and (ii) are ordered by arrival time. As these data can not be kept in memory and read afterwards, usual data mining techniques can not apply. Therefore to build a classifier in that context requires to do it incrementally and/or to keep a subset of the information seen and then build the classifier. This paper focuses on the second option and proposed a two layers approach based on order statistics. The first layer uses the Greenwald and Khanna quantiles summary and the second layer a supervised method such as MODL.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://largescale.ml.tu-berlin.de.

  2. 2.

    http://explo.cs.ucl.ac.uk/.

References

  • Ben-Haim, Y., & Tom-Tov, E. (2010). A streaming parallel decision tree algorithm. Journal of Machine Learning, 11, 849–872.

    MathSciNet  MATH  Google Scholar 

  • Boullé, M. (2005). A Bayes optimal approach for partitioning the values of categorical attributes. Journal of Machine Learning Research, 6(04), 1431–1452.

    MATH  Google Scholar 

  • Boullé, M. (2006). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.

    Article  Google Scholar 

  • Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58–75.

    Article  MathSciNet  MATH  Google Scholar 

  • Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80). New York, NY: ACM.

    Chapter  Google Scholar 

  • Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the twelfth international conference on machine learning (pp. 194–202). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the international joint conference on uncertainty in AI (pp. 1022–1027).

    Google Scholar 

  • Gama, J., & Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM symposium on applied computing (pp. 662–667).

    Google Scholar 

  • Greenwald, M., & Khanna, S. (2001). Aproximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2), 426–435.

    Google Scholar 

  • Manku, G. S., Rajagopalan, S., & Lindsay, B. G. (1998). Lecture Notes in Computer Science Volume 5012, New York.

    Google Scholar 

  • Pfahringer, B., Holmes, G., & Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. Advances in Knowledge Discovery and Data Mining, 296–307.

    Google Scholar 

  • Provost, F., & Domingos, P. (2003). Tree induction for probability-based ranking. Machine Learning, 52(3), 199–215.

    Article  MATH  Google Scholar 

  • Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christophe Salperwyck .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Salperwyck, C., Lemaire, V. (2013). A Two Layers Incremental Discretization Based on Order Statistics. In: Giudici, P., Ingrassia, S., Vichi, M. (eds) Statistical Models for Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00032-9_36

Download citation

Publish with us

Policies and ethics