A Two Layers Incremental Discretization Based on Order Statistics

Salperwyck, Christophe; Lemaire, Vincent

doi:10.1007/978-3-319-00032-9_36

A Two Layers Incremental Discretization Based on Order Statistics

Christophe Salperwyck^4,5 &
Vincent Lemaire^4,5

Conference paper
First Online: 01 January 2013

5039 Accesses
2 Citations

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Large amounts of data are produced today: network logs, web data, social network data…The data amount and their arrival speed make them impossible to be stored. Such data are called streaming data. The stream specificities are: (i) data are just visible once and (ii) are ordered by arrival time. As these data can not be kept in memory and read afterwards, usual data mining techniques can not apply. Therefore to build a classifier in that context requires to do it incrementally and/or to keep a subset of the information seen and then build the classifier. This paper focuses on the second option and proposed a two layers approach based on order statistics. The first layer uses the Greenwald and Khanna quantiles summary and the second layer a supervised method such as MODL.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Ben-Haim, Y., & Tom-Tov, E. (2010). A streaming parallel decision tree algorithm. Journal of Machine Learning, 11, 849–872.
MathSciNet MATH Google Scholar
Boullé, M. (2005). A Bayes optimal approach for partitioning the values of categorical attributes. Journal of Machine Learning Research, 6(04), 1431–1452.
MATH Google Scholar
Boullé, M. (2006). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.
Article Google Scholar
Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58–75.
Article MathSciNet MATH Google Scholar
Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80). New York, NY: ACM.
Chapter Google Scholar
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the twelfth international conference on machine learning (pp. 194–202). San Francisco: Morgan Kaufmann.
Google Scholar
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the international joint conference on uncertainty in AI (pp. 1022–1027).
Google Scholar
Gama, J., & Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM symposium on applied computing (pp. 662–667).
Google Scholar
Greenwald, M., & Khanna, S. (2001). Aproximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2), 426–435.
Google Scholar
Manku, G. S., Rajagopalan, S., & Lindsay, B. G. (1998). Lecture Notes in Computer Science Volume 5012, New York.
Google Scholar
Pfahringer, B., Holmes, G., & Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. Advances in Knowledge Discovery and Data Mining, 296–307.
Google Scholar
Provost, F., & Domingos, P. (2003). Tree induction for probability-based ranking. Machine Learning, 52(3), 199–215.
Article MATH Google Scholar
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Orange Labs, Lannion, France
Christophe Salperwyck & Vincent Lemaire
LIFL, Université de Lille 3, Villeneuve d’Ascq, France
Christophe Salperwyck & Vincent Lemaire

Authors

Christophe Salperwyck
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Lemaire
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Salperwyck .

Editor information

Editors and Affiliations

Department of Economics, and Management, University of Pavia, Via San Felice 7, Pavia, 27100, Italy
Paolo Giudici
Department of Economics, and Business, University of Catania, Corso Italia 55, Catania, 95129, Italy
Salvatore Ingrassia
, Department of Statistics, University of Rome "La Sapienza", Piazzale Aldo Moro 5, Rome, 00185, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salperwyck, C., Lemaire, V. (2013). A Two Layers Incremental Discretization Based on Order Statistics. In: Giudici, P., Ingrassia, S., Vichi, M. (eds) Statistical Models for Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00032-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-00032-9_36
Published: 22 May 2013
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00031-2
Online ISBN: 978-3-319-00032-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics