Skip to main content

A Grouping Method for Categorical Attributes Having Very Large Number of Values

  • Conference paper
Book cover Machine Learning and Data Mining in Pattern Recognition (MLDM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Abstract

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In case of very large number of values, the risk of overfitting the data increases sharply and building good groupings becomes difficult. In this paper, we propose two new grouping methods founded on a Bayesian approach, leading to Bayes optimal groupings. The first method exploits a standard schema for grouping models and the second one extends this schema by managing a “garbage” group dedicated to the least frequent values. Extensive comparative experiments demonstrate that the new grouping methods build high quality groupings in terms of predictive quality, robustness and small number of groups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berckman, N.C.: Value grouping for binary decision trees. Technical Report, Computer Science Department – University of Massachusetts (1995)

    Google Scholar 

  2. Boullé, M.: A robust method for partitioning the values of categorical attributes. In: Revue des Nouvelles Technologies de l’Information, Extraction et gestion des connaissances (EGC 2004), RNTI-E-2, vol. II, pp. 173–182 (2004a)

    Google Scholar 

  3. Boullé, M.: A Bayesian Approach for Supervised Discretization. In: Zanasi, A., Ebecken, N.F.F., Brebbia, C.A. (eds.) Data Mining V, pp. 199–208. WIT Press, Southampton (2004b)

    Google Scholar 

  4. Boullé, M.: MODL: une méthode quasi-optimale de groupage des valeurs d’un attribut symbolique. Note Technique NT/FT/R&D/8611. France Telecom R&D (2004c)

    Google Scholar 

  5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International, California (1984)

    MATH  Google Scholar 

  6. Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 1986: A knowledge-elicitation tool for sophisticated users. In: Bratko, I., Lavrac, N. (eds.) Progress in Machine Learning. Sigma Press, Wilmslow (1987)

    Google Scholar 

  7. Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Applied Statistics 29(2), 119–127 (1980)

    Article  Google Scholar 

  8. Kerber, R.: Chimerge discretization of numeric attributes. In: Proceedings of the 10th International Conference on Artificial Intelligence, pp. 123–128 (1991)

    Google Scholar 

  9. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959); republished by Dover (1968)

    MATH  Google Scholar 

  10. Langley, P., Iba, W., Thompson, K.: An analysis of bayesian classifiers. In: Proceedings of the 10th national conference on Artificial Intelligence, pp. 223–228. AAAI Press, Menlo Park (1992)

    Google Scholar 

  11. Langley, P., Sage, S.: Induction of Selective Bayesian Classifiers. In: Proc. of the 10th Conference on Uncertainty in Artificial Intelligence, pp. 399–406. Morgan Kaufmann, San Francisco (1994)

    Google Scholar 

  12. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  13. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  14. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  15. Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Statis. 11, 416–431 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  16. Ritschard, G., Zighed, D.A., Nicoloyannis, N.: Maximisation de l’association par regroupement de lignes ou de colonnes d’un tableau croisé. Math. & Sci. Hum., n° 154-155, 81–98 (2001)

    Google Scholar 

  17. Witten, I.H., Franck, E.: Data Mining. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Boullé, M. (2005). A Grouping Method for Categorical Attributes Having Very Large Number of Values. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_23

Download citation

  • DOI: https://doi.org/10.1007/11510888_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26923-6

  • Online ISBN: 978-3-540-31891-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics