A Grouping Method for Categorical Attributes Having Very Large Number of Values

Boullé, Marc

doi:10.1007/11510888_23

Marc Boullé²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2078 Accesses
7 Citations

Abstract

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In case of very large number of values, the risk of overfitting the data increases sharply and building good groupings becomes difficult. In this paper, we propose two new grouping methods founded on a Bayesian approach, leading to Bayes optimal groupings. The first method exploits a standard schema for grouping models and the second one extends this schema by managing a “garbage” group dedicated to the least frequent values. Extensive comparative experiments demonstrate that the new grouping methods build high quality groupings in terms of predictive quality, robustness and small number of groups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berckman, N.C.: Value grouping for binary decision trees. Technical Report, Computer Science Department – University of Massachusetts (1995)
Google Scholar
Boullé, M.: A robust method for partitioning the values of categorical attributes. In: Revue des Nouvelles Technologies de l’Information, Extraction et gestion des connaissances (EGC 2004), RNTI-E-2, vol. II, pp. 173–182 (2004a)
Google Scholar
Boullé, M.: A Bayesian Approach for Supervised Discretization. In: Zanasi, A., Ebecken, N.F.F., Brebbia, C.A. (eds.) Data Mining V, pp. 199–208. WIT Press, Southampton (2004b)
Google Scholar
Boullé, M.: MODL: une méthode quasi-optimale de groupage des valeurs d’un attribut symbolique. Note Technique NT/FT/R&D/8611. France Telecom R&D (2004c)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International, California (1984)
MATH Google Scholar
Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 1986: A knowledge-elicitation tool for sophisticated users. In: Bratko, I., Lavrac, N. (eds.) Progress in Machine Learning. Sigma Press, Wilmslow (1987)
Google Scholar
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Applied Statistics 29(2), 119–127 (1980)
Article Google Scholar
Kerber, R.: Chimerge discretization of numeric attributes. In: Proceedings of the 10^th International Conference on Artificial Intelligence, pp. 123–128 (1991)
Google Scholar
Kullback, S.: Information Theory and Statistics. Wiley, New York (1959); republished by Dover (1968)
MATH Google Scholar
Langley, P., Iba, W., Thompson, K.: An analysis of bayesian classifiers. In: Proceedings of the 10th national conference on Artificial Intelligence, pp. 223–228. AAAI Press, Menlo Park (1992)
Google Scholar
Langley, P., Sage, S.: Induction of Selective Bayesian Classifiers. In: Proc. of the 10^th Conference on Uncertainty in Artificial Intelligence, pp. 399–406. Morgan Kaufmann, San Francisco (1994)
Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Statis. 11, 416–431 (1983)
Article MATH MathSciNet Google Scholar
Ritschard, G., Zighed, D.A., Nicoloyannis, N.: Maximisation de l’association par regroupement de lignes ou de colonnes d’un tableau croisé. Math. & Sci. Hum., n° 154-155, 81–98 (2001)
Google Scholar
Witten, I.H., Franck, E.: Data Mining. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

France Telecom R&D, 2, Avenue Pierre Marzin, 22300, Lannion, France
Marc Boullé

Authors

Marc Boullé
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner
Institute of Media and Information Technology, Chiba University, Japan
Atsushi Imiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boullé, M. (2005). A Grouping Method for Categorical Attributes Having Very Large Number of Values. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_23

Download citation

DOI: https://doi.org/10.1007/11510888_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics