How to extract predictive binary attributes from a categorical one

Lerman, I. C.; Pinto da Costa, J. F.

doi:10.1007/978-3-642-72253-0_33

I. C. Lerman⁸ &
J. F. Pinto da Costa^9,10

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

894 Accesses

Abstract

In this work, we present new ways of dealing with categorical attributes; in particular, the methodology introduced here concerns the use of these attributes in binary decision trees. We consider essentially two main operations; the first one consists in using the joint distribution of two or more categorical attributes in order to increase the final performance of the decision tree; the second - and the most important - operation, concerns the extraction of relatively few predictive binary attributes from a categorical attribute; specially, when the latter has a large number of values. With more than two classes to predict, most of the present binary decision tree software needs to test an exponential number of binary attributes for each categorical attribute; which can be prohibitive. Our method, ARCADE, is independent of the number of classes to be predicted, and it starts by reducing significantly the number of values of the initial categorical attribute. This is done by clustering the initial values, using a hierarchical classification method. Each cluster of values will then represent a new value of a new categorical attribute. This new attribute will then be used in the decision tree, instead of the initial one. Nevertheless, not all of the binary attributes associated with this new categorical attribute will be used; only those which are predictive. The reduction in the complexity of the search for the best binary split is therefore enormous, as will be seen in the application that we consider; that is, the old and still lively protein secondary structure prediction problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Almuallim, H., Akiba, Y. & Kaneda, S. (1995). On Handling Tree-Structured Attributes in Decision Tree Learning. International Conference on Machine Learning, 1995.
Google Scholar
Asseraf, M. (1996). Extension de la distance de Kolmogorov-Smirnov et stratégie de la coupure optimale. IV^ème journées de la société francophone de classification (SFC), Vannes, France. Breiman, L., Friedman, J., Olshen, A. & Stone, C. (1984) Classification and Regression Trees. Wadsworth.
Google Scholar
Colloc’h, N., Etchebest, C., Thoreau, E., Henrissat, B., & Mornon, J. (1993). Protein Engineering, 6, 377–382.
Article Google Scholar
Lerman. I.C. (1993). Likelihood linkage analysis (LLA) classification method: An example treated by hand. Biochimie, Elsevier editions, volume 75, pp. 379–397.
Google Scholar
Lerman, I.C. & Pinto da Costa, J. (1996). Variables à très grand nombre de catégories dans les arbres de décision. Application à Videntification de la structure secondaire d’une protéine. Rapport de Recherche Inria, 2803, 46 pages.
Google Scholar
Pinto da Costa, J. (1996). Coefficients d’association et binarisation par la classification hiérarchique dans les arbres de décision. Application à l’identification de la structure secondaire des protéines. Thèse de l’Université de Rennes I.
Google Scholar
Quinlan, J.R. (1993). C4. 5: Programs for Machine Learning. Morgan Kaufman, California.
Google Scholar
Rost, B. & Sander, C. (1993). Prediction of protein secondary structures at better than 70%. J. Mol. Biology, 232, pp. 584–599.
Article Google Scholar

Download references

Author information

Authors and Affiliations

IRISA-INRIA, Rennes, France
I. C. Lerman
Dep. Matemática Aplicada, Universidade do Porto, Portugal
J. F. Pinto da Costa
LIACC Universidade do Porto, Portugal
J. F. Pinto da Costa

Authors

I. C. Lerman
View author publications
You can also search for this author in PubMed Google Scholar
J. F. Pinto da Costa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Statisticà, Probabilità e Statistiche Applicate, Università di Roma “La Sapienza”, Piazzale Aldo Moro 5, I-00185, Roma, Italia
Alfredo Rizzi
Dipartimento di Metodi Quantitativi e Teoria Economica, Università “G. D’Annunzio” di Chieti, Viale Pindaro 42, I-65127, Pescara, Italia
Maurizio Vichi
Institut für Statistik und Wirtschaftsmathematik, Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen, Wüllnerstraße 3, D-52056, Aachen, Germany
Hans-Hermann Bock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lerman, I.C., Pinto da Costa, J.F. (1998). How to extract predictive binary attributes from a categorical one. In: Rizzi, A., Vichi, M., Bock, HH. (eds) Advances in Data Science and Classification. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-72253-0_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-72253-0_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64641-9
Online ISBN: 978-3-642-72253-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics