Robust impurity measures in decision trees
Tree-based methods are a statistical procedure for automatic learning from data, their main characteristic being the simplicity of the results obtained. Their virtue is also their defect since the tree growing process is very dependent on data; small fluctuations in data may cause a big change in the tree growing process. Our main objective was to define data diagnostics to prevent internal instability in the tree growing process before a particular split has been made. We present a general formulation for the impurity of a node, a function of the proximity between the individuals in the node and its representative. Then, we compute a stability measure of a split and hence we can define more robust splits. Also. we have studied the theoretical complexity of this algorithm and its applicability to large data sets.
KeywordsRegression Tree Child Node Classification Tree Convex Polygon Gini Index
Unable to display preview. Download preview PDF.
- Aluja T., Nafria E. (1995). Generalised impurity measures and data diagnostics in decision trees. Visualising Categorical Data. Cologne.Google Scholar
- Breiman L., Friedman J.H., Olshen RA., and Stone C.J. (1984). Classification and Regression Trees. Waldsworth International Group, Belmont, California.Google Scholar
- Celeux G., Lechevallier Y. (1982). Méthodes de Segementation non Paramétriques. Revue de Statistique Appliquée, XXX (4), 39–53.Google Scholar
- Greenacre M. (1984). Theory and Application of Correspondence Analysis. Academic Press.Google Scholar
- Gueguen A., Nakache J.P. (1988). Méthode de discrimination basée sur la construction d’un arbre de décision binaire. Revue de Statistique Appliquée, XXXVI (1), 19–38.Google Scholar
- Mola F., Siciliano R. (1992). A two-stage predictive splitting algorithm in binary segmentation. Computational Statistics. vol. 1. Y. Dodge and J. Whittaker ed. Physica Verlag.Google Scholar
- Sonquist J.A., Morgan J.N. (1964). The Detection of Interaction Effects. Ann Arbor: Institute for Social Research. University of Michigan.Google Scholar