A New Sampling Strategy for Building Decision Trees from Large Databases

  • J. H. Chauchat
  • R. Rakotomalala
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many numerical attributes which must be discretized at each step. Successive samples are used, one on each tree node. Applying the method to a simulated database (virtually infinite size) confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speeds up the mining process while maintaining the accuracy of the classifier.


Decision Tree Association Rule Continuous Attribute Tree Node Numerical Attribute 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984) Classification and Regression Trees. California: Wadsworth InternationalGoogle Scholar
  2. Chauchat, J., Boussaid, O., And Amoura, L. (1998) Optimization sampling in a large database for induction trees. In Proceedings of the JCIS ’98- Association for Intelligent Machinery, 28–31.Google Scholar
  3. Chauchat, J., And Rakotomalala, R. (1999) Détermination statistique de la taille d’échantillon dans la construction des graphes d’induction. In Actes des 7 emes journées de la Société Francophone de Classification, 93–99.Google Scholar
  4. Cohen, W. (1995) Fast effective rule induction. In Proc. 12th International Conference on Machine Learning, 115–123. Morgan Kaufmann.Google Scholar
  5. Fayyad, U., Piatetsky-Shapiro, G., And Smyth, P. (1996) Knowledge discovey and data mining: Towards an unifying framework. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining.Google Scholar
  6. Lechevallier, Y. (1990) Recherche d’une partition optimale sous contrainte d’ordre totale. Technical Report 1247, INRIA.Google Scholar
  7. Quinlan, J. (1986) Induction Of Decision Trees. Machine Learning 1:81-106Google Scholar
  8. Quinlan, J. (1993) Comparing connectionist and symbolic learning methods. In Hanson, S.; Drastal, G.; and Rivest, R., eds., Computational Learning Theory and Natural Learning Systems: Constraints and Prospects. MIT Press.Google Scholar
  9. Shannon, C. E., and Weaver, W. (1949) The mathematical theory of communication. University of Illinois Press.Google Scholar
  10. Toivonen, H. (1996) Sampling large databases for association rules. In Proceedings of 2 nd VLDB Conference, 134–145.Google Scholar
  11. Vitter, J. (1987) An efficient algorithm for sequential random sampling. ACM Transactions on Mathematical Software 13(l):58–67CrossRefGoogle Scholar
  12. Zighed, D. and Rakotomalala, R. (2000) Graphes d’Induction: Apprentissage et Data Mining. Hermes.Google Scholar
  13. Zighed, D., Rabaseda, S., And Rakotomalala, R. (1998) Fusinter: a method for discretization of continuous attributes for supervised learning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(33):307–326Google Scholar

Copyright information

© Springer-Verlag Berlin · Heidelberg 2000

Authors and Affiliations

  • J. H. Chauchat
    • 1
  • R. Rakotomalala
    • 1
  1. 1.Université Lumière Lyon 2Bron CedexFrance

Personalised recommendations