Matching Data Mining Algorithm Suitability to Data Characteristics Using a Self-Organizing Map

  • Kate A. Smith
  • Frederick Woo
  • Vic Ciesielski
  • Remzi Ibrahim
Part of the Advances in Soft Computing book series (AINSC, volume 14)


The vast range of data mining algorithms available for learning classification problems has encouraged a trial-and-error approach to finding the best model. This problem is exacerbated by the fact that little is known about which techniques are suited to which types of problems. This paper provides some insights into the data characteristics that suit particular data mining algorithms. Our approach consists of four main stages. First, the performance of six leading data mining algorithms is examined across a collection of 57 well-known classification problems from the machine learning literature. Secondly, a collection of statistics that describe each of the 57 problems in terms of data complexity is collated. Thirdly, a self-organising map (SOM) is used to cluster the 57 problems based on these measures of complexity. Each cluster represents a group of classification problems with similar data characteristics. The performance of each data mining algorithm within each cluster is then examined in the final stage to provide both quantitative and qualitative insights into which techniques perform best on certain problem types.


Classification Problem Canonical Correlation Data Characteristic Data Mining Algorithm Qualitative Insight 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. D. Aha, Machine Learning Database, University of California, Irvine,
  2. D. H. Wolpert and W. G. Macready, “No Free Lunch Theorem for Optimization”, IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997.CrossRefGoogle Scholar
  3. P. B. Brazdil and R. J. Henery, “Analysis of Results”, in D. Michie, D. J. Spiegelhalter and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood Limited, Chapter 10, 1994.Google Scholar
  4. R. J. Henery, “Methods for Comparison”, in in D. Michie, D. J. Spiegelhalter and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood Limited, Chapter 7, 1994.Google Scholar
  5. D. Aha, and D. Kibler, “Instance-based learning algorithms”, Machine Learning, vol. 6, pp. 37–66, 1991.Google Scholar
  6. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.Google Scholar
  7. E. Frank and I. H. Witten, “Generating Accurate Rule Sets Without Global Optimization”. In Shavlik, J., ed., Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1998.Google Scholar
  8. G. H. John and P. Langley. Langley, “Estimating Continuous Distributions in Bayesian Classifiers”. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338–345. Morgan Kaufmann, San Mateo, 1995.Google Scholar
  9. R.C. Holte, “Very simple classification rules perform well on most commonly used datasets”. Machine Learning, Vol. 11, pp. 63–91, 1993.MATHCrossRefGoogle Scholar
  10. B. W. Silverman, Density estimation for statistics and data analysis, Chapman and Hall, New York, 1986.MATHGoogle Scholar
  11. J. A. Hartigan, Clustering Algorithms, New York: John Wiley & Sons, 1975.MATHGoogle Scholar
  12. T. Kohonen, Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69, 1982.MathSciNetMATHCrossRefGoogle Scholar
  13. T. Kohonen, Self-Organisation and Associative Memory, New York: Springer-Verlag, 1988.CrossRefGoogle Scholar
  14. Eudaptics, Viscovery SOMine 3.0 User Manual,
  15. G. Deboeck and T. Kohonen, Visual Explorations in Finance with Self-Organizing Maps. London: Springer-Verlag, 1998.MATHCrossRefGoogle Scholar
  16. K. A. Smith, F. Woo, V. Ciesielski, and R. Ibrahim, “Modelling the relationship between problem characteristics and data mining algorithm performance using neural networks”, C. Dagli et al. (Eds.), Smart Engineering System Design: Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining, and Complex Systems, ASME Press, vol. 11, 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kate A. Smith
    • 1
  • Frederick Woo
    • 1
  • Vic Ciesielski
    • 2
  • Remzi Ibrahim
  1. 1.School of Business SystemsMonash UniversityVictoriaAustralia
  2. 2.School of Computer Science and Information TechnologyRoyal Melbourne Institute of TechnologyVictoriaAustralia

Personalised recommendations