Matching Data Mining Algorithm Suitability to Data Characteristics Using a Self-Organizing Map
The vast range of data mining algorithms available for learning classification problems has encouraged a trial-and-error approach to finding the best model. This problem is exacerbated by the fact that little is known about which techniques are suited to which types of problems. This paper provides some insights into the data characteristics that suit particular data mining algorithms. Our approach consists of four main stages. First, the performance of six leading data mining algorithms is examined across a collection of 57 well-known classification problems from the machine learning literature. Secondly, a collection of statistics that describe each of the 57 problems in terms of data complexity is collated. Thirdly, a self-organising map (SOM) is used to cluster the 57 problems based on these measures of complexity. Each cluster represents a group of classification problems with similar data characteristics. The performance of each data mining algorithm within each cluster is then examined in the final stage to provide both quantitative and qualitative insights into which techniques perform best on certain problem types.
KeywordsClassification Problem Canonical Correlation Data Characteristic Data Mining Algorithm Qualitative Insight
Unable to display preview. Download preview PDF.
- D. Aha, Machine Learning Database, University of California, Irvine, http://www.ics.uci.edu/pub/machine-learning-databases
- P. B. Brazdil and R. J. Henery, “Analysis of Results”, in D. Michie, D. J. Spiegelhalter and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood Limited, Chapter 10, 1994.Google Scholar
- R. J. Henery, “Methods for Comparison”, in in D. Michie, D. J. Spiegelhalter and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood Limited, Chapter 7, 1994.Google Scholar
- D. Aha, and D. Kibler, “Instance-based learning algorithms”, Machine Learning, vol. 6, pp. 37–66, 1991.Google Scholar
- R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.Google Scholar
- E. Frank and I. H. Witten, “Generating Accurate Rule Sets Without Global Optimization”. In Shavlik, J., ed., Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1998.Google Scholar
- G. H. John and P. Langley. Langley, “Estimating Continuous Distributions in Bayesian Classifiers”. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338–345. Morgan Kaufmann, San Mateo, 1995.Google Scholar
- Eudaptics, Viscovery SOMine 3.0 User Manual, www.eudaptics.com.
- K. A. Smith, F. Woo, V. Ciesielski, and R. Ibrahim, “Modelling the relationship between problem characteristics and data mining algorithm performance using neural networks”, C. Dagli et al. (Eds.), Smart Engineering System Design: Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining, and Complex Systems, ASME Press, vol. 11, 2001.Google Scholar