# The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

Many classification studies often times conclude with a summary table which presents performance results of applying various data mining approaches on different datasets. No single method outperforms all methods all the time. Furthermore, the performance of a classiffication method in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates, may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the HomogeneityBased Algorithm (or HBA) for optimally controlling the previous three error rates. This is done by first formulating an optimization problem. The key development in this chapter is based on a special way for analyzing the space of the training data and then partitioning it according to the data density of different regions of this space. Next, the classification task is pursued based on the previous partitioning of the training space. In this way, the previous three error rates can be controlled in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach has a significant potential to fill in a critical gap in current data mining methodologies.

Key words: classification, prediction, overfitting, overgeneralization, falsepositive, false-negative, homogenous set, homogeneity degree, optimization

## Keywords

Penalty Cost Unit Size Training Point Radial Expansion Negative Pattern## Preview

Unable to display preview. Download preview PDF.

## References

- Abdi, H., (2003), “A neural network primer,” Journal of Biological Systems, vol. 2, pp. 247-281.CrossRefGoogle Scholar
- Ali, K., C. Brunk, and M. Pazzani, (1994), “On learning multiple descriptions of a concept,” Proceedings of Tools with Artificial Intelligence, New Orleans, LA, USA, pp. 476-483.Google Scholar
- Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, website: http://www.mathworks.com/products/
- Boros, E., P. L. Hammer, and J. N. Hooker, (1994), “Predicting Cause-Effect Relationships from Incomplete Discrete Observations,” Journal on Discrete Mathematics, vol. 7, no. 4, pp. 531-543.MATHCrossRefMathSciNetGoogle Scholar
- Bracewell, R., (1999), “The Impulse Symbol,” Chapter 5 in The Fourier Transform and Its Applications, 3rd ed. New York: McGraw-Hill, pp. 69-97.Google Scholar
- Breiman, L., (1996), “Bagging predictors,” Journal of Machine Learning, vol. 24, pp. 123-140.MATHMathSciNetGoogle Scholar
- Breiman, L., (2001), ”Random forests,” Journal of Machine Learning, vol. 45, no. 1, pp. 5-32.MATHCrossRefGoogle Scholar
- Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone, (1984), “Classification and Regression Trees,” Chapman Hall/CRC Publisher, pp. 279-293.Google Scholar
- Byvatov, E., and G. Schneider, (2003), “Support vector machine applications in bioinformatics,” Journal of Application Bioinformatics, vol. 2, no.2, pp. 67-77.Google Scholar
- Clark, P., and R. Boswell, (1991), “Rule induction with CN2: Some recent improvements,” Y. Kodratoff, editor, Machine Learning - EWSL-91, Berlin, SpringerVerlag, pp. 151-163.CrossRefGoogle Scholar
- Clark, P., and T. Niblett, (1989), “The CN2 Algorithm,” Journal of Machine Learning, vol. 3, pp. 261-283.Google Scholar
- Cohen S., L. Rokach, O. Maimon, (2007), “Decision-tree instance-space decomposition with grouped gain-ratio,”, Information Science, Volume 177, Issue 17, pp. 3592-3612.CrossRefGoogle Scholar
- Cohen, W. W., (1995), “Fast effective rule induction,” Machine Learning: Proceedings of the Twelfth International Conference, Tahoe City, CA., USA, pp. 115-123.Google Scholar
- Cortes, C., and V. Vapnik, (1995), “Support-vector networks,” Journal of Machine Learning, vol. 20, no. 3, pp. 273-297.MATHGoogle Scholar
- Cover, T. M., and P. E. Hart, (1967), “Nearest Neighbor Pattern Classification,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 13, no. 1, pp. 21-27.MATHGoogle Scholar
- Cristianini, N., and S. T. John, (2000), “An Introduction to Support Vector Machines and other kernel-based learning methods,” Cambridge University Press.Google Scholar
- Dasarathy, B. V., and B. V. Sheela, (1979), “A Composite Classifier System Design: Concepts and Methodology,” Proceedings of the IEEE, vol. 67, no. 5, pp. 708-713.CrossRefGoogle Scholar
- Dietterich, T. G., and G. Bakiri, (1994), “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research, vol. 2, pp. 263-286.Google Scholar
- Duda, R. O., and P. E. Hart, (1973), “Pattern Classification and Scene Analysis,” Wiley Publisher, pp. 56-64.Google Scholar
- Duda. O. R., E. H. Peter, G. S. David , (2001), “Pattern Classification,” Chapter 4: Nonparametric Techniques in Wiley Interscience Publisher, pp. 161-199.Google Scholar
- Dudani, S., (1976), “The Distance-Weighted k-Nearest-Neighbor Rule,” IEEE Transactions on Systems, Man and Cybernetics, vol. 6, no. 4, pp. 325-327.Google Scholar
- Friedman, N., D. Geiger, and M. Goldszmidt, (1997), “Bayesian Network Classifiers,” Journal of Machine Learning, vol. 29, pp. 131-161.MATHCrossRefGoogle Scholar
- Geman, S., E. Bienenstock, and R. Doursat, (1992), “Neural Networks and the Bias/Variance Dilemma,” Journal of Neural Computation, vol. 4, pp. 1-58.CrossRefGoogle Scholar
- Hecht-Nielsen, R., (1989), “Theory of the Backpropagation neural Network,” International Joint Conference on neural networks, Washington, DC, USA, pp. 593-605.Google Scholar
- Huzefa, R., and G. Karypis, (2005), “Profile Based Direct Kernels for Remote Homology Detection and Fold Recognition,” Journal of Bioinformatics, vol. 31, no. 23, pp. 4239-4247.Google Scholar
- Karp, R. M., (1972), “Reducibility Among Combinatorial Problems,” Proceedings of Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, New York: Plenum, pp. 85-103.Google Scholar
- Keller, J. M., M. R. Gray, and J. A. Givens, Jr, (1985), “A Fuzzy K-Nearest Neighbor Algorithm,” Journal of IEEE Transactions on Systems, Man, and Cybernetics, vol. 15, no. 4, pp. 580-585.Google Scholar
- Kohavi R., (1996), “Scaling up the accuracy of naive-Bayes classifiers: a decisiontree hybrid,” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, pp. 202-207.Google Scholar
- Kohavi, R., and G. John, (1997), “Wrappers for Feature Subset Selection,” Journal of Artificial Intelligence: special issue on relevance, vol. 97, no. 1-2, pp. 273-324.MATHGoogle Scholar
- Kokol, P., M. Zorman, M. M. Stiglic, and I. Malcic, (1998), “The limitations of decision trees and automatic learning in real world medical decision making,” Proceedings of the 9th World Congress on Medical Informatics MEDINFO’98, vol. 52, pp. 529-533.Google Scholar
- ıve Bayesian classifier,” Y. Kodratoff Editor, Proceedings of sixth European working session on learning, Springer-Verlag, pp. 206-219.Google Scholar
- Kwok, S., and C. Carter, (1990), “Multiple decision trees: uncertainty,” Journal of Artificial Intelligence, vol.4, pp. 327-335.Google Scholar
- Langley, P., and S. Sage, (1994), “Induction of Selective Bayesian Classifiers,” Proceedings of UAI-94, Seattle, WA, USA, pp. 399-406.Google Scholar
- Mansour, Y., D. McAllester, (2000), “Generalization Bounds for Decision Trees,” Proceedings of the 13th Annual Conference on Computer Learning Theory, San Francisco, Morgan Kaufmann, USA, pp. 69-80.Google Scholar
- Moody, J. E., (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” Journal of Advances in Neural Information Processing Systems, vol. 4, pp. 847-854.Google Scholar
- Nock, R., and O. Gascuel, (1995), “On learning decision committees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 413-420.Google Scholar
- Oliver, J. J., and D. J.Hand, (1995), “On pruning and averaging decision trees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 430-437.Google Scholar
- Pazzani, M.J., (1995), “Searching for dependencies in Bayesian classifiers,” Proceedings of AI STAT’95, pp. 239-248.Google Scholar
- Podgorelec, V., P. Kokol, B. Stiglic, and I. Rozman, (2002), “Decision trees: an overview and their use in medicine,” Journal of Medical Systems, Kluwer Academic/Plenum Press, vol. 26, no. 5, pp. 445-463Google Scholar
- Quinlan, J. R., (1987), “Simplifying decision trees,” International Journal of ManMachine Studies, vol. 27, pp. 221-234.CrossRefGoogle Scholar
- Quinlan, J. R., (1993), “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publisher San Mateo, CA., USA, pp. 35-42.Google Scholar
- Rada, M., (2004), “Seminar on Machine Learning,” a presentation of a course taught at University of North Texas.Google Scholar
- Rokach L., O. Maimon, O. Arad, (2005), “Improving Supervised Learning by Sample Decomposition,” Journal of Computational Intelligence and Applications, vol. 5, no. 1, pp. 37-54.CrossRefGoogle Scholar
- Sands D., (1998), “Improvement theory and its applications,” Gordon A. D., and A. M. Pitts Editors, Higher Order Operational Techniques in Semantics, Publications of the Newton Institute, Cambridge University Press, pp. 275-306.Google Scholar
- Schapire, R. E, (1990), “The strength of weak learnability,” Journal of Machine Learning, vol. 5, pp. 197-227.Google Scholar
- Shawe-Taylor. J., and C. Nello, (1999), “Further results on the margin distribution,” Proceedings of COLT99, Santa Cruz, CA., USA, pp. 278-285.Google Scholar
- Smith, M., (1996), “Neural Networks for Statistical Modeling,” Itp New Media Publisher, ISBN 1-850-32842-0, pp. 117-129.Google Scholar
- Spizer, M., L. Stefan, C. Paul, S. Alexander, and F. George, (2006), “IsoSVM Distinguishing isoforms and paralogs on the protein level,” Journal of BMC Bioinformatics, vol. 7:110, website: http://www.biomedcentral.com/content/pdf/1471-2105-7-110.pdf.
- Tan, P. N., S. Michael, and K. Vipin, (2005), “Introduction to Data Mining,” Chapters 4 and 5, Addison-Wesley Publisher, pp. 145-315.Google Scholar
- Triantaphyllou, E., (2007), “Data Mining and Knowledge Discovery Via a Novel Logic-Based Approach,” A monograph, Springer, Massive Computing Series, 420 pages, (in print).Google Scholar
- Triantaphyllou, E., and G. Felici, (Editors), (2006), “Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques,” Springer, Massive Computing Series, 796 pages.Google Scholar
- Triantaphyllou, E., L. Allen, L. Soyster, and S. R. T. Kumara, (1994), “Generating Logical Expressions From Positive and Negative Examples via a Branch-andBound approach,” Journal of Computers and Operations Research, vol. 21, pp. 783-799.Google Scholar
- Vapnik, V., (1998), “Statistical Learning Theory,” Wiley Publisher, pp. 375-567.Google Scholar
- Webb, G. I., (1996), “Further experimental evidence against the utility of Occam’s razor,” Journal of Artificial Intelligence Research, vol. 4, pp. 397-417.MATHGoogle Scholar
- Webb, G. I., (1997), “Decision Tree Grafting,” Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), vol. 2, pp. 23-29.Google Scholar
- Weigend, A., (1994), “On overfitting and the effective number of hidden units,” Proceedings of the 1993 Connectionist Models Summer School, pp. 335-342.Google Scholar
- Wikipedia Dictionary, (2007), website: http://en.wikipedia.org/wiki/Homogenous.
- Wolpert, D. H, (1992), “Stacked generalization,” Journal of Neural Networks, vol. 5, pp. 241-259.CrossRefGoogle Scholar
- Zavrsnik, J., P. Kokol, I. Maleiae, K. Kancler, M. Mernik, and M. Bigec, (1995), “ROSE: decision trees, automatic learning and their applications in cardiac medicine,” MEDINFO’95, Vancouver, Canada, pp. 201-206.Google Scholar
- Zhou Z. and C. Chen, (2002), “Hybrid decision tree,” Journal of Knowledge-Based Systems, vol. 15, pp. 515-528.CrossRefMathSciNetGoogle Scholar