Advertisement

The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

  • Huy Nguyen Anh Pham
  • Evangelos Triantaphyllou

Many classification studies often times conclude with a summary table which presents performance results of applying various data mining approaches on different datasets. No single method outperforms all methods all the time. Furthermore, the performance of a classiffication method in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates, may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the HomogeneityBased Algorithm (or HBA) for optimally controlling the previous three error rates. This is done by first formulating an optimization problem. The key development in this chapter is based on a special way for analyzing the space of the training data and then partitioning it according to the data density of different regions of this space. Next, the classification task is pursued based on the previous partitioning of the training space. In this way, the previous three error rates can be controlled in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach has a significant potential to fill in a critical gap in current data mining methodologies.

Key words: classification, prediction, overfitting, overgeneralization, falsepositive, false-negative, homogenous set, homogeneity degree, optimization

Keywords

Penalty Cost Unit Size Training Point Radial Expansion Negative Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdi, H., (2003), “A neural network primer,” Journal of Biological Systems, vol. 2, pp. 247-281.CrossRefGoogle Scholar
  2. Ali, K., C. Brunk, and M. Pazzani, (1994), “On learning multiple descriptions of a concept,” Proceedings of Tools with Artificial Intelligence, New Orleans, LA, USA, pp. 476-483.Google Scholar
  3. Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, website: http://www.mathworks.com/products/
  4. Boros, E., P. L. Hammer, and J. N. Hooker, (1994), “Predicting Cause-Effect Relationships from Incomplete Discrete Observations,” Journal on Discrete Mathematics, vol. 7, no. 4, pp. 531-543.MATHCrossRefMathSciNetGoogle Scholar
  5. Bracewell, R., (1999), “The Impulse Symbol,” Chapter 5 in The Fourier Transform and Its Applications, 3rd ed. New York: McGraw-Hill, pp. 69-97.Google Scholar
  6. Breiman, L., (1996), “Bagging predictors,” Journal of Machine Learning, vol. 24, pp. 123-140.MATHMathSciNetGoogle Scholar
  7. Breiman, L., (2001), ”Random forests,” Journal of Machine Learning, vol. 45, no. 1, pp. 5-32.MATHCrossRefGoogle Scholar
  8. Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone, (1984), “Classification and Regression Trees,” Chapman Hall/CRC Publisher, pp. 279-293.Google Scholar
  9. Byvatov, E., and G. Schneider, (2003), “Support vector machine applications in bioinformatics,” Journal of Application Bioinformatics, vol. 2, no.2, pp. 67-77.Google Scholar
  10. Clark, P., and R. Boswell, (1991), “Rule induction with CN2: Some recent improvements,” Y. Kodratoff, editor, Machine Learning - EWSL-91, Berlin, SpringerVerlag, pp. 151-163.CrossRefGoogle Scholar
  11. Clark, P., and T. Niblett, (1989), “The CN2 Algorithm,” Journal of Machine Learning, vol. 3, pp. 261-283.Google Scholar
  12. Cohen S., L. Rokach, O. Maimon, (2007), “Decision-tree instance-space decomposition with grouped gain-ratio,”, Information Science, Volume 177, Issue 17, pp. 3592-3612.CrossRefGoogle Scholar
  13. Cohen, W. W., (1995), “Fast effective rule induction,” Machine Learning: Proceedings of the Twelfth International Conference, Tahoe City, CA., USA, pp. 115-123.Google Scholar
  14. Cortes, C., and V. Vapnik, (1995), “Support-vector networks,” Journal of Machine Learning, vol. 20, no. 3, pp. 273-297.MATHGoogle Scholar
  15. Cover, T. M., and P. E. Hart, (1967), “Nearest Neighbor Pattern Classification,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 13, no. 1, pp. 21-27.MATHGoogle Scholar
  16. Cristianini, N., and S. T. John, (2000), “An Introduction to Support Vector Machines and other kernel-based learning methods,” Cambridge University Press.Google Scholar
  17. Dasarathy, B. V., and B. V. Sheela, (1979), “A Composite Classifier System Design: Concepts and Methodology,” Proceedings of the IEEE, vol. 67, no. 5, pp. 708-713.CrossRefGoogle Scholar
  18. Dietterich, T. G., and G. Bakiri, (1994), “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research, vol. 2, pp. 263-286.Google Scholar
  19. Duda, R. O., and P. E. Hart, (1973), “Pattern Classification and Scene Analysis,” Wiley Publisher, pp. 56-64.Google Scholar
  20. Duda. O. R., E. H. Peter, G. S. David , (2001), “Pattern Classification,” Chapter 4: Nonparametric Techniques in Wiley Interscience Publisher, pp. 161-199.Google Scholar
  21. Dudani, S., (1976), “The Distance-Weighted k-Nearest-Neighbor Rule,” IEEE Transactions on Systems, Man and Cybernetics, vol. 6, no. 4, pp. 325-327.Google Scholar
  22. Friedman, N., D. Geiger, and M. Goldszmidt, (1997), “Bayesian Network Classifiers,” Journal of Machine Learning, vol. 29, pp. 131-161.MATHCrossRefGoogle Scholar
  23. Geman, S., E. Bienenstock, and R. Doursat, (1992), “Neural Networks and the Bias/Variance Dilemma,” Journal of Neural Computation, vol. 4, pp. 1-58.CrossRefGoogle Scholar
  24. Hecht-Nielsen, R., (1989), “Theory of the Backpropagation neural Network,” International Joint Conference on neural networks, Washington, DC, USA, pp. 593-605.Google Scholar
  25. Huzefa, R., and G. Karypis, (2005), “Profile Based Direct Kernels for Remote Homology Detection and Fold Recognition,” Journal of Bioinformatics, vol. 31, no. 23, pp. 4239-4247.Google Scholar
  26. Karp, R. M., (1972), “Reducibility Among Combinatorial Problems,” Proceedings of Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, New York: Plenum, pp. 85-103.Google Scholar
  27. Keller, J. M., M. R. Gray, and J. A. Givens, Jr, (1985), “A Fuzzy K-Nearest Neighbor Algorithm,” Journal of IEEE Transactions on Systems, Man, and Cybernetics, vol. 15, no. 4, pp. 580-585.Google Scholar
  28. Kohavi R., (1996), “Scaling up the accuracy of naive-Bayes classifiers: a decisiontree hybrid,” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, pp. 202-207.Google Scholar
  29. Kohavi, R., and G. John, (1997), “Wrappers for Feature Subset Selection,” Journal of Artificial Intelligence: special issue on relevance, vol. 97, no. 1-2, pp. 273-324.MATHGoogle Scholar
  30. Kokol, P., M. Zorman, M. M. Stiglic, and I. Malcic, (1998), “The limitations of decision trees and automatic learning in real world medical decision making,” Proceedings of the 9th World Congress on Medical Informatics MEDINFO’98, vol. 52, pp. 529-533.Google Scholar
  31. ıve Bayesian classifier,” Y. Kodratoff Editor, Proceedings of sixth European working session on learning, Springer-Verlag, pp. 206-219.Google Scholar
  32. Kwok, S., and C. Carter, (1990), “Multiple decision trees: uncertainty,” Journal of Artificial Intelligence, vol.4, pp. 327-335.Google Scholar
  33. Langley, P., and S. Sage, (1994), “Induction of Selective Bayesian Classifiers,” Proceedings of UAI-94, Seattle, WA, USA, pp. 399-406.Google Scholar
  34. Mansour, Y., D. McAllester, (2000), “Generalization Bounds for Decision Trees,” Proceedings of the 13th Annual Conference on Computer Learning Theory, San Francisco, Morgan Kaufmann, USA, pp. 69-80.Google Scholar
  35. Moody, J. E., (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” Journal of Advances in Neural Information Processing Systems, vol. 4, pp. 847-854.Google Scholar
  36. Nock, R., and O. Gascuel, (1995), “On learning decision committees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 413-420.Google Scholar
  37. Oliver, J. J., and D. J.Hand, (1995), “On pruning and averaging decision trees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 430-437.Google Scholar
  38. Pazzani, M.J., (1995), “Searching for dependencies in Bayesian classifiers,” Proceedings of AI STAT’95, pp. 239-248.Google Scholar
  39. Podgorelec, V., P. Kokol, B. Stiglic, and I. Rozman, (2002), “Decision trees: an overview and their use in medicine,” Journal of Medical Systems, Kluwer Academic/Plenum Press, vol. 26, no. 5, pp. 445-463Google Scholar
  40. Quinlan, J. R., (1987), “Simplifying decision trees,” International Journal of ManMachine Studies, vol. 27, pp. 221-234.CrossRefGoogle Scholar
  41. Quinlan, J. R., (1993), “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publisher San Mateo, CA., USA, pp. 35-42.Google Scholar
  42. Rada, M., (2004), “Seminar on Machine Learning,” a presentation of a course taught at University of North Texas.Google Scholar
  43. Rokach L., O. Maimon, O. Arad, (2005), “Improving Supervised Learning by Sample Decomposition,” Journal of Computational Intelligence and Applications, vol. 5, no. 1, pp. 37-54.CrossRefGoogle Scholar
  44. Sands D., (1998), “Improvement theory and its applications,” Gordon A. D., and A. M. Pitts Editors, Higher Order Operational Techniques in Semantics, Publications of the Newton Institute, Cambridge University Press, pp. 275-306.Google Scholar
  45. Schapire, R. E, (1990), “The strength of weak learnability,” Journal of Machine Learning, vol. 5, pp. 197-227.Google Scholar
  46. Shawe-Taylor. J., and C. Nello, (1999), “Further results on the margin distribution,” Proceedings of COLT99, Santa Cruz, CA., USA, pp. 278-285.Google Scholar
  47. Smith, M., (1996), “Neural Networks for Statistical Modeling,” Itp New Media Publisher, ISBN 1-850-32842-0, pp. 117-129.Google Scholar
  48. Spizer, M., L. Stefan, C. Paul, S. Alexander, and F. George, (2006), “IsoSVM Distinguishing isoforms and paralogs on the protein level,” Journal of BMC Bioinformatics, vol. 7:110, website: http://www.biomedcentral.com/content/pdf/1471-2105-7-110.pdf.
  49. Tan, P. N., S. Michael, and K. Vipin, (2005), “Introduction to Data Mining,” Chapters 4 and 5, Addison-Wesley Publisher, pp. 145-315.Google Scholar
  50. Triantaphyllou, E., (2007), “Data Mining and Knowledge Discovery Via a Novel Logic-Based Approach,” A monograph, Springer, Massive Computing Series, 420 pages, (in print).Google Scholar
  51. Triantaphyllou, E., and G. Felici, (Editors), (2006), “Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques,” Springer, Massive Computing Series, 796 pages.Google Scholar
  52. Triantaphyllou, E., L. Allen, L. Soyster, and S. R. T. Kumara, (1994), “Generating Logical Expressions From Positive and Negative Examples via a Branch-andBound approach,” Journal of Computers and Operations Research, vol. 21, pp. 783-799.Google Scholar
  53. Vapnik, V., (1998), “Statistical Learning Theory,” Wiley Publisher, pp. 375-567.Google Scholar
  54. Webb, G. I., (1996), “Further experimental evidence against the utility of Occam’s razor,” Journal of Artificial Intelligence Research, vol. 4, pp. 397-417.MATHGoogle Scholar
  55. Webb, G. I., (1997), “Decision Tree Grafting,” Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), vol. 2, pp. 23-29.Google Scholar
  56. Weigend, A., (1994), “On overfitting and the effective number of hidden units,” Proceedings of the 1993 Connectionist Models Summer School, pp. 335-342.Google Scholar
  57. Wikipedia Dictionary, (2007), website: http://en.wikipedia.org/wiki/Homogenous.
  58. Wolpert, D. H, (1992), “Stacked generalization,” Journal of Neural Networks, vol. 5, pp. 241-259.CrossRefGoogle Scholar
  59. Zavrsnik, J., P. Kokol, I. Maleiae, K. Kancler, M. Mernik, and M. Bigec, (1995), “ROSE: decision trees, automatic learning and their applications in cardiac medicine,” MEDINFO’95, Vancouver, Canada, pp. 201-206.Google Scholar
  60. Zhou Z. and C. Chen, (2002), “Hybrid decision tree,” Journal of Knowledge-Based Systems, vol. 15, pp. 515-528.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Huy Nguyen Anh Pham
    • 1
  • Evangelos Triantaphyllou
    • 1
  1. 1.Department of Computer ScienceLouisiana State UniversityBaton RougeUSA

Personalised recommendations