Linear and Order Statistics Combiners for Pattern Classification

  • Amanda J. C. Sharkey
Part of the Perspectives in Neural Computing book series (PERSPECT.NEURAL)

Summary

Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the “added” error. If N unbiased classifiers are combined by simple averaging, the added error rate can be reduced by a factor of N if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the ith order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.

Keywords

Covariance Hull Rosen Doyle Sonar 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    K. Al-Ghoneim and B. V. K. Vijaya Kumar. Learning ranks with neural networks (Invited paper). In Applications and Science of Artificial Neural Networks, Proceedings of the SPIE, volume 2492, pages 446–464, April 1995.Google Scholar
  2. 2.
    K. M. Ali and M. J. Pazzani. On the link between error correlation and error reduction in decision tree ensembles. Technical Report 95–38, Department of Information and Computer Science, University of California, Irvine, 1995.Google Scholar
  3. 3.
    B.C. Arnold, N. Balakrishnan, and H.N. Nagaraja. A First Course in Order StatisticsWiley, New York, 1992.MATHGoogle Scholar
  4. 4.
    J.A. Barnett. Computational methods for a mathematical theory of evidence. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, pages 868–875, August 1981.Google Scholar
  5. 5.
    R. Battiti and A. M. Colla. Democracy in neural nets: Voting schemes for classification. Neural Networks, 7 (4): 691–709, 1994.CrossRefGoogle Scholar
  6. 6.
    W. G. Baxt. Improving the accuracy of an artificial neural network using multiple differently trained networks. Neural Computation, 4: 772–780, 1992.CrossRefGoogle Scholar
  7. 7.
    J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, and P.H. Swain. Parallel consensual neural networks with optimally weighted outputs. In Proceedings of the World Congress on Neural Networks, pages III:129–137. INNS Press, 1994.Google Scholar
  8. 8.
    V. Biou, J.F. Gibrat, J.M. Levin, B. Robson, and J. Gamier. Secondary structure prediction: combination of three different methods. Protein Engineering, 2: 185–91, 1988.CrossRefGoogle Scholar
  9. 9.
    L. Breiman. Stacked regression. Technical Report 367, Department of Statistics, University of California, Berkeley, 1993.Google Scholar
  10. 10.
    L. Breiman. Bagging predictors. Technical Report 421, Department of Statistics, University of California, Berkeley, 1994.Google Scholar
  11. 11.
    P.K. Chan and S.J. Stolfo. A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings of the Twelfth International Machine Learning Conference, pages 90–98, Tahoe City, CA, 1995. Morgan Kaufmann.Google Scholar
  12. 12.
    H. A. David. Order StatisticsWiley, New York, 1970.MATHGoogle Scholar
  13. 13.
    H. Drucker and C. Cortes. Boosting decision trees. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems-8, pages 479–485. M.I.T. Press, 1996.Google Scholar
  14. 14.
    H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other ensemble methods. Neural Computation, 6 (6): 1289–1301, 1994.MATHCrossRefGoogle Scholar
  15. 15.
    H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks using a boosting algorithm. In S.J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems-5, pages 42–49. Morgan Kaufmann, 1993.Google Scholar
  16. 16.
    R. O. Duda and P. E. Hart. Pattern Classification and Scene AnalysisWiley, New York, 1973.MATHGoogle Scholar
  17. 17.
    B. Efron. The Jackknife, the Bootstrap and Other Resarnpling PlansSIAM, Philadelphia, 1982.Google Scholar
  18. 18.
    B. Efron. Estimating the error rate of a prediction rule. Journal of the American Statistical Association, 78: 316–333, 1983.MathSciNetMATHCrossRefGoogle Scholar
  19. 19.
    Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23–37. Springer Verlag, March 1995.Google Scholar
  20. 20.
    Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156. Morgan Kaufmann, 1996.Google Scholar
  21. 21.
    J. H. Friedman. An overview of predictive learning and function approximation. In V. Cherkassky, J.H. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, Proc. NATO/ASI Workshop, pages 1–55. Springer Verlag, 1994.Google Scholar
  22. 22.
    K. Fukunaga. Introduction to Statistical Pattern Recognition(2nd Ed.), Academic Press, 1990.Google Scholar
  23. 23.
    T.D. Garvey, J.D. Lowrance, and M.A. Fischler. An inference technique for integrating knowledge from disparate sources. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, pages 319–325, August 1981.Google Scholar
  24. 24.
    S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4 (1): 1–58, 1992.CrossRefGoogle Scholar
  25. 25.
    J. Ghosh, L. Deuser, and S. Beck. A neural network based hybrid system for detection, characterization and classification of short-duration oceanic signals. IEEE Journal of Ocean Engineering, 17 (4): 351–363, October 1992.CrossRefGoogle Scholar
  26. 26.
    J. Ghosh and K. Turner. Structural adaptation and generalization in supervised feedforward networks. Journal of Artificial Neural Networks, 1 (4): 431–458, 1994.Google Scholar
  27. 27.
    J. Ghosh, K. Turner, S. Beck, and L. Deuser. Integration of neural classifiers for passive sonar signals. In C.T. Leondes, editor, Control and Dynamic Systems-Advances in Theory and Applications, volume 77, pages 301–338. Academic Press, 1996.Google Scholar
  28. 28.
    C. W. J. Granger. Combining forecasts-twenty years later. Journal of Forecasting, 8 (3): 167–173, 1989.MathSciNetCrossRefGoogle Scholar
  29. 29.
    J.B. Hampshire and A.H. Waibel. The Meta-Pi network: Building distributed representations for robust multisource pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14 (7): 751–769, 1992.CrossRefGoogle Scholar
  30. 30.
    L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (10): 993–1000, 1990.CrossRefGoogle Scholar
  31. 31.
    S. Hashem and B. Schmeiser. Approximating a function and its derivatives using MSE-optimal linear combinations of trained feedforward neural networks. In Proceedings of the Joint Conference on Neural Networks, volume 87, pages I:617–620, New Jersey, 1993.Google Scholar
  32. 32.
    D. Heckerman. Probabilistic interpretation for MYCIN’s uncertainty factors. In L.N Kanal and J.F. Lemmer, editors, Uncertainty in Artificial Intelligence, pages 167–196. North-Holland, 1986.Google Scholar
  33. 33.
    T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16 (1): 66–76, 1994.CrossRefGoogle Scholar
  34. 34.
    Robert Jacobs. Method for combining experts’ probability assessments. Neural Computation, 7 (5): 867–888, 1995.CrossRefGoogle Scholar
  35. 35.
    A. Jain, R. Dubes, and C. Chen. Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9: 628–633, 1987.MATHCrossRefGoogle Scholar
  36. 36.
    A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and active learning. In G. , D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems-7, pages 231–238. M.I.T. Press, 1995.Google Scholar
  37. 37.
    J. Lee, J.-N. Hwang, D.T. Davis, and A.C. Nelson. Integration of neural networks and decision tree classifiers for automated cytology screening. In Proceedings of the International Joint Conference on Neural Networks, Seattle, pages I:257–262, July 1991.CrossRefGoogle Scholar
  38. 38.
    E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78 (10): 1568–74, Oct 1990.CrossRefGoogle Scholar
  39. 39.
    W.P. Lincoln and J. Skrzypek. Synergy of clustering multiple back propagation networks. In D. Touretzky, editor, Advances in Neural Information Processing Systems-2, pages 650–657. Morgan Kaufmann, 1990.Google Scholar
  40. 40.
    O. L. Mangasarian, R. Setiono, and W. H. Wolberg. Pattern recognition via linear programming: Theory and application to medical diagnosis. In Thomas F. Coleman and Yuying Li, editors, Large-Scale Numerical Optimization, pages 22–30. SIAM Publications, 1990.Google Scholar
  41. 41.
    R. Meir. Bias, variance, and the combination of estimators; the case of least linear squares. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems-7, pages 295–302. M.I.T. Press, 1995.Google Scholar
  42. 42.
    C.J. Merz and M.J. Pazzani. Combining neural network regression estimates with regularized linear weights. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems-9, pages 564–570. M.I.T. Press, 1997.Google Scholar
  43. 43.
    R.S. Michalski and R.L. Chilausky. Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 1980.Google Scholar
  44. 44.
    N. J. Nilsson. Learning Machines: Foundations of Trainable Pattern-Classifying SystemsMcGraw Hill, NY, 1965.MATHGoogle Scholar
  45. 45.
    M. O. Noordewier, G. G. Towell, and J. W. Shavlik. Training knowledge-based neural networks to recognize genes in DNA sequences. In R.P. Lippmann, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing Systems-3, pages 530–536. Morgan Kaufmann, 1991.Google Scholar
  46. 46.
    D. W. Opitz and J. W. Shavlik. Generating accurate and diverse members of a neural-network ensemble. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems-8, pages 535–541. M.I.T. Press, 1996.Google Scholar
  47. 47.
    B. Parmanto, P. W. Munro, and H. R. Doyle. Reducing variance of committee prediction with resampling techniques. Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches, 8 (3 & 4): 405–426, 1996.Google Scholar
  48. 48.
    M.P. Perrone and L. N. Cooper. Learning from what’s been learned: Supervised learning in multi-neural network systems. In Proceedings of the World Congress on Neural Networks, pages I11:354–357. INNS Press, 1993.Google Scholar
  49. 49.
    M.P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image Processing, chapter 10. Chapmann-Hall, 1993.Google Scholar
  50. 50.
    Lutz Prechelt. PROBEN1 - A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. Anonymous FTP: /pub/papers/techreports/1994/1994–21.ps.Z on ftp.ira.uka.de.Google Scholar
  51. 51.
    J.R. Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27: 221–234, December 1987.CrossRefGoogle Scholar
  52. 52.
    J.R. Quinlan. C4.5: Programs for Machine LearningMorgan Kaufman, San Mateo, California, 1992.Google Scholar
  53. 53.
    E. Rich and K. Knight. Artificial IntelligenceMcGraw-Hill, Inc., 2 edition, 1991.Google Scholar
  54. 54.
    M.D. Richard and R.P. Lippmann. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Computation, 3 (4): 461–483, 1991.CrossRefGoogle Scholar
  55. 55.
    G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7 (5): 777–781, 1994.CrossRefGoogle Scholar
  56. 56.
    B. Rosen. Ensemble learning using decorrelated neural networks. Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches, 8 (3 & 4): 373–384, 1996.Google Scholar
  57. 57.
    D. W. Ruck, S. K. Rogers, M. E. Kabrisky, M. E. Oxley, and B. W. Suter. The multilayer Perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1 (4): 296–298, 1990.CrossRefGoogle Scholar
  58. 58.
    A. E. Sarhan and B. G. Greenberg. Estimation of location and scale parameters by order statistics from singly and doubly censored samples. Annals of Mathematical Statistics Science, 27: 427–451, 1956.MathSciNetMATHCrossRefGoogle Scholar
  59. 59.
    R. Schapire, Y. Freund, P. Bartlett, and Lee W.S. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proceedings of the Fourteenth International Conference on Machine LearningMorgan Kaufmann, 1997.Google Scholar
  60. 60.
    A. J. J. Sharkey. (editor). Connection Science: Special Issue on Combining Artificial Neural Networks: Ensemble Approaches, 8(3 & 4 ), 1996.Google Scholar
  61. 61.
    S. Shlien. Multiple binary decision tree classifiers. Pattern Recognition, 23 (7): 757–63, 1990.CrossRefGoogle Scholar
  62. 62.
    P.A. Shoemaker, M.J. Carlin, R.L. Shimabukuro, and C.E. Priebe. Least squares learning and approximation of posterior probabilities on classification problems by neural network models. In Proc. 2nd Workshop on Neural Networks, WNN-AIND91,Auburn, pages 187–196, February 1991.Google Scholar
  63. 63.
    J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care, pages 261–265. IEEE Computer Society Press, 1988.Google Scholar
  64. 64.
    P. Sollich and A. Krogh. Learning with ensembles: How overfitting can be useful. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems-8, pages 190–196. M.I.T. Press, 1996.Google Scholar
  65. 65.
    M. Stone. Cross-validatory choice and assessment of statistical prediction. Journal of the Royal Statistical Society, 36: 111–147, 1974.MATHGoogle Scholar
  66. 66.
    G. G. Towell and J. W. Shavlik. Interpretation of artificial neural networks: Mapping knowledge-based neural networks into rules. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems-4, pages 977–984. Morgan Kaufmann, 1992.Google Scholar
  67. 67.
    K. Turner and J. Ghosh. Limits to performance gains in combined neural classifiers. In Proceedings of the Artificial Neural Networks in Engineering ‘85, pages 419–424, St. Louis, 1995.Google Scholar
  68. 68.
    K. Turner and J. Ghosh. order statistics combiners for neural classifiers. In Proceedings of the World Congress on Neural Networks, pages 1:31–34, Washington D.C., 1995. INNS Press.Google Scholar
  69. 69.
    K. Turner and J. Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29 (2): 341–348, February 1996.CrossRefGoogle Scholar
  70. 70.
    K. Turner and J. Ghosh. Error correlation and error reduction in ensemble classifiers. Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches, 8 (3 &4): 385–404, 1996.Google Scholar
  71. 71.
    K. Turner and J. Ghosh. Estimating the Bayes error rate through classifier combining. In Proceedings of the International Conference on Pattern Recognition, Vienna, Austria, pages IV:695–699, 1996.Google Scholar
  72. 72.
    K. Turner and J. Ghosh. Classifier combining through trimmed means and order statistics. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, Alaska, 1998.Google Scholar
  73. 73.
    Kagan Turner. Linear and Order Statistics Combiners for Reliable Pattern ClassificationPhD thesis, The University of Texas, Austin, TX, May 1996.Google Scholar
  74. 74.
    J. M. Twomey and A. E. Smith. Committee networks by resampling. In C. H. Dagli, M. Akay, C. L. P. Chen, B. R. Fernandez, and J. Ghosh, editors, Intelligent Engineering Systems through Artificial Neural Networks, volume 5, pages 153–158. ASME Press, 1995.Google Scholar
  75. 75.
    S. M. Weiss and C.A. Kulikowski. Computer Systems That LearnMorgan Kaufmann, 1991.Google Scholar
  76. 76.
    William H. Wolberg and O.L. Mangasarian. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, volume 87, pages 9193–9196, U.S.A, December 1990.Google Scholar
  77. 77.
    D. H. Wolpert. A mathematical theory of generalization. Complex Systems, 4: 151–200, 1990.MathSciNetMATHGoogle Scholar
  78. 78.
    D. H. Wolpert. Stacked generalization. Neural Networks, 5: 241–259, 1992.CrossRefGoogle Scholar
  79. 79.
    D. H. Wolpert. The existence of a priori distinctions between learning algorithms. Neural Computation, 8: 1391–1420, 1996.CrossRefGoogle Scholar
  80. 80.
    D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8: 1341–1390, 1996.CrossRefGoogle Scholar
  81. 81.
    L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22 (3): 418–435, May 1992.CrossRefGoogle Scholar
  82. 82.
    J.-B. Yang and M. G. Singh. An evidential reasoning approach for multiple-attribute decision making with uncertainty. IEEE Transactions on Systems, Man, and Cybernetics, 24 (1): 1–19, 1994.CrossRefGoogle Scholar
  83. 83.
    X. Zhang, J.P. Mesirov, and D.L. Waltz. Hybrid system for protein secondary structure prediction. J. Molecular Biology, 225: 1049–63, 1992.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 1999

Authors and Affiliations

  • Amanda J. C. Sharkey
    • 1
  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUK

Personalised recommendations