Combining Artificial Neural Nets pp 127-161 | Cite as

# Linear and Order Statistics Combiners for Pattern Classification

## Summary

Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to *quantify* the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the “added” error. If *N* unbiased classifiers are combined by simple averaging, the added error rate can be reduced by a factor of *N* if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the *i*th order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.

## Keywords

Decision Boundary Pattern Classification Individual Classifier Error Reduction Output Space## Preview

Unable to display preview. Download preview PDF.

## References

- 1.K. Al-Ghoneim and B. V. K. Vijaya Kumar. Learning ranks with neural networks (Invited paper). In
*Applications and Science of Artificial Neural Networks, Proceedings of the SPIE*, volume 2492, pages 446–464, April 1995.Google Scholar - 2.K. M. Ali and M. J. Pazzani. On the link between error correlation and error reduction in decision tree ensembles. Technical Report 95–38, Department of Information and Computer Science, University of California, Irvine, 1995.Google Scholar
- 3.B.C. Arnold, N. Balakrishnan, and H.N. Nagaraja.
*A First Course in Order Statistics*Wiley, New York, 1992.MATHGoogle Scholar - 4.J.A. Barnett. Computational methods for a mathematical theory of evidence. In
*Proceedings of the Seventh International Joint Conference on Artificial Intelligence*, pages 868–875, August 1981.Google Scholar - 5.R. Battiti and A. M. Colla. Democracy in neural nets: Voting schemes for classification.
*Neural Networks*, 7 (4): 691–709, 1994.CrossRefGoogle Scholar - 6.W. G. Baxt. Improving the accuracy of an artificial neural network using multiple differently trained networks.
*Neural Computation*, 4: 772–780, 1992.CrossRefGoogle Scholar - 7.J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, and P.H. Swain. Parallel consensual neural networks with optimally weighted outputs. In
*Proceedings of the World Congress on Neural Networks*, pages III:129–137. INNS Press, 1994.Google Scholar - 8.V. Biou, J.F. Gibrat, J.M. Levin, B. Robson, and J. Gamier. Secondary structure prediction: combination of three different methods.
*Protein Engineering*, 2: 185–91, 1988.CrossRefGoogle Scholar - 9.L. Breiman. Stacked regression. Technical Report 367, Department of Statistics, University of California, Berkeley, 1993.Google Scholar
- 10.L. Breiman. Bagging predictors. Technical Report 421, Department of Statistics, University of California, Berkeley, 1994.Google Scholar
- 11.P.K. Chan and S.J. Stolfo. A comparative evaluation of voting and meta-learning on partitioned data. In
*Proceedings of the Twelfth International Machine Learning Conference*, pages 90–98, Tahoe City, CA, 1995. Morgan Kaufmann.Google Scholar - 12.H. A. David.
*Order Statistics*Wiley, New York, 1970.MATHGoogle Scholar - 13.H. Drucker and C. Cortes. Boosting decision trees. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors,
*Advances in Neural Information Processing Systems-8*, pages 479–485. M.I.T. Press, 1996.Google Scholar - 14.H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other ensemble methods.
*Neural Computation*, 6 (6): 1289–1301, 1994.MATHCrossRefGoogle Scholar - 15.H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks using a boosting algorithm. In S.J. Hanson, J. D. Cowan, and C. L. Giles, editors,
*Advances in Neural Information Processing Systems-5*, pages 42–49. Morgan Kaufmann, 1993.Google Scholar - 16.R. O. Duda and P. E. Hart.
*Pattern Classification and Scene Analysis*Wiley, New York, 1973.MATHGoogle Scholar - 17.B. Efron.
*The Jackknife, the Bootstrap and Other Resarnpling Plans*SIAM, Philadelphia, 1982.Google Scholar - 18.B. Efron. Estimating the error rate of a prediction rule.
*Journal of the American Statistical Association*, 78: 316–333, 1983.MathSciNetMATHCrossRefGoogle Scholar - 19.Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In
*Proceedings of the Second European Conference on Computational Learning Theory*, pages 23–37. Springer Verlag, March 1995.Google Scholar - 20.Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In
*Proceedings of the Thirteenth International Conference on Machine Learning*, pages 148–156. Morgan Kaufmann, 1996.Google Scholar - 21.J. H. Friedman. An overview of predictive learning and function approximation. In V. Cherkassky, J.H. Friedman, and H. Wechsler, editors,
*From Statistics to Neural Networks, Proc. NATO/ASI Workshop*, pages 1–55. Springer Verlag, 1994.Google Scholar - 22.K. Fukunaga.
*Introduction to Statistical Pattern Recognition*(2nd Ed.), Academic Press, 1990.Google Scholar - 23.T.D. Garvey, J.D. Lowrance, and M.A. Fischler. An inference technique for integrating knowledge from disparate sources. In
*Proceedings of the Seventh International Joint Conference on Artificial Intelligence*, pages 319–325, August 1981.Google Scholar - 24.S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma.
*Neural Computation*, 4 (1): 1–58, 1992.CrossRefGoogle Scholar - 25.J. Ghosh, L. Deuser, and S. Beck. A neural network based hybrid system for detection, characterization and classification of short-duration oceanic signals.
*IEEE Journal of Ocean Engineering*, 17 (4): 351–363, October 1992.CrossRefGoogle Scholar - 26.J. Ghosh and K. Turner. Structural adaptation and generalization in supervised feedforward networks.
*Journal of Artificial Neural Networks*, 1 (4): 431–458, 1994.Google Scholar - 27.J. Ghosh, K. Turner, S. Beck, and L. Deuser. Integration of neural classifiers for passive sonar signals. In C.T. Leondes, editor,
*Control and Dynamic Systems-Advances in Theory and Applications*, volume 77, pages 301–338. Academic Press, 1996.Google Scholar - 28.C. W. J. Granger. Combining forecasts-twenty years later.
*Journal of Forecasting*, 8 (3): 167–173, 1989.MathSciNetCrossRefGoogle Scholar - 29.J.B. Hampshire and A.H. Waibel. The Meta-Pi network: Building distributed representations for robust multisource pattern recognition.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 14 (7): 751–769, 1992.CrossRefGoogle Scholar - 30.L. K. Hansen and P. Salamon. Neural network ensembles.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 12 (10): 993–1000, 1990.CrossRefGoogle Scholar - 31.S. Hashem and B. Schmeiser. Approximating a function and its derivatives using MSE-optimal linear combinations of trained feedforward neural networks. In
*Proceedings of the Joint Conference on Neural Networks*, volume 87, pages I:617–620, New Jersey, 1993.Google Scholar - 32.D. Heckerman. Probabilistic interpretation for MYCIN’s uncertainty factors. In L.N Kanal and J.F. Lemmer, editors,
*Uncertainty in Artificial Intelligence*, pages 167–196. North-Holland, 1986.Google Scholar - 33.T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 16 (1): 66–76, 1994.CrossRefGoogle Scholar - 34.Robert Jacobs. Method for combining experts’ probability assessments.
*Neural Computation*, 7 (5): 867–888, 1995.CrossRefGoogle Scholar - 35.A. Jain, R. Dubes, and C. Chen. Bootstrap techniques for error estimation.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 9: 628–633, 1987.MATHCrossRefGoogle Scholar - 36.A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and active learning. In G. , D. S. Touretzky, and T. K. Leen, editors,
*Advances in Neural Information Processing Systems-7*, pages 231–238. M.I.T. Press, 1995.Google Scholar - 37.J. Lee, J.-N. Hwang, D.T. Davis, and A.C. Nelson. Integration of neural networks and decision tree classifiers for automated cytology screening. In
*Proceedings of the International Joint Conference on Neural Networks, Seattle*, pages I:257–262, July 1991.CrossRefGoogle Scholar - 38.E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks.
*Proc. IEEE*, 78 (10): 1568–74, Oct 1990.CrossRefGoogle Scholar - 39.W.P. Lincoln and J. Skrzypek. Synergy of clustering multiple back propagation networks. In D. Touretzky, editor,
*Advances in Neural Information Processing Systems-2*, pages 650–657. Morgan Kaufmann, 1990.Google Scholar - 40.O. L. Mangasarian, R. Setiono, and W. H. Wolberg. Pattern recognition via linear programming: Theory and application to medical diagnosis. In Thomas F. Coleman and Yuying Li, editors,
*Large-Scale Numerical Optimization*, pages 22–30. SIAM Publications, 1990.Google Scholar - 41.R. Meir. Bias, variance, and the combination of estimators; the case of least linear squares. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors,
*Advances in Neural Information Processing Systems-7*, pages 295–302. M.I.T. Press, 1995.Google Scholar - 42.C.J. Merz and M.J. Pazzani. Combining neural network regression estimates with regularized linear weights. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors,
*Advances in Neural Information Processing Systems-9*, pages 564–570. M.I.T. Press, 1997.Google Scholar - 43.R.S. Michalski and R.L. Chilausky. Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis.
*International Journal of Policy Analysis and Information Systems*, 4(2), 1980.Google Scholar - 44.N. J. Nilsson.
*Learning Machines: Foundations of Trainable Pattern-Classifying Systems*McGraw Hill, NY, 1965.MATHGoogle Scholar - 45.M. O. Noordewier, G. G. Towell, and J. W. Shavlik. Training knowledge-based neural networks to recognize genes in DNA sequences. In R.P. Lippmann, J.E. Moody, and D.S. Touretzky, editors,
*Advances in Neural Information Processing Systems-3*, pages 530–536. Morgan Kaufmann, 1991.Google Scholar - 46.D. W. Opitz and J. W. Shavlik. Generating accurate and diverse members of a neural-network ensemble. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors,
*Advances in Neural Information Processing Systems-8*, pages 535–541. M.I.T. Press, 1996.Google Scholar - 47.B. Parmanto, P. W. Munro, and H. R. Doyle. Reducing variance of committee prediction with resampling techniques.
*Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches*, 8 (3 & 4): 405–426, 1996.Google Scholar - 48.M.P. Perrone and L. N. Cooper. Learning from what’s been learned: Supervised learning in multi-neural network systems. In
*Proceedings of the World Congress on Neural Networks*, pages I11:354–357. INNS Press, 1993.Google Scholar - 49.M.P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone, editor,
*Neural Networks for Speech and Image Processing*, chapter 10. Chapmann-Hall, 1993.Google Scholar - 50.Lutz Prechelt. PROBEN1 - A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. Anonymous FTP: /pub/papers/techreports/1994/1994–21.ps.Z on ftp.ira.uka.de.Google Scholar
- 51.J.R. Quinlan. Simplifying decision trees.
*International Journal of Man-Machine Studies*, 27: 221–234, December 1987.CrossRefGoogle Scholar - 52.J.R. Quinlan.
*C4.5: Programs for Machine Learning*Morgan Kaufman, San Mateo, California, 1992.Google Scholar - 53.E. Rich and K. Knight.
*Artificial Intelligence*McGraw-Hill, Inc., 2 edition, 1991.Google Scholar - 54.M.D. Richard and R.P. Lippmann. Neural network classifiers estimate Bayesian a posteriori probabilities.
*Neural Computation*, 3 (4): 461–483, 1991.CrossRefGoogle Scholar - 55.G. Rogova. Combining the results of several neural network classifiers.
*Neural Networks*, 7 (5): 777–781, 1994.CrossRefGoogle Scholar - 56.B. Rosen. Ensemble learning using decorrelated neural networks.
*Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches*, 8 (3 & 4): 373–384, 1996.Google Scholar - 57.D. W. Ruck, S. K. Rogers, M. E. Kabrisky, M. E. Oxley, and B. W. Suter. The multilayer Perceptron as an approximation to a Bayes optimal discriminant function.
*IEEE Transactions on Neural Networks*, 1 (4): 296–298, 1990.CrossRefGoogle Scholar - 58.A. E. Sarhan and B. G. Greenberg. Estimation of location and scale parameters by order statistics from singly and doubly censored samples.
*Annals of Mathematical Statistics Science*, 27: 427–451, 1956.MathSciNetMATHCrossRefGoogle Scholar - 59.R. Schapire, Y. Freund, P. Bartlett, and Lee W.S. Boosting the margin: A new explanation for the effectiveness of voting methods. In
*Proceedings of the Fourteenth International Conference on Machine Learning*Morgan Kaufmann, 1997.Google Scholar - 60.A. J. J. Sharkey. (editor).
*Connection Science: Special Issue on Combining Artificial Neural Networks: Ensemble Approaches*, 8(3 & 4 ), 1996.Google Scholar - 61.S. Shlien. Multiple binary decision tree classifiers.
*Pattern Recognition*, 23 (7): 757–63, 1990.CrossRefGoogle Scholar - 62.P.A. Shoemaker, M.J. Carlin, R.L. Shimabukuro, and C.E. Priebe. Least squares learning and approximation of posterior probabilities on classification problems by neural network models. In
*Proc. 2nd Workshop on Neural Networks, WNN-AIND91,Auburn*, pages 187–196, February 1991.Google Scholar - 63.J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In
*Proceedings of the Symposium on Computer Applications and Medical Care*, pages 261–265. IEEE Computer Society Press, 1988.Google Scholar - 64.P. Sollich and A. Krogh. Learning with ensembles: How overfitting can be useful. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors,
*Advances in Neural Information Processing Systems-8*, pages 190–196. M.I.T. Press, 1996.Google Scholar - 65.M. Stone. Cross-validatory choice and assessment of statistical prediction.
*Journal of the Royal Statistical Society*, 36: 111–147, 1974.MATHGoogle Scholar - 66.G. G. Towell and J. W. Shavlik. Interpretation of artificial neural networks: Mapping knowledge-based neural networks into rules. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors,
*Advances in Neural Information Processing Systems-4*, pages 977–984. Morgan Kaufmann, 1992.Google Scholar - 67.K. Turner and J. Ghosh. Limits to performance gains in combined neural classifiers. In
*Proceedings of the Artificial Neural Networks in Engineering ‘85*, pages 419–424, St. Louis, 1995.Google Scholar - 68.K. Turner and J. Ghosh. order statistics combiners for neural classifiers. In
*Proceedings of the World Congress on Neural Networks*, pages 1:31–34, Washington D.C., 1995. INNS Press.Google Scholar - 69.K. Turner and J. Ghosh. Analysis of decision boundaries in linearly combined neural classifiers.
*Pattern Recognition*, 29 (2): 341–348, February 1996.CrossRefGoogle Scholar - 70.K. Turner and J. Ghosh. Error correlation and error reduction in ensemble classifiers.
*Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches*, 8 (3 &4): 385–404, 1996.Google Scholar - 71.K. Turner and J. Ghosh. Estimating the Bayes error rate through classifier combining. In
*Proceedings of the International Conference on Pattern Recognition, Vienna, Austria*, pages IV:695–699, 1996.Google Scholar - 72.K. Turner and J. Ghosh. Classifier combining through trimmed means and order statistics. In
*Proceedings of the International Joint Conference on Neural Networks*, Anchorage, Alaska, 1998.Google Scholar - 73.Kagan Turner.
*Linear and Order Statistics Combiners for Reliable Pattern Classification*PhD thesis, The University of Texas, Austin, TX, May 1996.Google Scholar - 74.J. M. Twomey and A. E. Smith. Committee networks by resampling. In C. H. Dagli, M. Akay, C. L. P. Chen, B. R. Fernandez, and J. Ghosh, editors,
*Intelligent Engineering Systems through Artificial Neural Networks*, volume 5, pages 153–158. ASME Press, 1995.Google Scholar - 75.S. M. Weiss and C.A. Kulikowski.
*Computer Systems That Learn*Morgan Kaufmann, 1991.Google Scholar - 76.William H. Wolberg and O.L. Mangasarian. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In
*Proceedings of the National Academy of Sciences*, volume 87, pages 9193–9196, U.S.A, December 1990.Google Scholar - 77.D. H. Wolpert. A mathematical theory of generalization.
*Complex Systems*, 4: 151–200, 1990.MathSciNetMATHGoogle Scholar - 78.D. H. Wolpert. Stacked generalization.
*Neural Networks*, 5: 241–259, 1992.CrossRefGoogle Scholar - 79.D. H. Wolpert. The existence of a priori distinctions between learning algorithms.
*Neural Computation*, 8: 1391–1420, 1996.CrossRefGoogle Scholar - 80.D. H. Wolpert. The lack of a priori distinctions between learning algorithms.
*Neural Computation*, 8: 1341–1390, 1996.CrossRefGoogle Scholar - 81.L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition.
*IEEE Transactions on Systems, Man and Cybernetics*, 22 (3): 418–435, May 1992.CrossRefGoogle Scholar - 82.J.-B. Yang and M. G. Singh. An evidential reasoning approach for multiple-attribute decision making with uncertainty.
*IEEE Transactions on Systems, Man, and Cybernetics*, 24 (1): 1–19, 1994.CrossRefGoogle Scholar - 83.X. Zhang, J.P. Mesirov, and D.L. Waltz. Hybrid system for protein secondary structure prediction.
*J. Molecular Biology*, 225: 1049–63, 1992.CrossRefGoogle Scholar