Combining Multiple Learners: Data Fusion and Ensemble Learning

  • Ke-Lin DuEmail author
  • M. N. S. Swamy


According to no-free-lunch theorem, there is no single method that always performs the best in any domain. In practice, many methods are available for solving a given problem, each having its limitations. A popular way of dealing with difficult problems is via brainstorming in which participants share their knowledge from different viewpoints, and collective wisdom is achieved by voting on the decision. Data fusion is a concept that combines the results of all these individual methods using ensemble learning. This chapter deals with ensemble learning.


  1. 1.
    Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.Google Scholar
  2. 2.
    Bartlett, P. L., & Traskin, M. (2007). AdaBoost is consistent. Journal of Machine Learning Research, 8(1), 2347–2368.MathSciNetzbMATHGoogle Scholar
  3. 3.
    Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139.CrossRefGoogle Scholar
  4. 4.
    Bellet, A., Habrard, A., Morvant, E., & Sebban, M. (2014). Learning a priori constrained weighted majority votes. Machine Learning, 97, 129–154.MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Biau, G., Cadre, B., & Rouviere, L. (2019). Accelerated gradient boosting. Machine Learning, 108, (6), 971–992.Google Scholar
  6. 6.
    Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.zbMATHGoogle Scholar
  7. 7.
    Breiman, L. (1996). Bias variance and arcing classifiers (Technical report TR 460). Berkeley, CA: Statistics Department, University of California.Google Scholar
  8. 8.
    Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.zbMATHCrossRefGoogle Scholar
  9. 9.
    Breiman, L. (2004). Population theory for predictor ensembles. Annals of Statistics, 32(1), 1–11.MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. London: Chapman & Hall/CRC.Google Scholar
  11. 11.
    Buhlmann, P., & Yu, B. (2003). Boosting with the \(L_2\) loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324–339.MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Chang, C.-C., Chien, L.-J., & Lee, Y.-J. (2011). A novel framework for multi-class classification via ternary smooth support vector machine. Pattern Recognition, 44, 1235–1244.zbMATHCrossRefGoogle Scholar
  13. 13.
    Clarke, B. (2003). Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. Journal of Machine Learning Research, 4, 683–712.MathSciNetzbMATHGoogle Scholar
  14. 14.
    Collins, M., Schapire, R. E., & Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning, 47, 253–285.zbMATHCrossRefGoogle Scholar
  15. 15.
    Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, 14, 1105–1114.zbMATHCrossRefGoogle Scholar
  16. 16.
    Dempster, A. P. (1967). Upper and lower probabilities induced by multivalued mappings. Annals of Mathematics and Statistics, 38, 325–339.MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Denoeux, T. (2019). Logistic regression, neural networks and Dempster-Shafer theory: A new perspective. Knowledge-Based Systems, 176, 54–67.CrossRefGoogle Scholar
  18. 18.
    Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–158.Google Scholar
  19. 19.
    Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.zbMATHCrossRefGoogle Scholar
  20. 20.
    Domingos, P. (2000). A unified bias-variance decomposition for zero-one and squared loss. In Proceedings of the 17th National Conference on Artificial Intelligence (pp. 564–569). Austin, TX.Google Scholar
  21. 21.
    Domingos, P. (2000). Bayesian averaging of classifiers and the overfitting problem. In Proceedings of the 17th International Conference on Machine Learning (pp. 223–230). San Mateo, CA: Morgan Kaufmann.Google Scholar
  22. 22.
    Du, K.-L., & Swamy, M. N. S. (2010). Wireless communication systems. Cambridge, UK: Cambridge University Press.Google Scholar
  23. 23.
    Ehrlinger, J., & Ishwaran, H. (2012). Characterizing L2 boosting. Annals of Statistics, 40(2), 1074–1101.MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    Elwell, R., & Polikar, R. (2011). Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks, 22(10), 1517–1531.CrossRefGoogle Scholar
  25. 25.
    Escalera, S., Tax, D., Pujol, O., Radeva, P., & Duin, R. (2008). Subclass problem dependent design of error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1041–1054.CrossRefGoogle Scholar
  26. 26.
    Escalera, S., Pujol, O., & Radeva, P. (2010). On the decoding process in ternary error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 120–134.CrossRefGoogle Scholar
  27. 27.
    Escalera, S., Masip, D., Puertas, E., Radeva, P., & Pujol, O. (2011). Online error correcting output codes. Pattern Recognition Letters, 32, 458–467.CrossRefGoogle Scholar
  28. 28.
    Freund, Y. (2001). An adaptive version of the boost by majority algorithm. Machine Learning, 43, 293–318.zbMATHCrossRefGoogle Scholar
  29. 29.
    Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.Google Scholar
  30. 30.
    Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MathSciNetzbMATHCrossRefGoogle Scholar
  31. 31.
    Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.CrossRefGoogle Scholar
  32. 32.
    Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.MathSciNetzbMATHCrossRefGoogle Scholar
  33. 33.
    Friedman, J., & Hall, P. (2000). On bagging and nonlinear estimation (Technical report). Stanford, CA: Statistics Department, Stanford University.Google Scholar
  34. 34.
    Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.MathSciNetzbMATHCrossRefGoogle Scholar
  35. 35.
    Fu, Z., Robles-Kelly, A., & Zhou, J. (2010). Mixing linear SVMs for nonlinear classification. IEEE Transactions on Neural Networks, 21(12), 1963–1975.CrossRefGoogle Scholar
  36. 36.
    Gambs, S., Kegl, B., & Aimeur, E. (2007). Privacy-preserving boosting. Data Mining and Knowledge Discovery, 14, 131–170.MathSciNetCrossRefGoogle Scholar
  37. 37.
    Gao, C., Sang, N., & Tang, Q. (2010). On selection and combination of weak learners in AdaBoost. Pattern Recognition Letters, 31, 991–1001.CrossRefGoogle Scholar
  38. 38.
    Geist, M. (2015). Soft-max boosting. Machine Learning, 100, 305–332.MathSciNetzbMATHCrossRefGoogle Scholar
  39. 39.
    Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. London: Chapman & Hall/CRC.Google Scholar
  40. 40.
    Germain, P., Lacasse, A., Laviolette, F., Marchand, M., & Roy, J.-F. (2015). Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16, 787–860.MathSciNetzbMATHGoogle Scholar
  41. 41.
    Guestrin, C. (2006). PAC-learning, VC dimension and margin-based bounds. Machine Learning - 10701/15781, Carnegie Mellon University.Google Scholar
  42. 42.
    Hastie, T., & Tibshirani, R. (1998). Classification by pairwise grouping. In Advances in neural information processing systems (Vol. 11, pp. 451–471). Cambridge, MA: MIT Press.Google Scholar
  43. 43.
    Hastie, T., Taylor, J., Tibshirani, R., & Walther, G. (2007). Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1, 1–29.MathSciNetzbMATHCrossRefGoogle Scholar
  44. 44.
    Ho, T. K. (1995). Random decision forests. In Proceedings of the International Conference on Document Analysis and Recognition (pp. 278–282). Washington, DC.Google Scholar
  45. 45.
    Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.CrossRefGoogle Scholar
  46. 46.
    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.CrossRefGoogle Scholar
  47. 47.
    Jiang, W. (2000). The VC dimension for mixtures of binary classifiers. Neural Computation, 12, 1293–1301.CrossRefGoogle Scholar
  48. 48.
    Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2007). Robust loss functions for boosting. Neural Computation, 19, 2183–2244.MathSciNetzbMATHCrossRefGoogle Scholar
  49. 49.
    Klautau, A., Jevtic, N., & Orlitsky, A. (2003). On nearest-neighbor error-correcting output codes with application to all-pairs multiclass support vector machines. Journal of Machine Learning Research, 4, 1–15.MathSciNetzbMATHGoogle Scholar
  50. 50.
    Kleinberg, E. (2000). On the algorithmic implementation of stochastic discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5), 473–490.CrossRefGoogle Scholar
  51. 51.
    Kong, E., & Dietterich, T. G. (1995). Error-correcting output coding correct bias and variance. In Proceedings of the 12th International Conference on Machine Learning (pp. 313–321). San Francisco, CA: Morgan Kauffmanm.Google Scholar
  52. 52.
    Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2), 181–207.zbMATHCrossRefGoogle Scholar
  53. 53.
    Kuncheva, L. I., & Vetrov, D. P. (2006). Evaluation of stability of \(k\)-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.CrossRefGoogle Scholar
  54. 54.
    Lacasse, A., Laviolette, F., Marchand, M., Germain, P., & Usunier, N. (2006). PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In Advances in neural information processing systems (Vol. 19, pp. 769–776).Google Scholar
  55. 55.
    Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. In Advances in neural information processing systems (Vol. 15, pp. 423–430).Google Scholar
  56. 56.
    Laviolette, F., Marchand, M., & Roy, J.-F. (2011). From PAC-Bayes bounds to quadratic programs for majority votes. In Proceedings of the 28th International Conference on Machine Learning (pp. 649–656). Bellevue, WA.Google Scholar
  57. 57.
    Lee, H. K. H., & Clyde, M. A. (2004). Lossless online Bayesian bagging. Journal of Machine Learning Research, 5, 143–151.MathSciNetGoogle Scholar
  58. 58.
    Li, S. Z., & Zhang, Z. (2004). FloatBoost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1112–1123.CrossRefGoogle Scholar
  59. 59.
    Lin, S., Wang, Y., & Xu, L. (2015). Re-scale boosting for regression and classification. arXiv:1505.01371.
  60. 60.
    Mease, D., & Wyner, A. (2008). Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 9, 131–156.Google Scholar
  61. 61.
    McAllester, D. (2003). Simplified PAC-Bayesian margin bounds. In Computational learning theory and kernel machines. LNCS (Vol. 2777, pp. 203–215).CrossRefGoogle Scholar
  62. 62.
    McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the 12th ACM Annual Conference on Computational Learning Theory (pp. 164–170).Google Scholar
  63. 63.
    Meynet, J., & Thiran, J.-P. (2010). Information theoretic combination of pattern classifiers. Pattern Recognition, 43, 3412–3421.zbMATHCrossRefGoogle Scholar
  64. 64.
    Mirikitani, D. T., & Nikolaev, N. (2010). Efficient online recurrent connectionist learning with the ensemble Kalman filter. Neurocomputing, 73, 1024–1030.CrossRefGoogle Scholar
  65. 65.
    Muhlbaier, M. D., Topalis, A., & Polikar, R. (2009). Learn++.NC: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks, 20(1), 152–168.Google Scholar
  66. 66.
    Mukherjee, I., Rudin, C., & Schapire, R. E. (2013). The rate of convergence of AdaBoost. Journal of Machine Learning Research, 14, 2315–2347.MathSciNetzbMATHGoogle Scholar
  67. 67.
    Oza, N. C., & Russell, S. (2001). Online bagging and boosting. In T. Richardson & T. Jaakkola (Eds.), Proceedings of the 18th International Workshop on Artificial Intelligence and Statistics (AISTATS) (pp. 105–112). Key West, FL. San Mateo, CA: Morgan Kaufmann.Google Scholar
  68. 68.
    Pavlov, D., Mao, J., & Dom, B. (2000). Scaling-up support vector machines using boosting algorithm. In Proceedings of the 15th International Conference on Pattern Recognition (pp. 2219–2222). Barcelona, Spain.Google Scholar
  69. 69.
    Pedrajas, N. G., & Boyer, D. O. (2006). Improving multiclass pattern recognition by the combination of two strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 1001–1006.CrossRefGoogle Scholar
  70. 70.
    Platt, J. C., Christiani, N., & Shawe-Taylor, J. (1999). Large margin DAGs for multiclass classification. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 547–553). Cambridge, MA: MIT Press.Google Scholar
  71. 71.
    Polikar, R., Upda, L., Upda, S. S., & Honavar, V. (2001). Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems Man and Cybernetics Part C, 31(4), 497–508.CrossRefGoogle Scholar
  72. 72.
    Pujol, O., Radeva, P., & Vitria, J. (2006). Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1001–1007.CrossRefGoogle Scholar
  73. 73.
    Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Google Scholar
  74. 74.
    Quost, B., Masson, M.-H., & Denoeux, T. (2011). Classifier fusion in the Dempster-Shafer framework using optimized t-norm based combination rules. International Journal of Approximate Reasoning, 52, 353–374.MathSciNetCrossRefGoogle Scholar
  75. 75.
    Ratsch, G., & Warmuth, M. K. (2005). Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6, 2153–2175.MathSciNetzbMATHGoogle Scholar
  76. 76.
    Ratsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 43(3), 287–320.zbMATHCrossRefGoogle Scholar
  77. 77.
    Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1619–1630.CrossRefGoogle Scholar
  78. 78.
    Saberian, M., & Vasconcelos, N. (2014). Boosting algorithms for detector cascade learning. Journal of Machine Learning Research, 15, 2569–2605.MathSciNetzbMATHGoogle Scholar
  79. 79.
    Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.Google Scholar
  80. 80.
    Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), 297–336.zbMATHCrossRefGoogle Scholar
  81. 81.
    Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686.MathSciNetzbMATHCrossRefGoogle Scholar
  82. 82.
    Schubert, J. (2011). Conflict management in Dempster-Shafer theory using the degree of falsity. International Journal of Approximate Reasoning, 52(3), 449–460.MathSciNetCrossRefGoogle Scholar
  83. 83.
    Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485–1500.MathSciNetzbMATHCrossRefGoogle Scholar
  84. 84.
    Servedio, R. A. (2003). Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4, 633–648.MathSciNetzbMATHGoogle Scholar
  85. 85.
    Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton University Press.zbMATHGoogle Scholar
  86. 86.
    Shalev-Shwartz, S., & Singer, Y. (2010). On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine Learning, 80, 141–163.MathSciNetCrossRefGoogle Scholar
  87. 87.
    Shigei, N., Miyajima, H., Maeda, M., & Ma, L. (2009). Bagging and AdaBoost algorithms for vector quantization. Neurocomputing, 73, 106–114.CrossRefGoogle Scholar
  88. 88.
    Shrestha, D. L., & Solomatine, D. P. (2006). Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Computation, 18, 1678–1710.zbMATHCrossRefGoogle Scholar
  89. 89.
    Singh, V., Mukherjee, L., Peng, J., & Xu, J. (2010). Ensemble clustering using semidefinite programming with applications. Machine Learning, 79, 177–200.MathSciNetCrossRefGoogle Scholar
  90. 90.
    Smets, P. (1990). The combination of evidence in the transferable belief model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 447–458.CrossRefGoogle Scholar
  91. 91.
    Steele, B. M. (2009). Exact bootstrap \(k\)-nearest neighbor learners. Machine Learning, 74, 235–255.CrossRefGoogle Scholar
  92. 92.
    Tang, E. K., Suganthan, P. N., & Yao, X. (2006). An analysis of diversity measures. Machine Learning, 65(1), 247–271.CrossRefGoogle Scholar
  93. 93.
    Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12, 2719–2741.CrossRefGoogle Scholar
  94. 94.
    Tumer, K., & Ghosh, J. (1996). Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2), 341–348.CrossRefGoogle Scholar
  95. 95.
    Valentini, G. (2005). An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 35(6), 1252–1271.CrossRefGoogle Scholar
  96. 96.
    Valentini, G., & Dietterich, T. G. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775.MathSciNetzbMATHGoogle Scholar
  97. 97.
    Viola, P., & Jones, M. (2001). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154.CrossRefGoogle Scholar
  98. 98.
    Viola, P., & Jones, M. (2002). Fast and robust classification using asymmetric AdaBoost and a detector cascade. In Advances in neural information processing systems (Vol. 14, pp. 1311–1318).Google Scholar
  99. 99.
    Wager, S., Hastie, T., & Efron, B. (2014). Confidence intervals for random forests: The Jackknife and the infinitesimal Jackknife. Journal of Machine Learning Research, 15, 1625–1651.MathSciNetzbMATHGoogle Scholar
  100. 100.
    Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259.CrossRefGoogle Scholar
  101. 101.
    Wyner, A. J., Olson, M., Bleich, J., & Mease, D. (2017). Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18, 1–33.MathSciNetzbMATHGoogle Scholar
  102. 102.
    Xu, L., Krzyzak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man, and Cybernetics, 22, 418–435.CrossRefGoogle Scholar
  103. 103.
    Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decision-making. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183–190.MathSciNetzbMATHCrossRefGoogle Scholar
  104. 104.
    Zadeh, L. A. (1986). A simple view of the Dempster-Shafer theory of evidence and its implication for the rule of combination. AI Magazine, 2, 85–90.Google Scholar
  105. 105.
    Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Annals of Statistics, 33(4), 1538–1579.MathSciNetzbMATHCrossRefGoogle Scholar
  106. 106.
    Zhang, Y., Burer, S., & Street, W. N. (2006). Ensemble pruning via semi-definite programming. Journal of Machine Learning Research, 7, 1315–1338.MathSciNetzbMATHGoogle Scholar
  107. 107.
    Zhao, Q., Jiang, Y., & Xu, M. (2010). Incremental learning by heterogeneous bagging ensemble. In Proceedings of the 6th International Conference on Advanced Data Mining and Applications (Vol. 2, pp. 1–12). Chongqing, China.Google Scholar
  108. 108.
    Zhu, J., Zou, H., Rosset, S., & Hastie, T. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2, 249–360.MathSciNetzbMATHGoogle Scholar
  109. 109.
    Zliobaite, I. (2010). Adaptive training set formation. Ph.D. thesis, Vilnius University.Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada
  2. 2.Xonlink Inc.HangzhouChina

Personalised recommendations