Combining Multiple Learners: Data Fusion and Emsemble Learning

Chapter

Abstract

Different learning algorithms have different accuracies. The no free lunch theorem asserts that no single learning algorithm always achieves the best performance in any domain. They can be combined to attain higher accuracy. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. Fusion of data for improving prediction accuracy and reliability is an important problem in machine learning.

Keywords

Covariance Shrinkage Summing 

References

  1. 1.
    Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.MathSciNetGoogle Scholar
  2. 2.
    Bartlett, P. L., & Traskin, M. (2007). AdaBoost is consistent. Journal of Machine Learning Research, 8, 2347–2368.MATHMathSciNetGoogle Scholar
  3. 3.
    Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139.CrossRefGoogle Scholar
  4. 4.
    Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140.Google Scholar
  5. 5.
    Breiman, L. (1996). Bias variance and arcing classifiers. Technical Report TR 460, Berkeley, CA: Statistics Department, University of California.Google Scholar
  6. 6.
    Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefMATHGoogle Scholar
  7. 7.
    Breiman, L. (2004). Population theory for predictor ensembles. Annals of Statistics, 32(1), 1–11.CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Chang, C.-C., Chien, L.-J., & Lee, Y.-J. (2011). A novel framework for multi-class classification via ternary smooth support vector machine. Pattern Recognition, 44, 1235–1244.CrossRefMATHGoogle Scholar
  9. 9.
    Clarke, B. (2003). Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. Journal of Machine Learning Research, 4, 683–712.Google Scholar
  10. 10.
    Collins, M., Schapire, R. E., & Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning, 47, 253–285.CrossRefGoogle Scholar
  11. 11.
    Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, 14, 1105–1114.CrossRefMATHGoogle Scholar
  12. 12.
    Dempster, A. P. (1967). Upper and lower probabilities induced by multivalued mappings. The Annals of Mathematical Statistics, 38, 325–339.CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Denoeux, T. (1995). A \(k\)-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Transactions on Systems, Man, and Cybernetics, 25(5), 804–813.Google Scholar
  14. 14.
    Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.MATHGoogle Scholar
  15. 15.
    Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2), 139–158.CrossRefGoogle Scholar
  16. 16.
    Domingos, P. (2000). A unified bias-variance decomposition for zero-one and squared loss. In: Proceedings of the 17th National Conference on Artificial Intelligence (pp. 564–569). Austin, TX.Google Scholar
  17. 17.
    Du, K.-L., & Swamy, M. N. S. (2010). Wireless communication systems. Cambridge, UK: Cambridge University Press.Google Scholar
  18. 18.
    Escalera, S., Tax, D., Pujol, O., Radeva, P., & Duin, R. (2008). Subclass problem dependent design of error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1041–1054.CrossRefGoogle Scholar
  19. 19.
    Escalera, S., Pujol, O., & Radeva, P. (2010). On the decoding process in ternary error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 120–134.CrossRefGoogle Scholar
  20. 20.
    Escalera, S., Masip, D., Puertas, E., Radeva, P., & Pujol, O. (2011). Online error correcting output codes. Pattern Recognition Letters, 32, 458–467.CrossRefGoogle Scholar
  21. 21.
    Freund, Y. & Schapire, R.E. (1996). Experiments with a new boosting algorithm: Proceedings of the 13th International Conference on Machine Learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.Google Scholar
  22. 22.
    Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.CrossRefMATHMathSciNetGoogle Scholar
  23. 23.
    Freund, Y. (2001). An adaptive version of the boost by majority algorithm. Machine Learning, 43, 293–318.CrossRefMATHGoogle Scholar
  24. 24.
    Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.CrossRefGoogle Scholar
  25. 25.
    Friedman, J., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407.CrossRefMathSciNetGoogle Scholar
  26. 26.
    Friedman, J., & Hall, P. (2000). On bagging and nonlinear estimation. Technical Report. Stanford, CA: Statistics Department, Stanford University.Google Scholar
  27. 27.
    Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.CrossRefMATHMathSciNetGoogle Scholar
  28. 28.
    Fu, Z., Robles-Kelly, A., & Zhou, J. (2010). Mixing linear SVMs for nonlinear classification. IEEE Transactions on Neural Networks, 21(12), 1963–1975.CrossRefGoogle Scholar
  29. 29.
    Gambs, S., Kegl, B., & Aimeur, E. (2007). Privacy-preserving boosting. Data Mining and Knowledge Discovery, 14, 131–170.CrossRefMathSciNetGoogle Scholar
  30. 30.
    Gao, C., Sang, N., & Tang, Q. (2010). On selection and combination of weak learners in AdaBoost. Pattern Recognition Letters, 31, 991–1001.CrossRefGoogle Scholar
  31. 31.
    Hastie, T. & Tibshirani, R. (1998). Classification by pairwise grouping. In: Advances in neural information processing systems (Vol. 26, pp. 451–471). Cambridge, MA: MIT Press.Google Scholar
  32. 32.
    Ho, T.K. (1995). Random decision forests: Proceedings of the 3rd International Conference on Document Analysis and Recognition (pp. 278–282). Washington DC, USA.Google Scholar
  33. 33.
    Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.CrossRefGoogle Scholar
  34. 34.
    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.CrossRefGoogle Scholar
  35. 35.
    Jiang, W. (2000). The VC dimension for mixtures of binary classifiers. Neural Computation, 12, 1293–1301.CrossRefGoogle Scholar
  36. 36.
    Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2007). Robust loss functions for boosting. Neural Computation, 19, 2183–2244.CrossRefMATHMathSciNetGoogle Scholar
  37. 37.
    Klautau, A., Jevtic, N., & Orlitsky, A. (2003). On nearest-neighbor error-correcting output codes with application to all-pairs multiclass support vector machines. Journal of Machine Learning Research, 4, 1–15.MathSciNetGoogle Scholar
  38. 38.
    Kleinberg, E. (2000). On the algorithmic implementation of stochastic discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5), 473–490.CrossRefGoogle Scholar
  39. 39.
    Kong, E., & Dietterich, T.G. (1995). Error-correcting output coding correct bias and variance: In Proceedings of 12th International Conference on Machine Learning (pp. 313–321). San Francisco, CA.Google Scholar
  40. 40.
    Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2), 181–207.CrossRefMATHGoogle Scholar
  41. 41.
    Kuncheva, L. I., & Vetrov, D. P. (2006). Evaluation of stability of \(k\)-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.CrossRefGoogle Scholar
  42. 42.
    Lee, H. K. H., & Clyde, M. A. (2004). Lossless online Bayesian bagging. Journal of Machine Learning Research, 5, 143–151.MathSciNetGoogle Scholar
  43. 43.
    Li, S. Z., & Zhang, Z. (2004). FloatBoost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1112–1123.CrossRefGoogle Scholar
  44. 44.
    Mease, D., & Wyner, A. (2008). Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 9, 131–156.Google Scholar
  45. 45.
    Meynet, J., & Thiran, J.-P. (2010). Information theoretic combination of pattern classifiers. Pattern Recognition, 43, 3412–3421.CrossRefMATHGoogle Scholar
  46. 46.
    Mirikitani, D. T., & Nikolaev, N. (2010). Efficient online recurrent connectionist learning with the ensemble Kalman filter. Neurocomputing, 73, 1024–1030.CrossRefGoogle Scholar
  47. 47.
    Muhlbaier, M. D., Topalis, A., & Polikar, R. (2009). Learn++.NC: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks, 20(1), 152–168.CrossRefGoogle Scholar
  48. 48.
    Oza, N.C. & Russell, S. (2001). Online bagging and boosting. In: T. Richardson & T. Jaakkola (Eds.), Proceedings of the 8th International Workshops on Artificial Intelligence and Statistics (pp. 105–112). San Mateo, CA : Morgan Kaufmann.Google Scholar
  49. 49.
    Pavlov, D., Mao, J. & Dom, B. (2000). Scaling-up support vector machines using boosting algorithm. In: Proceedings of International Conference on Pattern Recognition(Vol. 2, pp. 2219–2222). Barcelona, Spain.Google Scholar
  50. 50.
    Pedrajas, N. G., & Boyer, D. O. (2006). Improving multiclass pattern recognition by the combination of two strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 1001–1006.CrossRefGoogle Scholar
  51. 51.
    Platt, J.C., Christiani, N. & ShaweCTaylor, J. (1999). Large margin DAGs for multiclass classification. In: S.A. Solla, T.K. Leen, & K.R. Muller (Eds.), Advances in neural information processing systems (pp. 547–553). Cambridge, MA: MIT Press.Google Scholar
  52. 52.
    Pujol, O., Radeva, P., & Vitria, J. (2006). Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1001–1007.CrossRefGoogle Scholar
  53. 53.
    Quost, B., Masson, M.-H., & Denoeux, T. (2011). Classifier fusion in the DempsterCShafer framework using optimized t-norm based combination rules. International Journal of Approximate Reasoning, 52, 353–374.CrossRefMathSciNetGoogle Scholar
  54. 54.
    Ratsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 43(3), 287–320.CrossRefGoogle Scholar
  55. 55.
    Ratsch, G., & Warmuth, M. (2005). Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6, 2153–2175.Google Scholar
  56. 56.
    Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1619–1630.CrossRefGoogle Scholar
  57. 57.
    Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.Google Scholar
  58. 58.
    Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686.CrossRefMATHMathSciNetGoogle Scholar
  59. 59.
    Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), 297–336.CrossRefMATHGoogle Scholar
  60. 60.
    Servedio, R. A. (2003). Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4, 633–648.MathSciNetGoogle Scholar
  61. 61.
    Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton University Press.MATHGoogle Scholar
  62. 62.
    Shalev-Shwartz, S., & Singer, Y. (2010). On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine Learning, 80, 141–163.CrossRefMathSciNetGoogle Scholar
  63. 63.
    Shigei, N., Miyajima, H., Maeda, M., & Ma, L. (2009). Bagging and AdaBoost algorithms for vector quantization. Neurocomputing, 73, 106–114.CrossRefGoogle Scholar
  64. 64.
    Shrestha, D. L., & Solomatine, D. P. (2006). Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Computation, 18, 1678–1710.CrossRefMATHGoogle Scholar
  65. 65.
    Singh, V., Mukherjee, L., Peng, J., & Xu, J. (2010). Ensemble clustering using semidefinite programming with applications. Machine Learning, 79, 177–200.CrossRefMathSciNetGoogle Scholar
  66. 66.
    Smets, P. (1990). The combination of evidence in the transferable belief model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 447–458.CrossRefGoogle Scholar
  67. 67.
    Steele, B. M. (2009). Exact bootstrap \(k\)-nearest neighbor learners. Machine Learning, 74, 235–255.CrossRefGoogle Scholar
  68. 68.
    Tang, E. K., Suganthan, P. N., & Yao, X. (2006). An analysis of diversity measures. Machine Learning, 65(1), 247–271.CrossRefGoogle Scholar
  69. 69.
    Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12, 2719–2741.CrossRefGoogle Scholar
  70. 70.
    Valentini, G., & Dietterich, T. G. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775.MATHMathSciNetGoogle Scholar
  71. 71.
    Valentini, G. (2005). An experimental bias-variance analysis of SVM ensembles based on resampling techniques. IEEE Transactions on Systems, Man, and Cybernetics Part B, 35(6), 1252–1271.Google Scholar
  72. 72.
    Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259.CrossRefGoogle Scholar
  73. 73.
    Xu, L., Krzyzak, A. & Suen, C.Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man, and Cybernetics, 22, 418–435.Google Scholar
  74. 74.
    Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decision-making. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183–190.CrossRefMATHMathSciNetGoogle Scholar
  75. 75.
    Zhang, Y., Burer, S., & Street, W. N. (2006). Ensemble pruning via semi-definite programming. Journal of Machine Learning Research, 7, 1315–1338.MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  1. 1.Enjoyor LabsEnjoyor Inc.HangzhouChina
  2. 2.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada

Personalised recommendations