Fundamentals of Machine Learning

  • Ke-Lin DuEmail author
  • M. N. S. Swamy


This chapter deals with the fundamental concepts and theories of machine learning. It first introduces various learning and inference methods, followed by learning and generalization, model selection, and neural networks as universal machines. Some other important topics are also covered.


  1. 1.
    Akaike, H. (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21, 425–439.MathSciNetzbMATHGoogle Scholar
  2. 2.
    Akaike, H. (1970). Statistical prediction information. Annals of the Institute of Statistical Mathematics, 22, 203–217.MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Amari, S., Murata, N., Muller, K. R., Finke, M., & Yang, H. (1996). Statistical theory of overtraining: Is cross-validation asymptotically effective? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 176–182). Cambridge, MA: MIT Press.Google Scholar
  5. 5.
    Arlot, S., & Lerasle, M. (2016). Choice of V for V-fold cross-validation in least-squares density estimation. Journal of Machine Learning Research, 17, 1–50.MathSciNetzbMATHGoogle Scholar
  6. 6.
    Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 316–322). Cambridge, MA: MIT Press.Google Scholar
  7. 7.
    Baldi, P., & Sadowski, P. (2013). Understanding dropout. In Advances in neural information processing systems (Vol. 27, pp. 2814–2822).Google Scholar
  8. 8.
    Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945.MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846.CrossRefGoogle Scholar
  10. 10.
    Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Baum, E. B., & Wilczek, F. (1988). Supervised learning of probability distributions by neural networks. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 52–61). New York: American Institute Physics.Google Scholar
  12. 12.
    Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances neural information processing systems (Vol. 14, pp. 585–591). Cambridge, MA: MIT Press.Google Scholar
  13. 13.
    Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396.zbMATHCrossRefGoogle Scholar
  14. 14.
    Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434.MathSciNetzbMATHGoogle Scholar
  15. 15.
    Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of \(K\)-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105.MathSciNetzbMATHGoogle Scholar
  16. 16.
    Bernier, J. L., Ortega, J., Ros, E., Rojas, I., & Prieto, A. (2000). A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs. Neural Computation, 12, 2941–2964.CrossRefGoogle Scholar
  17. 17.
    Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford Press.zbMATHGoogle Scholar
  18. 18.
    Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.CrossRefGoogle Scholar
  19. 19.
    Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117–127.CrossRefGoogle Scholar
  20. 20.
    Bousquet, O., & Elisseeff, A. (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499–526.MathSciNetzbMATHGoogle Scholar
  21. 21.
    Breiman, L., & Spector, P. (1992). Submodel selection and evaluation in regression: The X-random case. International Statistical Review, 60(3), 291–319.CrossRefGoogle Scholar
  22. 22.
    Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An overview. Technical Report MSR-TR-2010-82, Microsoft Research.Google Scholar
  23. 23.
    Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41–75.CrossRefGoogle Scholar
  24. 24.
    Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861.zbMATHGoogle Scholar
  25. 25.
    Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.MathSciNetzbMATHGoogle Scholar
  26. 26.
    Chapelle, O., & Chang, Y. (2011). Yahoo! learning to rank challenge overview. In JMLR workshop and conference proceedings: Workshop on Yahoo! learning to rank challenge (Vol. 14, pp. 1–24).Google Scholar
  27. 27.
    Chawla, N., Bowyer, K., & Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.zbMATHCrossRefGoogle Scholar
  28. 28.
    Chen, D. S., & Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479.CrossRefGoogle Scholar
  29. 29.
    Chiu, C., Mehrotra, K., Mohan, C. K., & Ranka, S. (1994). Modifying training algorithms for improved fault tolerance. In Proceedings of IEEE International Conference on Neural Networks, Orlando, FL, USA (Vol. 4, pp. 333–338).Google Scholar
  30. 30.
    Cichocki, A., & Unbehauen, R. (1992). Neural networks for optimization and signal processing. New York: Wiley.zbMATHGoogle Scholar
  31. 31.
    Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334.zbMATHCrossRefGoogle Scholar
  32. 32.
    Dasgupta, S., Littman, M., & McAllester, D. (2002). PAC generalization bounds for co-training. In: Advances in neural information processing systems (Vol. 14, pp. 375–382).Google Scholar
  33. 33.
    Denker, J. S., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., et al. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems, 1, 877–922.MathSciNetzbMATHGoogle Scholar
  34. 34.
    Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89, 31–71.zbMATHCrossRefGoogle Scholar
  35. 35.
    Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3, 409–425.CrossRefGoogle Scholar
  36. 36.
    Edwards, P. J., & Murray, A. F. (1998). Towards optimally distributed computation. Neural Computation, 10, 997–1015.CrossRefGoogle Scholar
  37. 37.
    Fedorov, V. V. (1972). Theory of optimal experiments. San Diego: Academic Press.Google Scholar
  38. 38.
    Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.MathSciNetzbMATHGoogle Scholar
  39. 39.
    Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.zbMATHCrossRefGoogle Scholar
  40. 40.
    Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.MathSciNetzbMATHCrossRefGoogle Scholar
  41. 41.
    Friedrichs, F., & Schmitt, M. (2005). On the power of Boolean computations in generalized RBF neural networks. Neurocomputing, 63, 483–498.CrossRefGoogle Scholar
  42. 42.
    Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (Vol. 48, pp. 1050–1059).Google Scholar
  43. 43.
    Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.Google Scholar
  44. 44.
    Ghodsi, A., & Schuurmans, D. (2003). Automatic basis selection techniques for RBF networks. Neural Networks, 16, 809–816.CrossRefGoogle Scholar
  45. 45.
    Gish, H. (1990). A probabilistic approach to the understanding and training of neural network classifiers. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 1361–1364).Google Scholar
  46. 46.
    Hanson, S. J., & Burr, D. J. (1988). Minkowski back-propagation: Learning in connectionist models with non-Euclidean error signals. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 348–357). New York: American Institute Physics.Google Scholar
  47. 47.
    Hassoun, M. H. (1995). Fundamentals of artificial neural networks. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  48. 48.
    Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem. In Proceedings of the 1st IEEE International Conference on Neural Networks (Vol. 3, pp. 11–14). San Diego, CA.Google Scholar
  49. 49.
    Helmbold, D. P., & Long, P. M. (2018). Surprising properties of dropout in deep networks. Journal of Machine Learning Research, 18, 1–28.zbMATHGoogle Scholar
  50. 50.
    Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In P. J. Bartlett, B. Scholkopf, D. Schuurmans, & A. J. Smola (Eds.), Advances in large margin classifiers (pp. 115–132). Cambridge, MA: MIT Press.Google Scholar
  51. 51.
    Hinton, G. E. (1989). Connectionist learning procedure. Artificial Intelligence, 40, 185–234.CrossRefGoogle Scholar
  52. 52.
    Hinton, G. E. (2012). Dropout: A simple and effective way to improve neural networks. Scholar
  53. 53.
    Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. The Computing Research Repository (CoRR), abs/1207.0580.Google Scholar
  54. 54.
    Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 5–13). Santa Cruz, CA.Google Scholar
  55. 55.
    Ho, K. I.-J., Leung, C.-S., & Sum, J. (2010). Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Transactions on Neural Networks, 21(6), 938–947.CrossRefGoogle Scholar
  56. 56.
    Hoi, S. C. H., Jin, R., & Lyu, M. R. (2009). Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1233–1248.CrossRefGoogle Scholar
  57. 57.
    Holmstrom, L., & Koistinen, P. (1992). Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1), 24–38.CrossRefGoogle Scholar
  58. 58.
    Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.zbMATHCrossRefGoogle Scholar
  59. 59.
    Huber, P. J. (1981). Robust statistics. New York: Wiley.zbMATHCrossRefGoogle Scholar
  60. 60.
    Janssen, P., Stoica, P., Soderstrom, T., & Eykhoff, P. (1988). Model structure selection for multivariable systems by cross-validation. International Journal of Control, 47, 1737–1758.MathSciNetzbMATHCrossRefGoogle Scholar
  61. 61.
    Kettenring, J. (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433–451.MathSciNetzbMATHCrossRefGoogle Scholar
  62. 62.
    Khan, S. H., Hayat, M., & Porikli, F. (2019). Regularization of deep neural networks with spectral dropout. Neural Networks, 110, 82–90.CrossRefGoogle Scholar
  63. 63.
    Kokiopoulou, E., & Saad, Y. (2007). Orthogonal neighborhood preserving projections: A projection-based dimensionality reduction technique. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2143–2156.CrossRefGoogle Scholar
  64. 64.
    Kolmogorov, A. N. (1957). On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk USSR, 114(5), 953–956.zbMATHGoogle Scholar
  65. 65.
    Krogh, A., & Hertz, J. A. (1992). A simple weight decay improves generalization. In Proceedings of Neural Information and Processing Systems (NIPS) Conference (pp. 950–957). San Mateo, CA: Morgan Kaufmann.Google Scholar
  66. 66.
    Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72.MathSciNetzbMATHGoogle Scholar
  67. 67.
    Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46, 191–202.zbMATHCrossRefGoogle Scholar
  68. 68.
    Liu, W., Pokharel, P. P., & Principe, J. C. (2007). Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Transactions on Signal Processing, 55(11), 5286–5298.MathSciNetzbMATHCrossRefGoogle Scholar
  69. 69.
    Liu, Y., Starzyk, J. A., & Zhu, Z. (2008). Optimized approximation algorithm in neural networks without overfitting. IEEE Transactions on Neural Networks, 19(6), 983–995.CrossRefGoogle Scholar
  70. 70.
    Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12, 2519–2535.CrossRefGoogle Scholar
  71. 71.
    MacKay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604.CrossRefGoogle Scholar
  72. 72.
    Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2005). Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research, 6, 1127–1168.MathSciNetzbMATHGoogle Scholar
  73. 73.
    Matsuoka, K., & Yi, J. (1991). Backpropagation based on the logarithmic error function and elimination of local minima. In Proceedings of the International Joint Conference on Neural Networks (pp. 1117–1122). Seattle, WA.Google Scholar
  74. 74.
    McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society: Series B, 42(2), 109–142.MathSciNetzbMATHGoogle Scholar
  75. 75.
    Muller, B., Reinhardt, J., & Strickland, M. (1995). Neural networks: An introduction (2nd ed.). Berlin: Springer.zbMATHCrossRefGoogle Scholar
  76. 76.
    Murray, A. F., & Edwards, P. J. (1994). Synaptic weight noise euring MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792–802.CrossRefGoogle Scholar
  77. 77.
    Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.zbMATHCrossRefGoogle Scholar
  78. 78.
    Niyogi, P., & Girosi, F. (1999). Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10, 51–80.MathSciNetzbMATHCrossRefGoogle Scholar
  79. 79.
    Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4), 473–493.CrossRefGoogle Scholar
  80. 80.
    Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
  81. 81.
    Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(1), 1065–1076.MathSciNetzbMATHCrossRefGoogle Scholar
  82. 82.
    Phatak, D. S. (1999). Relationship between fault tolerance, generalization and the Vapnik-Cervonenkis (VC) dimension of feedforward ANNs. Proceedings of International Joint Conference on Neural Networks, 1, 705–709.CrossRefGoogle Scholar
  83. 83.
    Plutowski, M. E. P. (1996). Survey: Cross-validation in theory and in practice. Research Report. Princeton, NJ: Department of Computational Science Research, David Sarnoff Research Center.Google Scholar
  84. 84.
    Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497.zbMATHCrossRefGoogle Scholar
  85. 85.
    Prechelt, L. (1998). Automatic early stopping using cross validation: Quantifying the criteria. Neural Networks, 11, 761–767.CrossRefGoogle Scholar
  86. 86.
    Reed, R., Marks, R. J, I. I., & Oh, S. (1995). Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3), 529–538.CrossRefGoogle Scholar
  87. 87.
    Rimer, M., & Martinez, T. (2006). Classification-based objective functions. Machine Learning, 63(2), 183–205.zbMATHCrossRefGoogle Scholar
  88. 88.
    Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–477.zbMATHCrossRefGoogle Scholar
  89. 89.
    Rissanen, J. (1999). Hypothesis selection and testing by the MDL principle. Computer Journal, 42(4), 260–269.MathSciNetzbMATHCrossRefGoogle Scholar
  90. 90.
    Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.CrossRefGoogle Scholar
  91. 91.
    Royden, H. L. (1968). Real analysis (2nd ed.). New York: Macmillan.zbMATHGoogle Scholar
  92. 92.
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Google Scholar
  93. 93.
    Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation: the basic theory. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architecture, and applications (pp. 1–34). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  94. 94.
    Sabato, S., & Tishby, N. (2012). Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13, 2999–3039.MathSciNetzbMATHGoogle Scholar
  95. 95.
    Sarbo, J. J., & Cozijn, R. (2019). Belief in reasoning. Cognitive Systems Research, 55, 245–256.CrossRefGoogle Scholar
  96. 96.
    Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.MathSciNetCrossRefGoogle Scholar
  97. 97.
    Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.MathSciNetzbMATHCrossRefGoogle Scholar
  98. 98.
    Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88, 486–494.MathSciNetzbMATHCrossRefGoogle Scholar
  99. 99.
    Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150.MathSciNetzbMATHCrossRefGoogle Scholar
  100. 100.
    Silva, L. M., de Sa, J. M., & Alexandre, L. A. (2008). Data classification with multilayer perceptrons using a generalized error function. Neural Networks, 21, 1302–1310.zbMATHCrossRefGoogle Scholar
  101. 101.
    Sima, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017–1023.CrossRefGoogle Scholar
  102. 102.
    Singh, A., Pokharel, R., & Principe, J. C. (2014). The C-loss function for pattern classification. Pattern Recognition, 47(1), 441–453.zbMATHCrossRefGoogle Scholar
  103. 103.
    Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–640.MathSciNetzbMATHGoogle Scholar
  104. 104.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.MathSciNetzbMATHGoogle Scholar
  105. 105.
    Stoica, P., & Selen, Y. (2004). A review of information criterion rules. EEE Signal Processing Magazine, 21(4), 36–47.CrossRefGoogle Scholar
  106. 106.
    Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B, 36, 111–147.MathSciNetzbMATHGoogle Scholar
  107. 107.
    Sugiyama, M., & Ogawa, H. (2000). Incremental active learning for optimal generalization. Neural Computation, 12, 2909–2940.CrossRefGoogle Scholar
  108. 108.
    Sugiyama, M., & Nakajima, S. (2009). Pool-based active learning in approximate linear regression. Machine Learning, 75, 249–274.CrossRefGoogle Scholar
  109. 109.
    Sum, J. P.-F., Leung, C.-S., & Ho, K. I.-J. (2012). On-line node fault injection training algorithm for MLP networks: Objective function and convergence analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(2), 211–222.Google Scholar
  110. 110.
    Tabatabai, M. A., & Argyros, I. K. (1993). Robust estimation and testing for general nonlinear regression models. Applied Mathematics and Computation, 58, 85–101.MathSciNetzbMATHCrossRefGoogle Scholar
  111. 111.
    Tecuci, G., Kaiser, L., Marcu, D., Uttamsingh, C., & Boicu, M. (2018). Evidence-based reasoning in intelligence analysis: Structured methodology and system. Computing in Science & Engineering, 20(6), 9–21.CrossRefGoogle Scholar
  112. 112.
    Tikhonov, A. N. (1963). On solving incorrectly posed problems and method of regularization. Doklady Akademii Nauk USSR, 151, 501–504.Google Scholar
  113. 113.
    Tucker, L. R. (1964). The extension of factor analysis to three-dimensional matrices. Contributions to mathematical psychology (pp. 109–127). Holt, Rinehardt & Winston: New York, NY.Google Scholar
  114. 114.
    Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
  115. 115.
    Wan, L., Zeiler, M., Zhang, S., LeCun, Y., Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of International Conference on Machine Learning (pp. 1058–1066).Google Scholar
  116. 116.
    Widrow, B., & Lehr, M. A. (1990). 30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442.CrossRefGoogle Scholar
  117. 117.
    Wu, G., & Cheng, E. (2003). Class-boundary alignment for imbalanced dataset learning. In Proceedings of ICML 2003 Workshop on Learning Imbalanced Data Sets II (pp. 49–56). Washington, DC.Google Scholar
  118. 118.
    Xiao, Y., Feng, R.-B., Leung, C.-S., & Sum, J. (2016). Objective function and learning algorithm for the general node fault situation. IEEE Transactions on Neural Networks and Learning Systems, 27(4), 863–874.MathSciNetCrossRefGoogle Scholar
  119. 119.
    Xu, H., Caramanis, C., & Mannor, S. (2010). Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7), 3561–3574.MathSciNetzbMATHCrossRefGoogle Scholar
  120. 120.
    Xu, H., Caramanis, C., & Mannor, S. (2012). Sparse algorithms are not stable: A no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 187–193.CrossRefGoogle Scholar
  121. 121.
    Yang, L., Hanneke, S., & Carbonell, J. (2013). A theory of transfer learning with applications to active learning. Machine Learning, 90(2), 161–189.MathSciNetzbMATHCrossRefGoogle Scholar
  122. 122.
    Zahalka, J., & Zelezny, F. (2011). An experimental test of Occam’s razor in classification. Machine Learning, 82, 475–481.MathSciNetCrossRefGoogle Scholar
  123. 123.
    Zhang, M.-L., & Zhou, Z.-H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048.zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada
  2. 2.Xonlink Inc.HangzhouChina

Personalised recommendations