Skip to main content

Machine Learning with Shallow Neural Networks

Abstract

Conventional machine learning often uses optimization and gradient-descent methods for learning parameterized models. Examples of such models include linear regression, support vector machines, logistic regression, dimensionality reduction, and matrix factorization. Neural networks are also parameterized models that are learned with continuous optimization methods.

Keywords

  • Least-squares Classification
  • CBOW Model
  • Gradient Descent Update
  • Widrow-Hoff Learning
  • Probabilistic Latent Semantic Analysis

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-94463-0_2
  • Chapter length: 52 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-94463-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)
Hardcover Book
USD   69.99
Price excludes VAT (USA)
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 2.5
Figure 2.6
Figure 2.7
Figure 2.8
Figure 2.9
Figure 2.10
Figure 2.11
Figure 2.12
Figure 2.13
Figure 2.14
Figure 2.15
Figure 2.16
Figure 2.17

Notes

  1. 1.

    In recent years, the sigmoid unit has fallen out of favor compared to the ReLU.

  2. 2.

    In order to obtain exactly the same direction as the Fisher method with Equation 2.8, it is important to mean-center both the feature variables and the binary targets. Therefore, each binary target will be one of two real values with different signs. The real values will contain the fraction of instances belonging to the other class. Alternatively, one can use a bias neuron to absorb the constant offsets.

  3. 3.

    This subspace is defined by the top-k singular vectors of singular value decomposition. However, the optimization problem does not impose orthogonality constraints, and therefore the columns of V might use a different non-orthogonal basis system to represent this subspace.

  4. 4.

    There is no loss in reconstruction accuracy in several special cases like the single-layer case discussed here, even on the training data. In other cases, the loss of accuracy is only on the training data, but the autoencoder tends to better reconstruct out-of-sample data because of the regularization effects of parameter footprint reduction.

  5. 5.

    The t-SNE method works on the principle is that it is impossible to preserve all pairwise similarities and dissimilarities with the same level of accuracy in a low-dimensional embedding. Therefore, unlike dimensionality reduction or autoencoders that try to faithfully reconstruct the data, it has an asymmetric loss function in terms of how similarity is treated versus dissimilarity. This type of asymmetric loss function is particularly helpful for separating out different manifolds during visualization. Therefore, t-SNE might perform better than autoencoders at visualization.

  6. 6.

    The work in [287] does point out a number of implicit relationships with matrix factorization, but not the more direct ones pointed out in this book. Some of these relationships are also pointed out in [6].

  7. 7.

    There is a slight abuse of notation in the updates adding \(\overline{u}_{i}\) and \(\overline{v}_{j}\). This is because \(\overline{u}_{i}\) is a row vector and \(\overline{v}_{j}\) is a column vector. Throughout this section, we omit the explicit transposition of one of these two vectors to avoid notational clutter, since the updates are intuitively clear.

  8. 8.

    This fact is not evident in the toy example of Figure 2.17. In practice, the degree of a node is a tiny fraction of the total number of nodes. For example, a person might have 100 friends in a social network of millions of nodes.

  9. 9.

    The weighted degree of node j is r c rj.

Bibliography

  1. C. Aggarwal. Data mining: The textbook. Springer, 2015.

    Google Scholar 

  2. C. Aggarwal. Recommender systems: The textbook. Springer, 2016.

    Google Scholar 

  3. C. Aggarwal. Machine learning for text. Springer, 2018.

    Google Scholar 

  4. E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with deep learning: Taxonomy and new methods. arXiv:1801.07648, 2018.https://arxiv.org/abs/1801.07648

  5. R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. arXiv:1307.1662, 2013.https://arxiv.org/abs/1307.1662

  6. C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.

    Google Scholar 

  7. C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

    Google Scholar 

  8. H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4), pp. 291–294, 1988.

    Google Scholar 

  9. S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, and T. Huang. Heterogeneous network embedding via deep architectures. ACM KDD Conference, pp. 119–128, 2015.

    Google Scholar 

  10. J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. SIAM Conference on Data Mining, 2017.

    Google Scholar 

  11. Y. Chen and M. Zaki. KATE: K-Competitive Autoencoder for Text. ACM KDD Conference, 2017.

    Google Scholar 

  12. A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. AAAI Conference, pp. 215–223, 2011.

    Google Scholar 

  13. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.

    Google Scholar 

  14. M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp. 2148–2156, 2013.

    Google Scholar 

  15. F. Despagne and D. Massart. Neural networks in multivariate calibration. Analyst, 123(11), pp. 157R–178R, 1998.

    Google Scholar 

  16. C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.

    Google Scholar 

  17. C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.https://arxiv.org/abs/1606.05908

  18. A. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. WWW Conference, pp. 278–288, 2015.

    Google Scholar 

  19. R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.

    Google Scholar 

  20. F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.

    Google Scholar 

  21. A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. ACM KDD Conference, pp. 855–864, 2016.

    Google Scholar 

  22. M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.

    Google Scholar 

  23. T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.

    Google Scholar 

  24. S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier detection using replicator neural networks. International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180, 2002.

    Google Scholar 

  25. X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. S. Chua. Neural collaborative filtering. WWW Conference, pp. 173–182, 2017.

    Google Scholar 

  26. G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.

    Google Scholar 

  27. G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313, (5766), pp. 504–507, 2006.

    Google Scholar 

  28. T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.

    Google Scholar 

  29. C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.

    Google Scholar 

  30. D. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.https://arxiv.org/abs/1312.6114

  31. Y. Koren. Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1, 2010.

    Google Scholar 

  32. Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.

    Google Scholar 

  33. Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.

    Google Scholar 

  34. Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. CVPR Conference, 2011.

    Google Scholar 

  35. Y. LeCun. Modeles connexionnistes de l’apprentissage. Doctoral Dissertation, Universite Paris, 1987.

    Google Scholar 

  36. H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for visual area V2. NIPS Conference, 2008.

    Google Scholar 

  37. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.

    Google Scholar 

  38. O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.

    Google Scholar 

  39. D. Liben-Nowell, and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), pp. 1019–1031, 2007.

    Google Scholar 

  40. L. Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9, pp. 2579–2605, 2008.

    Google Scholar 

  41. A. Makhzani and B. Frey. K-sparse autoencoders. arXiv:1312.5663, 2013.https://arxiv.org/abs/1312.5663

  42. A. Makhzani and B. Frey. Winner-take-all autoencoders. NIPS Conference, pp. 2791–2799, 2015.

    Google Scholar 

  43. C. Manning and R. Socher. CS224N: Natural language processing with deep learning. Stanford University School of Engineering, 2017. https://www.youtube.com/watch?v=OQQ-W_63UgQ

  44. P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.

    Google Scholar 

  45. G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.

    Google Scholar 

  46. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.https://arxiv.org/abs/1301.3781

  47. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.

    Google Scholar 

  48. G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), pp. 235–312, 1990.https://wordnet.princeton.edu/

  49. A. Mnih and G. Hinton. A scalable hierarchical distributed language model. NIPS Conference, pp. 1081–1088, 2009.

    Google Scholar 

  50. A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.

    Google Scholar 

  51. A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012.https://arxiv.org/abs/1206.6426

  52. F. Morin and Y. Bengio. Hierarchical Probabilistic Neural Network Language Model. AISTATS, pp. 246–252, 2005.

    Google Scholar 

  53. A. Ng. Sparse autoencoder. CS294A Lecture notes, 2011. https://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

  54. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. ICML Conference, pp. 689–696, 2011.

    Google Scholar 

  55. J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.

    Google Scholar 

  56. B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. ACM KDD Conference, pp. 701–710.

    Google Scholar 

  57. R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html

  58. S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. ICML Conference, pp. 833–840, 2011.

    Google Scholar 

  59. D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.https://arxiv.org/abs/1401.4082

  60. R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002.

    Google Scholar 

  61. R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5, pp. 101–141, 2004.

    Google Scholar 

  62. X. Rong. word2vec parameter learning explained. arXiv:1411.2738, 2014.https://arxiv.org/abs/1411.2738

  63. F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386, 1958.

    Google Scholar 

  64. D. Ruck, S. Rogers, and M. Kabrisky. Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2(2), pp. 40–88, 1990.

    Google Scholar 

  65. D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp. 533–536, 1986.

    Google Scholar 

  66. R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. ICML Confererence, pp. 791–798, 2007.

    Google Scholar 

  67. S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative filtering. WWW Conference, pp. 111–112, 2015.

    Google Scholar 

  68. A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.

    Google Scholar 

  69. S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.

    Google Scholar 

  70. Y. Song, A. Elkahky, and X. He. Multi-rate deep learning for temporal recommendation. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 909–912, 2016.

    Google Scholar 

  71. J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806, 2014.https://arxiv.org/abs/1412.6806

  72. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. NIPS Conference, pp. 2222–2230, 2012.

    Google Scholar 

  73. F. Strub and J. Mary. Collaborative filtering with stacked denoising autoencoders and sparse inputs. NIPS Workshop on Machine Learning for eCommerce, 2015.

    Google Scholar 

  74. A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.

    Google Scholar 

  75. P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. ICML Confererence, pp. 1096–1103, 2008.

    Google Scholar 

  76. D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. ACM KDD Conference, pp. 1225–1234, 2016.

    Google Scholar 

  77. H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. ACM KDD Conference, pp. 1235–1244, 2015.

    Google Scholar 

  78. K. Weinberger, B. Packer, and L. Saul. Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization. AISTATS, 2005.

    Google Scholar 

  79. J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.

    Google Scholar 

  80. B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.

    Google Scholar 

  81. Y. Wu, C. DuBois, A. Zheng, and M. Ester. Collaborative denoising auto-encoders for top-n recommender systems. Web Search and Data Mining, pp. 153–162, 2016.

    Google Scholar 

  82. W. Yu, W. Cheng, C. Aggarwal, K. Zhang, H. Chen, and Wei Wang. NetWalk: A flexible deep embedding approach for anomaly Detection in dynamic networks, ACM KDD Conference, 2018.

    Google Scholar 

  83. W. Yu, C. Zheng, W. Cheng, C. Aggarwal, D. Song, B. Zong, H. Chen, and W. Wang. Learning deep network representations with adversarially regularized autoencoders. ACM KDD Conference, 2018.

    Google Scholar 

  84. D. Zhang, Z.-H. Zhou, and S. Chen. Non-negative matrix factorization on kernels. Trends in Artificial Intelligence, pp. 404–412, 2006.

    Google Scholar 

  85. S. Zhang, L. Yao, and A. Sun. Deep learning based recommender system: A survey and new perspectives. arXiv:1707.07435, 2017.https://arxiv.org/abs/1707.07435

  86. C. Zhou and R. Paffenroth. Anomaly detection with robust deep autoencoders. ACM KDD Conference, pp. 665–674, 2017.

    Google Scholar 

  87. http://scikit-learn.org/

  88. http://clic.cimec.unitn.it/composes/toolkit/

  89. https://github.com/stanfordnlp/GloVe

  90. https://deeplearning4j.org/

  91. https://code.google.com/archive/p/word2vec/

  92. https://www.tensorflow.org/tutorials/word2vec/

  93. https://github.com/aditya-grover/node2vec

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Machine Learning with Shallow Neural Networks. In: Neural Networks and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-94463-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94463-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94462-3

  • Online ISBN: 978-3-319-94463-0

  • eBook Packages: Computer ScienceComputer Science (R0)