Abstract
Conventional machine learning often uses optimization and gradient-descent methods for learning parameterized models. Examples of such models include linear regression, support vector machines, logistic regression, dimensionality reduction, and matrix factorization. Neural networks are also parameterized models that are learned with continuous optimization methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In recent years, the sigmoid unit has fallen out of favor compared to the ReLU.
- 2.
In order to obtain exactly the same direction as the Fisher method with Equation 2.8, it is important to mean-center both the feature variables and the binary targets. Therefore, each binary target will be one of two real values with different signs. The real values will contain the fraction of instances belonging to the other class. Alternatively, one can use a bias neuron to absorb the constant offsets.
- 3.
This subspace is defined by the top-k singular vectors of singular value decomposition. However, the optimization problem does not impose orthogonality constraints, and therefore the columns of V might use a different non-orthogonal basis system to represent this subspace.
- 4.
There is no loss in reconstruction accuracy in several special cases like the single-layer case discussed here, even on the training data. In other cases, the loss of accuracy is only on the training data, but the autoencoder tends to better reconstruct out-of-sample data because of the regularization effects of parameter footprint reduction.
- 5.
The t-SNE method works on the principle is that it is impossible to preserve all pairwise similarities and dissimilarities with the same level of accuracy in a low-dimensional embedding. Therefore, unlike dimensionality reduction or autoencoders that try to faithfully reconstruct the data, it has an asymmetric loss function in terms of how similarity is treated versus dissimilarity. This type of asymmetric loss function is particularly helpful for separating out different manifolds during visualization. Therefore, t-SNE might perform better than autoencoders at visualization.
- 6.
- 7.
There is a slight abuse of notation in the updates adding \(\overline{u}_{i}\) and \(\overline{v}_{j}\). This is because \(\overline{u}_{i}\) is a row vector and \(\overline{v}_{j}\) is a column vector. Throughout this section, we omit the explicit transposition of one of these two vectors to avoid notational clutter, since the updates are intuitively clear.
- 8.
This fact is not evident in the toy example of Figure 2.17. In practice, the degree of a node is a tiny fraction of the total number of nodes. For example, a person might have 100 friends in a social network of millions of nodes.
- 9.
The weighted degree of node j is ∑ r c rj.
Bibliography
C. Aggarwal. Data mining: The textbook. Springer, 2015.
C. Aggarwal. Recommender systems: The textbook. Springer, 2016.
C. Aggarwal. Machine learning for text. Springer, 2018.
E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with deep learning: Taxonomy and new methods. arXiv:1801.07648, 2018.https://arxiv.org/abs/1801.07648
R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. arXiv:1307.1662, 2013.https://arxiv.org/abs/1307.1662
C. M. Bishop. Pattern recognition and machine learning. Springer, 2007.
C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4), pp. 291–294, 1988.
S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, and T. Huang. Heterogeneous network embedding via deep architectures. ACM KDD Conference, pp. 119–128, 2015.
J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. SIAM Conference on Data Mining, 2017.
Y. Chen and M. Zaki. KATE: K-Competitive Autoencoder for Text. ACM KDD Conference, 2017.
A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. AAAI Conference, pp. 215–223, 2011.
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3), pp. 273–297, 1995.
M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. NIPS Conference, pp. 2148–2156, 2013.
F. Despagne and D. Massart. Neural networks in multivariate calibration. Analyst, 123(11), pp. 157R–178R, 1998.
C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.
C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.https://arxiv.org/abs/1606.05908
A. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. WWW Conference, pp. 278–288, 2015.
R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: pp. 179–188, 1936.
F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63(3), pp. 169–176, 1990.
A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. ACM KDD Conference, pp. 855–864, 2016.
M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.
T. Hastie and R. Tibshirani. Generalized additive models. CRC Press, 1990.
S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier detection using replicator neural networks. International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180, 2002.
X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. S. Chua. Neural collaborative filtering. WWW Conference, pp. 173–182, 2017.
G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1–3), pp. 185–234, 1989.
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313, (5766), pp. 504–507, 2006.
T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.
C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.
D. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.https://arxiv.org/abs/1312.6114
Y. Koren. Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1, 2010.
Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.
Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, On optimization methods for deep learning. ICML Conference, pp. 265–272, 2011.
Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. CVPR Conference, 2011.
Y. LeCun. Modeles connexionnistes de l’apprentissage. Doctoral Dissertation, Universite Paris, 1987.
H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for visual area V2. NIPS Conference, 2008.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.
O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.
D. Liben-Nowell, and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), pp. 1019–1031, 2007.
L. Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9, pp. 2579–2605, 2008.
A. Makhzani and B. Frey. K-sparse autoencoders. arXiv:1312.5663, 2013.https://arxiv.org/abs/1312.5663
A. Makhzani and B. Frey. Winner-take-all autoencoders. NIPS Conference, pp. 2791–2799, 2015.
C. Manning and R. Socher. CS224N: Natural language processing with deep learning. Stanford University School of Engineering, 2017. https://www.youtube.com/watch?v=OQQ-W_63UgQ
P. McCullagh and J. Nelder. Generalized linear models CRC Press, 1989.
G. McLachlan. Discriminant analysis and statistical pattern recognition John Wiley & Sons, 2004.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.https://arxiv.org/abs/1301.3781
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), pp. 235–312, 1990.https://wordnet.princeton.edu/
A. Mnih and G. Hinton. A scalable hierarchical distributed language model. NIPS Conference, pp. 1081–1088, 2009.
A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.
A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012.https://arxiv.org/abs/1206.6426
F. Morin and Y. Bengio. Hierarchical Probabilistic Neural Network Language Model. AISTATS, pp. 246–252, 2005.
A. Ng. Sparse autoencoder. CS294A Lecture notes, 2011. https://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. ICML Conference, pp. 689–696, 2011.
J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. ACM KDD Conference, pp. 701–710.
R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. ICML Conference, pp. 833–840, 2011.
D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.https://arxiv.org/abs/1401.4082
R. Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D. Thesis, Massachusetts Institute of Technology, 2002.
R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5, pp. 101–141, 2004.
X. Rong. word2vec parameter learning explained. arXiv:1411.2738, 2014.https://arxiv.org/abs/1411.2738
F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386, 1958.
D. Ruck, S. Rogers, and M. Kabrisky. Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2(2), pp. 40–88, 1990.
D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323 (6088), pp. 533–536, 1986.
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. ICML Confererence, pp. 791–798, 2007.
S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative filtering. WWW Conference, pp. 111–112, 2015.
A. Shashua. On the equivalence between the support vector machine for classification and sparsified Fisher’s linear discriminant. Neural Processing Letters, 9(2), pp. 129–139, 1999.
S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), pp. 3–30, 2011.
Y. Song, A. Elkahky, and X. He. Multi-rate deep learning for temporal recommendation. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 909–912, 2016.
J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806, 2014.https://arxiv.org/abs/1412.6806
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. NIPS Conference, pp. 2222–2230, 2012.
F. Strub and J. Mary. Collaborative filtering with stacked denoising autoencoders and sparse inputs. NIPS Workshop on Machine Learning for eCommerce, 2015.
A. Tikhonov and V. Arsenin. Solution of ill-posed problems. Winston and Sons, 1977.
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. ICML Confererence, pp. 1096–1103, 2008.
D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. ACM KDD Conference, pp. 1225–1234, 2016.
H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. ACM KDD Conference, pp. 1235–1244, 2015.
K. Weinberger, B. Packer, and L. Saul. Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization. AISTATS, 2005.
J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.
B. Widrow and M. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, 4(1), pp. 96–104, 1960.
Y. Wu, C. DuBois, A. Zheng, and M. Ester. Collaborative denoising auto-encoders for top-n recommender systems. Web Search and Data Mining, pp. 153–162, 2016.
W. Yu, W. Cheng, C. Aggarwal, K. Zhang, H. Chen, and Wei Wang. NetWalk: A flexible deep embedding approach for anomaly Detection in dynamic networks, ACM KDD Conference, 2018.
W. Yu, C. Zheng, W. Cheng, C. Aggarwal, D. Song, B. Zong, H. Chen, and W. Wang. Learning deep network representations with adversarially regularized autoencoders. ACM KDD Conference, 2018.
D. Zhang, Z.-H. Zhou, and S. Chen. Non-negative matrix factorization on kernels. Trends in Artificial Intelligence, pp. 404–412, 2006.
S. Zhang, L. Yao, and A. Sun. Deep learning based recommender system: A survey and new perspectives. arXiv:1707.07435, 2017.https://arxiv.org/abs/1707.07435
C. Zhou and R. Paffenroth. Anomaly detection with robust deep autoencoders. ACM KDD Conference, pp. 665–674, 2017.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Machine Learning with Shallow Neural Networks. In: Neural Networks and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-94463-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-94463-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94462-3
Online ISBN: 978-3-319-94463-0
eBook Packages: Computer ScienceComputer Science (R0)