Abstract
This chapter provides a more in-depth description of supervised learning, deep learning, and neural networks—presenting the foundational mathematical and statistical learning concepts and explaining how they relate to real-world examples in trading, risk management, and investment management. These applications present challenges for forecasting and model design and are presented as a reoccurring theme throughout the book. This chapter moves towards a more engineering style exposition of neural networks, applying concepts in the previous chapters to elucidate various model design choices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that there is a potential degeneracy in this case; There may exist “flat directions”—hyper-surfaces in the parameter space that have exactly the same loss function.
- 2.
There is some redundancy in the construction of the network and around 50 units are needed.
- 3.
The parameterized softplus function \(\sigma (x;t)=\frac {1}{t}ln(1+\exp \{tx\})\), with a model parameter t >> 1, converges to the ReLU function in the limit t →∞.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensor flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16 (pp. 265–283).
Adams, R., Wallach, H., & Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 1–8).
Andrews, D. (1989). A unified theory of estimation and inference for nonlinear dynamic models a.r. gallant and h. white. Econometric Theory,5(01), 166–171.
Baillie, R. T., & Kapetanios, G. (2007). Testing for neglected nonlinearity in long-memory models. Journal of Business & Economic Statistics,25(4), 447–461.
Barber, D., & Bishop, C. M. (1998). Ensemble learning in Bayesian neural networks. Neural Networks and Machine Learning,168, 215–238.
Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017a). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.
Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017b). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2006). Convex neural networks. In Y. Weiss, Schölkopf, B., & Platt, J. C. (Eds.), Advances in neural information processing systems 18 (pp. 123–130). MIT Press.
Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer-Verlag.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015a, May). Weight uncertainty in neural networks. arXiv:1505.05424 [cs, stat].
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015b). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.
Chataigner, Crepe, & Dixon. (2020). Deep local volatility.
Chen, J., Flood, M. D., & Sowers, R. B. (2017). Measuring the unmeasurable: an application of uncertainty quantification to treasury bond portfolios. Quantitative Finance,17(10), 1491–1507.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223–1231).
Dixon, M., Klabjan, D., & Bang, J. H. (2016). Classification-based financial markets prediction using deep neural networks. CoRR,abs/1603.08604.
Feng, G., He, J., & Polson, N. G. (2018, Apr). Deep learning for predicting asset returns. arXiv e-prints, arXiv:1804.09314.
Frey, B. J., & Hinton, G. E. (1999). Variational learning in nonlinear Gaussian belief networks. Neural Computation,11(1), 193–213.
Gal, Y. (2015). A theoretically grounded application of dropout in recurrent neural networks. arXiv:1512.05287.
Gal, Y. (2016). Uncertainty in deep learning. Ph.D. thesis, University of Cambridge.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international Conference on Machine Learning (pp. 1050–1059).
Gallant, A., & White, H. (1988, July). There exists a neural network that does not make avoidable mistakes. In IEEE 1988 International Conference on Neural Networks (vol.1 ,pp. 657–664).
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356).
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
Heaton, J. B., Polson, N. G., & Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry,33(1), 3–12.
Hernández-Lobato, J. M., & Adams, R. (2015). Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning (pp. 1861–1869).
Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 448–453). IEEE New York.
Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory (pp. 5–13). ACM.
Hornik, K., Stinchcombe, M., & White, H. (1989, July). Multilayer feedforward networks are universal approximators. Neural Netw.,2(5), 359–366.
Horvath, B., Muguruza, A., & Tomas, M. (2019, Jan). Deep learning volatility. arXiv e-prints, arXiv:1901.09647.
Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance,49(3), 851–889.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
Kuan, C.-M., & White, H. (1994). Artificial neural networks: an econometric perspective. Econometric Reviews,13(1), 1–91.
Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research,6(Nov), 1783–1816.
Liang, S., & Srikant, R. (2016). Why deep neural networks? CoRRabs/1610.04161.
Lo, A. (1994). Neural networks and other nonparametric techniques in economics and finance. In AIMR Conference Proceedings, Number 9.
MacKay, D. J. (1992a). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.
MacKay, D. J. C. (1992b, May). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.
Martin, C. H., & Mahoney, M. W. (2018). Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRRabs/1810.01075.
Mhaskar, H., Liao, Q., & Poggio, T. A. (2016). Learning real and Boolean functions: When is deep better than shallow. CoRRabs/1603.00988.
Mnih, A., & Gregor, K. (2014). Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030.
Montúfar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014, Feb). On the number of linear regions of deep neural networks. arXiv e-prints, arXiv:1402.1869.
Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives,31(2), 87–106.
Neal, R. M. (1990). Learning stochastic feedforward networks, Vol. 64. Technical report, Department of Computer Science, University of Toronto.
Neal, R. M. (1992). Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, CRG-TR-92-1, Dept. of Computer Science, University of Toronto.
Neal, R. M. (2012). Bayesian learning for neural networks, Vol. 118. Springer Science & Business Media. bibtex: aneal2012bayesian.
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.
Poggio, T. (2016). Deep learning: mathematics and neuroscience. A sponsored supplement to sciencebrain-inspired intelligent robotics: The intersection of robotics and neuroscience, pp. 9–12.
Polson, N., & Rockova, V. (2018, Mar). Posterior concentration for sparse deep learning. arXiv e-prints, arXiv:1803.09138.
Polson, N. G., Willard, B. T., & Heidari, M. (2015). A statistical theory of deep learning via proximal splitting. arXiv:1509.06061.
Racine, J. (2001). On the nonlinear predictability of stock returns using financial and economic variables. Journal of Business & Economic Statistics,19(3), 380–382.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
Ruiz, F. R., Aueb, M. T. R., & Blei, D. (2016). The generalized reparameterization gradient. In Advances in Neural Information Processing Systems (pp. 460–468).
Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines. Tech. Rep., Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto.
Salakhutdinov, R., & Hinton, G. (2009). Deep Boltzmann machines. In Artificial Intelligence and Statistics (pp. 448–455).
Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research,4, 61–76.
Sirignano, J., Sadhwani, A., & Giesecke, K. (2016, July). Deep learning for mortgage risk. ArXiv e-prints.
Smolensky, P. (1986). Parallel distributed processing: explorations in the microstructure of cognition (Vol. 1. pp. 194–281). Cambridge, MA, USA: MIT Press.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,15(1), 1929–1958.
Swanson, N. R., & White, H. (1995). A model-selection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business & Economic Statistics,13(3), 265–275.
Telgarsky, M. (2016). Benefits of depth in neural networks. CoRRabs/1602.04485.
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (pp. 1064–1071). ACM.
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. CoRRabs/1503.02406.
Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017, January). Deep probabilistic programming. arXiv:1701.03757 [cs, stat].
Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems (pp. 1481–1488).
Williams, C. K. (1997). Computing with infinite networks. In Advances in Neural Information Processing systems (pp. 295–301).
Author information
Authors and Affiliations
Appendix
Appendix
1.1 Answers to Multiple Choice Questions
Question 1
Answer: 1, 2, 3, 4. All answers are found in the text.
Question 2
Answer: 1,2. A feedforward architecture is always convex w.r.t. each input variable if every activation function is convex and the weights are constrained to be either all positive or all negative. Simply using convex activation functions is not sufficient, since the composition of a convex function and the affine transformation of a convex function do not preserve the convexity. For example, if σ(x) = x 2, w = −1, and b = 1, then σ(wσ(x) + b) = (−x 2 + 1)2 is not convex in x.
A feedforward architecture with positive weights is a monotonically increasing function of the input for any choice of monotonically increasing activation function.
The weights of a feedforward architecture need not be constrained for the output of a feedforward network to be bounded. For example, activating the output with a softmax function will bound the output. Only if the output is not activated, should the weights and bias in the final layer be bounded to ensure bounded output.
The bias terms in a network shift the output but also effect the derivatives of the output w.r.t. to the input when the layer is activated.
Question 3
Answer: 1,2,3,4. The training of a neural network involves minimizing a loss function w.r.t. the weights and biases over the training data. L 1 regularization is used during model selection to penalize models with too many parameters. The loss function is augmented with a Lagrange penalty for the number of weights. In deep learning, regularization can be applied to each layer of the network. Therefore each layer has an associated regularization parameter. Back-propagation uses the chain rule to update the weights of the network but is not guaranteed to convergence to a unique minimum. This is because the loss function is not convex w.r.t. the weights. Stochastic gradient descent is a type of optimization method which is implemented with back-propagation. There are variants of SGD, however, such as adding Nestov’s momentum term , ADAM , or RMSProp .
1.2 Back-Propagation
Let us consider a feedforward architecture with an input layer, L − 1 hidden layers, and one output layer, with K units in the output layer for classification of K categories. As a result, we have L sets of weights and biases (W (ℓ), b (ℓ)) for ℓ = 1, …, L, corresponding to the layer inputs Z (ℓ−1) and outputs Z (ℓ) for ℓ = 1, …, L. Recall that each layer is an activation of a semi-affine transformation, I (ℓ)(Z (ℓ−1)) := W (L)Z (ℓ−1) + b (L). The corresponding activation functions are denoted as σ (ℓ). The activation function for the output layer is a softmax function, σ s(x).
Here we use the cross-entropy as the loss function, which is defined as
The relationship between the layers, for ℓ ∈{1, …, L} are
The update rules for the weights and biases are
We now begin the back-propagation, tracking the intermediate calculations carefully using Einstein summation notation.
For the gradient of \(\mathcal {L}\) w.r.t. W (L) we have
But
where we have used the fact that \(\sum _{k=1}^{K}Y_{k}=1\) in the last equality. Similarly for b (L), we have
It follows that
where ⊗ denotes the outer product.
Now for the gradient of \(\mathcal {L}\) w.r.t. W (L−1) we have
If we assume that σ (ℓ)(x) = sigmoid(x), ℓ ∈{1, …, L − 1}, then
Similarly we have
It follows that we can define the following recursion relation for the loss gradient:
where ∘ denotes the Hadamard product (element-wise multiplication). This recursion relation can be generalized for all layers and choice of activation functions. To see this, let the back-propagation error \(\delta ^{(\ell )}:=\nabla _{b^{(\ell )}}\mathcal {L}\), and since
or equivalently in matrix–vector form
we can write, in general, for any choice of activation function for the hidden layer,
and
1.3 Proof of Theorem 4.2
Using the same deep structure shown in Fig. 4.9, Liang and Srikant (2016) find the binary expansion sequence {x 0, …, x n}. In this step, they used n binary steps units in total. Then they rewrite \(g_{m+1}(\sum _{i=0}^{n}\frac {x_{i}}{2^{n}})\),
Clearly Eq. 4.57 defines iterations between the outputs of neighboring layers. Defining the output of the multilayer neural network as \(\hat {f}(x)=\sum _{i=0}^{p}a_{i}g_{i}\left (\sum _{j=0}^{n}\frac {x_{j}}{2^{j}}\right ).\) For this multilayer network, the approximation error is
This indicates, to achieve ε-approximation error, one should choose \(n=\left \lceil \log \frac {p}{\varepsilon }\right \rceil +1\). Besides, since \(\mathcal {O}(n+p)\) layers with \(\mathcal {O}(n)\) binary step units and \(\mathcal {O}(pn)\) ReLU units are used in total, this multilayer neural network thus has \(\mathcal {O}\left (p+\log \frac {p}{\varepsilon }\right )\) layers, \(\mathcal {O}\left (\log \frac {p}{\varepsilon }\right )\) binary step units, and \(\mathcal {O}\left (p\log \frac {p}{\varepsilon }\right )\) ReLU units.
1.4 Proof of Lemmas from Telgarsky (2016)
Proof (Proof of 4.1)
Let cIf denote the partition of \(\mathbb {R}\) corresponding to f, and cIg denote the partition of \(\mathbb {R}\) corresponding to g.
First consider f + g, and moreover any intervals U f ∈cIf and U g ∈cIg. Necessarily, f + g has a single slope along U f ∩ U g. Consequently, f + g is |cI|-sawtooth, where cI is the set of all intersections of intervals from cIf and cIg, meaning cI := {U f ∩ U g : U f ∈cIf, U g ∈cIg}. By sorting the left endpoints of elements of cIf and cIg, it follows that |cI|≤ k + l (the other intersections are empty).
For example, consider the example in Fig. 4.11 with partitions given in Table 4.2. The set of all intersections of intervals from cIf and cIg contains 3 elements:
Now consider f ∘ g, and in particular consider the image f(g(U g)) for some interval U g ∈cIg. g is affine with a single slope along U g; therefore, f is being considered along a single unbroken interval g(U g). However, nothing prevents g(U g) from hitting all the elements of cIf; since U g was arbitrary, it holds that f ∘ g is (|cIf|⋅|cIg|)-sawtooth. □
Proof
Recall the notation \(\tilde f(x) := [f(x) \geq 1/2]\), whereby \(\mathcal {E}(f) := \frac {1}{n}\sum _i [y_i\neq \tilde f(x_i)]\). Since f is piecewise monotonic with a corresponding partition \(\mathbb {R}\) having at most t pieces, then f has at most 2t − 1 crossings of 1/2: at most one within each interval of the partition, and at most 1 at the right endpoint of all but the last interval. Consequently, \(\tilde f\) is piecewise constant, where the corresponding partition of \(\mathbb {R}\) is into at most 2t intervals. This means n points with alternating labels must land in 2t buckets, thus the total number of points landing in buckets with at least three points is at least n − 4t. □
1.5 Python Notebooks
The notebooks provided in the accompanying source code repository are designed to gain insight in toy classification datasets. They provide examples of deep feedforward classification, back-propagation, and Bayesian network classifiers. Further details of the notebooks are included in the README.md file.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Dixon, M.F., Halperin, I., Bilokon, P. (2020). Feedforward Neural Networks. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-41068-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41067-4
Online ISBN: 978-3-030-41068-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)