Feedforward Neural Networks

Dixon, Matthew F.; Halperin, Igor; Bilokon, Paul

doi:10.1007/978-3-030-41068-1_4

Matthew F. Dixon⁴,
Igor Halperin⁵ &
Paul Bilokon⁶

12k Accesses
3 Citations

Abstract

This chapter provides a more in-depth description of supervised learning, deep learning, and neural networks—presenting the foundational mathematical and statistical learning concepts and explaining how they relate to real-world examples in trading, risk management, and investment management. These applications present challenges for forecasting and model design and are presented as a reoccurring theme throughout the book. This chapter moves towards a more engineering style exposition of neural networks, applying concepts in the previous chapters to elucidate various model design choices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that there is a potential degeneracy in this case; There may exist “flat directions”—hyper-surfaces in the parameter space that have exactly the same loss function.
2.
There is some redundancy in the construction of the network and around 50 units are needed.
3.
The parameterized softplus function $\sigma (x;t)=\frac {1}{t}ln(1+\exp \{tx\})$, with a model parameter t >> 1, converges to the ReLU function in the limit t →∞.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensor flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16 (pp. 265–283).
Google Scholar
Adams, R., Wallach, H., & Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 1–8).
Google Scholar
Andrews, D. (1989). A unified theory of estimation and inference for nonlinear dynamic models a.r. gallant and h. white. Econometric Theory,5(01), 166–171.
Google Scholar
Baillie, R. T., & Kapetanios, G. (2007). Testing for neglected nonlinearity in long-memory models. Journal of Business & Economic Statistics,25(4), 447–461.
MathSciNet Google Scholar
Barber, D., & Bishop, C. M. (1998). Ensemble learning in Bayesian neural networks. Neural Networks and Machine Learning,168, 215–238.
MATH Google Scholar
Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017a). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.
Google Scholar
Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017b). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.
Google Scholar
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2006). Convex neural networks. In Y. Weiss, Schölkopf, B., & Platt, J. C. (Eds.), Advances in neural information processing systems 18 (pp. 123–130). MIT Press.
Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer-Verlag.
MATH Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015a, May). Weight uncertainty in neural networks. arXiv:1505.05424 [cs, stat].
Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015b). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.
Google Scholar
Chataigner, Crepe, & Dixon. (2020). Deep local volatility.
Google Scholar
Chen, J., Flood, M. D., & Sowers, R. B. (2017). Measuring the unmeasurable: an application of uncertainty quantification to treasury bond portfolios. Quantitative Finance,17(10), 1491–1507.
MathSciNet MATH Google Scholar
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223–1231).
Google Scholar
Dixon, M., Klabjan, D., & Bang, J. H. (2016). Classification-based financial markets prediction using deep neural networks. CoRR,abs/1603.08604.
Google Scholar
Feng, G., He, J., & Polson, N. G. (2018, Apr). Deep learning for predicting asset returns. arXiv e-prints, arXiv:1804.09314.
Google Scholar
Frey, B. J., & Hinton, G. E. (1999). Variational learning in nonlinear Gaussian belief networks. Neural Computation,11(1), 193–213.
Google Scholar
Gal, Y. (2015). A theoretically grounded application of dropout in recurrent neural networks. arXiv:1512.05287.
Google Scholar
Gal, Y. (2016). Uncertainty in deep learning. Ph.D. thesis, University of Cambridge.
Google Scholar
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international Conference on Machine Learning (pp. 1050–1059).
Google Scholar
Gallant, A., & White, H. (1988, July). There exists a neural network that does not make avoidable mistakes. In IEEE 1988 International Conference on Neural Networks (vol.1 ,pp. 657–664).
Google Scholar
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356).
Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
MATH Google Scholar
Heaton, J. B., Polson, N. G., & Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry,33(1), 3–12.
MathSciNet MATH Google Scholar
Hernández-Lobato, J. M., & Adams, R. (2015). Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning (pp. 1861–1869).
Google Scholar
Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 448–453). IEEE New York.
Google Scholar
Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory (pp. 5–13). ACM.
Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989, July). Multilayer feedforward networks are universal approximators. Neural Netw.,2(5), 359–366.
MATH Google Scholar
Horvath, B., Muguruza, A., & Tomas, M. (2019, Jan). Deep learning volatility. arXiv e-prints, arXiv:1901.09647.
Google Scholar
Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance,49(3), 851–889.
Google Scholar
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
Google Scholar
Kuan, C.-M., & White, H. (1994). Artificial neural networks: an econometric perspective. Econometric Reviews,13(1), 1–91.
MathSciNet MATH Google Scholar
Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research,6(Nov), 1783–1816.
MathSciNet MATH Google Scholar
Liang, S., & Srikant, R. (2016). Why deep neural networks? CoRRabs/1610.04161.
Google Scholar
Lo, A. (1994). Neural networks and other nonparametric techniques in economics and finance. In AIMR Conference Proceedings, Number 9.
Google Scholar
MacKay, D. J. (1992a). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.
Google Scholar
MacKay, D. J. C. (1992b, May). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.
Google Scholar
Martin, C. H., & Mahoney, M. W. (2018). Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRRabs/1810.01075.
Google Scholar
Mhaskar, H., Liao, Q., & Poggio, T. A. (2016). Learning real and Boolean functions: When is deep better than shallow. CoRRabs/1603.00988.
Google Scholar
Mnih, A., & Gregor, K. (2014). Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030.
Google Scholar
Montúfar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014, Feb). On the number of linear regions of deep neural networks. arXiv e-prints, arXiv:1402.1869.
Google Scholar
Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives,31(2), 87–106.
Google Scholar
Neal, R. M. (1990). Learning stochastic feedforward networks, Vol. 64. Technical report, Department of Computer Science, University of Toronto.
Google Scholar
Neal, R. M. (1992). Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, CRG-TR-92-1, Dept. of Computer Science, University of Toronto.
Google Scholar
Neal, R. M. (2012). Bayesian learning for neural networks, Vol. 118. Springer Science & Business Media. bibtex: aneal2012bayesian.
Google Scholar
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.
Google Scholar
Poggio, T. (2016). Deep learning: mathematics and neuroscience. A sponsored supplement to sciencebrain-inspired intelligent robotics: The intersection of robotics and neuroscience, pp. 9–12.
Google Scholar
Polson, N., & Rockova, V. (2018, Mar). Posterior concentration for sparse deep learning. arXiv e-prints, arXiv:1803.09138.
Google Scholar
Polson, N. G., Willard, B. T., & Heidari, M. (2015). A statistical theory of deep learning via proximal splitting. arXiv:1509.06061.
Google Scholar
Racine, J. (2001). On the nonlinear predictability of stock returns using financial and economic variables. Journal of Business & Economic Statistics,19(3), 380–382.
MathSciNet Google Scholar
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
Google Scholar
Ruiz, F. R., Aueb, M. T. R., & Blei, D. (2016). The generalized reparameterization gradient. In Advances in Neural Information Processing Systems (pp. 460–468).
Google Scholar
Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines. Tech. Rep., Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto.
Google Scholar
Salakhutdinov, R., & Hinton, G. (2009). Deep Boltzmann machines. In Artificial Intelligence and Statistics (pp. 448–455).
Google Scholar
Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research,4, 61–76.
MATH Google Scholar
Sirignano, J., Sadhwani, A., & Giesecke, K. (2016, July). Deep learning for mortgage risk. ArXiv e-prints.
Google Scholar
Smolensky, P. (1986). Parallel distributed processing: explorations in the microstructure of cognition (Vol. 1. pp. 194–281). Cambridge, MA, USA: MIT Press.
Google Scholar
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,15(1), 1929–1958.
MathSciNet MATH Google Scholar
Swanson, N. R., & White, H. (1995). A model-selection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business & Economic Statistics,13(3), 265–275.
Google Scholar
Telgarsky, M. (2016). Benefits of depth in neural networks. CoRRabs/1602.04485.
Google Scholar
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (pp. 1064–1071). ACM.
Google Scholar
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. CoRRabs/1503.02406.
Google Scholar
Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017, January). Deep probabilistic programming. arXiv:1701.03757 [cs, stat].
Google Scholar
Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.
MATH Google Scholar
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems (pp. 1481–1488).
Google Scholar
Williams, C. K. (1997). Computing with infinite networks. In Advances in Neural Information Processing systems (pp. 295–301).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL, USA
Matthew F. Dixon
Tandon School of Engineering, New York University, Brooklyn, NY, USA
Igor Halperin
Department of Mathematics, Imperial College London, London, UK
Paul Bilokon

Authors

Matthew F. Dixon
View author publications
You can also search for this author in PubMed Google Scholar
Igor Halperin
View author publications
You can also search for this author in PubMed Google Scholar
Paul Bilokon
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

1.1 Answers to Multiple Choice Questions

Question 1

Answer: 1, 2, 3, 4. All answers are found in the text.

Question 2

Answer: 1,2. A feedforward architecture is always convex w.r.t. each input variable if every activation function is convex and the weights are constrained to be either all positive or all negative. Simply using convex activation functions is not sufficient, since the composition of a convex function and the affine transformation of a convex function do not preserve the convexity. For example, if σ(x) = x ², w = −1, and b = 1, then σ(wσ(x) + b) = (−x ² + 1)² is not convex in x.

A feedforward architecture with positive weights is a monotonically increasing function of the input for any choice of monotonically increasing activation function.

The weights of a feedforward architecture need not be constrained for the output of a feedforward network to be bounded. For example, activating the output with a softmax function will bound the output. Only if the output is not activated, should the weights and bias in the final layer be bounded to ensure bounded output.

The bias terms in a network shift the output but also effect the derivatives of the output w.r.t. to the input when the layer is activated.

Question 3

Answer: 1,2,3,4. The training of a neural network involves minimizing a loss function w.r.t. the weights and biases over the training data. L ₁ regularization is used during model selection to penalize models with too many parameters. The loss function is augmented with a Lagrange penalty for the number of weights. In deep learning, regularization can be applied to each layer of the network. Therefore each layer has an associated regularization parameter. Back-propagation uses the chain rule to update the weights of the network but is not guaranteed to convergence to a unique minimum. This is because the loss function is not convex w.r.t. the weights. Stochastic gradient descent is a type of optimization method which is implemented with back-propagation. There are variants of SGD, however, such as adding Nestov’s momentum term , ADAM , or RMSProp .

1.2 Back-Propagation

Let us consider a feedforward architecture with an input layer, L − 1 hidden layers, and one output layer, with K units in the output layer for classification of K categories. As a result, we have L sets of weights and biases (W ^(ℓ), b ^(ℓ)) for ℓ = 1, …, L, corresponding to the layer inputs Z ^(ℓ−1) and outputs Z ^(ℓ) for ℓ = 1, …, L. Recall that each layer is an activation of a semi-affine transformation, I ^(ℓ)(Z ^(ℓ−1)) := W ^(L)Z ^(ℓ−1) + b ^(L). The corresponding activation functions are denoted as σ ^(ℓ). The activation function for the output layer is a softmax function, σ _s(x).

Here we use the cross-entropy as the loss function, which is defined as

$$\displaystyle \begin{aligned}\mathcal{L}:= -\sum_{k=1}^{K}Y_{k}\log \hat{Y}_{k}.\end{aligned} $$

The relationship between the layers, for ℓ ∈{1, …, L} are

$$\displaystyle \begin{aligned} \hat{Y} (X) & = Z^{(L)}=\sigma_s(I^{(L)}) \in [0,1]^{K},\\ Z^{(\ell)} & = \sigma^{(\ell)} \left ( I^{(\ell)} \right ), ~\ell=1,\dots,L-1,\\ Z^{(0)} & = X.\end{aligned} $$

The update rules for the weights and biases are

$$\displaystyle \begin{aligned} \Delta W^{(\ell)} &= - \gamma \nabla_{W^{(\ell)}}\mathcal{L},\\ \Delta {\mathbf{b}}^{(\ell)} &= - \gamma \nabla_{{\mathbf{b}}^{(\ell)}}\mathcal{L}. \end{aligned} $$

We now begin the back-propagation, tracking the intermediate calculations carefully using Einstein summation notation.

For the gradient of $\mathcal {L}$ w.r.t. W ^(L) we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L)}} &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \frac{\partial Z_{k}^{(L)}}{\partial w_{ij}^{(L)}}\\ &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \frac{\partial I_{m}^{(L)}}{\partial w_{ij}^{(L)}} \end{aligned} $$

But

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} &= -\frac{Y_{k}}{Z_{k}^{(L)}}\\ \frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} &= \frac{\partial}{\partial I_{m}^{(L)}}[\sigma(I^{(L)})]_{k}\\ &= \frac{\partial}{\partial I_{m}^{(L)}} \frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]}\\ &= \begin{cases} -\frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} \frac{\exp[I_{m}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} & \text{if } k \neq m \\ \frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} - \frac{\exp[I_{k}^{(L)}]} {\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} \frac{\exp[I_{m}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} & \text{otherwise} \end{cases}\\ &= \begin{cases} -\sigma_{k}\sigma_{m}& \text{if } k \neq m \\ \sigma_k(1 - \sigma_m) & \text{otherwise} \end{cases}\\ &= \sigma_k(\delta_{km} - \sigma_m) \quad \text{where} \, \delta_{km} \, \text{is the Kronecker's Delta}\\ \frac{\partial I_{m}^{(L)}}{\partial w_{ij}^{(L)}} &= \delta_{mi}Z_{j}^{(L-1)}\\ \implies \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L)}} &= -\sum_{k=1}^{K}\frac{Y_{k}}{Z_{k}^{(L)}} \sum_{m=1}^{K} Z_{m}^{(L)}(\delta_{km} - Z_{m}^{(L)}) \delta_{mi}Z_{j}^{(L-1)}\\ &= -Z_{j}^{(L-1)} \sum_{k=1}^{K}Y_{k} (\delta_{ki} - Z_{i}^{(L)}) \\ &= Z_{j}^{(L-1)} (Z_{i}^{(L)}-Y_{i}), \end{aligned} $$

where we have used the fact that $\sum _{k=1}^{K}Y_{k}=1$ in the last equality. Similarly for b ^(L), we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial b_{i}^{(L)}} &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \frac{\partial I_{m}^{(L)}}{\partial b_{i}^{(L)}}\\ &= Z_{i}^{(L)}-Y_{i} \end{aligned} $$

It follows that

$$\displaystyle \begin{aligned} \nabla_{{\mathbf{b}}^{(L)}}\mathcal{L} &= Z^{(L)}-Y\\ \nabla_{W^{(L)}}\mathcal{L} &= \nabla_{{\mathbf{b}}^{(L)}}\mathcal{L} \otimes {Z^{(L-1)}}, \end{aligned} $$

where ⊗ denotes the outer product.

Now for the gradient of $\mathcal {L}$ w.r.t. W ^(L−1) we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L-1)}} &= \sum_{k=1}^{K}\frac{\partial L}{\partial Z_{k}^{(L)}} \frac{\partial Z_{k}^{(L)}}{\partial w_{ij}^{(L-1)}}\\ &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \sum_{n=1}^{n^{(L-1)}} \frac{\partial I_{m}^{(L)}}{\partial Z_{n}^{(L-1)}} \sum_{p=1}^{n^{(L-1)}} \frac{\partial Z_{n}^{(L-1)}}{\partial I_{p}^{(L-1)}} \frac{\partial I_{p}^{(L-1)}}{\partial w_{ij}^{(L-1)}} \end{aligned} $$

If we assume that σ ^(ℓ)(x) = sigmoid(x), ℓ ∈{1, …, L − 1}, then

$$\displaystyle \begin{aligned} \frac{\partial I_{m}^{(L)}}{\partial Z_{n}^{(L-1)}} &= w_{mn}^{(L)}\\ \frac{\partial Z_{n}^{(L-1)}}{\partial I_{p}^{(L-1)}} &= \frac{\partial}{\partial I_{p}^{(L-1)}}\bigg(\frac{1}{1+\exp(-I_{n}^{(L-1)})}\bigg)\\ &= \frac{1}{1+\exp(-I_{n}^{(L-1)})} \frac{\exp(-I_{n}^{(L-1)})}{1+\exp(-I_{n}^{(L-1)})} \, \delta_{np} \\ &= Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{np} = \sigma^{(L-1)}_n(1-\sigma^{(L-1)}_n)\delta_{np} \\ \frac{\partial I_{p}^{(L-1)}}{\partial w_{ij}^{(L-1)}} &= \delta_{pi} Z_{j}^{(L-2)} \\ \implies \frac{\partial L}{\partial w_{ij}^{(L)}} &= -\sum_{k=1}^{K}\frac{Y_{k}}{Z_{k}^{(L)}} \sum_{m=1}^{K}Z_{k}^{(L)}(\delta_{km} - Z_{m}^{(L)})\\ &\qquad \sum_{n=1}^{n^{(L-1)}} w_{mn}^{(L)} \sum_{p=1}^{n^{(L-1)}} Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{np} \delta_{pi} Z_{j}^{(L-2)} \\ &= -\sum_{k=1}^{K}Y_{k} \sum_{m=1}^{K}(\delta_{km} - Z_{m}^{(L)}) \sum_{n=1}^{n^{(L-1)}} w_{mn}^{(L)} Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{ni} Z_{j}^{(L-2)} \\ &= -\sum_{k=1}^{K}Y_{k} \sum_{m=1}^{K}(\delta_{km} - Z_{m}^{(L)}) w_{mi}^{(L)} Z_{i}^{(L-2)} (1-Z_{i}^{(L-1)}) Z_{j}^{(L-2)} \\ &= -Z_{j}^{(L-2)}Z_{i}^{(L-1)}(1-Z_{i}^{(L-1)}) \sum_{m=1}^{K} w_{mi}^{(L)} \sum_{k=1}^{K}(\delta_{km}Y_{k} - Z_{m}^{(L)}Y_{k}) \\ &= Z_{j}^{(L-2)}Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) \sum_{m=1}^{K} w_{mi}^{(L)} (Z_{m}^{(L)} - Y_{m}) \\ &= Z_{j}^{(L-2)}Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) (Z^{(L)} - Y)^{T} {\mathbf{w}}_{,i}^{(L)} \\ \end{aligned} $$

Similarly we have

$$\displaystyle \begin{aligned}\frac{\partial \mathcal{L}}{\partial b_{i}^{(L-1)}} = Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) (Z^{(L)} - Y)^{T} {\mathbf{w}}_{,i}^{(L)}.\end{aligned}$$

It follows that we can define the following recursion relation for the loss gradient:

$$\displaystyle \begin{aligned} \nabla_{b^{(L-1)}}\mathcal{L} &= Z^{(L-1)} \circ (\mathbf{1}-Z^{(L-1)}) \circ ({W^{(L)}}^{T} \nabla_{b^{(L)}}\mathcal{L}) \\ \nabla_{W^{(L-1)}}\mathcal{L} &= \nabla_{b^{(L-1)}}\mathcal{L} \otimes Z^{(L-2)}\\ & = Z^{(L-1)} \circ (\mathbf{1}-Z^{(L-1)}) \circ ({W^{(L)}}^{T} \nabla_{W^{(L)}}\mathcal{L}), \end{aligned} $$

where ∘ denotes the Hadamard product (element-wise multiplication). This recursion relation can be generalized for all layers and choice of activation functions. To see this, let the back-propagation error $\delta ^{(\ell )}:=\nabla _{b^{(\ell )}}\mathcal {L}$, and since

$$\displaystyle \begin{aligned} \left[\frac{\partial \sigma^{(\ell)}}{\partial I^{(\ell)}}\right]_{ij}&=\frac{\partial \sigma_i^{(\ell)}}{\partial I_j^{(\ell)}}\\ &=\sigma_i^{(\ell)}(1-\sigma_i^{(\ell)})\delta_{ij} \end{aligned} $$

or equivalently in matrix–vector form

$$\displaystyle \begin{aligned}\nabla_{I^{(\ell)}} \sigma^{(\ell)}=\text{diag}(\sigma^{(\ell)} \circ (\mathbf{1}-\sigma^{(\ell)})),\end{aligned}$$

we can write, in general, for any choice of activation function for the hidden layer,

$$\displaystyle \begin{aligned}\delta^{(\ell)}=\nabla_{I^{(\ell)}} \sigma^{(\ell)}(W^{(\ell+1)})^T\delta^{(\ell+1)},\end{aligned}$$

and

$$\displaystyle \begin{aligned}\nabla_{W^{(\ell)}}\mathcal{L} = \delta^{(\ell)} \otimes Z^{(\ell-1)}.\end{aligned}$$

1.3 Proof of Theorem 4.2

Using the same deep structure shown in Fig. 4.9, Liang and Srikant (2016) find the binary expansion sequence {x ₀, …, x _n}. In this step, they used n binary steps units in total. Then they rewrite $g_{m+1}(\sum _{i=0}^{n}\frac {x_{i}}{2^{n}})$,

$$\displaystyle \begin{aligned} g_{m+1}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right)&=\sum_{j=0}^{n}\left[x_{j}\cdot\frac{1}{2^{j}}g_{m}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right)\right]\\ &=\sum_{j=0}^{n}\max\left[2(x_{j}-1)+\frac{1}{2^j}g_{m}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right),0\right].{} \end{aligned} $$

(4.57)

Clearly Eq. 4.57 defines iterations between the outputs of neighboring layers. Defining the output of the multilayer neural network as $\hat {f}(x)=\sum _{i=0}^{p}a_{i}g_{i}\left (\sum _{j=0}^{n}\frac {x_{j}}{2^{j}}\right ).$ For this multilayer network, the approximation error is

$$\displaystyle \begin{aligned} |f(x)-\hat{f}(x)|&=\left|\sum_{i=0}^{p}a_{i}g_{i}\left(\sum_{j=0}^{n}\frac{x_{j}}{2^{j}}\right)-\sum_{i=0}^{p}a_{i}x^{i}\right|\\ &\le \sum_{i=0}^{p}\left[|a_{i}|\cdot\left|g_{i}\left(\sum_{j=0}^{n}\frac{x_{j}}{2^{j}}\right)-x^{i}\right|~\right]\le\frac{p}{2^{n-1}}. \end{aligned} $$

This indicates, to achieve ε-approximation error, one should choose $n=\left \lceil \log \frac {p}{\varepsilon }\right \rceil +1$. Besides, since $\mathcal {O}(n+p)$ layers with $\mathcal {O}(n)$ binary step units and $\mathcal {O}(pn)$ ReLU units are used in total, this multilayer neural network thus has $\mathcal {O}\left (p+\log \frac {p}{\varepsilon }\right )$ layers, $\mathcal {O}\left (\log \frac {p}{\varepsilon }\right )$ binary step units, and $\mathcal {O}\left (p\log \frac {p}{\varepsilon }\right )$ ReLU units.

1.4 Proof of Lemmas from Telgarsky (2016)

Proof (Proof of 4.1)

Let cI_f denote the partition of $\mathbb {R}$ corresponding to f, and cI_g denote the partition of $\mathbb {R}$ corresponding to g.

First consider f + g, and moreover any intervals U _f ∈cI_f and U _g ∈cI_g. Necessarily, f + g has a single slope along U _f ∩ U _g. Consequently, f + g is |cI|-sawtooth, where cI is the set of all intersections of intervals from cI_f and cI_g, meaning cI := {U _f ∩ U _g : U _f ∈cI_f, U _g ∈cI_g}. By sorting the left endpoints of elements of cI_f and cI_g, it follows that |cI|≤ k + l (the other intersections are empty).

For example, consider the example in Fig. 4.11 with partitions given in Table 4.2. The set of all intersections of intervals from cI_f and cI_g contains 3 elements:

$$\displaystyle \begin{aligned} \mbox{cI}=\{[0,\frac{1}{4}] \cap [0,\frac{1}{2}], (\frac{1}{4},1] \cap [0,\frac{1}{2}], (\frac{1}{4},1] \cap (\frac{1}{2},1]\} \end{aligned} $$

(4.58)

Table 4.2 Definitions of the functions f(x) and g(x)

Full size table

Now consider f ∘ g, and in particular consider the image f(g(U _g)) for some interval U _g ∈cI_g. g is affine with a single slope along U _g; therefore, f is being considered along a single unbroken interval g(U _g). However, nothing prevents g(U _g) from hitting all the elements of cI_f; since U _g was arbitrary, it holds that f ∘ g is (|cI_f|⋅|cI_g|)-sawtooth. □

Proof

Recall the notation $\tilde f(x) := [f(x) \geq 1/2]$, whereby $\mathcal {E}(f) := \frac {1}{n}\sum _i [y_i\neq \tilde f(x_i)]$. Since f is piecewise monotonic with a corresponding partition $\mathbb {R}$ having at most t pieces, then f has at most 2t − 1 crossings of 1/2: at most one within each interval of the partition, and at most 1 at the right endpoint of all but the last interval. Consequently, $\tilde f$ is piecewise constant, where the corresponding partition of $\mathbb {R}$ is into at most 2t intervals. This means n points with alternating labels must land in 2t buckets, thus the total number of points landing in buckets with at least three points is at least n − 4t. □

1.5 Python Notebooks

The notebooks provided in the accompanying source code repository are designed to gain insight in toy classification datasets. They provide examples of deep feedforward classification, back-propagation, and Bayesian network classifiers. Further details of the notebooks are included in the README.md file.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dixon, M.F., Halperin, I., Bilokon, P. (2020). Feedforward Neural Networks. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-41068-1_4
Published: 02 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41067-4
Online ISBN: 978-3-030-41068-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Feedforward Neural Networks

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Appendix

Appendix

1.1 Answers to Multiple Choice Questions

Question 1

Question 2

Question 3

1.2 Back-Propagation

1.3 Proof of Theorem 4.2

1.4 Proof of Lemmas from Telgarsky (2016)

Proof (Proof of 4.1)

Proof

1.5 Python Notebooks

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation