Skip to main content

Feedforward Neural Networks

  • Chapter
  • First Online:
Machine Learning in Finance

Abstract

This chapter provides a more in-depth description of supervised learning, deep learning, and neural networks—presenting the foundational mathematical and statistical learning concepts and explaining how they relate to real-world examples in trading, risk management, and investment management. These applications present challenges for forecasting and model design and are presented as a reoccurring theme throughout the book. This chapter moves towards a more engineering style exposition of neural networks, applying concepts in the previous chapters to elucidate various model design choices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that there is a potential degeneracy in this case; There may exist “flat directions”—hyper-surfaces in the parameter space that have exactly the same loss function.

  2. 2.

    There is some redundancy in the construction of the network and around 50 units are needed.

  3. 3.

    The parameterized softplus function \(\sigma (x;t)=\frac {1}{t}ln(1+\exp \{tx\})\), with a model parameter t >> 1, converges to the ReLU function in the limit t →.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensor flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16 (pp. 265–283).

    Google Scholar 

  • Adams, R., Wallach, H., & Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 1–8).

    Google Scholar 

  • Andrews, D. (1989). A unified theory of estimation and inference for nonlinear dynamic models a.r. gallant and h. white. Econometric Theory,5(01), 166–171.

    Google Scholar 

  • Baillie, R. T., & Kapetanios, G. (2007). Testing for neglected nonlinearity in long-memory models. Journal of Business & Economic Statistics,25(4), 447–461.

    MathSciNet  Google Scholar 

  • Barber, D., & Bishop, C. M. (1998). Ensemble learning in Bayesian neural networks. Neural Networks and Machine Learning,168, 215–238.

    MATH  Google Scholar 

  • Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017a). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.

    Google Scholar 

  • Bartlett, P., Harvey, N., Liaw, C., & Mehrabian, A. (2017b). Nearly-tight VC-dimension bounds for piecewise linear neural networks. CoRR,abs/1703.02930.

    Google Scholar 

  • Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2006). Convex neural networks. In Y. Weiss, Schölkopf, B., & Platt, J. C. (Eds.), Advances in neural information processing systems 18 (pp. 123–130). MIT Press.

    Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer-Verlag.

    MATH  Google Scholar 

  • Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015a, May). Weight uncertainty in neural networks. arXiv:1505.05424 [cs, stat].

    Google Scholar 

  • Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015b). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.

    Google Scholar 

  • Chataigner, Crepe, & Dixon. (2020). Deep local volatility.

    Google Scholar 

  • Chen, J., Flood, M. D., & Sowers, R. B. (2017). Measuring the unmeasurable: an application of uncertainty quantification to treasury bond portfolios. Quantitative Finance,17(10), 1491–1507.

    MathSciNet  MATH  Google Scholar 

  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223–1231).

    Google Scholar 

  • Dixon, M., Klabjan, D., & Bang, J. H. (2016). Classification-based financial markets prediction using deep neural networks. CoRR,abs/1603.08604.

    Google Scholar 

  • Feng, G., He, J., & Polson, N. G. (2018, Apr). Deep learning for predicting asset returns. arXiv e-prints, arXiv:1804.09314.

    Google Scholar 

  • Frey, B. J., & Hinton, G. E. (1999). Variational learning in nonlinear Gaussian belief networks. Neural Computation,11(1), 193–213.

    Google Scholar 

  • Gal, Y. (2015). A theoretically grounded application of dropout in recurrent neural networks. arXiv:1512.05287.

    Google Scholar 

  • Gal, Y. (2016). Uncertainty in deep learning. Ph.D. thesis, University of Cambridge.

    Google Scholar 

  • Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international Conference on Machine Learning (pp. 1050–1059).

    Google Scholar 

  • Gallant, A., & White, H. (1988, July). There exists a neural network that does not make avoidable mistakes. In IEEE 1988 International Conference on Neural Networks (vol.1 ,pp. 657–664).

    Google Scholar 

  • Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356).

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.

    MATH  Google Scholar 

  • Heaton, J. B., Polson, N. G., & Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry,33(1), 3–12.

    MathSciNet  MATH  Google Scholar 

  • Hernández-Lobato, J. M., & Adams, R. (2015). Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning (pp. 1861–1869).

    Google Scholar 

  • Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 448–453). IEEE New York.

    Google Scholar 

  • Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory (pp. 5–13). ACM.

    Google Scholar 

  • Hornik, K., Stinchcombe, M., & White, H. (1989, July). Multilayer feedforward networks are universal approximators. Neural Netw.,2(5), 359–366.

    MATH  Google Scholar 

  • Horvath, B., Muguruza, A., & Tomas, M. (2019, Jan). Deep learning volatility. arXiv e-prints, arXiv:1901.09647.

    Google Scholar 

  • Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance,49(3), 851–889.

    Google Scholar 

  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.

    Google Scholar 

  • Kuan, C.-M., & White, H. (1994). Artificial neural networks: an econometric perspective. Econometric Reviews,13(1), 1–91.

    MathSciNet  MATH  Google Scholar 

  • Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research,6(Nov), 1783–1816.

    MathSciNet  MATH  Google Scholar 

  • Liang, S., & Srikant, R. (2016). Why deep neural networks? CoRRabs/1610.04161.

    Google Scholar 

  • Lo, A. (1994). Neural networks and other nonparametric techniques in economics and finance. In AIMR Conference Proceedings, Number 9.

    Google Scholar 

  • MacKay, D. J. (1992a). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.

    Google Scholar 

  • MacKay, D. J. C. (1992b, May). A practical Bayesian framework for backpropagation networks. Neural Computation,4(3), 448–472.

    Google Scholar 

  • Martin, C. H., & Mahoney, M. W. (2018). Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRRabs/1810.01075.

    Google Scholar 

  • Mhaskar, H., Liao, Q., & Poggio, T. A. (2016). Learning real and Boolean functions: When is deep better than shallow. CoRRabs/1603.00988.

    Google Scholar 

  • Mnih, A., & Gregor, K. (2014). Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030.

    Google Scholar 

  • Montúfar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014, Feb). On the number of linear regions of deep neural networks. arXiv e-prints, arXiv:1402.1869.

    Google Scholar 

  • Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives,31(2), 87–106.

    Google Scholar 

  • Neal, R. M. (1990). Learning stochastic feedforward networks, Vol. 64. Technical report, Department of Computer Science, University of Toronto.

    Google Scholar 

  • Neal, R. M. (1992). Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, CRG-TR-92-1, Dept. of Computer Science, University of Toronto.

    Google Scholar 

  • Neal, R. M. (2012). Bayesian learning for neural networks, Vol. 118. Springer Science & Business Media. bibtex: aneal2012bayesian.

    Google Scholar 

  • Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.

    Google Scholar 

  • Poggio, T. (2016). Deep learning: mathematics and neuroscience. A sponsored supplement to sciencebrain-inspired intelligent robotics: The intersection of robotics and neuroscience, pp. 9–12.

    Google Scholar 

  • Polson, N., & Rockova, V. (2018, Mar). Posterior concentration for sparse deep learning. arXiv e-prints, arXiv:1803.09138.

    Google Scholar 

  • Polson, N. G., Willard, B. T., & Heidari, M. (2015). A statistical theory of deep learning via proximal splitting. arXiv:1509.06061.

    Google Scholar 

  • Racine, J. (2001). On the nonlinear predictability of stock returns using financial and economic variables. Journal of Business & Economic Statistics,19(3), 380–382.

    MathSciNet  Google Scholar 

  • Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.

    Google Scholar 

  • Ruiz, F. R., Aueb, M. T. R., & Blei, D. (2016). The generalized reparameterization gradient. In Advances in Neural Information Processing Systems (pp. 460–468).

    Google Scholar 

  • Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines. Tech. Rep., Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto.

    Google Scholar 

  • Salakhutdinov, R., & Hinton, G. (2009). Deep Boltzmann machines. In Artificial Intelligence and Statistics (pp. 448–455).

    Google Scholar 

  • Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research,4, 61–76.

    MATH  Google Scholar 

  • Sirignano, J., Sadhwani, A., & Giesecke, K. (2016, July). Deep learning for mortgage risk. ArXiv e-prints.

    Google Scholar 

  • Smolensky, P. (1986). Parallel distributed processing: explorations in the microstructure of cognition (Vol. 1. pp. 194–281). Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,15(1), 1929–1958.

    MathSciNet  MATH  Google Scholar 

  • Swanson, N. R., & White, H. (1995). A model-selection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business & Economic Statistics,13(3), 265–275.

    Google Scholar 

  • Telgarsky, M. (2016). Benefits of depth in neural networks. CoRRabs/1602.04485.

    Google Scholar 

  • Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (pp. 1064–1071). ACM.

    Google Scholar 

  • Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. CoRRabs/1503.02406.

    Google Scholar 

  • Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017, January). Deep probabilistic programming. arXiv:1701.03757 [cs, stat].

    Google Scholar 

  • Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.

    MATH  Google Scholar 

  • Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems (pp. 1481–1488).

    Google Scholar 

  • Williams, C. K. (1997). Computing with infinite networks. In Advances in Neural Information Processing systems (pp. 295–301).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

1.1 Answers to Multiple Choice Questions

Question 1

Answer: 1, 2, 3, 4. All answers are found in the text.

Question 2

Answer: 1,2. A feedforward architecture is always convex w.r.t. each input variable if every activation function is convex and the weights are constrained to be either all positive or all negative. Simply using convex activation functions is not sufficient, since the composition of a convex function and the affine transformation of a convex function do not preserve the convexity. For example, if σ(x) = x 2, w = −1, and b = 1, then σ((x) + b) = (−x 2 + 1)2 is not convex in x.

A feedforward architecture with positive weights is a monotonically increasing function of the input for any choice of monotonically increasing activation function.

The weights of a feedforward architecture need not be constrained for the output of a feedforward network to be bounded. For example, activating the output with a softmax function will bound the output. Only if the output is not activated, should the weights and bias in the final layer be bounded to ensure bounded output.

The bias terms in a network shift the output but also effect the derivatives of the output w.r.t. to the input when the layer is activated.

Question 3

Answer: 1,2,3,4. The training of a neural network involves minimizing a loss function w.r.t. the weights and biases over the training data. L 1 regularization is used during model selection to penalize models with too many parameters. The loss function is augmented with a Lagrange penalty for the number of weights. In deep learning, regularization can be applied to each layer of the network. Therefore each layer has an associated regularization parameter. Back-propagation uses the chain rule to update the weights of the network but is not guaranteed to convergence to a unique minimum. This is because the loss function is not convex w.r.t. the weights. Stochastic gradient descent is a type of optimization method which is implemented with back-propagation. There are variants of SGD, however, such as adding Nestov’s momentum term , ADAM , or RMSProp .

1.2 Back-Propagation

Let us consider a feedforward architecture with an input layer, L − 1 hidden layers, and one output layer, with K units in the output layer for classification of K categories. As a result, we have L sets of weights and biases (W (), b ()) for  = 1, …, L, corresponding to the layer inputs Z (−1) and outputs Z () for  = 1, …, L. Recall that each layer is an activation of a semi-affine transformation, I ()(Z (−1)) := W (L)Z (−1) + b (L). The corresponding activation functions are denoted as σ (). The activation function for the output layer is a softmax function, σ s(x).

Here we use the cross-entropy as the loss function, which is defined as

$$\displaystyle \begin{aligned}\mathcal{L}:= -\sum_{k=1}^{K}Y_{k}\log \hat{Y}_{k}.\end{aligned} $$

The relationship between the layers, for  ∈{1, …, L} are

$$\displaystyle \begin{aligned} \hat{Y} (X) & = Z^{(L)}=\sigma_s(I^{(L)}) \in [0,1]^{K},\\ Z^{(\ell)} & = \sigma^{(\ell)} \left ( I^{(\ell)} \right ), ~\ell=1,\dots,L-1,\\ Z^{(0)} & = X.\end{aligned} $$

The update rules for the weights and biases are

$$\displaystyle \begin{aligned} \Delta W^{(\ell)} &= - \gamma \nabla_{W^{(\ell)}}\mathcal{L},\\ \Delta {\mathbf{b}}^{(\ell)} &= - \gamma \nabla_{{\mathbf{b}}^{(\ell)}}\mathcal{L}. \end{aligned} $$

We now begin the back-propagation, tracking the intermediate calculations carefully using Einstein summation notation.

For the gradient of \(\mathcal {L}\) w.r.t. W (L) we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L)}} &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \frac{\partial Z_{k}^{(L)}}{\partial w_{ij}^{(L)}}\\ &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \frac{\partial I_{m}^{(L)}}{\partial w_{ij}^{(L)}} \end{aligned} $$

But

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} &= -\frac{Y_{k}}{Z_{k}^{(L)}}\\ \frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} &= \frac{\partial}{\partial I_{m}^{(L)}}[\sigma(I^{(L)})]_{k}\\ &= \frac{\partial}{\partial I_{m}^{(L)}} \frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]}\\ &= \begin{cases} -\frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} \frac{\exp[I_{m}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} & \text{if } k \neq m \\ \frac{\exp[I_{k}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} - \frac{\exp[I_{k}^{(L)}]} {\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} \frac{\exp[I_{m}^{(L)}]}{\sum_{n=1}^{K}\exp[I_{n}^{(L)}]} & \text{otherwise} \end{cases}\\ &= \begin{cases} -\sigma_{k}\sigma_{m}& \text{if } k \neq m \\ \sigma_k(1 - \sigma_m) & \text{otherwise} \end{cases}\\ &= \sigma_k(\delta_{km} - \sigma_m) \quad \text{where} \, \delta_{km} \, \text{is the Kronecker's Delta}\\ \frac{\partial I_{m}^{(L)}}{\partial w_{ij}^{(L)}} &= \delta_{mi}Z_{j}^{(L-1)}\\ \implies \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L)}} &= -\sum_{k=1}^{K}\frac{Y_{k}}{Z_{k}^{(L)}} \sum_{m=1}^{K} Z_{m}^{(L)}(\delta_{km} - Z_{m}^{(L)}) \delta_{mi}Z_{j}^{(L-1)}\\ &= -Z_{j}^{(L-1)} \sum_{k=1}^{K}Y_{k} (\delta_{ki} - Z_{i}^{(L)}) \\ &= Z_{j}^{(L-1)} (Z_{i}^{(L)}-Y_{i}), \end{aligned} $$

where we have used the fact that \(\sum _{k=1}^{K}Y_{k}=1\) in the last equality. Similarly for b (L), we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial b_{i}^{(L)}} &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \frac{\partial I_{m}^{(L)}}{\partial b_{i}^{(L)}}\\ &= Z_{i}^{(L)}-Y_{i} \end{aligned} $$

It follows that

$$\displaystyle \begin{aligned} \nabla_{{\mathbf{b}}^{(L)}}\mathcal{L} &= Z^{(L)}-Y\\ \nabla_{W^{(L)}}\mathcal{L} &= \nabla_{{\mathbf{b}}^{(L)}}\mathcal{L} \otimes {Z^{(L-1)}}, \end{aligned} $$

where ⊗ denotes the outer product.

Now for the gradient of \(\mathcal {L}\) w.r.t. W (L−1) we have

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}}{\partial w_{ij}^{(L-1)}} &= \sum_{k=1}^{K}\frac{\partial L}{\partial Z_{k}^{(L)}} \frac{\partial Z_{k}^{(L)}}{\partial w_{ij}^{(L-1)}}\\ &= \sum_{k=1}^{K}\frac{\partial \mathcal{L}}{\partial Z_{k}^{(L)}} \sum_{m=1}^{K}\frac{\partial Z_{k}^{(L)}}{\partial I_{m}^{(L)}} \sum_{n=1}^{n^{(L-1)}} \frac{\partial I_{m}^{(L)}}{\partial Z_{n}^{(L-1)}} \sum_{p=1}^{n^{(L-1)}} \frac{\partial Z_{n}^{(L-1)}}{\partial I_{p}^{(L-1)}} \frac{\partial I_{p}^{(L-1)}}{\partial w_{ij}^{(L-1)}} \end{aligned} $$

If we assume that σ ()(x) = sigmoid(x),  ∈{1, …, L − 1}, then

$$\displaystyle \begin{aligned} \frac{\partial I_{m}^{(L)}}{\partial Z_{n}^{(L-1)}} &= w_{mn}^{(L)}\\ \frac{\partial Z_{n}^{(L-1)}}{\partial I_{p}^{(L-1)}} &= \frac{\partial}{\partial I_{p}^{(L-1)}}\bigg(\frac{1}{1+\exp(-I_{n}^{(L-1)})}\bigg)\\ &= \frac{1}{1+\exp(-I_{n}^{(L-1)})} \frac{\exp(-I_{n}^{(L-1)})}{1+\exp(-I_{n}^{(L-1)})} \, \delta_{np} \\ &= Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{np} = \sigma^{(L-1)}_n(1-\sigma^{(L-1)}_n)\delta_{np} \\ \frac{\partial I_{p}^{(L-1)}}{\partial w_{ij}^{(L-1)}} &= \delta_{pi} Z_{j}^{(L-2)} \\ \implies \frac{\partial L}{\partial w_{ij}^{(L)}} &= -\sum_{k=1}^{K}\frac{Y_{k}}{Z_{k}^{(L)}} \sum_{m=1}^{K}Z_{k}^{(L)}(\delta_{km} - Z_{m}^{(L)})\\ &\qquad \sum_{n=1}^{n^{(L-1)}} w_{mn}^{(L)} \sum_{p=1}^{n^{(L-1)}} Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{np} \delta_{pi} Z_{j}^{(L-2)} \\ &= -\sum_{k=1}^{K}Y_{k} \sum_{m=1}^{K}(\delta_{km} - Z_{m}^{(L)}) \sum_{n=1}^{n^{(L-1)}} w_{mn}^{(L)} Z_{n}^{(L-1)} (1-Z_{n}^{(L-1)}) \, \delta_{ni} Z_{j}^{(L-2)} \\ &= -\sum_{k=1}^{K}Y_{k} \sum_{m=1}^{K}(\delta_{km} - Z_{m}^{(L)}) w_{mi}^{(L)} Z_{i}^{(L-2)} (1-Z_{i}^{(L-1)}) Z_{j}^{(L-2)} \\ &= -Z_{j}^{(L-2)}Z_{i}^{(L-1)}(1-Z_{i}^{(L-1)}) \sum_{m=1}^{K} w_{mi}^{(L)} \sum_{k=1}^{K}(\delta_{km}Y_{k} - Z_{m}^{(L)}Y_{k}) \\ &= Z_{j}^{(L-2)}Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) \sum_{m=1}^{K} w_{mi}^{(L)} (Z_{m}^{(L)} - Y_{m}) \\ &= Z_{j}^{(L-2)}Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) (Z^{(L)} - Y)^{T} {\mathbf{w}}_{,i}^{(L)} \\ \end{aligned} $$

Similarly we have

$$\displaystyle \begin{aligned}\frac{\partial \mathcal{L}}{\partial b_{i}^{(L-1)}} = Z_{i}^{(L-1)} (1-Z_{i}^{(L-1)}) (Z^{(L)} - Y)^{T} {\mathbf{w}}_{,i}^{(L)}.\end{aligned}$$

It follows that we can define the following recursion relation for the loss gradient:

$$\displaystyle \begin{aligned} \nabla_{b^{(L-1)}}\mathcal{L} &= Z^{(L-1)} \circ (\mathbf{1}-Z^{(L-1)}) \circ ({W^{(L)}}^{T} \nabla_{b^{(L)}}\mathcal{L}) \\ \nabla_{W^{(L-1)}}\mathcal{L} &= \nabla_{b^{(L-1)}}\mathcal{L} \otimes Z^{(L-2)}\\ & = Z^{(L-1)} \circ (\mathbf{1}-Z^{(L-1)}) \circ ({W^{(L)}}^{T} \nabla_{W^{(L)}}\mathcal{L}), \end{aligned} $$

where ∘ denotes the Hadamard product (element-wise multiplication). This recursion relation can be generalized for all layers and choice of activation functions. To see this, let the back-propagation error \(\delta ^{(\ell )}:=\nabla _{b^{(\ell )}}\mathcal {L}\), and since

$$\displaystyle \begin{aligned} \left[\frac{\partial \sigma^{(\ell)}}{\partial I^{(\ell)}}\right]_{ij}&=\frac{\partial \sigma_i^{(\ell)}}{\partial I_j^{(\ell)}}\\ &=\sigma_i^{(\ell)}(1-\sigma_i^{(\ell)})\delta_{ij} \end{aligned} $$

or equivalently in matrix–vector form

$$\displaystyle \begin{aligned}\nabla_{I^{(\ell)}} \sigma^{(\ell)}=\text{diag}(\sigma^{(\ell)} \circ (\mathbf{1}-\sigma^{(\ell)})),\end{aligned}$$

we can write, in general, for any choice of activation function for the hidden layer,

$$\displaystyle \begin{aligned}\delta^{(\ell)}=\nabla_{I^{(\ell)}} \sigma^{(\ell)}(W^{(\ell+1)})^T\delta^{(\ell+1)},\end{aligned}$$

and

$$\displaystyle \begin{aligned}\nabla_{W^{(\ell)}}\mathcal{L} = \delta^{(\ell)} \otimes Z^{(\ell-1)}.\end{aligned}$$

1.3 Proof of Theorem 4.2

Using the same deep structure shown in Fig. 4.9, Liang and Srikant (2016) find the binary expansion sequence {x 0, …, x n}. In this step, they used n binary steps units in total. Then they rewrite \(g_{m+1}(\sum _{i=0}^{n}\frac {x_{i}}{2^{n}})\),

$$\displaystyle \begin{aligned} g_{m+1}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right)&=\sum_{j=0}^{n}\left[x_{j}\cdot\frac{1}{2^{j}}g_{m}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right)\right]\\ &=\sum_{j=0}^{n}\max\left[2(x_{j}-1)+\frac{1}{2^j}g_{m}\left(\sum_{i=0}^{n}\frac{x_{i}}{2^{i}}\right),0\right].{} \end{aligned} $$
(4.57)

Clearly Eq. 4.57 defines iterations between the outputs of neighboring layers. Defining the output of the multilayer neural network as \(\hat {f}(x)=\sum _{i=0}^{p}a_{i}g_{i}\left (\sum _{j=0}^{n}\frac {x_{j}}{2^{j}}\right ).\) For this multilayer network, the approximation error is

$$\displaystyle \begin{aligned} |f(x)-\hat{f}(x)|&=\left|\sum_{i=0}^{p}a_{i}g_{i}\left(\sum_{j=0}^{n}\frac{x_{j}}{2^{j}}\right)-\sum_{i=0}^{p}a_{i}x^{i}\right|\\ &\le \sum_{i=0}^{p}\left[|a_{i}|\cdot\left|g_{i}\left(\sum_{j=0}^{n}\frac{x_{j}}{2^{j}}\right)-x^{i}\right|~\right]\le\frac{p}{2^{n-1}}. \end{aligned} $$

This indicates, to achieve ε-approximation error, one should choose \(n=\left \lceil \log \frac {p}{\varepsilon }\right \rceil +1\). Besides, since \(\mathcal {O}(n+p)\) layers with \(\mathcal {O}(n)\) binary step units and \(\mathcal {O}(pn)\) ReLU units are used in total, this multilayer neural network thus has \(\mathcal {O}\left (p+\log \frac {p}{\varepsilon }\right )\) layers, \(\mathcal {O}\left (\log \frac {p}{\varepsilon }\right )\) binary step units, and \(\mathcal {O}\left (p\log \frac {p}{\varepsilon }\right )\) ReLU units.

1.4 Proof of Lemmas from Telgarsky (2016)

Proof (Proof of 4.1)

Let cIf denote the partition of \(\mathbb {R}\) corresponding to f, and cIg denote the partition of \(\mathbb {R}\) corresponding to g.

First consider f + g, and moreover any intervals U f ∈cIf and U g ∈cIg. Necessarily, f + g has a single slope along U f ∩ U g. Consequently, f + g is |cI|-sawtooth, where cI is the set of all intersections of intervals from cIf and cIg, meaning cI := {U f ∩ U g : U f ∈cIf, U g ∈cIg}. By sorting the left endpoints of elements of cIf and cIg, it follows that |cI|≤ k + l (the other intersections are empty).

For example, consider the example in Fig. 4.11 with partitions given in Table 4.2. The set of all intersections of intervals from cIf and cIg contains 3 elements:

$$\displaystyle \begin{aligned} \mbox{cI}=\{[0,\frac{1}{4}] \cap [0,\frac{1}{2}], (\frac{1}{4},1] \cap [0,\frac{1}{2}], (\frac{1}{4},1] \cap (\frac{1}{2},1]\} \end{aligned} $$
(4.58)
Table 4.2 Definitions of the functions f(x) and g(x)

Now consider f ∘ g, and in particular consider the image f(g(U g)) for some interval U g ∈cIg. g is affine with a single slope along U g; therefore, f is being considered along a single unbroken interval g(U g). However, nothing prevents g(U g) from hitting all the elements of cIf; since U g was arbitrary, it holds that f ∘ g is (|cIf|⋅|cIg|)-sawtooth. □

Proof

Recall the notation \(\tilde f(x) := [f(x) \geq 1/2]\), whereby \(\mathcal {E}(f) := \frac {1}{n}\sum _i [y_i\neq \tilde f(x_i)]\). Since f is piecewise monotonic with a corresponding partition \(\mathbb {R}\) having at most t pieces, then f has at most 2t − 1 crossings of 1/2: at most one within each interval of the partition, and at most 1 at the right endpoint of all but the last interval. Consequently, \(\tilde f\) is piecewise constant, where the corresponding partition of \(\mathbb {R}\) is into at most 2t intervals. This means n points with alternating labels must land in 2t buckets, thus the total number of points landing in buckets with at least three points is at least n − 4t. □

1.5 Python Notebooks

The notebooks provided in the accompanying source code repository are designed to gain insight in toy classification datasets. They provide examples of deep feedforward classification, back-propagation, and Bayesian network classifiers. Further details of the notebooks are included in the README.md file.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dixon, M.F., Halperin, I., Bilokon, P. (2020). Feedforward Neural Networks. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_4

Download citation

Publish with us

Policies and ethics