Skip to main content
Log in

Data science and the art of modelling

  • Published:
Lettera Matematica

Abstract

Datacentric enthusiasm is growing strong across a variety of domains. Whilst data science asks unquestionably exciting scientific questions, we argue that its contributions should not be extrapolated from the scientific context in which they originate. In particular we suggest that the simple-minded idea to the effect that data can be seen as a replacement for scientific modelling is not tenable. By recalling some well-known examples from dynamical systems we conclude that data science performs at its best when coupled with the subtle art of modelling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. See http://www.tylervigen.com/spurious-correlations for sources, as well as [3].

  2. See for instance the World Economic Forum article “A brief history of big data everyone should read”, https://www.weforum.org/agenda/2015/02/a-brief-history-of-big-data-everyone-should-read/. For a critical point of view, see instead the book by Viktor Mayer-Schönberger and Thomas Ramge [20].

  3. This is the motivation given by Volterra for his study [24]:

    Dr. Umberto D’Ancona had repeatedly discussed with me statistics he was collecting about fishing during the war and in the periods before and after it, asking me if it were possible to give a mathematical explanation of the results he was obtaining on the percentages of the various species in these different periods. This request led me to pose the problem as I do in these pages and to solve it by establishing various laws whose statement can be found here.

References

  1. Anderson, C.: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired (2008) http://www.wired.com/2008/06/pb-theory/

  2. Bacaër, N.: Histoire de Mathématiques et de Populations. Cassini, Paris (2008)

    Google Scholar 

  3. Calude, C.S., Longo, G.: The deluge of spurious correlations in big data. Found. Sci. 21, 1 (2016)

    Article  Google Scholar 

  4. Cecconi, F., Cencini, M., Falcioni, M., Vulpiani, A.: The prediction of future from the past: an old problem from a modern perspective. Am. J. Phys. 80, 1001–1008 (2012)

    Article  Google Scholar 

  5. Cecconi, F., Cencini, M., Sylos Labini, F.: Si può prevedere il futuro? Scienze 538, 32–35 (2013)

    Google Scholar 

  6. Dahan Dalmedico, A.: History and epistemology of models: meteorology as a case study. Arch. Hist. Exact Sci. 55, 395–422 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  7. Giaquinta, M., Hosni, H.: Mathematics in the social sciences: reflections on the theory of social choice and welfare. Lett. Mat. Int. Ed. 3, 101–109 (2015)

    Article  MathSciNet  Google Scholar 

  8. Guerraggio, A.: 15 Grandi Idee Matematiche. Bruno Mondadori, Milano (2013)

    MATH  Google Scholar 

  9. Guerraggio, A., Paoloni, G.: Vito Volterra. Franco Muzzio Editore, Roma (2008)

    MATH  Google Scholar 

  10. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data Intensive Scientific Discovery. Microsoft Research, Redmond (2009)

    Google Scholar 

  11. Hosni, H., Vulpiani, A.: Forecasting in the light of Big Data. Philos. Technol. (2017). https://doi.org/10.1007/s13347-017-0265-3

    Google Scholar 

  12. Kac, M.: On the notion of recurrence in discrete stochastic processes. Bull. Am. Math. Soc. 53, 1002–1010 (1947)

    Article  MathSciNet  MATH  Google Scholar 

  13. Licitra, L., Trama, A., Hosni, H.: Benefits and risks of machine learning decision support systems. JAMA 318(23), 2354 (2017). https://doi.org/10.1001/jama.2017.16627

    Article  Google Scholar 

  14. Lorenz, E.N.: Deterministic nonperiodic flow. J. Atmos. Sci. 20, 130–148 (1963)

    Article  MATH  Google Scholar 

  15. Lorenz, E.N.: Atmospheric predictability as revealed by naturally occurring analogues. J. Atmos. Sci. 26, 636–646 (1969)

    Article  Google Scholar 

  16. Lorenz, E.N.: Three approaches to atmospheric predictability. Bull. Am. Meteorol. Soc. 50, 345–349 (1969)

    Article  Google Scholar 

  17. Lotka, A.J.: Analytical note on certain rhythmic relations in organic systems. PNAS 6, 410–415 (1920)

    Article  Google Scholar 

  18. Lynch, P.: The Emergence of Numerical Weather Prediction: Richardson’s Dream. Cambridge University Press, Cambridge (2006)

    MATH  Google Scholar 

  19. Ma, S.K.: Statistical Mechanics. World Scientific, Singapore (1985)

    Book  MATH  Google Scholar 

  20. Mayer-Schönberger, V., Ramge, T.: Reinventing Capitalism in the Age of Big Data. Basic Books, New York (2018)

    Google Scholar 

  21. Onsager, L., Machlup, S.: Phys. Rev. Fluctuations and irreversible processes 91, 1505–12 (1953)

    Google Scholar 

  22. Popkin, G.: A twisted path to equation-free prediction. Quanta Mag. (2015)

  23. Rényi, A.: Dialogues on Mathematics. Holden-Day, San Francisco (1967)

    MATH  Google Scholar 

  24. Volterra, V.: Variazioni e fluttuazioni del numero d’individui in specie animali conviventi. Memorie del R. Comitato talassografico italiano, Mem. CXXXI (1927). Also in: Volterra, V.: Opere matematiche: memorie e note. Vol. 5: 1926–1940. Accademia nazionale dei Lincei, Roma (1962)

  25. Vulpiani, A.: Lewis Fry Richardson: scientist, visionary and pacifist. Lett. Mat. Int. Ed. 2, 121–128 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  26. Weigend, A.S., Gershenfeld, N.A. (eds.): Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, Boston (1994)

    Google Scholar 

  27. Ye, H., Beamish, R.J., Glaser, S.M., Grant, S.C.H., Hsieh, C., Richards, L.J., Schnute, J.T., Sugihara, G.: Equation-free mechanistic ecosystem forecasting using empirical dynamic modeling. Proc Natl Acad Sci 112(13), E1569–E1576 (2015). https://doi.org/10.1073/pnas.1417063112

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hykel Hosni or Angelo Vulpiani.

Appendices

Appendix 1: Lorenz’s model

A rather common problem in many applications is the following: given a nonlinear partial differential equation

$$\begin{aligned} \partial _t \psi (\mathbf{x},t) = \mathcal{L}[\psi (\mathbf{x},t), \nabla \psi (\mathbf{x},t), \Delta \psi (\mathbf{x},t)] \end{aligned}$$
(5)

where \(\psi\) is a vector field, we want to find a set of differential equations that approximate (5). As an example, we may consider the Navier–Stokes equations with \(\psi = (\mathbf{u}, \rho , p, T)\), where \(\mathbf{u}\), \(\rho\), p and T denote, respectively, the velocity field, density, pressure and temperature.

A widely used procedure (the so-called Galerkin method) consists in approximating \(\psi (\mathbf{x},t)\) in the form

$$\begin{aligned} \psi (\mathbf{x},t)=\sum _{n<N} a_n(t) \phi _n(\mathbf{x}), \end{aligned}$$
(6)

where \(\{\phi _n\}\) are suitable orthonormal, complete functions. Substituting (6) in (5), we obtain a set of differential equations for \(\{a_n\}\):

$$\begin{aligned} {d a_n \over dt}=F_n(a_1,a_2,\ldots , a_N)\quad n=1,2,\ldots ,N. \end{aligned}$$
(7)

Of course, if we want a good quantitative agreement, then N has to be very large, so that the set of differential Eq. (7) is a good approximation of (5). This is what is done in meteorology or engineering, where the value of N easily reaches \(10^9\) and even more.

In his famous 1963 paper [14], Lorenz, studying the problem of convection in a fluid heated from below, used for N the smallest value that could give non-periodic behaviours, that is \(N = 3\): in particular, 2 harmonics for speed and 1 for temperature. Here is his famous model, apparently innocuous:

$$\begin{aligned} {dx \over dt}= -\sigma x + \sigma y, \quad {dy \over dt}= -xz+rx-y, \quad {dz \over dt}= xy -bz, \end{aligned}$$
(8)

where (xyz) are proportional to \((a_1, a_2, a_3)\), and \(\sigma\), b and r are constants related to the properties of the fluid; in particular, r is proportional to the Rayleigh number.

It is important to emphasise the fact that these equations were not invented, but were obtained, even if with a very brutal truncation, from the equations of fluid dynamics. Of course, the value of \(N = 3\) does not allow a quantitative agreement with the original equations.

The importance of Lorenz’s model lies in having shown that it is possible to obtain a chaotic behaviour even in low-dimension systems: the complexity of the temporal evolution that occurs in turbulent fluids is not necessarily a mere superposition of many elementary events (say, many Fourier harmonics), but comes from the nonlinear structure of the equations.

Appendix 2: The role of theory and of right variables in weather forecasting

In the 1950s, Charney and von Neumann, in the context of the Meteorological Project at the Institute for Advanced Study in Princeton, noticed that the equations originally proposed by Richardson, even though correct, are not suitable for weather forecasting. The apparently paradoxical reason is that they are too accurate, as mentioned above. It was thus necessary to construct effective equations that eliminated the fast variables. The introduction of the filtering procedure, which separates the meteorologically relevant part from the irrelevant one, has a clear practical advantage: numerical instabilities are less severe and therefore a relatively large \(\Delta t\) integration step can be used, which allows more efficient numerical calculations.

Besides the computational aspect, it is important to note that with effective equations for the slow dynamics it is possible to identify the most important ingredients, which instead remain hidden in the detailed description in the system given by the original equations. The equations used are called quasi-geostrophic; the simplest case is the barotropic one, in which the pressure depends only on the horizontal coordinates.

To give an idea of the construction of effective equations for slow variables, consider a (rather academic) case in which the status of the system \(\mathbf{X}\) consists of slow variables \(\mathbf{X}_S\), with a characteristic time O(1), and fast variables \(\mathbf{X}_f\) with a characteristic time \(O(\epsilon ) \ll 1\), which evolve with a set of differential equations

$$\begin{aligned} {d\mathbf{X}_s \over dt} = \mathbf{F}(\mathbf{X}_s, \mathbf{X}_f) \quad {d\mathbf{X}_f \over dt} = {1 \over \epsilon } \mathbf{G}(\mathbf{X}_s, \mathbf{X}_f). \end{aligned}$$
(9)

Note that even a numerical study of this problem is not simple: it would require the use of an extremely small \(\Delta t\) integration step, that is, much smaller than \(\epsilon\).

On the other hand, if we are only interested in slow variables, it is sufficient to write an equation for the \(\mathbf{X}_s\):

$$\begin{aligned} {d\mathbf{X}_s \over dt} = \mathbf{F}_{ eff}(\mathbf{X}_s), \end{aligned}$$

which takes into account the effect of fast variables on slow ones; this way, we could use a \(\Delta t\) that is not too small.

Unfortunately, there are no systematic procedures to find effective equations, and not even the separation of the variables into slow and fast ones is easy.

Finally, in the case studied by Charney and von Neumann, things are even more difficult since we have partial differential equations.

Perhaps the most famous example of eliminating fast variables is given by the Langevin equation, which describes the motion of a colloidal particle in a liquid. These particles are much larger than the liquid molecules, their dimensions being in the order of microns, and much slower; the effect of the molecules translates into a friction force and a fluctuating force (white noise).

Translated from the Italian by Daniele A. Gewurz

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosni, H., Vulpiani, A. Data science and the art of modelling. Lett Mat Int 6, 121–129 (2018). https://doi.org/10.1007/s40329-018-0225-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40329-018-0225-5

Keywords

Navigation