Skip to main content

Data-Informed Modeling in the Health Sciences

  • Chapter
  • First Online:
Mathematics of Planet Earth

Part of the book series: Mathematics of Planet Earth ((MPE,volume 5))

  • 988 Accesses

Abstract

The adoption of automation and technology by health professionals is triggering an explosion of databases and data streams in that sector. The emergence of this data torrent creates the pressing need to mine it for value, which in turn requires investment for the development of modeling and analysis tools. In view of this, dynamicists are presented with the terrific opportunity to enrich their discipline by supplying it with new tools, expanding its scope, and elevating its social impact. This chapter is written in that spirit, examining three concrete case studies encountered in the field: quantifying the salmonellosis risk posed by distinct food sources, assimilating genetic data into a dynamical model for avian influenza transmission, and statistically decontaminating gas chromatography/mass spectroscopy time series. We review available prototypical models and build on them guided by data and mathematical abstraction, demonstrating in the process how to root a model into data. This takes us quite naturally into the realm of probabilistic and statistical modeling and reopens a decades-old discussion on the role of discrete models in applied mathematics. We also touch briefly on the timely subject of mathematicians being employed as such outside math departments and attempt a short outlook on their prospects and opportunities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000). ISBN: 0-8218-0531-2

    MATH  Google Scholar 

  2. Barto, A.G.: Discrete and continuous models. Int. J. Gen. Syst. 4(3), 163–177 (1978). https://doi.org/10.1080/03081077808960681

    Article  Google Scholar 

  3. Benaglia, T., Chauveau, D., Hunter, D.R., et al.: mixtools: an R package for analyzing mixture models. J. Stat. Softw. 32(6) (2010). https://doi.org/10.18637/jss.v032.i06

  4. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006). ISBN: 0-387-31073-8

    MATH  Google Scholar 

  5. Boender, G.J., Hagenaars, T.J., Bouma, A., et al.: Risk maps for the spread of highly pathogenic avian influenza in poultry. PLoS Comput. Biol. 3(4), 704–712 (2007). https://doi.org/10.1371/journal.pcbi.0030071

    Article  MathSciNet  Google Scholar 

  6. Box, G.E.P.: Science and statistics. J. Amer. Stat. Assoc. 71(356), 791–799 (1976). https://doi.org/10.1080/01621459.1976.10480949

    Article  MathSciNet  Google Scholar 

  7. Bromham, L., Dinnage, R., Hua, X.: Interdisciplinary research has consistently lower funding success. Nature 534(7609) (2016). https://doi.org/10.1038/nature18315

    Article  Google Scholar 

  8. Busch, R., Neese, R.A., Awada, M., et al.: Measurement of cell proliferation by heavy water labeling. Nat. Prot. 2(12), 3045–3057 (2007). https://doi.org/10.1038/nprot.2007.420

    Article  Google Scholar 

  9. Council of the European Communities: Council directive 2005/94/ec of 20 December 2005 on community measures for the control of avian influenza and repealing directive 92/40/eec. Off. J. Eur. Union 49, L10/16–65 (2006). ISSN: 1725-2555

    Google Scholar 

  10. Cox, D.R.: Principles of Statistical Inference. Cambridge University Press, Cambridge (2006). ISBN: 978-0-521-86673-6

    Book  Google Scholar 

  11. Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)

    MATH  Google Scholar 

  12. Dorado-García, A., Smid, J.H., van Pelt, W., et al.: Molecular relatedness of ESBL/AmpC-producing Escherichia coli from humans, animals, food and the environment: a pooled analysis. J. Antimicrob. Chemother. 73(2), 339–347 (2018). https://doi.org/10.1093/jac/dkx397

    Article  Google Scholar 

  13. Fisher, R.A.: Presidential address. Sankhy\(\bar {a}\) Ind. J. Stat. 4(1), 14–17 (1938)

    Google Scholar 

  14. GitHub repository. https://github.com/azagaris

  15. Gutenkunst, R.N., Waterfall, J.J., Casey, F.P., et al.: Universally sloppy parameter sensitivities in systems biology models. PLoS Comp. Biol. 3, 1871–1878 (2007). https://doi.org/10.1371/journal.pcbi.0030189

    Article  MathSciNet  Google Scholar 

  16. Hald, T., Wegener, H.C.: Quantitative assessment of the sources of human salmonellosis attributable to pork. In: Proceedings of the 3rd ISECSP, pp. 200–205 (1999)

    Google Scholar 

  17. Hald, T., Vose, D., Wegener, H.C., et al.: A Bayesian approach to quantify the contribution of animal–food sources to human salmonellosis. Risk Anal. 24, 255–269 (2004). https://doi.org/10.1111/j.0272-4332.2004.00427.x

    Article  Google Scholar 

  18. Hamming, R.W.: Toward a lean and lively calculus: report of the conference/workshop to develop curriculum and teaching methods for calculus at the college level. Am. Math. Mon. 95(5), 466–471 (1988). https://doi.org/10.1080/00029890.1988.11972034

    MathSciNet  Google Scholar 

  19. Karch, H., Denamur, E., Dobrindt, U., et al.: The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841–848 (2012). https://doi.org/10.1002/emmm.201201662

    Article  Google Scholar 

  20. Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. A 115, 700–721 (1927). https://doi.org/10.1098/rspa.1927.0118

    Article  Google Scholar 

  21. Kimura, M.: Estimation of evolutionary distances between homologous nucleotide distances. Proc. Natl. Acad. Sci. 78, 454–458 (1981)

    Article  Google Scholar 

  22. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694

    Article  MathSciNet  Google Scholar 

  23. Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, New York (2000). ISBN: 978-0521895606

    Google Scholar 

  24. Raue, A., Kreutz, C., Maiwald, T., et al.: Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929 (2009). https://doi.org/10.1093/bioinformatics/btp358

    Article  Google Scholar 

  25. Schervish, M.J.: Theory of Statistics. Springer, New York (1995). ISBN: 978-1-4612-8708-7

    Book  Google Scholar 

  26. Snow, J.: On the Mode of Communication of Cholera. John Churchill, London (1855)

    Google Scholar 

  27. Sorg, L.: Forward-looking panel tackles issues of the Mathematics of Planet Earth. SIAM News Blog (2016)

    Google Scholar 

  28. Stegeman, A., Bouma, A., Elbers, A.R.W., et al.: Avian Influenza A Virus (H7N7) epidemic in The Netherlands in 2003: course of the epidemic and effectiveness of control measures. J. Infect. Dis. 190(12), 2088–2095 (2004). https://doi.org/10.1086/425583

    Article  Google Scholar 

  29. Tan, C.Y., Iglewicz, B.: Measurement-methods comparisons and linear statistical relationship. Technometrics 41(3), 192–201 (1999). https://doi.org/10.1080/00401706.1999.10485668

    Article  MathSciNet  Google Scholar 

  30. Tufte, E.R.: Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire (1997). ISBN: 978-0961392123

    Google Scholar 

  31. Vrisekoop, N., den Braber, I., de Boer, A.B., et al.: Sparse production but preferential incorporation of recently produced naïve T cells in the human peripheral pool. Proc. Natl. Acad. Sci. 105(16), 6115–6120 (2008). https://doi.org/10.1073/pnas.0709713105

    Article  Google Scholar 

  32. Waterfall, J.J., Casey, F.P., Gutenkunst, R.N., et al.: Sloppy-model universality class and the Vandermonde matrix. Phys. Rev. Lett. 97, 150601 (2006). https://doi.org/10.1103/PhysRevLett.97.150601

    Article  Google Scholar 

  33. Wilson, E.O.: Letters to a Young Scientists. Liveright, New York (2003). ISBN: 978-0871403858

    Google Scholar 

  34. Zilversmit, D.B., Entenman, C., Fishler, M.C.: On the calculation of “turnover time” and “turnover rate” from experiments involving the use of labeling agents. J. Gen. Physiol. 26(3), 325–331 (1943)

    Article  Google Scholar 

Download references

Acknowledgements

The work in Sect. 6.3 was initiated and supervised by Gert-Jan Boender and Thomas Hagenaars (Bacteriology and Epidemiology, Wageningen University and Research). The work in Sect. 6.4 was initiated by and done in collaboration with Rob de Boer (Theoretical Biology and Bioinformatics, Utrecht University), José Borghans (University Medical Center Utrecht), Ad Koets, and Lars Ravesloot (Bacteriology and Epidemiology, Wageningen University and Research). The author thanks them dearly for opening up a world of scientific opportunity and scholarship to him.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonios Zagaris .

Editor information

Editors and Affiliations

Appendix: A Short Primer on Parameter Estimation

Appendix: A Short Primer on Parameter Estimation

The fundamental belief underpinning any modeling endeavor is that system measurements can be approximately generated by a specific model. In general terms, inference uses such measurements to mitigate uncertainty present in the underlying model. In this short appendix, we assume a well-defined class of candidate models that differ only in particulars; our task is to locate among them the one that best fits the available measurements (data). Here, these models share a common functional form containing finitely many parameters, so we speak of a parametric family and parametric inference. Lifting the uncertainty surrounding the parameter values is the inferential task par excellence.

Parameter values can be inferred in various ways joined by a common thread. Typically, unknown values are obtained as solutions to an optimization problem involving the model class and available data; in the problems treated here, that data is model outputs such as values of the dependent variables. For a deterministic model, a reasonable minimal requirement for an estimator would seemingly be self-consistency: given data generated by simulating a model with specific parameter values, a self-consistent estimator would return those precise parameter values, i.e., invert the simulation. Imposing that condition is reasonable, as long as distinct parameter values yield well-defined, distinct data (parameter identifiability [24]). However, the models treated in this chapter are probabilistic: specific parameter settings only have a certain probability to generate specific data. This makes the correspondence between parameter values and data both one-to-many and many-to-one, and it necessitates rethinking what can be reasonably expected from an estimator.

To address this problem, we start with univariate r.v.s X 1, …, X N defined on a common sample space Ω and having distributions \(f_{X_1},\ldots ,f_{X_N}\). We then write \(X = (X_1,\ldots ,X_N) : \varOmega \to \mathcal {X}\) for the multivariate r.v. collecting them, and we recognize as the space where data resides. This data space is equipped with an induced joint probability distribution , and each point x = (x 1, …, x N) in it corresponds to a full set of system measurements. In general, this joint distribution does not follow trivially from the marginals \(f_{X_1},\ldots ,f_{X_N}\); determining it may be a sizable part of the modeling process and a closed-form expression outside reach, if the problem does not possess additional structure. A favorable case occurs when X 1, …, X N are pairwise independent, as f X then has the product decomposition \(f_X(x) = \prod _{n=1}^N f_{X_n}(x_n)\); another, trivial case occurs when r.v. components are algebraically constrained. Often, neither is true and modeling f X is nontrivial. As a concrete example, the reader should derive the sampling distribution of \(X = \sum _{n=1}^N X_n/N\) (sample mean) corresponding to i.i.d. Gaussian r.v.s X 1, …, X N.

We now assume that f X depends on a set of parameters Θ = (θ 1, …, θ M) ∈ Δ and write f X|Θ(⋅|θ) to reflect this. The parameter values θ are the subject of inference, i.e., of mapping data to parameter values by means of an estimator \(\hat {\varTheta } : \mathcal {X} \to \varDelta \). This function will unambiguously (i.e., deterministically) map specific data to specific parameter values without recourse to the parameter values that generated the data. It is in this sense that parameter estimation reverse-engineers data generation. To proceed intelligently with estimator design, we note that parameter values generate data probabilistically—by sampling f X|Θ(⋅|θ)—but \(\hat {\varTheta }\) maps these to parameter estimates deterministically. The combination of sampling and estimation is therefore probabilistic in nature, meaning that a fixed set of parameter values generates different data and thus gives rise to various estimates of those values. In fact, the composite map \(\hat {\varTheta } \circ X : \varOmega \to \varDelta \) is a transformed version of X and hence automatically an r.v. in its own right. Indeed, any measurable set U in parameter space Δ is assigned the measure of its pre-image \(\hat {\varTheta }^{-1}(U)\) in data space \(\mathcal {X}\) which, in turn, inherits that of \(X^{-1}(\hat {\varTheta }^{-1}(U))\) in sample space Ω.

Being a r.v., the estimator is distributed according to some sampling distribution \(f_{\hat {\varTheta }\vert \varTheta }\) that depends on the unknown parameters values. This observation suggests adapting the deterministic notion of self-consistency to that of an unbiased estimator, which amounts to demanding that

$$\displaystyle \begin{aligned} \int_{\varDelta} \hat{\theta} \, f_{\hat{\varTheta}\vert\varTheta}(\hat{\theta}\vert\theta) \, \mathrm{d}\hat{\theta} \ = \ \int_{\mathcal{X}} \hat{\varTheta}(x) \, f_{X\vert\varTheta}(x\vert\theta) \, \mathrm{d}x \ = \ \theta , \quad \theta\in\varDelta . \end{aligned} $$
(6.4.11)

If this condition holds, then the expected parameter estimates match the true parameter values, i.e., the estimator is correct on average although individual estimates inevitably deviate from the truth. That deviation can be quantified (again on average) using the variance of \(f_{\hat {\varTheta }\vert \varTheta }\), which one would like to keep as low as possible; note that some variance is inevitable, see the Cramér–Rao bound [11]. These notions of estimator bias and variance permeate estimation theory fundamentally. For example, the aforementioned variance bound links to information theory and geometry [1], whereas modern machine learning work often involves biased estimators that trade off accuracy for precision.

In our work in this chapter, we employed the likelihood L(θ|x) = f X|Θ(x|θ) with which parameter values θ ∈ Δ generate given data \(x\in \mathcal {X}\). We specifically used the maximum likelihood estimator (MLE),

$$\displaystyle \begin{aligned} \hat{\varTheta}(x) = \mathrm{arg} \, \max_\theta L(\theta \vert x) = \mathrm{arg} \, \max_\theta f_{X\vert\varTheta}(x\vert\theta) , \quad x\in\mathcal{X} . {} \end{aligned} $$
(6.4.12)

In words, the estimate for the parameter value generating given data is the value maximizing the probability (likelihood) of generating that data. The evident circularity in this statement manifests that sampling and inference run contrary to each other. Note that neither existence nor uniqueness of the MLE is automatic (nor universal) and that the MLE is often biased. However, if X 1, …, X N are i.i.d. and N →, then \(f_{\hat {\varTheta }\vert \varTheta }(\cdot \vert \theta )\) is an approximate Gaussian centered at θ by the central limit theorem (CLT). For more detailed introductions to parameter inference at two different levels, we refer the reader to [10, 25].

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zagaris, A. (2019). Data-Informed Modeling in the Health Sciences. In: Kaper, H., Roberts, F. (eds) Mathematics of Planet Earth. Mathematics of Planet Earth, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-030-22044-0_6

Download citation

Publish with us

Policies and ethics