Data-Informed Modeling in the Health Sciences

Zagaris, Antonios

doi:10.1007/978-3-030-22044-0_6

Antonios Zagaris⁸

Part of the book series: Mathematics of Planet Earth ((MPE,volume 5))

988 Accesses

Abstract

The adoption of automation and technology by health professionals is triggering an explosion of databases and data streams in that sector. The emergence of this data torrent creates the pressing need to mine it for value, which in turn requires investment for the development of modeling and analysis tools. In view of this, dynamicists are presented with the terrific opportunity to enrich their discipline by supplying it with new tools, expanding its scope, and elevating its social impact. This chapter is written in that spirit, examining three concrete case studies encountered in the field: quantifying the salmonellosis risk posed by distinct food sources, assimilating genetic data into a dynamical model for avian influenza transmission, and statistically decontaminating gas chromatography/mass spectroscopy time series. We review available prototypical models and build on them guided by data and mathematical abstraction, demonstrating in the process how to root a model into data. This takes us quite naturally into the realm of probabilistic and statistical modeling and reopens a decades-old discussion on the role of discrete models in applied mathematics. We also touch briefly on the timely subject of mathematicians being employed as such outside math departments and attempt a short outlook on their prospects and opportunities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000). ISBN: 0-8218-0531-2
MATH Google Scholar
Barto, A.G.: Discrete and continuous models. Int. J. Gen. Syst. 4(3), 163–177 (1978). https://doi.org/10.1080/03081077808960681
Article Google Scholar
Benaglia, T., Chauveau, D., Hunter, D.R., et al.: mixtools: an R package for analyzing mixture models. J. Stat. Softw. 32(6) (2010). https://doi.org/10.18637/jss.v032.i06
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006). ISBN: 0-387-31073-8
MATH Google Scholar
Boender, G.J., Hagenaars, T.J., Bouma, A., et al.: Risk maps for the spread of highly pathogenic avian influenza in poultry. PLoS Comput. Biol. 3(4), 704–712 (2007). https://doi.org/10.1371/journal.pcbi.0030071
Article MathSciNet Google Scholar
Box, G.E.P.: Science and statistics. J. Amer. Stat. Assoc. 71(356), 791–799 (1976). https://doi.org/10.1080/01621459.1976.10480949
Article MathSciNet Google Scholar
Bromham, L., Dinnage, R., Hua, X.: Interdisciplinary research has consistently lower funding success. Nature 534(7609) (2016). https://doi.org/10.1038/nature18315
Article Google Scholar
Busch, R., Neese, R.A., Awada, M., et al.: Measurement of cell proliferation by heavy water labeling. Nat. Prot. 2(12), 3045–3057 (2007). https://doi.org/10.1038/nprot.2007.420
Article Google Scholar
Council of the European Communities: Council directive 2005/94/ec of 20 December 2005 on community measures for the control of avian influenza and repealing directive 92/40/eec. Off. J. Eur. Union 49, L10/16–65 (2006). ISSN: 1725-2555
Google Scholar
Cox, D.R.: Principles of Statistical Inference. Cambridge University Press, Cambridge (2006). ISBN: 978-0-521-86673-6
Book Google Scholar
Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
MATH Google Scholar
Dorado-García, A., Smid, J.H., van Pelt, W., et al.: Molecular relatedness of ESBL/AmpC-producing Escherichia coli from humans, animals, food and the environment: a pooled analysis. J. Antimicrob. Chemother. 73(2), 339–347 (2018). https://doi.org/10.1093/jac/dkx397
Article Google Scholar
Fisher, R.A.: Presidential address. Sankhy$\bar {a}$ Ind. J. Stat. 4(1), 14–17 (1938)
Google Scholar
GitHub repository. https://github.com/azagaris
Gutenkunst, R.N., Waterfall, J.J., Casey, F.P., et al.: Universally sloppy parameter sensitivities in systems biology models. PLoS Comp. Biol. 3, 1871–1878 (2007). https://doi.org/10.1371/journal.pcbi.0030189
Article MathSciNet Google Scholar
Hald, T., Wegener, H.C.: Quantitative assessment of the sources of human salmonellosis attributable to pork. In: Proceedings of the 3rd ISECSP, pp. 200–205 (1999)
Google Scholar
Hald, T., Vose, D., Wegener, H.C., et al.: A Bayesian approach to quantify the contribution of animal–food sources to human salmonellosis. Risk Anal. 24, 255–269 (2004). https://doi.org/10.1111/j.0272-4332.2004.00427.x
Article Google Scholar
Hamming, R.W.: Toward a lean and lively calculus: report of the conference/workshop to develop curriculum and teaching methods for calculus at the college level. Am. Math. Mon. 95(5), 466–471 (1988). https://doi.org/10.1080/00029890.1988.11972034
MathSciNet Google Scholar
Karch, H., Denamur, E., Dobrindt, U., et al.: The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841–848 (2012). https://doi.org/10.1002/emmm.201201662
Article Google Scholar
Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. A 115, 700–721 (1927). https://doi.org/10.1098/rspa.1927.0118
Article Google Scholar
Kimura, M.: Estimation of evolutionary distances between homologous nucleotide distances. Proc. Natl. Acad. Sci. 78, 454–458 (1981)
Article Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694
Article MathSciNet Google Scholar
Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, New York (2000). ISBN: 978-0521895606
Google Scholar
Raue, A., Kreutz, C., Maiwald, T., et al.: Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929 (2009). https://doi.org/10.1093/bioinformatics/btp358
Article Google Scholar
Schervish, M.J.: Theory of Statistics. Springer, New York (1995). ISBN: 978-1-4612-8708-7
Book Google Scholar
Snow, J.: On the Mode of Communication of Cholera. John Churchill, London (1855)
Google Scholar
Sorg, L.: Forward-looking panel tackles issues of the Mathematics of Planet Earth. SIAM News Blog (2016)
Google Scholar
Stegeman, A., Bouma, A., Elbers, A.R.W., et al.: Avian Influenza A Virus (H7N7) epidemic in The Netherlands in 2003: course of the epidemic and effectiveness of control measures. J. Infect. Dis. 190(12), 2088–2095 (2004). https://doi.org/10.1086/425583
Article Google Scholar
Tan, C.Y., Iglewicz, B.: Measurement-methods comparisons and linear statistical relationship. Technometrics 41(3), 192–201 (1999). https://doi.org/10.1080/00401706.1999.10485668
Article MathSciNet Google Scholar
Tufte, E.R.: Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire (1997). ISBN: 978-0961392123
Google Scholar
Vrisekoop, N., den Braber, I., de Boer, A.B., et al.: Sparse production but preferential incorporation of recently produced naïve T cells in the human peripheral pool. Proc. Natl. Acad. Sci. 105(16), 6115–6120 (2008). https://doi.org/10.1073/pnas.0709713105
Article Google Scholar
Waterfall, J.J., Casey, F.P., Gutenkunst, R.N., et al.: Sloppy-model universality class and the Vandermonde matrix. Phys. Rev. Lett. 97, 150601 (2006). https://doi.org/10.1103/PhysRevLett.97.150601
Article Google Scholar
Wilson, E.O.: Letters to a Young Scientists. Liveright, New York (2003). ISBN: 978-0871403858
Google Scholar
Zilversmit, D.B., Entenman, C., Fishler, M.C.: On the calculation of “turnover time” and “turnover rate” from experiments involving the use of labeling agents. J. Gen. Physiol. 26(3), 325–331 (1943)
Article Google Scholar

Download references

Acknowledgements

The work in Sect. 6.3 was initiated and supervised by Gert-Jan Boender and Thomas Hagenaars (Bacteriology and Epidemiology, Wageningen University and Research). The work in Sect. 6.4 was initiated by and done in collaboration with Rob de Boer (Theoretical Biology and Bioinformatics, Utrecht University), José Borghans (University Medical Center Utrecht), Ad Koets, and Lars Ravesloot (Bacteriology and Epidemiology, Wageningen University and Research). The author thanks them dearly for opening up a world of scientific opportunity and scholarship to him.

Author information

Authors and Affiliations

Department of Bacteriology and Epidemiology, Wageningen Bioveterinary Research, Wageningen University and Research, Lelystad, The Netherlands
Antonios Zagaris

Authors

Antonios Zagaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonios Zagaris .

Editor information

Editors and Affiliations

Mathematics and Statistics, Georgetown University, Washington, DC, USA
Hans G. Kaper
DIMACS Center, Rutgers University, Piscataway, NJ, USA
Fred S. Roberts

Appendix: A Short Primer on Parameter Estimation

The fundamental belief underpinning any modeling endeavor is that system measurements can be approximately generated by a specific model. In general terms, inference uses such measurements to mitigate uncertainty present in the underlying model. In this short appendix, we assume a well-defined class of candidate models that differ only in particulars; our task is to locate among them the one that best fits the available measurements (data). Here, these models share a common functional form containing finitely many parameters, so we speak of a parametric family and parametric inference. Lifting the uncertainty surrounding the parameter values is the inferential task par excellence.

Parameter values can be inferred in various ways joined by a common thread. Typically, unknown values are obtained as solutions to an optimization problem involving the model class and available data; in the problems treated here, that data is model outputs such as values of the dependent variables. For a deterministic model, a reasonable minimal requirement for an estimator would seemingly be self-consistency: given data generated by simulating a model with specific parameter values, a self-consistent estimator would return those precise parameter values, i.e., invert the simulation. Imposing that condition is reasonable, as long as distinct parameter values yield well-defined, distinct data (parameter identifiability [24]). However, the models treated in this chapter are probabilistic: specific parameter settings only have a certain probability to generate specific data. This makes the correspondence between parameter values and data both one-to-many and many-to-one, and it necessitates rethinking what can be reasonably expected from an estimator.

To address this problem, we start with univariate r.v.s X ₁, …, X _N defined on a common sample space Ω and having distributions $f_{X_1},\ldots ,f_{X_N}$. We then write $X = (X_1,\ldots ,X_N) : \varOmega \to \mathcal {X}$ for the multivariate r.v. collecting them, and we recognize as the space where data resides. This data space is equipped with an induced joint probability distribution , and each point x = (x ₁, …, x _N) in it corresponds to a full set of system measurements. In general, this joint distribution does not follow trivially from the marginals $f_{X_1},\ldots ,f_{X_N}$; determining it may be a sizable part of the modeling process and a closed-form expression outside reach, if the problem does not possess additional structure. A favorable case occurs when X ₁, …, X _N are pairwise independent, as f _X then has the product decomposition $f_X(x) = \prod _{n=1}^N f_{X_n}(x_n)$; another, trivial case occurs when r.v. components are algebraically constrained. Often, neither is true and modeling f _X is nontrivial. As a concrete example, the reader should derive the sampling distribution of $X = \sum _{n=1}^N X_n/N$ (sample mean) corresponding to i.i.d. Gaussian r.v.s X ₁, …, X _N.

We now assume that f _X depends on a set of parameters Θ = (θ ₁, …, θ _M) ∈ Δ and write f _X|Θ(⋅|θ) to reflect this. The parameter values θ are the subject of inference, i.e., of mapping data to parameter values by means of an estimator $\hat {\varTheta } : \mathcal {X} \to \varDelta $. This function will unambiguously (i.e., deterministically) map specific data to specific parameter values without recourse to the parameter values that generated the data. It is in this sense that parameter estimation reverse-engineers data generation. To proceed intelligently with estimator design, we note that parameter values generate data probabilistically—by sampling f _X|Θ(⋅|θ)—but $\hat {\varTheta }$ maps these to parameter estimates deterministically. The combination of sampling and estimation is therefore probabilistic in nature, meaning that a fixed set of parameter values generates different data and thus gives rise to various estimates of those values. In fact, the composite map $\hat {\varTheta } \circ X : \varOmega \to \varDelta $ is a transformed version of X and hence automatically an r.v. in its own right. Indeed, any measurable set U in parameter space Δ is assigned the measure of its pre-image $\hat {\varTheta }^{-1}(U)$ in data space $\mathcal {X}$ which, in turn, inherits that of $X^{-1}(\hat {\varTheta }^{-1}(U))$ in sample space Ω.

Being a r.v., the estimator is distributed according to some sampling distribution $f_{\hat {\varTheta }\vert \varTheta }$ that depends on the unknown parameters values. This observation suggests adapting the deterministic notion of self-consistency to that of an unbiased estimator, which amounts to demanding that

$$\displaystyle \begin{aligned} \int_{\varDelta} \hat{\theta} \, f_{\hat{\varTheta}\vert\varTheta}(\hat{\theta}\vert\theta) \, \mathrm{d}\hat{\theta} \ = \ \int_{\mathcal{X}} \hat{\varTheta}(x) \, f_{X\vert\varTheta}(x\vert\theta) \, \mathrm{d}x \ = \ \theta , \quad \theta\in\varDelta . \end{aligned} $$

(6.4.11)

If this condition holds, then the expected parameter estimates match the true parameter values, i.e., the estimator is correct on average although individual estimates inevitably deviate from the truth. That deviation can be quantified (again on average) using the variance of $f_{\hat {\varTheta }\vert \varTheta }$, which one would like to keep as low as possible; note that some variance is inevitable, see the Cramér–Rao bound [11]. These notions of estimator bias and variance permeate estimation theory fundamentally. For example, the aforementioned variance bound links to information theory and geometry [1], whereas modern machine learning work often involves biased estimators that trade off accuracy for precision.

In our work in this chapter, we employed the likelihood L(θ|x) = f _X|Θ(x|θ) with which parameter values θ ∈ Δ generate given data $x\in \mathcal {X}$. We specifically used the maximum likelihood estimator (MLE),

$$\displaystyle \begin{aligned} \hat{\varTheta}(x) = \mathrm{arg} \, \max_\theta L(\theta \vert x) = \mathrm{arg} \, \max_\theta f_{X\vert\varTheta}(x\vert\theta) , \quad x\in\mathcal{X} . {} \end{aligned} $$

(6.4.12)

In words, the estimate for the parameter value generating given data is the value maximizing the probability (likelihood) of generating that data. The evident circularity in this statement manifests that sampling and inference run contrary to each other. Note that neither existence nor uniqueness of the MLE is automatic (nor universal) and that the MLE is often biased. However, if X ₁, …, X _N are i.i.d. and N →∞, then $f_{\hat {\varTheta }\vert \varTheta }(\cdot \vert \theta )$ is an approximate Gaussian centered at θ by the central limit theorem (CLT). For more detailed introductions to parameter inference at two different levels, we refer the reader to [10, 25].

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zagaris, A. (2019). Data-Informed Modeling in the Health Sciences. In: Kaper, H., Roberts, F. (eds) Mathematics of Planet Earth. Mathematics of Planet Earth, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-030-22044-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-22044-0_6
Published: 02 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22043-3
Online ISBN: 978-3-030-22044-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Data-Informed Modeling in the Health Sciences

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: A Short Primer on Parameter Estimation

Appendix: A Short Primer on Parameter Estimation

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation