Abstract
The adoption of automation and technology by health professionals is triggering an explosion of databases and data streams in that sector. The emergence of this data torrent creates the pressing need to mine it for value, which in turn requires investment for the development of modeling and analysis tools. In view of this, dynamicists are presented with the terrific opportunity to enrich their discipline by supplying it with new tools, expanding its scope, and elevating its social impact. This chapter is written in that spirit, examining three concrete case studies encountered in the field: quantifying the salmonellosis risk posed by distinct food sources, assimilating genetic data into a dynamical model for avian influenza transmission, and statistically decontaminating gas chromatography/mass spectroscopy time series. We review available prototypical models and build on them guided by data and mathematical abstraction, demonstrating in the process how to root a model into data. This takes us quite naturally into the realm of probabilistic and statistical modeling and reopens a decades-old discussion on the role of discrete models in applied mathematics. We also touch briefly on the timely subject of mathematicians being employed as such outside math departments and attempt a short outlook on their prospects and opportunities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000). ISBN: 0-8218-0531-2
Barto, A.G.: Discrete and continuous models. Int. J. Gen. Syst. 4(3), 163–177 (1978). https://doi.org/10.1080/03081077808960681
Benaglia, T., Chauveau, D., Hunter, D.R., et al.: mixtools: an R package for analyzing mixture models. J. Stat. Softw. 32(6) (2010). https://doi.org/10.18637/jss.v032.i06
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006). ISBN: 0-387-31073-8
Boender, G.J., Hagenaars, T.J., Bouma, A., et al.: Risk maps for the spread of highly pathogenic avian influenza in poultry. PLoS Comput. Biol. 3(4), 704–712 (2007). https://doi.org/10.1371/journal.pcbi.0030071
Box, G.E.P.: Science and statistics. J. Amer. Stat. Assoc. 71(356), 791–799 (1976). https://doi.org/10.1080/01621459.1976.10480949
Bromham, L., Dinnage, R., Hua, X.: Interdisciplinary research has consistently lower funding success. Nature 534(7609) (2016). https://doi.org/10.1038/nature18315
Busch, R., Neese, R.A., Awada, M., et al.: Measurement of cell proliferation by heavy water labeling. Nat. Prot. 2(12), 3045–3057 (2007). https://doi.org/10.1038/nprot.2007.420
Council of the European Communities: Council directive 2005/94/ec of 20 December 2005 on community measures for the control of avian influenza and repealing directive 92/40/eec. Off. J. Eur. Union 49, L10/16–65 (2006). ISSN: 1725-2555
Cox, D.R.: Principles of Statistical Inference. Cambridge University Press, Cambridge (2006). ISBN: 978-0-521-86673-6
Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
Dorado-García, A., Smid, J.H., van Pelt, W., et al.: Molecular relatedness of ESBL/AmpC-producing Escherichia coli from humans, animals, food and the environment: a pooled analysis. J. Antimicrob. Chemother. 73(2), 339–347 (2018). https://doi.org/10.1093/jac/dkx397
Fisher, R.A.: Presidential address. Sankhy\(\bar {a}\) Ind. J. Stat. 4(1), 14–17 (1938)
GitHub repository. https://github.com/azagaris
Gutenkunst, R.N., Waterfall, J.J., Casey, F.P., et al.: Universally sloppy parameter sensitivities in systems biology models. PLoS Comp. Biol. 3, 1871–1878 (2007). https://doi.org/10.1371/journal.pcbi.0030189
Hald, T., Wegener, H.C.: Quantitative assessment of the sources of human salmonellosis attributable to pork. In: Proceedings of the 3rd ISECSP, pp. 200–205 (1999)
Hald, T., Vose, D., Wegener, H.C., et al.: A Bayesian approach to quantify the contribution of animal–food sources to human salmonellosis. Risk Anal. 24, 255–269 (2004). https://doi.org/10.1111/j.0272-4332.2004.00427.x
Hamming, R.W.: Toward a lean and lively calculus: report of the conference/workshop to develop curriculum and teaching methods for calculus at the college level. Am. Math. Mon. 95(5), 466–471 (1988). https://doi.org/10.1080/00029890.1988.11972034
Karch, H., Denamur, E., Dobrindt, U., et al.: The enemy within us: lessons from the 2011 European Escherichia coli O104:H4 outbreak. EMBO Mol. Med. 4, 841–848 (2012). https://doi.org/10.1002/emmm.201201662
Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. A 115, 700–721 (1927). https://doi.org/10.1098/rspa.1927.0118
Kimura, M.: Estimation of evolutionary distances between homologous nucleotide distances. Proc. Natl. Acad. Sci. 78, 454–458 (1981)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694
Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, New York (2000). ISBN: 978-0521895606
Raue, A., Kreutz, C., Maiwald, T., et al.: Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929 (2009). https://doi.org/10.1093/bioinformatics/btp358
Schervish, M.J.: Theory of Statistics. Springer, New York (1995). ISBN: 978-1-4612-8708-7
Snow, J.: On the Mode of Communication of Cholera. John Churchill, London (1855)
Sorg, L.: Forward-looking panel tackles issues of the Mathematics of Planet Earth. SIAM News Blog (2016)
Stegeman, A., Bouma, A., Elbers, A.R.W., et al.: Avian Influenza A Virus (H7N7) epidemic in The Netherlands in 2003: course of the epidemic and effectiveness of control measures. J. Infect. Dis. 190(12), 2088–2095 (2004). https://doi.org/10.1086/425583
Tan, C.Y., Iglewicz, B.: Measurement-methods comparisons and linear statistical relationship. Technometrics 41(3), 192–201 (1999). https://doi.org/10.1080/00401706.1999.10485668
Tufte, E.R.: Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire (1997). ISBN: 978-0961392123
Vrisekoop, N., den Braber, I., de Boer, A.B., et al.: Sparse production but preferential incorporation of recently produced naïve T cells in the human peripheral pool. Proc. Natl. Acad. Sci. 105(16), 6115–6120 (2008). https://doi.org/10.1073/pnas.0709713105
Waterfall, J.J., Casey, F.P., Gutenkunst, R.N., et al.: Sloppy-model universality class and the Vandermonde matrix. Phys. Rev. Lett. 97, 150601 (2006). https://doi.org/10.1103/PhysRevLett.97.150601
Wilson, E.O.: Letters to a Young Scientists. Liveright, New York (2003). ISBN: 978-0871403858
Zilversmit, D.B., Entenman, C., Fishler, M.C.: On the calculation of “turnover time” and “turnover rate” from experiments involving the use of labeling agents. J. Gen. Physiol. 26(3), 325–331 (1943)
Acknowledgements
The work in Sect. 6.3 was initiated and supervised by Gert-Jan Boender and Thomas Hagenaars (Bacteriology and Epidemiology, Wageningen University and Research). The work in Sect. 6.4 was initiated by and done in collaboration with Rob de Boer (Theoretical Biology and Bioinformatics, Utrecht University), José Borghans (University Medical Center Utrecht), Ad Koets, and Lars Ravesloot (Bacteriology and Epidemiology, Wageningen University and Research). The author thanks them dearly for opening up a world of scientific opportunity and scholarship to him.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: A Short Primer on Parameter Estimation
Appendix: A Short Primer on Parameter Estimation
The fundamental belief underpinning any modeling endeavor is that system measurements can be approximately generated by a specific model. In general terms, inference uses such measurements to mitigate uncertainty present in the underlying model. In this short appendix, we assume a well-defined class of candidate models that differ only in particulars; our task is to locate among them the one that best fits the available measurements (data). Here, these models share a common functional form containing finitely many parameters, so we speak of a parametric family and parametric inference. Lifting the uncertainty surrounding the parameter values is the inferential task par excellence.
Parameter values can be inferred in various ways joined by a common thread. Typically, unknown values are obtained as solutions to an optimization problem involving the model class and available data; in the problems treated here, that data is model outputs such as values of the dependent variables. For a deterministic model, a reasonable minimal requirement for an estimator would seemingly be self-consistency: given data generated by simulating a model with specific parameter values, a self-consistent estimator would return those precise parameter values, i.e., invert the simulation. Imposing that condition is reasonable, as long as distinct parameter values yield well-defined, distinct data (parameter identifiability [24]). However, the models treated in this chapter are probabilistic: specific parameter settings only have a certain probability to generate specific data. This makes the correspondence between parameter values and data both one-to-many and many-to-one, and it necessitates rethinking what can be reasonably expected from an estimator.
To address this problem, we start with univariate r.v.s X 1, …, X N defined on a common sample space Ω and having distributions \(f_{X_1},\ldots ,f_{X_N}\). We then write \(X = (X_1,\ldots ,X_N) : \varOmega \to \mathcal {X}\) for the multivariate r.v. collecting them, and we recognize as the space where data resides. This data space is equipped with an induced joint probability distribution , and each point x = (x 1, …, x N) in it corresponds to a full set of system measurements. In general, this joint distribution does not follow trivially from the marginals \(f_{X_1},\ldots ,f_{X_N}\); determining it may be a sizable part of the modeling process and a closed-form expression outside reach, if the problem does not possess additional structure. A favorable case occurs when X 1, …, X N are pairwise independent, as f X then has the product decomposition \(f_X(x) = \prod _{n=1}^N f_{X_n}(x_n)\); another, trivial case occurs when r.v. components are algebraically constrained. Often, neither is true and modeling f X is nontrivial. As a concrete example, the reader should derive the sampling distribution of \(X = \sum _{n=1}^N X_n/N\) (sample mean) corresponding to i.i.d. Gaussian r.v.s X 1, …, X N.
We now assume that f X depends on a set of parameters Θ = (θ 1, …, θ M) ∈ Δ and write f X|Θ(⋅|θ) to reflect this. The parameter values θ are the subject of inference, i.e., of mapping data to parameter values by means of an estimator \(\hat {\varTheta } : \mathcal {X} \to \varDelta \). This function will unambiguously (i.e., deterministically) map specific data to specific parameter values without recourse to the parameter values that generated the data. It is in this sense that parameter estimation reverse-engineers data generation. To proceed intelligently with estimator design, we note that parameter values generate data probabilistically—by sampling f X|Θ(⋅|θ)—but \(\hat {\varTheta }\) maps these to parameter estimates deterministically. The combination of sampling and estimation is therefore probabilistic in nature, meaning that a fixed set of parameter values generates different data and thus gives rise to various estimates of those values. In fact, the composite map \(\hat {\varTheta } \circ X : \varOmega \to \varDelta \) is a transformed version of X and hence automatically an r.v. in its own right. Indeed, any measurable set U in parameter space Δ is assigned the measure of its pre-image \(\hat {\varTheta }^{-1}(U)\) in data space \(\mathcal {X}\) which, in turn, inherits that of \(X^{-1}(\hat {\varTheta }^{-1}(U))\) in sample space Ω.
Being a r.v., the estimator is distributed according to some sampling distribution \(f_{\hat {\varTheta }\vert \varTheta }\) that depends on the unknown parameters values. This observation suggests adapting the deterministic notion of self-consistency to that of an unbiased estimator, which amounts to demanding that
If this condition holds, then the expected parameter estimates match the true parameter values, i.e., the estimator is correct on average although individual estimates inevitably deviate from the truth. That deviation can be quantified (again on average) using the variance of \(f_{\hat {\varTheta }\vert \varTheta }\), which one would like to keep as low as possible; note that some variance is inevitable, see the Cramér–Rao bound [11]. These notions of estimator bias and variance permeate estimation theory fundamentally. For example, the aforementioned variance bound links to information theory and geometry [1], whereas modern machine learning work often involves biased estimators that trade off accuracy for precision.
In our work in this chapter, we employed the likelihood L(θ|x) = f X|Θ(x|θ) with which parameter values θ ∈ Δ generate given data \(x\in \mathcal {X}\). We specifically used the maximum likelihood estimator (MLE),
In words, the estimate for the parameter value generating given data is the value maximizing the probability (likelihood) of generating that data. The evident circularity in this statement manifests that sampling and inference run contrary to each other. Note that neither existence nor uniqueness of the MLE is automatic (nor universal) and that the MLE is often biased. However, if X 1, …, X N are i.i.d. and N →∞, then \(f_{\hat {\varTheta }\vert \varTheta }(\cdot \vert \theta )\) is an approximate Gaussian centered at θ by the central limit theorem (CLT). For more detailed introductions to parameter inference at two different levels, we refer the reader to [10, 25].
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zagaris, A. (2019). Data-Informed Modeling in the Health Sciences. In: Kaper, H., Roberts, F. (eds) Mathematics of Planet Earth. Mathematics of Planet Earth, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-030-22044-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-22044-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22043-3
Online ISBN: 978-3-030-22044-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)