# Statistical inference in mechanistic models: time warping for improved gradient matching

- 693 Downloads

## Abstract

Inference in mechanistic models of non-linear differential equations is a challenging problem in current computational statistics. Due to the high computational costs of numerically solving the differential equations in every step of an iterative parameter adaptation scheme, approximate methods based on gradient matching have become popular. However, these methods critically depend on the smoothing scheme for function interpolation. The present article adapts an idea from manifold learning and demonstrates that a time warping approach aiming to homogenize intrinsic length scales can lead to a significant improvement in parameter estimation accuracy. We demonstrate the effectiveness of this scheme on noisy data from two dynamical systems with periodic limit cycle, a biopathway, and an application from soft-tissue mechanics. Our study also provides a comparative evaluation on a wide range of signal-to-noise ratios.

## Keywords

Differential equations Reproducing kernel Hilbert space Dynamical systems Objective function## 1 Introduction

The scientific landscape is changing, with an increasing number of traditionally qualitative disciplines becoming quantitative and adopting mathematical modelling techniques. This change is most dramatically witnessed in the life sciences (Cohen 2004). One of the most widely used modelling paradigms is based on coupled ordinary or partial differential equations (DEs). These equations are typically non-linear, so that a closed-form solution is intractable and numerical solutions are needed. This usually does not pose any restrictions on the forward problem: given the parameters, generate data from the model. However, it does provide challenges for the backward problem: given the data, infer the parameters.

The simplest approach to parameter inference for DEs is to compare the solution of the equations, for some given parameter set, to noisy observations of the signal based on some appropriate noise model. Parameter estimation can then be carried out by minimizing the discrepancy between the predicted solution of the DEs and the data. Robinson (2004) contains an introduction for obtaining explicit solutions of differential equations and amongst many other topics, Robinson discusses the use of Euler’s method and the Runge–Kutta scheme as methods for obtaining solutions numerically. Inference could be carried out on a system of DEs by using either of these two methods (with a reasonably small step-size) to numerically solve the equations and use least squares estimation to infer the parameters that best describe the data signal. Xue et al. (2010) discuss the influence of the numerical approximation to the DEs (employing the 4-stage Runge–Kutta algorithm in their studies). They argue that previous studies took the numerical solution as being the ground truth and only considered the measurement error when estimating the parameters. The authors show that when the maximum step size of a *p*-order numerical algorithm goes to zero at a rate faster than \(\displaystyle {n^{-1/p^4}}\), where *n* is the sample size, the numerical error is negligible in comparison to the measurement error. This provides some guidance in selecting the step-size when numerically solving DEs.

A different integration-based approach, which aims at avoiding explicitly solving the DEs, is to first smooth the data with a chosen interpolation method. This interpolant acts as a proxy for the solution of the DEs and the parameters can then be inferred with non-linear least squares. It is demonstrated in Xue et al. (2010) that a sieve estimator (a sequence of finite-dimensional models of increasing complexity) is asymptotically normal and has the same asymptotic covariance as when the true solution is known if the parameters are constant over time. A typical example of sieve regression is a spline (Hansen 2014). Dattner and Klaassen (2015) look at DEs where the systems are linear in the parameters. Taking advantage of the linearity in the model, the authors are able to develop a two-step estimation approach that does not require repeated integration of the system. By reformulating the minimization function in terms of integrals instead of derivatives, the authors obtain closed form estimates of the parameters of the system. These estimates are shown to be consistent estimators. Dattner and Klaassen consider two types of interpolation schemes—a local polynomial estimator and a step function estimator (which is obtained by averaging repeated measurements). The method using a local polynomial estimator was shown to outperform the two-step gradient matching approach of Liang and Wu (2008), whilst it was unable to outperform the gradient matching method of Ramsay et al. (2007). The accuracy of Daatner and Klaassen’s method using a step function estimator did not change much even when the number of repeated measures was quite small. Bayesian smooth-and-match is a related method, that avoids explicitly solving the DEs and instead indirectly solves the system by numerically integrating the interpolated signals. Ranciati et al. (2016) employ this approach, smoothing the data with penalized splines, and use ridge regression to infer the parameters of the DEs. Again, this approach focuses on systems that are linear in the parameters. In order to achieve a fully probabilistic generative model, the authors take a similar approach to Barber and Wang (2014) and as a consequence the vector of observations appears twice in the graphical model. The upshot of this is that the method is unable to deal with partially observed systems and the two observation vectors are coupled by a common nuisance (variance) parameter. Ranciati et al. (2016) demonstrate that the method is fast, with a built-in quantification of uncertainty about the DE solution. The results obtained, for a fully observed system that is linear in the parameters, are accurate and robust to dataset size and noise level.

*N*variables even though the total number of variables in the network may be large. At step one, the data is smoothed using penalized splines. At step two, the state variables and derivatives are substituted into the aforementioned SA-ODE model, producing a pseudo-sparse additive model (PSA). A truncated series expansion with B-spline bases is used to approximate the additive components of the PSA model. The number of basis functions is chosen as large as possible with the intention to correct for this at the fifth step. At step three, the group LASSO is used to identify significant functions in the model. The penalty parameter at this step is estimated using BIC. The group LASSO penalty treats the coefficients from each group equally, which is typically suboptimal. Hence, at step four, an adaptive group LASSO is applied to allow different levels of shrinkage to exist for different coefficients. Finally, at step five, a regular/adaptive LASSO is applied to account for the under-smoothing from step two (due to selecting more bases than are probably necessary). Wu et al demonstrate in their simulation studies that the method is able to obtain a high true positive rate, when the sample size is sufficiently large, and can more closely match the true underlying signal (noise free signal) than the method by Lu et al. (2011) which assumes a linear DE model and uses the smoothly clipped absolute deviation penalized likelihood method of Fan and Li (2001) for variable selection. A variety of other frameworks have also been developed in this context, including local linear and quadratic regression (Liang and Wu 2008), Gaussian processes (Calderhead et al. 2009; Dondelinger et al. 2013; Barber and Wang 2014; Macdonald et al. 2015), penalized smoothing and regression splines (Ramsay et al. 2007; Xun et al. 2013), and reproducing kernel Hilbert spaces (González et al. 2013, 2014).

A problem common to all of these approaches is the critical dependence of the inference scheme on the form of the interpolant. Small “wiggles”, which are hardly discernible at the level of the interpolant itself, can have dramatic effects at the level of the derivatives, which determine the parameter estimation. For noisy data, an adequate smoothing scheme is essential. However, any smoothing scheme is based on intrinsic length scales and these length scales may vary in time. Consider, for instance, estimating an oscillating signal with varying frequency using a Gaussian process (GP). If the length scale is tuned to the high-frequency domain, overfitting will typically result in the low frequency domain; if it is tuned to the low frequency domain, over-smoothing will affect the high frequency domain. In either case, the estimation of the derivatives will be poor, hampering DE parameter estimation.

The motivation for our work is given by the work of Calandra et al. (2016) in which the authors present examples where the smoothness assumptions upon which standard GPs are based are too restrictive. This limitation can be alleviated by mapping the data into a feature space. The authors integrate this map into what they call a manifold GP, and propose a joint inference scheme for learning both the transformation of the data and the GP regression from the feature space to the observed space.

The mapping proposed in Calandra et al. (2016) is, by the very nature of the inference scheme, a “black box”; for their practical work, the authors use a feedforward neural network. The modification we propose in the present article is to develop a map that explicitly targets changes in the length scales of oscillating signals. Periodic signals with varying lengths scales correspond to nonisotropic periodic limit cycles, and are characteristic of a large class of non-linear DEs (non-chaotic DEs without a stable fixed point).

In the present work, we implement the proposed warping scheme in the specific framework of reproducing kernel Hilbert space (RKHS) regression. We would like to emphasize, though, that this choice is rather arbitrary, and other regularized regression frameworks, like penalized splines or GPs, could also be chosen. The second point to notice is that although our framework has been motivated by oscillating functions, it turns out to be equally effective for non-periodic non-chaotic systems. We provide an example in the Results section (biopathway).

## 2 Background

### 2.1 Dynamical systems

*r*interacting states \(x_s\), \(1\le s\le r\), whose time evolution is governed by a set of coupled non-linear ordinary differential equations (DEs):

*n*noisy observations \(\varvec{y}_s=(y_{s1},\ldots ,y_{sn})'\) of the states \(\varvec{x}_s=(x_{s1},\ldots ,x_{sn})'\), subject to iid additive Gaussian noise \(\varvec{\epsilon }_{k} \sim N(0, \sigma ^2\varvec{I})\):

### 2.2 RKHS approach to inference in DEs

*g*defined over a set \({\mathbb {D}} \subset {\mathbb {R}}^m\). \({\mathcal {H}}\) is said to be a Reproducing Kernel Hilbert Space (RKHS) if and only if there exists a function \(k(\cdot ,\cdot ): {\mathbb {D}} \times {\mathbb {D}} \rightarrow {\mathbb {R}}\) such that for all \(t \in {\mathbb {D}}\) and all \(g \in {\mathcal {H}}\) the inner product \(<g( \cdot ), k(t,\cdot )>\) is equal to

*g*(

*t*) and the kernel function \(k(t,\cdot )\) is in \({\mathcal {H}} \) (Aronszajn 1950). When working with an RKHS approach for function estimation, functions are expressed as a linear combination of kernel functions evaluated at the data points

*s*th component of the dynamical system at time

*t*(which implies \(m=1\)) can be modelled as

^{1}and the regularization term \(||\varvec{q}_s||^2\) is the squared norm of \({\mathcal {H}}_s\):

*s*th component of the function defined in (1). However, this approach critically depends on the expressive power of the linear combination of kernels to represent the solution of the DE system which in turn limits the flexibility of the representation of the solution of the DE system leading to a potential degradation of the performance in estimating DE parameters. For instance, in the case of an RBF kernel, rapid changes in the signals require a lengthscale parameter (which is included in \(\varvec{\varphi }_s\)) that is short enough to have sufficient flexibility to accommodate these changes. As a result, flat parts of the signal will be modelled with an unnaturally short lengthscale. This leads to overfitting, a poor estimation of the gradient and, consequently, a poor performance of gradient matching for DE parameter estimation (see Fig. 11 of the Appendix). In the next section, we describe a novel RKHS-based time warping approach to overcome this limitation.

## 3 Methods

*s*of the dynamical system, time

*t*via a bijection \(\tilde{t}=w_s(t)\) such that in warped time \(\tilde{t}\), the unknown solutions \(x_s\) of the dynamical system show less variation in their intrinsic length scales. More specifically, we target oscillating functions and aim to transform them into a regular sinusoid by exploiting the fact that a sinusoid is closed under second-order differentiation (subject to a rescaling). We define the transformation of time as

*n*can, in principle, be treated as a model selection problem. In practice, we found that setting

*n*to the actual number of observations gave satisfactory results (as reported in Sect. 5). In the original time domain, the

*s*th variable of the dynamical system, \(x_s(t)\), is approximated by the smooth interpolant \(g_s(t)\). This function is now transformed, by virture of the bijection (11), into \(q_s(\tilde{t})\), where

*Step 1: Initialization* We initialize the system with standard kernel ridge regression, i.e. by solving Eqs. (8–9). This gives us the smooth interpolants \(g_s(t)\) in the original time domain *t*. We then initialize \(\tilde{t}=t\) and \(g_s(t) = q_s(\tilde{t})\), for each of the variables *s* of the dynamical system in turn.^{2}

*Step 2: Time warping*The bijection between the original time domain \(t \in [T_0,T_1]\) and the warped domain \(\tilde{t} \in [\tilde{T}_0,\tilde{T}_1]\) is obtained by minimizing the objective function

^{3}The integral in (13) is analytically intractable and needs to be solved numerically, e.g. with the trapezoid or Simpson’s method. However, in practice, we only need the functional form of the bijection \(w_s(.)\) at the observed time points \(t_i, 1 \le i \le n\). This motivates the following simplification of the objective function (recall that \(\tilde{t}_i=w_s(t_i)\)):

*Step 3: Interpolation*The second layer deals with function interpolation. The original data points \(y_s(t_i)\) are mapped to the warped time points, \(y({\tilde{t}}_i)\). We then apply standard kernel ridge regression with an RBF kernel in the warped domain, resulting in a smooth interpolant \(q_s({\tilde{t}})\), for each of the variables

*s*in the dynamical system:

*t*is straightforward. Since \(w_s(t)\) is bijective, we have \(g_s(t)=q_s({\tilde{t}})\), and

*Step 4: Gradient matching*Finally, we estimate the DE parameters with standard gradient matching, i.e. by minimizing the following objective function

^{4}with respect to \(\varvec{\theta }\):

## 4 Software

We have provided an implementation of the method to allow for reproducibility of our results. The code has been built in a modular, object oriented manner allowing flexibility and optimizing the opportunities for code re-use. The R package is available at http://dx.doi.org/10.5525/gla.researchdata.383.

## 5 Simulations

The objective of our simulation study is to compare the performance of the novel two-level time warping method proposed in Sect. 3 with the standard RKHS gradient matching method summarized in Sect. 2.2. We refer to these methods as RKGW (W for warping) and RKG, respectively. Unless stated otherwise, we use an RBF kernel. For the comparative evaluation, we have generated time series from two well-known dynamical systems and a biopathway, and strain/stress data from a soft tissue mechanical model. To ensure a robust comparison, we have repeatedly and independently subjected these data to additive iid Gaussian noise, over a range of signal-to-noise ratios (SNR). The computational costs of the two approaches over the different DE models are shown in Table 8 of the Appendix.

*Lotka–Volterra*The Lotka–Volterra equations describe the dynamics of ecological systems with predator-prey interactions (Lotka 1920):

*FitzHugh–Nagumo*The FitzHugh–Nagumo system is a two-dimensional dynamical system used for modelling spike generation in axons (FitzHugh 1955). It has two state variables, \(x_1\) and \(x_2\), and three parameters:

*a*,

*b*and

*c*.

*Biopathway*A model for the interactions of five protein isoforms,

*S*,

*dS*,

*R*,

*RS*,

*Rpp*, in a signal transduction pathway was studied by Vyshemirsky and Girolami (2008). The model describes interactions between the isoforms using both mass action and Michaelis–Menten kinetics:

*Soft tissue mechanics*We finally consider a soft-tissue mechanical model of the strain distribution in arteries that connect the human blood vessel network to the left ventricle of the heart. The arteries are modelled as a thick-walled non-linear elastic circular cylindrical tube. The deformation and the hyperelastic stress response of the arterial tissue material are described by the constitutive law proposed by Holzapfel and Ogden (2009), leading to

*r*is the radius of the tube.

*H*is the indicator function, i.e. \(H=1\) if \(I_4>1\) and 0 otherwise. The constants \(\lambda _z,R_i,R_0,k,r_i\) define known physiological properties that are predefined. The four patient-specific material parameters \(a, b, a_f , b_f\) are of medical interest and need to be inferred from the experimental data; see Holzapfel et al. (2000).

For the Soft tissue mechanics model, the solution of the DE system, which shows the strain in arteries in response to changes of the blood vessel radius, is depicted in Fig. 2d. The DEs were numerically integrated and we chose \(n=20\) equidistant radius values. The signal was corrupted with additive noise with SNR equal to 10 db, as assumed by our biological collaborators, and again we generated 50 independent data instantiations.

## 6 Comparison with alternative state of the art methods

We have compared the proposed method with two related state-of-the-art methods from the recent literature: an alternative method also based on reproducing kernel Hilbert space regression (RKHS), proposed by González et al. (2013, 2014), and a method based on a graphical model representation with Gaussian processes, proposed by Barber and Wang (2014).

The alternative RKHS approach, henceforth refereed to as the GON method (after the first author, Gonzalez), is based on an explicit representation of the regularization operator \(\varvec{K}_s\)in Eq. (8) in terms of the differential operator (a product of the differential operator and its adjoint operator). Solutions of the homogeneous DE system are eigenfunctions, the so-called Greens functions, of this operator. In practice, a closed-form expression of the Greens functions is rarely available, and the differential operator has to be approximated by a finite difference operator. Additionally, the theory does not include non-homogeneous DEs with a non-linear function *f*(.) in Eq. (1). To make the method applicable to the general case, the authors linearize the system by replacing the state variables \(\mathbf{x}(t)\) in the non-linear part of *f*(.) in Eq. (1) by fixed surrogates, obtained from, for example, a splines-based non-linear interpolation applied to the raw data.

The Gaussian process based approach, referred to by the authors as GPode, is based on a similar concept. Drawing on the analytical tractability of Gaussian processes, the state variables \(\mathbf{x}(t)\) are first integrated out in closed form, to obtain the conditional probability of a noisy observation given the time derivatives of the state variables, \({{\dot{\mathbf{x}}}}(t)\), which can be directly linked to the explicit form of the DEs via Eq. (1). The graphical model is then conditioned on surrogates of the state variables \(\mathbf{x}(t)\), which enter the DEs via Eq. (1).

## 7 Results

*p*-values shown in Tables 1, 2, 3, 4, 5, 6, 7 of Appendix A. They confirm that the observed trends are statistically significant.

The comparison with GPode (Barber and Wang 2014) has been relegated to Appendix A. A naive application of this method, starting from a vague prior and no knowledge of the noise variance, consistently led to singularities with negative infinite log likelihoods, presumably due to the approximations inherent in GPode (integrating out the state variables and then reinserting them via surrogate variables; see Sect. 6). To get GPode to work, we had to use additional prior information (noise variance assumed to be known, informative parameter priors and informative parameter initialization). Still, we found that RKGW outperformed GPode on the Lotka–Volterra data, while for the other data, both methods were on a par (see Figs. 13, 14, 15, 16 in Appendix C.). Note that RKGW achieved this performance without the inclusion of additional prior information.

## 8 Discussion

Inference in complex systems described by coupled differential equations (DEs) using gradient matching is challenging when the intrinsic length scales of functional change vary in the abscissa (time for dynamical systems, radius for the soft tissue mechanical model). In this article, we have proposed a time warping scheme to homogenize these length scales, based on an objective function that encourages functional invariance with respect to second-order differentiation. Applications to noisy data from three dynamical systems (Lotka–Volterra, FitzHugh–Nagumo, biopathway) have demonstrated consistent improvement over no warping for higher SNRs (30 and 40 db). For lower SNRs (10 and 20 db) the improvement was significantly improved in several cases, and never worse than for the standard scheme. For a soft tissue mechanical model with SNR \(=\) 10 db, the proposed method significantly outperformed all other methods in function space, and for 3 out of 4 of the parameters.

The motivation for the proposed scheme comes from the idea of manifold Gaussian processes (Calandra et al. 2016). The objective of the paper by Calandra et al. (2016) is to alleviate the problem of learning complex functions by transforming the data into a feature space such that the regression task becomes easier in the new latent representation. This latent feature space is learned along with the actual function in a supervised manner. Typical applications where the proposed approach achieves improved results are high-dimensional processes confined to low-dimensional manifolds, as their successful identification reduces the effect of the curse of dimensionality. The authors also demonstrate that their approach can learn time warpings that alleviate function regression. Common to many regression methods, like Gaussian processes and kernel ridge regression, are smoothness assumptions about the functions to be modelled. These assumptions are too restrictive if the smoothness characteristics change in time, leading to poor interpolants that do not match the true underlying functions. Warping the original time axis into a transformed space in which the smoothness characteristics are more uniform can then lead to improved regression results, as both Calandra et al. (2016) and we show in our papers. The essential difference between the two approaches is shown in Figs. 1 and 8. In Calandra et al. (2016), the model used for performing the time warping (e.g. a multilayer perceptron, as used by the authors) has to figure out the warping strategy on its own, as part of an overall supervised learning process. Note that time warping is only one of many applications of the authors’ method, along with manifold learning and the identification of low-dimensional subspaces for high-dimensional functions, as described above. Our method, on the other hand, is solely focussed on learning scalar functions in time, as part of the wider problem of parameter inference in systems of coupled differential equations. For that reason, we encapsulate the homogenization strategy—the strategy that renders the smoothness characteristics more homogeneous in time—in a separate objective function. While our approach lacks the universal nature of manifold learning, it is ideally suited for temporal regression, as the homogenization of smoothness characteristics is the very objective of learning and does not have to be figured out by the learning machine on its own. To paraphrase that: Since we are not interested in manifold learning in general, but in parameter estimation of differential equations, we use a transformation into a ‘feature space’ that is solely focussed on time warping. Due to this focussed nature, the training scheme can make use of additional ‘prior knowledge’ (i.e. the homogenization strategy), which is encapsulated in a separate objective function.

*t*, and the second trajectory is formulated as a function of warped time \(\tilde{t}\), which is a smooth bijective function of the real time axis into itself. A standard approach for finding the optimal warping function \(\tilde{t}(t)\), referred to as ‘registration’ in Su et al. (2014), is to minimize the following functional:

Finally, as discussed in Section 5 of Su et al. (2014), it is natural to generalize the Euclidean metric to the geodesics of an arbitray Riemannian manifold e.g. in trajectories of images in video surveillance. However, this is less of an issue for low-dimensional functions in time. A closer investigation of this aspect could provide a topic for future research.

A natural continuation of our work would be a model extension along the lines of the hierarchical Bayesian modelling framework proposed in Section 3 of Xun et al. (2013), whereby the DEs shape the prior distribution over the parameters. This framework would naturally benefit from the homogenization of the intrinsic functional length scales achieved with the proposed scheme. Our investigations have provided a first proof-of-principle study. They also provide a quantification of the improvement in the accuracy of inference that can be achieved, over a wide range of signal-to-noise ratios.

## Footnotes

- 1.
The dependency on \(\varvec{\varphi }_s\) is via \(k_s\) (which has not been made explicit in the notation).

- 2.
It would be more accurate to write \(t_s\) and \(\tilde{t}_s\) instead of

*t*and \(\tilde{t}\), which we avoid to reduce notational opacity. - 3.
The practical procedure is to increase \(\lambda _t\) until the results are invariant wrt a further increase.

- 4.
Recall that \(t_i\) depends on

*s*, so a more accurate (but cumbersome) notation would be \(g_s(t_i) \rightarrow g_s(t_i^s)\).

## Notes

### Acknowledgements

This work was supported by EPSRC (EP/L020319/1).

## References

- Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68(3):337–404MathSciNetCrossRefMATHGoogle Scholar
- Barber D, Wang Y (2014) Gaussian processes for Bayesian estimation in ordinary differential equations. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1485–1493Google Scholar
- Bishop CM (2006) Pattern recognition and machine learning. Springer, SingaporeMATHGoogle Scholar
- Calandra R, Peters J, Rasmussen CE, Deisenroth MP (2016) Manifold gaussian processes for regression. In: 2016 International joint conference on neural networks (IJCNN). IEEE, pp 3338–3345Google Scholar
- Calderhead B, Girolami M, Lawrence ND (2009) Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes. In: Proceedings of the 21st international conference on neural information processing systems (NIPS), pp 217–224Google Scholar
- Cohen JE (2004) Mathematics is biology’s next microscope, only better; biology is mathematics’ next physics, only better. PLoS Biol 2(12):e439CrossRefGoogle Scholar
- Dattner IM, Klaassen CAJ (2015) Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters. Electron J Stat 9(2):1939–1973MathSciNetCrossRefMATHGoogle Scholar
- Dondelinger F, Husmeier D, Rogers S, Filippone M (2013) Ode parameter inference using adaptive gradient matching with gaussian processes. AISTATS 31:216–228Google Scholar
- Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360MathSciNetCrossRefMATHGoogle Scholar
- FitzHugh R (1955) Mathematical models of threshold phenomena in the nerve membrane. Bull Math Biophys 17(4):257–278CrossRefGoogle Scholar
- González J, Vujačić I, Wit E (2013) Inferring latent gene regulatory network kinetics. Stat Appl Genet. Mol Biol 12(1):109–127MathSciNetCrossRefGoogle Scholar
- González J, Vujačić I, Wit E (2014) Reproducing kernel Hilbert space based estimation of systems of ordinary differential equations. Pattern Recognit Lett 45:26–32CrossRefGoogle Scholar
- Hansen BE (2014) Nonparametric sieve regression: least squares, averaging least squares, and cross-validation. In: The Oxford handbook of applied nonparametric and semiparametric econometrics and statistics, chap 8. Oxford University press, OxfordGoogle Scholar
- Holzapfel GA, Ogden RW (2009) Constitutive modelling of passive myocardium: a structurally based framework for material characterization. Philos Trans R Soc Lond A Math Phys Eng Sci 367(1902):3445–3475MathSciNetCrossRefMATHGoogle Scholar
- Holzapfel GA, Gasser TC, Ogden RW (2000) A new constitutive framework for arterial wall mechanics and a comparative study of material models. J Elast Phys Sci solids 61(1–3):1–48MathSciNetMATHGoogle Scholar
- Liang H, Wu H (2008) Parameter estimation for differential equation models using a framework of measurement error in regression models. J Am Stat Assoc 103(484):1570–1583MathSciNetCrossRefMATHGoogle Scholar
- Lotka AJ (1920) Analytical note on certain rhythmic relations in organic systems. Proc Natl Acad Sci USA 6(7):410CrossRefGoogle Scholar
- Lu T, Liang H, Li H, Wu H (2011) High-dimensional odes coupled with mixed-effects modeling techniques for dynamic gene regulatory network identification. J Am Stat Assoc 106(496):1242–1258MathSciNetCrossRefMATHGoogle Scholar
- Macdonald B, Higham C, Husmeier D (2015) Controversy in mechanistic modelling with Gaussian processes. In: Proceedings of the 32nd international conference on machine Learning, PMLR, vol 37, pp 1539–1547Google Scholar
- Ramsay JO, Hooker G, Campbell D, Cao J (2007) Parameter estimation for differential equations: a generalized smoothing approach. J R Stat Soc Ser B (Stat Methodol) 69(5):741–796MathSciNetCrossRefGoogle Scholar
- Ranciati S, Viroli C, Wit E (2016) Bayesian smooth-and-match estimation of ordinary differential equations parameters with quantifiable solution uncertainty. arXiv:1604.02318v3 [statME]
- Robinson JC (2004) An introduction to ordinary differential equations. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
- Su J, Kurtek S, Klassen E, Srivastava A (2014) Statistical analysis of trajectories on riemannian manifolds: bird migration, hurricane tracking and video surveillance. Ann Appl Stat 8(1):530–552MathSciNetCrossRefMATHGoogle Scholar
- Vyshemirsky V, Girolami MA (2008) Bayesian ranking of biochemical system models. Bioinformatics 24(6):833–839CrossRefGoogle Scholar
- Wu H, Lu T, Xue H, Liang H (2014) Sparse additive ordinary differential equations for dynamic gene regulatory network modeling. J Am Stat Assoc 109(506):700–716MathSciNetCrossRefMATHGoogle Scholar
- Xue H, Miao H, Wu H (2010) Sieve estimation of constant and time-varying coefficients in nonlinear ordinary differential equation models by considering both numerical error and measurement error. Ann Stat 38:2351–2387MathSciNetCrossRefMATHGoogle Scholar
- Xun X, Cao J, Mallick B, Carroll RJ, Maity A (2013) Parameter estimation of partial differential equation models. J Am Stat Assoc 108(503):37–41MathSciNetCrossRefMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.