## Abstract

Hybrid semi-parametric modeling, combining mechanistic and machine-learning methods, has proven to be a powerful method for process development. This paper proposes bootstrap aggregation to increase the predictive power of hybrid semi-parametric models when the process data are obtained by statistical design of experiments. A fed-batch *Escherichia coli* optimization problem is addressed, in which three factors (biomass growth setpoint, temperature, and biomass concentration at induction) were designed statistically to identify optimal cell growth and recombinant protein expression conditions. Synthetic data sets were generated applying three distinct design methods, namely, Box–Behnken, central composite, and Doehlert design. Bootstrap-aggregated hybrid models were developed for the three designs and compared against the respective non-aggregated versions. It is shown that bootstrap aggregation significantly decreases the prediction mean squared error of new batch experiments for all three designs. The number of (best) models to aggregate is a key calibration parameter that needs to be fine-tuned in each problem. The Doehlert design was slightly better than the other designs in the identification of the process optimum. Finally, the availability of several predictions allowed computing error bounds for the different parts of the model, which provides an additional insight into the variation of predictions within the model components.

This is a preview of subscription content, log in to check access.

## References

- 1.
Thompson ML, Kramer MA (1994) Modeling chemical processes using prior knowledge and neural networks. AIChE J 40:1328–1340. https://doi.org/10.1002/aic.690400806

- 2.
Psichogios DC, Ungar LH (1992) A hybrid neural network-first principles approach to process modeling. AIChE J 38:1499–1511. https://doi.org/10.1002/aic.690381003

- 3.
Simutis R, Oliveira R, Manikowski M et al (1997) How to increase the performance of models for process optimization and control. J Biotechnol 59:73–89. https://doi.org/10.1016/S0168-1656(97)00166-1

- 4.
Schubert J, Simutis R, Dors M et al (1994) Bioprocess optimization and control: application of hybrid modelling. J Biotechnol 35:51–68. https://doi.org/10.1016/0168-1656(94)90189-9

- 5.
van Can HJL, te Braake HAB, Bijman A et al (1999) An efficient model development strategy for bioprocesses based on neural networks in macroscopic balances: part II. Biotechnol Bioeng 62:666–680. https://doi.org/10.1002/(SICI)1097-0290(19990320)62:6%3c666:AID-BIT6%3e3.0.CO;2-S

- 6.
von Stosch M, Peres J, de Azevedo SF, Oliveira R (2010) Modelling biochemical networks with intrinsic time delays: a hybrid semi-parametric approach. BMC Syst Biol 4:131. https://doi.org/10.1186/1752-0509-4-131

- 7.
Oliveira R (2003) Combining first principles modelling and artificial neural networks: a general framework. Comput Aided Chem Eng 14:821–826. https://doi.org/10.1016/S1570-7946(03)80218-3

- 8.
von Stosch M, Oliveira R, Peres J, Feyo de Azevedo S (2014) Hybrid semi-parametric modeling in process systems engineering: past, present and future. Comput Chem Eng 60:86–101. https://doi.org/10.1016/J.COMPCHEMENG.2013.08.008

- 9.
von Stosch M, Oliveira R, Peres J, Feyo de Azevedo S (2011) A novel identification method for hybrid (N)PLS dynamical systems with application to bioprocesses. Expert Syst Appl 38:10862–10874. https://doi.org/10.1016/J.ESWA.2011.02.117

- 10.
Wang X, Chen J, Liu C, Pan F (2010) Hybrid modeling of penicillin fermentation process based on least square support vector machine. Chem Eng Res Des 88:415–420. https://doi.org/10.1016/J.CHERD.2009.08.010

- 11.
Portela RMC, von Stosch M, Oliveira R (2018) Hybrid semiparametric systems for quantitative sequence-activity modeling of synthetic biological parts. Synth Biol 3:10

- 12.
Zhang J (1999) Developing robust non-linear models through bootstrap aggregated neural networks. Neurocomputing 25:93–113. https://doi.org/10.1016/S0925-2312(99)00054-5

- 13.
Mevik B-H, Segtnan VH, Næs T (2004) Ensemble methods and partial least squares regression. J Chemom 18:498–507. https://doi.org/10.1002/cem.895

- 14.
Carinhas N, Bernal V, Teixeira AP et al (2011) Hybrid metabolic flux analysis: combining stoichiometric and statistical constraints to model the formation of complex recombinant products. BMC Syst Biol 5:34. https://doi.org/10.1186/1752-0509-5-34

- 15.
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199. https://doi.org/10.1007/s10021-005-0054-1

- 16.
Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958. https://doi.org/10.1021/ci034160g

- 17.
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. https://doi.org/10.1007/BF00058655

- 18.
Tian Y, Zhang J, Morris J (2004) Dynamic on-line reoptimization control of a batch MMA polymerization reactor using hybrid neural network models. Chem Eng Technol 27:1030–1038. https://doi.org/10.1002/ceat.200402068

- 19.
Peres J, Oliveira R, Feyo de Azevedo S (2000) Knowledge based modular networks for process modelling and control. Comput Aided Chem Eng 8:247–252. https://doi.org/10.1016/S1570-7946(00)80043-7

- 20.
Kahrs O, Marquardt W (2007) The validity domain of hybrid models and its application in process optimization. Chem Eng Process Process Intensif 46:1054–1066. https://doi.org/10.1016/j.cep.2007.02.031

- 21.
von Stosch M, Willis MJ (2017) Intensified design of experiments for upstream bioreactors. Eng Life Sci 17:1173–1184. https://doi.org/10.1002/elsc.201600037

- 22.
von Stosch M, Hamelink J-M, Oliveira R (2016) Hybrid modeling as a QbD/PAT tool in process development: an industrial

*E. coli*case study. Bioprocess Biosyst Eng 39:773–784. https://doi.org/10.1007/s00449-016-1557-1 - 23.
Gnoth S, Simutis R, Lübbert A (2010) Selective expression of the soluble product fraction in

*Escherichia coli*cultures employed in recombinant protein production processes. Appl Microbiol Biotechnol 87:2047–2058. https://doi.org/10.1007/s00253-010-2608-1 - 24.
Gnoth S, Jenzsch M, Simutis R, Lübbert A (2008) Product formation kinetics in genetically modified

*E. coli*bacteria: inclusion body formation. Bioprocess Biosyst Eng 31:41–46. https://doi.org/10.1007/s00449-007-0161-9 - 25.
Lin Y, Zhang Z, Thibault J (2009) Comparison of experimental designs using neural networks. Can J Chem Eng 87:965–971. https://doi.org/10.1002/cjce.20233

- 26.
Alam FM, McNaught KR, Ringrose TJ (2004) A comparison of experimental designs in the development of a neural network simulation metamodel. In: Simulation modelling practice and theory. pp 559–578

- 27.
Levisauskas D, Galvanauskas V, Henrich S et al (2003) Model-based optimization of viral capsid protein production in fed-batch culture of recombinant

*Escherichia coli*. Bioprocess Biosyst Eng 25:255–262. https://doi.org/10.1007/s00449-002-0305-x

## Author information

## Ethics declarations

### Conflict of interest

The other authors declare that no competing interests exist.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Electronic supplementary material

Below is the link to the electronic supplementary material.

## Appendix

### Appendix

### A: *E. coli* simulation fed-batch model

The model describes the production of viral capsid protein by a recombinant *E. coli* strain in a fed-batch bioreactor. This model has been proposed by [21], which is an adaptation of the model by [27]. The model comprises the material balances for biomass, substrate, and product concentration as well as the overall mass balance in a stirred tank bioreactor:

with \(\mu\), \(v_{S}\), and \(v_{P}\) the specific rates of biomass growth (1/h), substrate uptake (1/h), and product formation (U/g/h), \(X\), \(S\), and \(P\) the biomass (g/kg), substrate (g/kg), and product concentrations (U/kg), \(D = u_{F} /W\) (1/h) the dilution rate, and \(u_{F}\) the feeding rate (kg/h).

The specific biomass growth rate was modeled using the expression:

with \(\mu_{ \text{max} }\) = 0.737 (1/h), \(K_{S}\) = 0.00333 (g/kg), \(K_{i} = 93.8\) (g/kg), \(\alpha = 0.0495\) (1/C), \(T_{\text{ref}} = 37\) (°C), and \(T\) (°C) the temperature of the culture broth.

The specific substrate uptake rate is modeled via:

with \(Y_{XS} = 0.46\) (g/g) and \(m = 0.0242\) (g/g/h).

The specific product formation rate is modeled by

with

with \(A_{\text{eng}} = 62\) (kJ/mol), \(R_{\text{eng}} = 551\) (kJ/mol), \(R = 8.3144e - 3\) (kJ/mol/K), \(T_{PX} = 1.495\)(h),\(p_{X} = 50\)(U/g),\(k_{\mu } = 0.61\)(1/h), \(k_{m} = 751\)(U/g), \(k_{i\mu } = 0.0174\) (1/h), and the induction parameter \(I_{D} = 0\) before induction and \(I_{D} = 1\) afterwards.

For the feeding rate, an exponential profile was adopted to match a desired constant specific biomass growth, \(\mu_{\text{set}}\), that is

where \(X_{0} = X\left( {t_{0} } \right)\) (g/kg) is the initial biomass concentration and \(W_{0} = W\left( {t_{0} } \right)\) (kg) is the initial weight of the culture broth.

The process was divided into two phases, a growth and a production phase. During the growth phase, \(\mu_{\text{set}} = 0.3\) (h^{−1}) and \(T = 34\) (C). The duration of the growth phase was adapted to yield the initial biomass concentration, *X*_{ind}, set out by the DoEs. The substrate concentration in the feeding solution was set to \(S_{f} = 300\) (g/kg). Data for online variables were logged every 6 min. The biomass and product concentrations (offline variables) were measured 20 times during each fermentation. The data were corrupted with 5% Gaussian (white) noise.

### B: *E. coli* hybrid semi-parametric model

The parametric part of the hybrid model is based on the material balance equations of biomass and product, that is

where \(D\) is the dilution rate, \(X\) and \(P\) are the biomass and product concentrations (to note that the hybrid model does not consider substrate dynamics), with specific reaction rates \(\mu\) and \(v_{p}\). Thus, the volumetric rate Eq. (2) simplifies as follows for the present problem:

The specific rates \(\mu\) and \(v_{p}\) are much more difficult to establish; thus, they were modeled by a simple feedforward neural network with three layers only:

with \(w = \{ w^{1,1} ,w^{1,2} ,w^{2,2} ,w^{2,2} \}\). The network has only three inputs, namely, biomass, \(X\), the feeding rate, \(F\), and cultivation temperature, \(T\). Thus, the pre-processing function (Eq. (3)) reduces to the following form:

Preliminary tests have shown that five neurons in the hidden layer are optimal for the present case study data set used, which corresponds to \(\dim \left( w \right) = 4 \times 5 + 6 \times 2 = 32\) parameters to be identified in each run. The number of hidden nodes of the neural network was thus selected to be five in all studies performed.

## Rights and permissions

## About this article

### Cite this article

Pinto, J., de Azevedo, C.R., Oliveira, R. *et al.* A bootstrap-aggregated hybrid semi-parametric modeling framework for bioprocess development.
*Bioprocess Biosyst Eng* **42, **1853–1865 (2019). https://doi.org/10.1007/s00449-019-02181-y

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Hybrid semi-parametric modeling
- Hybrid modeling
- Bagging
- Design of experiments
- Sampling error
- Data portioning
- Ensemble methods