Introduction

Cellulases are a class of enzymes found in microbial life forms that break down cellulose and polysaccharides to obtain shorter (and sometimes, monomeric) polymer sugars. Cellulases play an important role in organisms, as they are a critical part of the metabolic pathways that confer the ability to obtain and use energy that sustain life. Without cellulases, life on Earth would not have existed.

In addition to their role in the cellular processes, driven by the search for new non-fossil-based alternative energy resources, cellulases are used in industrial ethanol production for fuel, as sugar molecules that are broken down by cellulases can be chemically converted into ethanol. Therefore, improving our understanding of the enzymatic mechanism and action of these amazing enzymes would not only satisfy our intellectual curiosity but also will help develop alternative energy resources for the humanity.

Cellulases and their associated proteins are, by itself, an exciting and active area of research. For example, the database Carbohydrate-Active enZymes (CAZy) is a niche database that has curated information for enzymes such as glycosyl hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrate esterases, and auxiliary activity enzymes [1]. The role of cellulases is not limited to degradation of cellulose, but it also extends to plant energy storage, plant’s life cycle, and others. Industrially, cellulases are widely used in paper and textile industries for processing recycled paper, enhancing softness of fabrics, and converting hardwood fibers to softer, malleable fibers for fine paper products [2]. In some detergents, cellulases are used as ingredients to make fabric appear brighter and relatively whiter. For these purposes, organisms such as Trichoderma reesei and Clostridium thermocellum are frequently used as a source for obtaining large quantities of biocatalysts, since their genome harbors multiple and diverse cellulases, exo-glucanases, endoglucanases, and other genes of commercial importance [3].

In the case of biofuels, a diverse set of enzymes needs to be used during industrial processes, which act synergistically to break down the polymeric structure of cellulose to simple sugars. Because cellulolytic enzymes have different and unique mechanisms of action, a mix of enzymes is sometimes used industrially for the specific biomass to be degraded. With the second-, third-, and fourth-generation biomass for biofuel production, the determination of which enzymes and compositions will be most effective in converting biomass into simple sugars is a challenge, and computational methods were used to determine the optimal composition. Some of these computational methods include predicting the tertiary structure of enzyme and analyzing its structural dynamics to reveal its mechanism of action. Simulations of an enzyme structure, either by itself or in complex with substrate or another protein, are performed using a fine-grained (i.e., all-atom model) or a coarse-grained model. In either case, the trade-off is the level of details vs. computation time. There are distinct advantages and disadvantages of using each method, and researchers make their choice of appropriate details based on the model system and the hypothesis being tested.

Despite the importance of cellulases, however, our understanding into their structure, dynamics, and enzymatic action is limited, for this enzyme class. In this review, we will primarily focus on computational studies of this rich enzyme family, which complement experimental investigations and inform on molecular and structural mechanisms of enzymatic action. To accomplish this, we will first provide a brief description of computational methods followed by an introduction to cellulase enzymes, and then, we will review insights from computational studies in detail for specific cellulase families. The main objective of this review is to provide the overall landscape for the structural and dynamic studies of biofuel enzymes using computational methods to understand enzymatic action and demonstrate how the use of these methods advances our understanding of specific enzyme families.

Brief Review of Computational Methods Used in Cellulase Studies

Enzymes function optimally through physical and chemical arrangement of their structures, where the arrangement can be local, such as breaking of existing bonds to create new ones, or global, such as a conformational change causing distinct apoenzyme and holoenzyme structures. The local and global changes in a protein structure are usually dynamically coupled, and the exact coupling between these motions at different physical and temporal scales is an intense area of research. For example, some researchers argued that couplings between different dynamic modes are related to allosteric behavior of proteins. Some recent computational studies expanded the classical definition of allostery, which is defined as a massive restructuring of protein shape upon ligand binding, to include coupled but localized motions of proteins, based on some experimental evidence, therefore arguing that allostery is not a behavior that is observed in some proteins, but it exists in various degrees in all proteins [4,5,6,7,8,9,10]. But, such theoretical interpretations often remain somewhat descriptive and lack physical and chemical specificity. In contrast, molecular dynamics simulations applied with a selective level of molecular details (fine-grained all-atom models vs. coarse-grained models) generate results that can be validated against experiments, such as single-molecule force spectroscopy [11] and small-angle X-ray scattering (SAXS) [12]. In this review, we will largely focus on computational methods related to molecular dynamics.

Coarse-Grained Vs. Fine-Grained Modeling

Granularity in computations is the extent to which an entity is divided into smaller groups of separable elements to facilitate computations. There are two main modeling approaches for biomolecular systems for granulation that incorporate different levels of atomistic details: coarse-grained and fine-grained.

In general, coarse-grained methods reduce computation time and can provide insights into molecular mechanisms at a longer time scales. However, coarse-grained approaches are limited in yielding details for subtle structural changes. For example, coarse-grained approaches can be used for studying cell components and processes for their interdependence, while atomically detailed simulations and molecular quantum mechanics can be applied to a region of few atoms.

A commonly used coarse-grained approach is the elastic network model (ENM) approach. These models use experimentally determined protein structures at equilibrium as input, and then, the fluctuations are calculated by simple mode decomposition using normal mode analysis. Although these models do not often provide atomistic details, their computational cost is usually much lower than molecular dynamics simulations [13]. In addition to elastic network models, other types of coarse-graining are applied to molecular dynamics in terms of energetics: for example, Martini force-field partition free energies between polar and apolar regions of chemical entities are applied to the studies of formation and fusion of vesicles, lamellar phase transformations, and membrane protein assemblies [14]; ELNEDIN, which is a physics-based coarse-grained model, uses a structural “scaffold” of protein that reduces the degree of freedom in the calculations [15].

Fine-grained methods include molecular dynamics (MD) simulations, quantum mechanics (QM), and quantum mechanics/molecular mechanics (QM/MM) methods. MD is suitable to understand the detail of molecular and atomistic interactions that confer specificity to proteins. MD was applied to the simulations of polypeptide folding, biomolecular association, partitioning between solvents, membrane/micelle formation, chemical reactions and enzyme catalysis, photochemical reactions, and electron transfer. MD factors in degrees of freedom for motions, boundary conditions as to a system’s temperature and pressure, and force field to generate dynamic trajectories by solving Newton’s equations of motion to reveal structural mechanisms of enzymatic function [16].

In some cases, a combination of methods is used concurrently regardless of differences in time and space scales. These multiscale modeling approaches incorporate the application of both coarse-grained and fine-grained methods simultaneously, and use atomistic details of a small region (e.g., active site of a protein) and apply coarse-graining to the rest of the system to obtain the overall dynamics of a large complex to attempt to harness the advantages of both methods.

Constant-pH Molecular Dynamics

The constant-pH molecular dynamics (CpHMD) method involves determination of the protonation states of titratable sites in a protein at the specified pH. Detailed mechanistic studies of pH-dependent conformational processes use CpHMD to understand pH-coupled dynamical phenomena. The CpHMD method is capable of predicting experimental pKa values and the pH-dependent conformational dynamics, but it is limited in modeling water titration and long-range interactions [17].

Thermodynamic Integration

The thermodynamic integration method computes free energy differences and other thermodynamic properties of the system between two distinct states/conformations. To measure the change from one state to another, parameters are slowly increased to maintain equilibrium at each stage in the trajectory [18]. The main requirement of the method is that the path should be reversible. Due to the chance of employing non-physical, but plausible paths (for example, allowing moves that increase free energy), the method confers great flexibility to molecular simulations. Both Monte Carlo and MD can be used with this method when only the equilibrium states along the path are needed to be simulated [18].

Metadynamics

Metadynamics is a sampling technique incorporating an additional bias potential that acts on multiple but selective number of degrees of freedom as collective variables. Methods that use this class include umbrella sampling and steered MD. The approach pushes the system away from local free energy minima and therefore allows exploring new reaction pathways. Moreover, no prior knowledge of the energetics of a system is needed to implement metadynamics techniques. The limitation is that because the model oscillates around the free energy rather than converging to it, it is difficult to decide as to when to stop the simulation. Second, identifying a set of collective variables for describing complex processes is very difficult to achieve [19].

Continuum-Molecular Dynamics

The continuum-molecular dynamics method generalizes simulated tempering to a continuous temperature space to provide a smooth transition from microscale to macroscale [20] by evaluating a conserved quantity that can be used to validate a simulation.

Quantum and Molecular Mechanics

The QM and MM are part of quantum chemistry toolbox [21]. Quantum mechanical descriptions are used to model accurate electronic rearrangements for those parts of a system that are involved during a chemical reaction, but quantum modeling is computationally expensive. Though less accurate, MM, on the other hand, is faster and computationally less costly. For simulations that do not involve a chemical reaction, the use of a simple MM force-field model is appropriate, which reduces the simulation time. To overcome the limitations of a full quantum mechanical or a full molecular mechanics description, the hybrid QM/MM methods are an option, in which the system is treated in part at the level of quantum chemistry (QM), retaining the computationally cheaper force field (MM) for the larger part [22].

Monte Carlo Methods

Monte Carlo methods are useful in the study of thermodynamics of the protein over its conformational space and also in searching for the low-energy conformations. The major limitation of Monte Carlo is that it is a data-intensive method [23]. A broad class of Monte Carlo algorithms, mean field particle methods, simulates a sequence of probability distribution for a non-linear evolution equation [24]. In contrast to Monte Carlo, this technique uses sequentially interacting samples, in a way that is similar to statistical Markov processes.

Simulated Annealing

Simulated annealing (also known as generalized simulated annealing) is often used when a large search space with many local minima needs to be studied by using a probabilistic approach to approximate the global optimum of a given function. With respect to enzymes, the approach involves the system to be heated up to a high temperature then gradually cooling to find the global minima. When the energy landscape at the needed temperature is smooth, however, this method is limited, because it cannot identify an optimal solution [25].

Structural and Functional Description of Cellulase Families and Insights from Computational Studies

We have divided this section into various subsections in terms of the enzymes used for studying the protein dynamics. We have compiled and tabulated the MD parameters used in the studies cited below in Supplementary Table T1 that the reader will find useful.

Family 7 Cellobiohydrolase

Most of these enzymes cleave β-1,4-glycosidic bonds in cellulose and β-1,4-glucans, which are its substrates. According to Koshland [26], there are two classifications based on the mechanism of action of the enzymes, i.e., retaining and inverting, out of which the family 7 cellobiohydrolase is classified as retaining enzymes. The cellobiohydrolase’s end product is cellobiose, a disaccharide, which is one step away from converting the cellulose polymer to glucose units (Figs. 1a, b and 2a). Computational studies on cellobiohydrolase (Cel7A, Cel7B) have focused on the overall structural dynamics either as a free enzyme or in complex with various substrates. For example, the simulations for Cel7B from Melanocarpus albomyces were performed at a constant pH to analyze the dynamic fluctuations of the loop regions [27], because the loop regions of cellulases, in general, play a major role in enzymatic function [27]. The CpHMD works by computationally coupling the protonation states of some amino acids within the framework of classical MD and capturing the residue pKa shifts and dynamic charge coupling. In the MD simulations of 70 ns with 2-ps steps across various pH levels, the loops showed differential fluctuations. At active pH, loop II showed increased flexibility compared to other pH levels. When CpHMD was performed for T. reesei Cel7A, a well-studied enzyme of industrial importance, the loop regions showed flexibility [28] that correlated with the neutron scattering experiments. Interestingly, whereas the presence of charged residues (Asp and His) in Melanocarpus albomyces is thought to contribute to the elevated pH profile compared to T. reesei Cel7A [28], these residues are not present in the loop region; therefore, the enzymatic function of this protein is an interplay of amino acid specificity and structural dynamics, possibly as a result of dynamics of loop regions on the protein surface. Separate comparisons of tunnel entrance at specific sites, loops near a specific subsite, loops near the catalytic center, and comparisons of product binding region with respect to other GH7 cellobiohydrolases revealed significant structural differences relating to differences in processivity, endo-initiation, and product inhibition [67]. Comparing T. reesei’s Cel7A with Cel7A of Trichoderma harzianum suggests that although they share high sequence homology (81% sequence identity), the short side chains of the adjacent residues in the catalytic tunnel create extra gaps at the side face of the catalytic tunnel [40].

Fig. 1
figure 1

Metabolic pathways and breakdown of crystalline cellulose. a The enzymes described in this study are present in KEGG as part of the sucrose and starch metabolic pathway, namely endoglucanase, cellobiohydrolase, and lytic polysaccharide monooxygenase. Also, GH9 is part of the same metabolic pathway but break downs 1,3-β-glucan. Similarly, Man5B is part of the fructose and mannose metabolic pathway, and GH18 is part of the amino sugar and nucleotide sugar metabolic pathway. b Schematic representation of enzymes involved in the conversion of cellulose to glucose

Fig. 2
figure 2

Compilation of biofuel enzymes studied using computational methods. Cartoon representation of the structures mentioned in this review (Table 1), where the helices are colored cyan, beta-strands are colored red, and loops are colored magenta. a Family 7 cellobiohydrolase (pdb id: 8cel, 1eg1). The protein corresponding to pdb id: 8cel is prepared from homology modeling, and we have used the same coordinates of the modeled structures as used originally for carrying out the simulations. b Endoglucanase (pdb id: 3qr3 and 3ndy). c Man5B (pdb id: 3w0k). d GH9 (pdb id: 3ez8). e Cel48F (pdb id: 1f9d). f GH18 (4axn). g GH6 (pdb id: 1qk2 and 4avn)

The dynamics of a substrate-bound enzyme is significantly different than the free form. The Cel7A from Geotrichum candidum strain 3C (GcaCel7A) was subjected to MD simulation for 100 ns in three different setups: (1) in the free form, (2) in complex with a long substrate (cellononaose), and (3) in complex with the microfibril of cellulose [31]. In these three different MD simulations, GcaCel7A shows similar structural and functional characteristics to the industrially relevant HjeCel7A (from Hypocrea jecorina) (Table 1). On the other hand, the ligand-bound MD simulations on T. reesei Cel7A (TrCel7A) showed that there exists a competitive binding in the presence of lignin, which is a known inhibitor and a by-product of the plant biomass during the pre-processing step. Successful removal of lignin ensured high turnover for the enzyme catalyst. Using a mix of crystalline and non-crystalline fibers and in the presence of 468 lignin molecules, the 1312-ns-long simulation indicated that the hydrophobic surface of the cellulose is the preferred binding site for both lignin and TrCel7A [29]. Lignin also was observed to bind to the hydrophobic patches of the carbohydrate-binding module (CBM) attached with TrCel7A to amplify its inhibitory activity. Similarly, MD simulation on the PfCBH1 (cellobiohydrolase belonging to GH7 family) from Penicillium funiculosum with microcrystalline cellulose revealed that the binding of substrate is relatively more accessible due to structural flexibility when compared with T. reesei and this might have a role in relatively faster product expulsion, thus a higher tolerance for product inhibition, i.e., cellobiose [30]. Study on Cel7A and Cel7B of T. reesei with the help of mutants suggests the structural differences between both the enzymes around the catalytic center, at active site tunnel entrances, and exits, all of which signify the processivity in GH7s [39]. Also, Cel7B catalytic domain of T. reesei with a cellulose microfibril revealed as to the domain’s complexation on cellulose chains from a crystal surface [33].

Table 1 Enzymes used in biofuel application reviewed in this paper, detailing the MD simulation time

Similar dynamics on substrate-bound form revealed the cellulose binding site was highly conserved in three enzymes when bound with cellononaose: Cel7A from Heterobasidion irregulare (HirCel7A), H. jecorina (HjeCel7A), and Cel7D from P. chrysosporium (PchCel7D) [36]. HjeCel7A, this time in complex with a cellodextrin nanomer chain, was placed at five different positions around the binding site, and the cellodextrin chain was observed to spontaneously diffuse into the catalytic tunnel by a cellobiose unit [35]. Further, it was suggested by means of potential mean force calculations that the Cel7A recognizes the free cellulose-reducing chain end [35]. In one case, it was suggested that the glycosylation reaction serves as the rate-limiting step in cellulose degradation [34]. A two-step simulation protocol that was implemented to observe the binding and interaction of Cel7A’s CBM with the cellulose Iβ fiber showed the binding preference of CBM toward the hydrophobic faces of the fiber rather than the hydrophilic ones via a 40-ns-long simulation [68]. In a related study, the flexible, glycosylated linkers of CBM bound to T. reesei’s TCel6A and TrCel7A were shown to bind non-specifically to the cellulose surface [37].

To study the interactions between the important residues of T. reesei Cel7A CBM and cellulose, a thermodynamic integration method was used to calculate the cellulose–Cel7A CBM binding free energy changes caused by Y5A, N29A, Y31A, Y32A, and Q34A mutations (pdb id: 1cbh) to demonstrate that interactions between residues and cellulose are dominated by the electrostatic changes [32]. The Cel7A from T. reesei was studied with MD to understand the structure–function relationships that glycosylation imparts to linkers. The enzyme is an intrinsically disordered protein most likely due to the absence of ordered secondary structure for the linker, as validated via 360-ns-long simulations. It was reported that the Cel7A linker is comparatively more disordered than other linkers in T. reesei cellulases [69].

MD simulations were used to examine the binding of cellobiose to the TrCel7A cellobiohydrolase and the effects of mutations that reduce cellobiose binding without affecting the structural integrity of the enzyme. The results showed that the binding site of the product demonstrates a specific flexibility that can hinder the cellobiose release sterically, though many point mutations can still maintain the structural integrity of the enzyme. It was suggested that there is a trade-off between inhibition of the product and the efficiency of the catalyst [70].

The unique Cel7B from the marine wood borer, Limnoria quadripunctata’s ability to operate in saline conditions was investigated with a 250-ns-long MD simulation that showed high flexibility of the exo-loops at the tunnel. The tolerance to high salt concentration is probably due to acidic charge distribution on its surface, and the aromatic residues at the entrance of the tunnel may be involved in substrate binding [38].

Endoglucanases

Endoglucanases were broadly studied with respect to two types: those that contain CBM domains and the ones that do not. Endoglucanase D (EngD) from Clostridium cellulovorans consists of a catalytic domain linked via a flexible linker to a CBM domain (Figs. 1a, b and 2b). While computational methods were not used to study the enzyme’s structural dynamics, SAXS experiments revealed the flexibility of the linker that allows an extended conformation of EngD in the solution, which proved the importance of the CBM module. The cellotriose-bound EngD structure has an extended active-site cleft, which contains Trp162, a residue that is absent in few other variants of the enzyme with a significantly reduced activity [54]. Endoglucanase 3 of T. reesei (TrEG3) and T. harzianum (ThEG3), a member of GH12 enzymes, does not contain the CBM domain, and yet it catalyzes cellulose hydrolysis [55]. The tertiary structure of ThEG3 (at 2.07 Å resolution) was determined by X-ray crystallography, and then, MD simulations were used to investigate enzyme–substrate interactions to understand the role that certain aromatic residues play in recognizing and binding to the substrate. The study showed that due to the significant spatial distance of this CBM-like cluster region from the catalytic site, the productive substrate binding and catalytic efficiency require longer oligosaccharide chains to simultaneously bind to the catalytic triad and the aromatic CBM-like cluster for efficient hydrolysis. The study highlighted the reason as to why shorter oligosaccharides have inefficient hydrolysis by Cel12A [55].

To understand how ionic liquids (ILs) interact with enzymes at the molecular scale, endoglucanase (E1) from Acidothermus cellulolyticus was simulated in aqueous 1-ethyl-3-methylimidazolium chloride ([Emim]Cl) to study potential inactivation mechanisms. The study showed that the utility of ionic liquids is highly restricted by the enzyme incompatibility and the interactions that are crucial to activation or inactivation of the results are unique. For example, [Emim]Cl interacts with higher specificity to E1’s binding site and disrupts native hydrophobic contacts, leading to inactivation of E1 [53]. A similar study on GH5 family of endoglucanases from Trichoderma viride, Thermogata maritima, and Pyrococcus horikoshii in the presence of 1-ethyl-3-methyl-imidazolium acetate ([EMIM][OAc]) with water at various temperatures was carried out [52]. The results did show that structural changes that happen at a long time scale (500 ns) cause deactivation of the enzymes. For example, in GH5 of T. viride, the deformation of binding pocket is correlated with the deactivation at low concentrations of the IL. Similarly, the deactivation of GH5 of T. maritima is due to changes in secondary structures that correlate with experimental data. However, the GH5 of P. horikoshii did not show any deactivation at low concentrations of IL [52].

A trimodular endoglucanase (CelB) with a CBM46 domain and a rigid CBM_X domain sandwiched between them was identified in Bacillus sp. BG-CS10, where the CBM46 domain interacts both with the catalytic domain and CBM_X domain. The resulting structure is labeled as an L-shaped cellulase. MD simulations for 50 ns indicated that the loop regions of the catalytic domain that contain the aromatic residues involved in substrate binding undergo relatively large structural changes, facilitating product release [61].

The computational studies for the BGLI gene include homology modeling for prediction of its tertiary structure from multiple sequence alignment with three selected templates of β-glucosidase. MD simulations were performed on the docked structure of BGLI with cellulose ligand to identify the ligand-binding domain of the enzyme. Stable conformations that were observed were in agreement with the structural flexibility of the free enzyme [71].

A 3D model for VpEXPA2, an α-expansin involved in the softening of Vasconcellea pubescens fruit, was built by comparative modeling strategy based on the structure of Phlp1 as template, a β-allergen from Timothy grass (Phleum pratense). Docking studies were performed to predict the putative binding of different octasaccharides to the protein: two different hemicellulose octasaccharides and one cellodextrin 8-mer that resembles a water-soluble cellulose molecule. MD simulations were carried out for each substrate inside VpEXPA2 for 20 ns which showed a strong interaction to cellodextrin 8-mer polymer and, in contrast, a low interaction with hemicelluloses octasaccharide polymers. It is reasonably hypothesized that the function of domain D1 of VpEXPA2 is highly dependent on the binding of cellulose microfibril to domain D2 [72].

The atomistic simulations based on classical interaction potentials were used to examine the interactions of Cel5A with cellulose fibrils with amorphous-like and non-crystalline regions. The analysis of the catalytic domain suggests that the enzyme actually alters the cellulose structures and the charge around the catalytic cleft in the domain plays a significant role in enzymatic function [56].

Another structural study involved the prediction of the putative hydrogen bonds formed between the enzymes and cellohexaose using homology models of NfEG12A from Neosartorya fischeri P1 and endoglucanase from Aspergillus niger. MD simulations were carried out to examine the effect of loop 3 on the catalytic efficiency of GH12 endoglucanases. Overall, the analysis through molecular mechanics/Poisson–Boltzmann and surface area continuum salvation (MM/PBSA) demonstrated that the hydrogen network interactions between protein and the substrate are enhanced by loop 3, resulting to an increase in the turnover rate, thereby improving the catalytic efficiency [60].

From a created library of consensus mutation using the sequence alignment of homologs of family 8 glycoside hydrolases, one of the mutants of Cel8A from Clostridium thermocellum showed a higher thermal stability without any loss of catalytic activity, possibly due to an increase in conformational rigidity of the protein backbone in unfolded state that was observed in an MD simulation of the enzyme’s mutant (G283P) [58].

The sequence and structural comparison studies were performed for the CBM of three endoglucanases (EG1, EGO, and EGV) modeled from T. reesei cellobiohydrolase CBHI. All the structures were found to be similar in their cellulose-binding domains, and disulfide bridges seem to stabilize the polypeptide fold [73].

A TrCel5A cellulose complex was simulated with MD using TrCel5A-catalyzed phosphate acid-swollen cellulose (PASC) hydrolysis as a model system. The slowdown of hydrolysis was studied by kinetic measurements, and it was found that the hydrolysis slowdown is correlated with the adsorption. The simulations significantly helped in identifying the potential residues involved in binding. The results from the comparative analysis of the complex with the wild type further showed that catalytic ability affects the slowdown of endoglucanase [59].

The structure, dynamics, and behavior of the Cel7B from Fusarium oxysporum were analyzed at 80 °C through MD simulations. The dynamical factors for analysis involved hydrogen bonds and fluctuations in the turn regions, which influence the activity and stability of the enzyme [74].

Another study using MD simulations identified the regions in proteins that trigger the partial unfolding for denaturation (also known as “weak spots”). These regions were identified in T. reesei Cel7B by calculating the distances between Cα in contact and their capability of forming disulfide bonds [57].

Lytic Polysaccharide Monooxygenases

Lytic polysaccharide monooxygenases (LPMOs) are recently discovered enzymes (Fig. 1a) that show immense industrial application for degrading crystalline form of cellulose, as they boost the degradation process significantly [75]. Their importance and their relevance were described in detail in the following reviews [41, 76, 77]. It is now well established that LPMOs can be classified into three subtypes based on the site of attack, namely (1) LPMO1 when oxidation occurs at C1 carbon, (2) LPMO2 when oxidation occurs at C4 carbon, and (3) LPMO3 if either C1 or C4 carbons are attacked (Fig. 3). Additionally, there are four CAZy families to which LPMOs are classified as auxiliary activity enzymes (AA9, AA10, AA11, and AA13) on the basis of their potential abilities to help the originally classified enzymes, the glycoside hydrolases (GHs), the polysaccharide lyases (PLs), and the carbohydrate esterases (CEs), in gaining access to the carbohydrates encrusted in the plant cell wall [78].

Fig. 3
figure 3

Schematic representation of C1 (type 1) and C4 (type 2) mechanism of action identified in lytic polysaccharide monooxygenases (LPMOs). LPMOs oxidize on either C1 or C4 carbon, giving rise to specific products, lactone or ketoaldose, respectively

Molecular insights into LPMO’s mechanism of action were significantly improved by computational studies ranging from QM/MM models to multiscale models (Fig. 3). The QM/MM models that were built using the density functional theory on LPMO belonging to the AA9 family from Thermoascus aurantiacus shed light on the geometry and coordination chemistry of the reactive oxygen with Cu(II) atom. The results indicated that the formation of the complex (copper–oxyl reactive oxygen species) drives the catalytic activity with a rebound step for oxygen to complete the cycle [79].

Another study informs us about the four-coordinate tetragonal structure of T. aurantiacus in an oxidized state and a three-coordinate T-shaped structure in a reduced state [80]. The O2 reactivity of the Cu(I) site was evaluated computationally using experimentally calibrated DFT calculations. To determine the number and type of coordinating ligands in Cu-AA9, extended X-ray absorption fine structure (EXAFS) experiments, which allows gathering information about atomic energy-level structures and the metallic center coordination, were performed on the oxidized and reduced enzyme forms to demonstrate that the structure of the enzyme site is suitable for rapid inner-sphere reductive activation of O2 by Cu(II)–superoxide formation [80].

MD simulations of LPMO from the AA9 family revealed that the loop regions undergo conformational changes that make the enzyme flexible during substrate binding. These findings are in agreement with the QM/MM results where the distance between the active site copper and C1 carbon is around 5 Å, where a superoxide intermediate of the reaction (a product of the reactive oxygen with Cu(II) atom) can be easily accommodated. The tyrosines (Y28, Y75, and Y198) were computationally observed to form local hydrophobic interactions, stabilizing the active site during substrate binding [42].

In order to accurately capture the enzyme dynamics using MD simulations, the use of accurate force fields for a given system is required. A recent study probed potential energy landscape for the AA9 family to create a specific set of force-field parameter [44]. The use of such accurate force fields that can represent metallo-proteins consists of single-point energy evaluations over a rectangular grid involving selected internal coordinates that incorporate the generation of energy profiles for the bond stretch, angle bend, and torsions for more realistic simulations. In recent years, the method of multiscale modeling was applied to study the large-scale dynamics of proteins and their interactions with substrates. The advantage of using multiscale modeling is that it gives the big picture of the interactions between the different components of a system. In the case of cellulases, the global level dynamics of cellulases on the surface of cellulose can shed light into how the complex synergistic activities of different enzymes help in degrading cellulose. A multiscale modeling of LPMO with Cel7A (non-reducing end specific exo-cellulase) and Cel7B (reducing end specific exo-cellulase) of T. reesei showed that LPMO decrystallizes the cellulose crystalline surface by forming new chain termini within the fibril, rather than at the ends of the fibril. It also has higher affinity to the reducing end of the fibril [41]. This multiscale modeling study emphasizes the possible synergistic interaction between LPMO and other enzymes for faster degradation of crystalline cellulose.

The reduction of the LPMO active site of AA9 enzyme from T. aurantiacus from states 1 (resting state) to 2 (reduced state) and two isomers of state 3 (copper–superoxide intermediate) was recently investigated (Fig. 4a) [81]. The results of combined QM/MM simulations provided evidence that the computational protocols that were followed in this study could reproduce the observed decrease in the coordination number when Cu(II) is reduced to Cu(I). Using QM for this system as opposed to full MM was a necessity because MM cannot model reactions. Among the two isomers that were observed in the Cu–superoxide complex, the multiscale modeling revealed that there is a preference for one isomer over the other for energetic stability. Further work on the enzyme–substrate complex from the same group led to the validation of four enzyme–substrate intermediate models based on bond-dissociation energy (BDE) [82]. BDE calculations are time consuming in an experimental setup, and thus, the alternative method of calculating BDEs from computational methods is quicker and sensitive. Specifically, in the LPMO studied (pdb id: 2yet [83]), the bond-dissociation energy for the four intermediates, [Cu–OH]3+, [Cu–OH]2+, [Cu–O]2+, and [Cu–O]+, is comparable; however, the intermediate [Cu–OH]3+ is not favorable compared to the other three. The study also highlighted the non-dependency of the aromatic residue in the active site, as many LPMOs have either a Tyr or a Phe at the same position [82].

Fig. 4
figure 4

Compilation of biofuel enzymes studied using computational methods. Cartoon representation of the structures mentioned in this review (Table 1), where the helices are colored cyan, beta-strands are colored red, and loops are colored magenta. a LPMO (pdb id: 2yet, 4b5q, 3eii, 4oy6, 4oy7). b Cellulosome (pdb id: 4iu3, 3kcp, 1ohz, 2b59). The protein corresponding to pdb id: 3EII is prepared from homology modeling, and we have used the same coordinates of the modeled structures as used originally for carrying out the simulations

The MD simulations once again prove their worth in identifying key areas that deviate from the crystal structures of ScLPMO10B and ScLPMO10C LPMOs to identify surface charge modifications to increase stability in ILs. The MD was performed for 250 ns in three ILs at 0 wt%, 10 wt%, and 20 wt% in water. The IL effects of dynamic fluctuations for specific regions of the enzyme on exposure to ionic liquid, on enzyme’s overall structure, as well as on the structure of enzyme’s active site were comprehensively and comparatively studied for both the LPMOs. The results clearly indicate that they both show structural similarity, and the fluctuations in IL and water are nearly the same. Therefore, both the LPMOs are unaffected by the influence of ionic liquids [43].

To study the functional aspects of CBP21, a chitin-active member of carbohydrate-binding module family, NMR, and isothermal titration calorimetry were used to map surface binding based on pH dependency, which showed that CBP21 is a compact and rigid molecule except at its catalytic metal binding site. CBP21 depends on Cu ion for catalysis, and binding of cyanide to the metal indicates that it is involved in the oxidative cleavage of the substrate. The comparisons with GH61 LPMO further showed that their metal binding sites are significantly different despite the fact that both catalyze the same reaction. An approach that uses the pH dependency of both the chitin–CBP21 interaction and the 1H exchange rate led to the identification of the residues involved in binding CBP21 to the chitin surface based on the first NMR structure ever resolved for an LPMO [84].

Cellulosome

Cellulosomes are macromolecular complexes that are specialized in cellulose degradation. The flexible linkers that connect dockerins and cohesins in the cellulosome gained much attention due to their contribution to the structural dynamics of the enzyme. The cellulosome dynamics was studied by generalized simulated annealing (GSA) on a fragment of C. thermocellum CipA (Fig. 4b) scaffolding in complex with the SdbA type II cohesion module. The study revealed that the CohI9 module (CipA’s ninth type I cohesion) has only two possible conformations (two thirds of occurrences of native form and one third of alternate form), despite the fact that the linker is highly flexible. Further MD simulation analysis showed that the small difference in the average potential energy between the two conformations can be overcome by the small changes in thermal energy, therefore affording the module the ability to easily switch between both conformations [46].

The X-modules-dockerin and cohesin complex (XMod-Doc:Coh) was studied to characterize the ligand–receptor complex responsible for substrate anchoring and inter-domain stabilization in Ruminococcus flavefaciens whereby single-molecule force spectroscopy and steered molecular dynamics simulations examine the mechanical unbinding of the complex. The mechanical dissociation of XMod-Doc:Coh was probed by single-molecule force spectroscopy, the results of which show that xylanase fusion domain on XMod-Doc and CBM fusion domain on Coh show identifiable unfolding patterns. This allowed screening of large datasets of force-distant curves. The XMod-Doc:Coh ruptures reported there fell in a range from 600 to 750 pN at loading rates ranging from 10 to 100 nN s−1, which were among the highest of their kind ever reported. The steered molecular dynamics results indicated that the force increased with distance continuously until the complex was ruptured. The analysis for the interacting residues and the contact surface area suggested that the mechanism of such stability is remarkable while still allowing fast assembly and disassembly of the complex at equilibrium [45].

MD simulations were performed to probe both the type I and type II coh-Xdoc interactions in C. thermocellum. They involve the simulations and free energy calculations of both wild type and D39N mutant of the type I coh-Xdoc from the same organism, the results of which are a clear indication that comparatively, the mutant shows significant flexibility caused by the change in hydrogen-bonding network in the conserved loop regions. The energy differences demonstrate that though dynamic changes are small, the conformational changes persist [47].

In another study for the type II, hot spots, i.e., the amino acid residues responsible for drastic decrease in binding affinity upon mutation, were mutated to examine their effect on binding. The study concluded that the rigid cohesion–dockerin interface is maintained by means of bulky and hydrophobic residues and their contacts with the protein interface [48].

C. thermocellum was studied for capturing the physical characteristics of three cellulosomal enzymes (Cel5B, CelS, and CbhA) and the scaffoldin (CipA) by MD simulations. The results showed that shape and modularity dominate the cellulosomal enzyme complex. Comparative insights about the abovementioned enzymes indicated that CbhA binds more frequently to CipA than the other two because of its flexible nature multimodularity [85].

Coarse-grained and MD were performed on many cellulosomal linkers of different lengths and compositions, which indicated that the linker’s stiffness depends on the length, and not the specific amino acid. The study showed that the short and stiff linkers are the cause of significant rearrangements in the folded domains of the mini-cellulosome composed of endoglucanase Cel8A in complex with scaffoldin ScafT (Cel8A-ScafT) of C. thermocellum as well as in a two-cohesin system derived from the scaffoldin ScaB of Acetivibrio cellulolyticus [86].

Man5B

In the MD computational analysis of Man5B (Figs. 1a and 2c), the enzyme from thermophilic bacteria Caldanaerobius polysaccharolyticus, molecular docking studies followed by principal component analysis were performed on the catalytic site bound with cellohexaose and mannohexaose to understand the mechanism by which Man5B hydrolyzes cello-oligosaccharide and manno-oligosaccharide substrates. The results showing Man5B binding to cellohexaose as tightly as mannohexaose were significant because the experimental assays showed that Man5B is relatively more efficient in hydrolyzing manno-oligosaccharides than on gluco-oligosaccharides [87].

Applying coarse-grained simulation on protein–oligosaccharide complex, where glucose is approximated to one bead (an approach similar to commonly used approximation of representing one amino acid as one bead), Poma et al. [66] constructed coarse-grained models for three different hexaoses and then tested it on a Man5B–hexaose complex. The predicted structural models correlated well with all-atom models reported earlier for the same system, and the analysis suggested that the interaction of Man5B with hexaose is four times stronger than the other oligosaccharides.

Another coarse-grained (CG) method application involved the use of Martini force field, which applied a mapping of four heavy atoms to one CG interaction site and was parameterized with the aim of reproducing thermodynamic properties. To overcome the barrier of unbreakable harmonic bonds controlling unfolding and folding processes, the ELNEDIN protein model [88] was based on the Martini CG force field, where the harmonic bonds were replaced with Lennard–Jones interactions on the contact map of the native protein structure as is done in Go̅-like models. This model revealed the structural motion linked to a particular catalytic activity in the Man5B protein, the details of which agreed with those of all-atom simulations. The approach made use of the contact map, which identified the key pairs of contacts between residues required to preserve the native structure of the protein without the need for using adjustable parameters [88].

In another study, two coarse-grained models of three hexaoses were studied. One of the models was based on centers of mass and C4 atoms. The second one was based on Cα atoms, and found more appropriate to analyze protein interactions. The corresponding stiffness constants were calculated by all-atom simulations and two statistical methods (Boltzmann inversion and energy-based). It was found that the energy-based method shows a better agreement with other theoretical and experimental determinations of non-bonded parameters. The contact energies were then calculated in the hexaose–Man5B complex, and the interactions of C4–Cα atoms were found to be stronger than the hydrogen bonds [66].

GH9

The cleavage of sugar chains from cellulose at high temperatures by the thermoresistant Cel9A-68 (Figs. 1a and 2d) from Thermobifida fusca is catalyzed by the cooperative action of two important domains of the cellulase: CBM and a catalytic domain connected by a Pro/Ser/Thr-rich linker. Based on this, the temperature dependence of the dynamics of Cel9A-68 was analyzed in detail at three temperatures: 300 K, 325 K (optimal temperature for activity), and 350 K. Using quasi-harmonic analysis, principal component analysis (PCA), and subsequent essential dynamics (ED) analysis, the conformational space and the collective motions were examined, and the CBM domain was observed to be highly flexible than the catalytic domain as observed in experiments [63].

The Ig-like domain in GH9, if deleted, causes the loss of enzymatic activity, though there is no evidence of any direct relation with the active site. MD simulations were used to investigate the role of Ig-like domain in Cel9A. The results show that the residues of the domain are correlated dynamically with the residues of carbohydrate-binding pocket, with few of them being related to the important catalytic residues of Cel9A. Further, it was shown that the catalytic domain is significantly stabilized by the Ig-like domain, possibly enhancing the thermostability of Cel9A [64].

Cel48F

The processive endocellulase, Cel48F of C. cellulolyticum, was studied for its hydrolysis mechanism when it forms a complex with the sugar chains by MD simulations (Fig. 2e). The computational approach for examining the structure of Cel48F involved metadynamics, which computed free energies expeditiously and allowed the study of statistically rare events. Metadynamics simulations usually follow standard MD simulations to stabilize protein–polysaccharide structures. Metadynamics proved its utility in investigating the details of sugar chain I entering and chain II leaving the Cel48F tunnel through a 5-ns-long simulation [50].

For the study of water control mechanism in enzymatic hydrolysis of cellulose, MD simulations were carried out for the two conformations of the Cel48F [hydrolyzing (H) and sliding (S)]. These two conformations were compared after repeating the MD simulations thrice with the same starting conformations. The hydrolysis seemed to begin when a water molecule is present for every glycosidic linkage, suggesting a water control mechanism for hydrolysis. During the shifts between conformation from S to H, the water molecule that is initially bound to D230 in S (known as site a) turns to W417 and M414 (known as site b) in H and performs a nucleophilic attack on the anomeric carbon, causing the hydrolysis product to be excluded from the cleft and water control system to return to site a. The simulations revealed the roles of other certain key residues through their ability to form hydrogen bonds. Key residues involved the most probable candidates for inverting anomeric carbon, the ones that could help converting one conformation to the other and those that could provide a hydrophobic environment preventing the water molecules from entering the active site. It was proposed that in addition to Cel48F, the method can be applied to study the reaction mechanisms in other processing enzymes [49].

MD simulations were carried out for imidazole in an aqueous solution of glucose in order to investigate the interactions that take place between the two co-solutes and also between the neutral imidazole molecules. This study showed the role of histidine side chains in the binding and hydrolysis in cellulases, including Cel48F [89].

The simulations showed the possible catalytic role of an unusual conserved water-filled pore structure in another member of the family, Cel48A from T. fusca, suggesting that the pore provides a pathway to the active site for the water molecules used in processive hydrolysis of the cellulose substrate [90].

Catalytic activity was studied in C. cellulolyticum by MD in combination with steered MD and binding free energy calculations, which gave insights about the important regions of the Cel48F that are involved in hydrolysis and the release process of the leaving group. The probable residues responsible for hydrolysis, which affect the catalytic activity significantly, were predicted [51].

A study for analyzing the mechanisms of cellulosomes used MD and normal mode analysis to refine the protein complex and investigated the dynamical differences between the domains. After determining the structure experimentally using SAXS, normal mode analysis confirmed that both the free dockerin and the dockerin–cohesin complexes undergo a rigid body motion with respect to the catalytic module [91].

GH18

A study reported the crystal structure and dynamics of the catalytic domain of the GH family 18 non-processive endochitinase, ChiC, from Serratia marcescens (Fig. 2f) with other processive enzymes, ChiA and ChiB from S. marcescens [62]. The study demonstrated that the dynamics of the processive enzymes is similar to that of a non-processive chitinase from Lactococcus lactis (pdb id: 3IAN; a structural homolog of ChiC’s catalytic module). All four proteins were docked with the chitin substrate to study the processivity of the chitinase. Each simulation was run for 250 ns, for a total simulation time of 1 μs. The overall structure of ChiC2 (the catalytic domain of ChiC) was studied in terms of energy difference, hydrogen bonding, and root-mean-square distance.

The catalytic residue in ChiC2, i.e., E141, was observed to be in a different orientation in the crystal structure and not bonded with D139, which would be crucial for optimal catalytic activity. In order to understand the interactions of E141 and whether the side chain conformations change in the active site, MD simulations were performed and showed that E141 is indeed flexible and there are three distinct conformational states for E141 and D139. Similarly, between the processive and non-processive chitinases, there were structural dynamics differences shown through the free energy-based calculations to confirm the conformational flexibility of Glu141, the loop regions, and in the active site. These simulations showed that the ChiC2 is highly flexible, and the dynamic on and off ligand binding processes associated with non-processive endochitinases correlate well to the experimentally derived processivity data in the S. marcescens’s chitinases [62].

GH6

The simulations reveal new structural details of GH6 CBHs (Fig. 2g). Two mutant structures of Thermobifida fusca’s Cel6B were characterized that allowed the analysis of their hydrolysis mechanisms. Using the wild-type Cel6B structure, three complexes were constructed. Two complexes were bound to cellobiose, and the remaining with cellohexaose. In addition, a Cel6B complexed with crystalline cellulose Iβ was also constructed. These complexes were built to study the tunnel-shaped active site. Specifically, the process of product expulsion was identified by the dynamic action of two loops, i.e., the exit loop (residues 185–197) and the bottom loop (residues 501–510). The simulations suggest that driven by their flexibility, these loops open up to create space to expulse the product from the active site. These two loop regions fluctuate the largest and are not correlated with the fluctuations of other parts of the protein. Multiple SMD simulations showed the exit loop opening up to 14 Å and the bottom loop up to 16 Å, creating a large enough gap for allowing product transport. The simulations show the flexibility of the loops to open and allow release of the product with equal probability in solution or when bound to cellulose [65].

Discussion

Enzyme design is one of the most complex engineering areas in chemical engineering. Multiple steps are necessary for an enzyme to be modified in order to become an industrially useful product. In addition, temperature and pressure values in different processing stages influence each enzyme’s stability, activity, and conversion rate. Each step is complicated enough to require extensive experimental testing, starting from small-scale experiments on the bench to testing in pilot plants. Success in the lab does not always translate into success in chemical plants. Many initially exciting products fail to reach to the production line. In multistep enzyme design and production stages, candidate enzymes are extensively and experimentally tested to demonstrate their superior performance against specific industrially meaningful conditions.

Given the importance of experiments in providing a realistic view of enzymatic performances, computational methods will not replace the experimental testing that provides dependable measures for enzymatic behavior in the near future. Computational methods, however, can be as useful as imperfect theoretical models which provide practical approximations that can guide experimental approaches. Although experiments are the ultimate touchstone to judge the performance of a novel enzyme product, simulations and theoretical methods are important parts of engineers’ toolsets to shape design approaches and optimize enzymatic behavior in industrial settings.

As we have discussed in this review, several computational methods are available to understand the molecular basis of enzymatic activity in cellulases. Different methods have different strengths and weaknesses. Given the computational cost of fine-grained atomistic simulations, coarse-grained approaches are preferentially used to speed up the simulations, but increased speed comes with a lower information content, sometimes as unrealistically as at the expense of lack of atomistic-level modeling of chemical reactions. Ultimately, before starting a research project, engineers and scientists need to determine critical parameters to optimize for, so that they can choose the most appropriate combination of tools from available experimental, theoretical, and computational methods to achieve their design goals.

When multiple approaches are used simultaneously, deciding on the effectiveness of each method individually is not an easy task. Obviously, experimental methods provide the most clear-cut product enhancements, especially for first few generations of enzymes. For example, randomly mutating enzymes and assessing their properties is a well-tested means for product development. For the development of later generation of enzymes, however, more sophisticated approaches are necessary, and a wider range of tools and a deeper understanding of enzymatic mechanisms are desired. Therefore, although computational methods often lack clear-cut success stories, their critical role by being a part of a larger engineering toolset when creating newer generation of enzymes is usually understood.

Conclusions

In this review, we surveyed how various computational approaches were used to understand how the structural dynamics and chemical specificity of cellulases contribute to enzymatic function. Complementary to experimental methods, computational methods, such as MD simulations and/or QM/MM method, and other methods were successfully used for many cellulases to understand the molecular underpinnings of their functions. By reviewing the recent scientific literature for the cellulase enzymes, we provided the latest computational research in structural and dynamics studies of these industrially important enzymes.

Numerous studies have demonstrated that computational methods are fast and reliable and provide the level of detail required to understand the enzymatic function of cellulases (Fig. 5). While the prerequisite for any such computational study is the availability of high-resolution structures, solved via NMR or X-ray crystallography, we do not perceive this as a limitation; because the speed at which even the low-resolution new protein structures are being deposited in structural databases, this would increasingly enable better computer simulations and analyses. As more computational studies will be performed in the future, a better understanding of mechanism of enzymatic action for cellulases will be developed, enabling scientist and engineers to make more informed design decisions for more efficient use of cellulases in biofuel applications.

Fig. 5
figure 5

Computational methods to understand dynamics in cellulases. Brief summary of the various fine-grained and coarse-grained methods employed to understand the structural dynamics of various families of cellulases as reviewed in this paper