Introduction

Modern sequencing technologies enable us to know the sequence of any protein-encoding gene in any extant species. Ancestral sequence reconstruction offers the opportunity to infer what that protein sequence was at various ancestral points in a phylogenetic tree, providing a window into molecular functions encoded in ancient genomes and how they might differ from those observed in present day genomes. A recent overview of such insights has been provided by Gumulya and Gillam (2017). The history of ancestral protein sequence reconstruction begins with a seminal paper in 1963 from Journal of Molecular Evolution founding editor Emile Zuckerkandl (Pauling et al. 1963) and continues with co-publication of the first maximum likelihood algorithm for ancestral protein sequence reconstruction (Yang et al. 1995; Koshi and Goldstein 1996). The notion that extant sequences and phylogenies could be used to not only infer the topological history of evolution, but also to make inference about the functional history of proteins continues to be an important concept.

Zuckerkandl and Pauling were one of the first to build on the idea that recent genes evolved from previous (homologous) ancestral genes. They noted that using aligned sequences at the tips of a phylogenetic tree, it is possible to determine the amino acid sequence of the ancestral gene and determine where on the tree specific mutations occurred. Not only did their work provide evidence that homologous genes derive from a common gene ancestor, but they also conceptualized a framework that led to the first methods for ancestral sequence reconstruction. Although Zuckerkandl and Pauling noted that the number of mutations between an ancestral gene and a daughter gene is correlated with time, the first widely used method of ancestral sequence reconstruction, parsimony, was notably time-independent (Fitch 1971). Maximum likelihood was introduced as an algorithm a decade later in Journal of Molecular Evolution (Felsenstein 1981) but it would take another 15 years before widely used models of protein evolution in a maximum likelihood framework were developed for ancestral protein sequence reconstruction. This was innovative work published in Journal of Molecular Evolution (Koshi and Goldstein 1996) and contemporaneously by others in Genetics (Yang et al. 1995).

Early Methodological Development

Following prior methods that used the principle of parsimony (Fitch 1971), maximum likelihood was used to infer protein ancestral states in a phylogeny. Maximum likelihood allowed for a more accurate characterization of ancient sequences with an appropriate model of sequence evolution. Just like parsimony, maximum likelihood requires a phylogenetic tree and the sequences at the tips of the tree. Unlike parsimony methods, maximum likelihood now requires a substitution matrix (which can be a mixture of models) and other evolutionary model components for proteins as well as branch lengths under the model to find the most likely ancestral sequence for nodes throughout the tree.

Using marginal likelihoods that integrated over probabilities of specific amino acids in other nodes in a tree, the probability vector could be generated for all aligned positions at each node in a tree. This is calculated from the equation below, that uses three knowns: a given mutation matrix \(M\), a given (unrooted) evolutionary tree \(T\), and given amino acids at the tips of the tree \({\{A}_{r}\}\). Using these three known values, we can use the equation below to find \({A}_{r}\), the ancestral sequences at ancestral nodes (Koshi and Goldstein 1996).

$$P\left( {\left. {A_{r} } \right|\left\{ {A_{i} } \right\}^{\prime } ,M,T} \right) = \frac{{P\left( {\left. {\left\{ {A_{i} } \right\}^{\prime } } \right|A_{r} ,M,T} \right)\left( {P\left( {A_{r} } \right)} \right)}}{{P\left( {\left. {\left\{ {A_{i} } \right\}^{\prime } } \right|M,T} \right)}}$$

Using this equation, we can calculate the maximum likelihood at one particular node. To find the maximum likelihood ancestor for an arbitrary node in the tree, we select that node as the root, with application of the pulley principle in likelihood-based phylogenetics for time reversible (equilibrium) models. We then sum over all the possibilities (substitutions) from that particular ancestral node and its descendents and can do so for any internal node (Yang et al. 1995) (Fig. 1).

Fig. 1
figure 1

The procedure for conducting model-based Ancestral Sequence Reconstruction is depicted. a An ancestral node to declare as the “root” of interest is selected. b The tree is re-rooted with this node. Based on the pulley principle for a time reversible model, any node can arbitrarily be declared the “root” without changing the likelihood. c Inference of the maximum likelihood ancestor is made by summing over all possible amino acid substitutions. This is done by comparing evolutionary trajectories from the ancestral sequence with extant sequences at the tips

A subsequent methodological development involved joint reconstruction across nodes instead of the original marginal reconstruction algorithm (Pupko et al. 2000). Here, instead of maximizing the likelihood of states at an individual node while integrating over all others, all nodes are considered together in maximizing the likelihood across the tree. While joint reconstruction has not been widely used, conceptually it provides a maximum likelihood method for providing a complete evolutionary history of each site. In practice, marginal and joint reconstructions at any site give very similar sequences, although differences do occur (Pupko et al. 2007).

One early important computational application of ancestral sequence reconstruction was in finding specific episodes of positive or negative selection on various ancestral branches of a given tree. After ancestral sequences at internal nodes were generated, nonsynonymous (dN) and synonymous (dS) nucleotide differences were calculated from the inferred substitution between ancestral nodes to obtain the ratio of dN/dS (then known as KA/KS). Nodes with dN/dS values less than 1 show evidence for negative selection, while dN/dS values greater than one show evidence for positive selection. This application was first performed on lysosomes in primates (Messier and Stewart 1997) and in leptin and the leptin receptor’s extracellular domain across mammals (Benner et al. 1998), and these analyses were able to find specific adaptive and purifying episodes localized to specific nodes on the phylogenetic tree. One such episode demonstrated that leptin had evolved significantly during early primates, following the most recent common ancestor of rodents and primates– and suggested that leptin may not be functionally associated with obesity in humans as it is in mice (Benner et al. 1998). It was also demonstrated that short adaptive episodes can be masked by long-term negative selection, like in lysozyme evolution studied in primates (Messier and Stewart 1997). These methods spurred the subsequent development of maximum likelihood methods that integrated across ancestral sequence probabilities in estimating branch-specific dN/dS values (Yang 1998), methods that are still widely used today based upon the Goldman-Yang model (Goldman and Yang 1994).

Early Experimental Applications of Computational Ancestral Sequence Reconstruction

In finding bouts of positive selection, ancestral sequence reconstruction generated experimentally testable hypotheses for studying molecular evolutionary history with potential protein functional change. While other recent reviews have examined this direction more systematically (Gumulya and Gillam 2017; Liberles et al. 2020), an overview of key developments from a historical perspective is presented. The first experimental study using ASR involved the replacement of three amino acid positions in a modern lysozyme protein with inferred ancestral residues at these positions (Malcolm et al. 1990), and proceeded to dissect possible intermediate pathways for how these amino acid positions evolved under selective constraints during an episode of functional divergence. It took an additional 5 years before the first full-length ancestral sequence was inferred and generated in the laboratory (DNA synthesis technology had improved enough to allow synthesis of full-length genes). An ASR study generating 13 resurrected ribonucleases revealed episodes of functional divergence during artiodactyl evolution (Jermann et al. 1995).

The above two inaugural experimental ASR studies used parsimony to infer ancestral character states. The advancement of robust statistical approaches during the 1990s (i.e., maximum likelihood) paved the way for more sophisticated experimental studies capable of probing deeper (in time) and more divergent evolutionary questions. The first study to accomplish such a feat involved the resurrection of ancestral rhodopsin proteins from a group of archosaurs that included birds and dinosaurs, and suggested these ancestors were able to best see in dim lighting (Chang et al. 2002). The next study to achieve a similar goal involved the resurrection of proteins used to infer the environmental temperature of the last common ancestor of bacteria, inferred to have lived billions of years ago on early Earth (Gaucher et al. 2003). The third study to achieve this goal involved the resurrection of steroid receptor proteins and demonstrated that the earliest steroid receptors likely bound estrogen (Thornton et al. 2003).

This trifecta of studies opened the door for a diversity of experimental ASR studies that have spanned numerous periods of evolutionary history and have probed a plethora of biomolecular functionalities. This has all been achieved from a seed planted by Pauling and Zuckerhandl in 1963, and we anticipate that a similar level of growth will occur for experimental ASR over the next ~ 60 years.

Methodological Improvements

One of the criticisms of ancestral sequence reconstruction approaches is concerned with potential bias of the likelihood statistical framework. Maximum parsimony and maximum likelihood approaches pick the most parsimonious/likely ancestor for experimental reconstruction and this has the potential to show functional bias by excluding rare variants that are likely to be present in any individual and/or likely to be slightly deleterious. One way of adding expected rare variants to the ancestors is through Bayesian Sampling, sampling of multiple sequences from the posterior distribution (Williams et al. 2006). This framework was shown computationally to have functional effects on traits like thermostability, although such biases have been experimentally shown to have minimal effect on actual protein thermostability (Gaucher et al. 2008) or actual protein fluorescence (Alieva et al. 2008). Another experimental extension of Bayesian Sampling involved serum paraoxonases (PONs) using a library of alternative PONs. This was created to consider alternative ambiguously predicted ancestors, evaluating the effects of inclusion of this uncertainty. Through the library approach, these authors were able to find certain predictions made by maximum likelihood were very unlikely to reflect the actual ancestral sequences (Bar-Rogovsky et al. 2015).

A different approach was developed to consider the inconsistency of using different models for alignment and phylogenetics. BaliPhy (Suchard and Redelings 2006) uses the same model to simultaneously generate the tree and perform the alignments through a Markov Chain Monte Carlo method. This method is dependent upon the underlying substitution model and includes a model for insertion and deletion events. This consistency is meant to solve any problems that are caused by the ad hoc nature of models and parameters for multiple sequence alignment (Anisimova et al. 2010).

Another of the limitations of ancestral sequence reconstruction approaches is the simplicity of models used for protein evolution. Developments in the protein model also began stepping away from the 20 × 20 amino acid matrix. This was initially done by using mixtures of substitution models (Koshi and Goldstein 1995) and models that did not assume the same mutational process for all sites in a mutation-selection framework (Halpern and Bruno 1998; Lartillot and Philippe 2004). The CAT models have been extended to include temporal shifts in amino acid fitnesses (CAT-BP) and have spawned work on the related mutation-selection models, including towards relaxing assumptions of an equilibrium process (Blanquart and Lartillot 2008; Teufel et al. 2018). Variants of the mutation-selection framework remain at the cutting edge of amino acid substitution models, but have not been widely used for ancestral sequence reconstruction yet. The strategy utilizing mixtures of substitution matrices, including while explicitly considering an attribute of protein structure (position solvent accessibility), for ancestral sequence reconstruction has recently been revisited with promising results (Moshe and Pupko 2019). Specifically, an improvement in the log-likelihood describing fit to empirical datasets was found together with the observation that the mixture of matrices resulted in major differences in inferred ancestral sequences in those datasets.

Another class of models explicitly considered the protein’s tertiary structure (Robinson et al. 2003; Rodrigue et al. 2006; Kleinman et al. 2010; Grahnen et al. 2011; Arenas et al. 2017; Chi et al. 2018). Such models do not yet describe structural constraints on protein sequences well. Many biological processes affecting protein biochemistry and evolution affect selective constraints that dictate which amino acids are substituted and which are not, but these are ignored in current ancestral sequence reconstruction methods. There is ripe future ground for further model development in this direction, where has recently been reviewed (Chi and Liberles 2016). This represents a different new direction in modeling.

Overall, the state of the art of protein models has progressed from PAM-style models of increasing sophistication (Dayhoff et al. 1978; Jones et al. 1992; Whelan and Goldman 2001; Le and Gascuel 2008) to CAT models (Lartillot and Philippe 2004) to CAT models with breakpoints (Zhou et al. 2010) to mutation-selection models (Teufel et al. 2018). Breakpoints and covarion-type models enable rates to shift at a site over a tree (Wang et al. 2007). Another direction where important developments are improving our ability to model sequences is with models that combine inter-specific with intra-specific processes (Wilson et al. 2011; Hey et al. 2018). This can provide a formal mechanism to characterize segregating sequence variants that more informally were modeled with Bayesian sampling from an inter-specific model.

This discussion of modeling has mostly focused on sequence substitution. The standard likelihood-based methods didn’t account for indel positions and ancestral sequences grew in length as one progressed back the tree (Pupko et al 2007). GASP (Edwards and Shields 2004) coupled model-based sequence reconstruction with parsimonious reconstruction of indel positions, treating each position independently. POY is a parsimony-based simultaneous alignment and tree method that treats both substitutions and indels using parsimony (Wheeler et al. 2015). As previously mentioned, Baliphy has a simple indel model in a likelihood framework to extend these types of models (Suchard and Redelings 2006). Two additional statistical treatments of gaps include a sequence and length-based model that generated a zipfian distribution and a fuller set of propensities of indel occurrence (Chang and Benner 2004) and an evolutionary HMM for alignment that can be employed for ancestral sequence reconstruction (Rivas and Eddy 2015). As with sequence substitution models, the future is ripe for development of integrated models for insertion and deletion, coupled to substitution that will improve our ability to reconstruct ancestral sequences. Without integrated models for insertions and deletions together with substitutions, there exists bias in current methods that has been shown to lead to too long ancestral proteins (Vialle et al. 2018). To take a different step towards reducing this bias, one method for dealing with alignment error integrates over alignments (Aadland and Kolaczkowski (2020) and this reduces the number of gapped positions. Towards the future, it is also the case that many of the most sophisticated models do not have software implementations and filling this gap will also be important in the future.

Extending Ancestral Sequence Reconstruction into Systems Biology

Most ASR studies examine individual proteins. However, differentiating between inter-molecular compensatory processes and directional selection acting on multiple proteins in a pathway requires an extension of these techniques to multiple members of a pathway (Orlenko et al. 2016b; Olson-Manning 2020). To reconstruct a pathway, one could use ancestral sequence reconstruction for every protein in an entire pathway, insert these into an organism, and measure flux. This however would be time consuming, even with newer models and methods. Another approach would be to reconstruct molecular phenotypes in different ways. The simplest phenotypic approach would be to reconstruct pathway flux as a continuous single value. However, assuming the pathway structure is conserved over the tree, individual parameters can be reconstructed independently subject to thermodynamic constraints and fit into differential equations to model pathway fluxes at particular ancestral nodes. While the ancestral reconstruction approaches are futuristic, glycolysis is one pathway where comparative analysis of kinetic models across species has been performed (Orlenko et al. 2016a). This work established trends consistent with mutation-selection-drift balance where mutation occurs at the gene/protein level and selection occurs at the pathway level (Orlenko et al. 2016b) giving rise to gene-specific context-dependent evolution, which is increasingly well understood as a general observation in comparative data (Eguchi et al. 2019).

Extending Ancestral Sequence Reconstruction Further into Molecular Ecology

To the extent that ASR is linked to uncovering environmental adaptation associated with positive directional selection for changes in protein function, a key feature of this is the relationship of ancient organisms to their environment and ecosystem. Including archosaur vision (Chang et al. 2002, discussed above) and bacterial protein thermostability (Risso et al. 2013), insights into ancient environments and how organisms interacted with them have been gained. In the latter case, direct effects from individual proteins (Risso et al. 2013) interplay with proteome-wide effects driven by temperature on substitution (Goldstein, 2007; Zeldovich et al. 2007; Gromiha et al. 2013). The next level beyond, extending protein interaction analysis to those involving inter-specific interactions, whether host–pathogen interactions, with Dobzhansky–Muller incompatibilities during speciation, or across a food web. In these cases, ancestral sequence reconstruction can be used not only to make protein functional inference, but also to make ecological inference about what species are interacting with what other species in a community. Reconstructing parts of the proteome widely across the tree of life has the potential to identify changes in community structure and species interactions over time. This is a direction that we have only seen the tip of the iceberg.

Ancestral Sequence Reconstruction and Infectious Disease

Ancestral sequence reconstruction can be used to understand viral evolution and towards therapeutic applications (Arenas 2020). An understanding of the evolutionary histories of these viruses can lead to applications in detecting targeted regions for future therapeutics, and to assist in predicting new viral resistance against current drugs.

Ancestral sequence reconstruction is also of emerging interest for vaccine technologies, especially for the development of vaccines to combat rapidly evolving viruses such as HIV and influenza strains (Gaschen et al. 2002; Ducatez et al. 2011). Using ancestrally derived sequences to create vaccine reagents takes advantage of the evolutionary history of the virus. This strategy contrasts with other methods which construct a consensus sequence from different viral strains, ignoring phylogenetic structure. A vaccine reagent can be based on the last common ancestral sequence of all the strains that are circulating, or from other points in the tree. For example, when the phylogenetic topology is skewed, the “center of tree” method may be implemented. The center of tree method considers the ancestral sequence that minimizes the evolutionary distance between different viral strains of interest (Nickle et al. 2003).

In the age of the SARS-CoV-2, ancestral sequence reconstruction has become of immediate interest to assist in vaccine development (Zhou et al. 2020). Like the rapidly evolving RNA virus influenza and retrovirus HIV, SARS-CoV-2 is also an RNA virus. However, a recent study used ancestral sequence reconstruction to demonstrate that unlike other RNA viruses, mutations in SARS-CoV-2 are rare, as the evolution rate is slower than the transmission rate. Because of the slow evolution of SARS-CoV-2, only one vaccine candidate may be necessary to match all currently circulating SARS-CoV-2 variants (Dearlove et al. 2020).

Aside from disease causing viruses, viruses are also developed to serve as a vehicle for gene therapy (Ivics et al. 1997). The Adeno-associated Virus (AAV) has been considered an efficient gene therapy for both inherited and infectious diseases. However, the complex structure and diversity associated with different target receptor binding for AAV make the virus difficult to properly structurally assemble when designed. Using ancestral sequence reconstruction, Zinn et al. (2015) were able to provide a virus with a structure that would remain evolutionarily resilient to future mutations and maintain broad clinical applicability.

Biomedical and Biotechnological Directions for Ancient Proteins

In addition to all the insights ASR reveals about natural evolutionary processes, it turns out that ancient proteins also have applied functions in biotechnology and biomedicine (Randall et al. 2016). Ancestral variants have been used to develop clinical treatments for type 2 diabetes (Skovgaard et al. 2006), gout (Kratzer et al. 2014), hemophilia (Zakas et al. 2017), tyrosinemia (Hendrikse et al. 2020) and others. It is anticipated that this trend in biomedicine will continue as ASR generates proteins having expanded biomolecular functionalities with lower immunogenic responses in human patients compared to their modern protein counterparts. Further, ancestral variants are being used in the biotechnology sector due to their unique and desirable properties. Companies such as nanoGUNE (Manteca et al. 2017), Syngenta, New England Biolabs (Zhou et al. 2012), DuPont (Ladics et al. 2020) and others have developed or integrated ancient proteins into their biotechnology product development pipelines, while some ancient proteins have even been tested for their value in the cosmetic industry (Perez-Jimenez et al. 2011).

The irony of ancient proteins having an applied utility to the development of therapeutics and industrial enzymes is clear. It is reasonable to expect that this utility will expand within the public and private sectors as more examples are discovered in the coming years. Sometimes one must explore the past in order to navigate the future.

Concluding Thoughts

Starting with the vision of Journal of Molecular Evolution founding editor Emile Zuckerkandl together with Linus Pauling, through the maximum likelihood method of Felsenstein to the application of this method to protein ancestral sequences by Koshi and Goldstein, Journal of Molecular Evolution has been an important home for the development of the field. As models and statistical frameworks for characterizing protein evolution over a phylogenetic tree continue to improve, these developments will continue to impact the field of ancestral sequence reconstruction, with downstream applications in fields as disparate as biomedicine and community ecology.