The Standard Genetic Code (SGC) exists in every known organism on Earth. SGC evolution via early unique codon assignment, then later wobble, yields coding resembling the near-universal code. Below, later wobble is shown to also create an optimal route to accurate codon assignment. Time of optimal codon assignment matches the previously defined mean time for ordered coding, exhibiting ≥ 90% of SGC order. Accurate evolution is also accessible, sufficiently frequent to appear in populations of 103 to 104 codes. SGC-like coding capacity, code order, and accurate assignments therefore arise together, in one attainable evolutionary intermediate. Examples, which plausibly resemble coding at evolutionary domain separation, are characterized.
Early evolution of the Standard Genetic Code (SGC) has been computed (Yarus 2021) by dividing code formation into time slices (passages). During a passage, with specified probability, coding triplets may either be assigned or capture mutationally related triplets for their preexisting assignment (or for related assignments) or can decay, losing assigned meaning. But also, nothing need happen during a passage. This procedure yields normal dynamic phenomena, like first- and second-order rates, as well as standard near-steady states. It is mathematically equivalent to defining typical first- and second-order kinetic constants for initiation, decay, and capture of codon assignments (Yarus 2021).
SGC-like coding tables arise by combining SGC-specific initial codon assignment (Yarus 2017, ≈ 10% randomness allowed) with coevolutionary capture (Wong 1981) that prefers amino acids with similar polar requirements (Woese et al. 1966; Mathew and Luthey-Schulten 2008). More particularly, in order to fill a coding table, wobble must arise late, for example, appearing after 20 amino acids are assigned. This leaves modern initiation and termination for a later origin, consistent with their unconserved, and thus late-arising, molecular components (Yarus 2021). Below, late wobble not only provides SGC access, but also an accessible, optimal route to SGC-like order with amino acid coding capacity.
Simple Crick Wobble is Used
To suit primordial coding, only natural unmodified nucleotides are assumed. Thus, wobble implies only that U:G and G:U pairs are allowed at third codon nucleotide positions, as Crick first proposed (Crick 1966). Thus, XYU and XYG may be read either by normal base-pairing (A and C, respectively) or by wobble pairings (G and U, respectively).
Ingredients for an Optimum
In Fig. 1a, late-wobbling coding table evolution is shown in a fashion designed to clarify approach to an SGC-like coding table.
The fraction of coding tables encoding 20 functions or more (≥ 20 fn; on the leftward ordinate) rises with time after a lag to accumulate enough assigned triplets for 20 distinct assignments.
The fraction of coding tables correctly assigned is about an order smaller and descends with time (on the rightward ordinate), after the first near-complete tables appear around 60 passages. Eventually, Fig. 1a assignments that are 10% at random (rather than wholly SGC-like) will be made in error.
Moreover, the SGC has many wobble assignments. But an early non-wobble coding table, evolving at 60 passages, is already 67% full. Thereafter, decreasing unassigned triplets limit added wobble. Thus, as the table is filled before wobble is instituted, wobble declines and resemblance to the SGC ultimately suffers.
Three descending curves show coding accuracy: 2 or fewer assignment errors (≤ 2 mis; top line), 1 or fewer misassignments (≤ 1 mis; middle), or fully SGC-like coding with no assignment error (0 mis; bottom). In the latter case, with complete assignment precision, we touch nearest the SGC itself.
Thus, Fig. 1a shows a crucial conflict. Sufficient capacity for realistic amino acid coding increases with time; accuracy decreases. So, there is an optimum, when accuracy and coding capacity best coexist.
An Optimum for Completion and Accuracy Defined
Figure 1b combines these coding criteria. The rising plot of capacity for ≥ 20 encoded functions is repeated to help time other events.
The other three lines, with maxima, show the fraction of codes that have both capacity to encode ≥ 20 functions and ≤ 2 assignment errors (top), ≥ 20 functions and ≤ 1 error (middle) and ≥ 20 functions with complete SGC-like codon assignment (bottom). Notably, most probable times for both qualities are similar: 100 passages (≤ 2 mis), 100–120 passages (≤ 1 mis), and 120 passages (0 mis). Optimal durations allow 39 to 44 initial triplet assignments, under these conditions (Fig. 1a, legend).
Accessibility of SGC-Like Codes
Pobs = probability of observation in E-independent evolutions, with event abundance/probability = Pevent:
where the latter equation is accurate for somewhat rare events, Pevent < ≈ 0.1. For even odds of observation, Pobs = 0.5:
Abundance of accurately-formed codes, Pevent (Fig. 1b), implies plausible population size E.
In Fig. 1c, fraction of codes specifying ≥ 20 functions again serves as time reference. Importantly, populations of about 300 (≤ 2 mis) codes, 1200 (≤ 1 mis) codes , or of 8700 (0 mis) codes would be required, at minima in Fig. 1c, to find codes that closely resemble the SGC, with even odds. SGC-like coding can therefore exist in biologically conceivable evolving populations, despite astronomically vast ensembles of possible coding schemes.
These results are consistent with previous searches (Yarus 2021), in which codes with 1 or 2 misassignments were found by searching among 600 coding tables, evolving under similar late wobble conditions. In prior searches, however, SGC-like code order was sought, rather than explicit assignment accuracy.
A 4-Fold Optimum
Remarkably: there is a time (≈115 passages; Fig. 1b, c; Yarus 2021) and a coding state (after 42 initial assignments under these conditions; Fig. 1a) when a nascent coding table, having just adopted wobble, simultaneously possesses near-optimal spacing (identical assignments in related codons), near-optimal chemical order (related triplets associated with similar polar requirements), and SGC-like triplet sequences, the latter extending to total identity: codes with no assigned codons differing from the SGC (Figs. 1b, c, 2). Such coding tables encode 20 functions (Fig. 1a–c) and, for example, might specify all amino acids. Such competent coding precisely overlapping the SGC appears with probability ≈ 8 × 10−5 (Fig. 1b).
Filling the Coding Table
To entirely resemble the SGC, one also wants full codes; most triplets assigned, but room left for late assignment of modern initiation and termination codons, and perhaps a few late, complex amino acids. In shifting to optimal time (Fig. 1b, c), unassigned triplets increase: at the 177 passage mean time (Yarus 2021), 5.4 triplets were unassigned, and at the present 115 passage optimum, 8.5 triplets are yet to be accounted for, on average. However, free triplets are also distributed. For the optimal time in Fig. 1c, 0.24 of all evolutions has 0 to 4 triplets free (using assignments for modern initiation and termination to define this hypothetical target). Because coding without misassignment has the same free distribution, this implies that codes that also have an appropriate set of unassigned triplets might be 0.24 of total, as shown in Fig. 1b, c. Using the above abundance equation—nearly full, highly ordered, 20 function, accurately assigned codes with appropriate room for known late functions appear among the following:
If there must also be room for other late functions assigned by different means, the code population required for even odds would contain between 8.7 × 103 and 3.6 × 104 independent codes. Notably, the individuals required to present codes will be greater than the number of independent codes required. But this is a feasible biological ensemble; in fact, an evolving SGC might appear in a small biological group, if code-bearers are microorganisms.
An Example Coding Table
To make numerical results tangible, consider an explicit example. Figure 2 is an evolved coding table from the likely range above. It has been colored to display amino acid chemistry as polar requirement (Woese et al. 1966; Mathew and Luthey-Schulten 2008): blue for the most hydrophobic, through light blue, then gray, beige, orange, and red, the latter for the most hydrophilic amino acids. Parenthetical numbers beside amino acid names are corrected polar requirements (Mathew and Luthey-Schulten 2008). Figure 2’s encoding was the 12,804th in a series of 115 passage evolutions, has 6 unassigned triplets, encodes 20 functions, agrees completely with SGC assignments, and has SGC-like order: compactness (spacing = 0.947; where random coding = 0.0 and SGC = 1.0), SGC-like distance (distance = 0.962), and SGC-like chemical order (dPR = 0.963).
Figure 2 visually confirms that code order and assignment accuracy coexist at 115 passages. In its calculation, 26 such inerrant coding tables were evolved. These were found among ≈ 9000 evolutions with probability 0.5, agreeing with Fig. 1b’s 8700. This implies an abundance of 7.7 × 10−5, again agreeing with Fig. 1b’s 8 × 10−5. Figure 2 varies from the SGC with gaps in canonical Pro and Ala boxes, incomplete encoding at one termination (Ter) codon and a complex, possibly late-evolving amino acid (Trp; Grosjean and Westhof 2016). Except for such arguably realistic exceptions, Fig. 2 is SGC-like.
Time for SGC-Like Codes
Starting bloc selection (Yarus 2018) identifies early biological development as the ideal time for selection of a desired improvement. Were resemblance to the SGC selectable, such selection would work best early in coding table evolution (Fig. 1b, c). Accordingly, the SGC may be a new instance of starting bloc selection.
Three Evolutionary Eras
Evolution under these conditions (Fig. 1a legend) can be summarized, spanning a possible starting bloc. Early events fall readily into one of three approximately equal eras. During era 1 (ends at gray arrow 1, Fig. 1c), partial coding tables are filled to produce mature coding capacity (Fig. 1a). Era 1 codes likely compete on the basis of coding capacity. However, at 60 passages and 28 initial assignments, 20 function codes still comprise < 1% of the population.
During the second, optimal era (ends at gray arrow 2, Fig. 1c), passage of about the same amount of additional time, and 13 more initial triplet assignments produce an optimum. Twenty-function late wobble coding sharply increases, 20- to 30-fold (Fig. 1b). Such coding can be highly ordered (Fig. 2): with identical assignments grouped, chemically similar amino acids associated with related triplets, and distance to the SGC short. A particularly interesting short distance exists in 7.7 × 10−5 of coding tables matching SGC capacity, order, and assignments simultaneously (gray arrow 2, Fig. 1c). Very SGC-like codes could be selected among 104 independent codes or a few-fold more. However, if one or two differing codon assignments are tolerable for a posited selection, hundreds or thousands of codes could be sufficient to select an SGC precursor (gray arrow 2, Fig. 1c).
Figure 2 shows an evolved code example with accurate SGC-like assignments, from the end of era 2. The domains of life use similar initiation and termination triplets, but different mechanisms and molecules for interpreting them (Yarus 2021). Thus, coding for translation initiation and termination was defined before mature initiation and termination mechanisms were settled. Figure 2 therefore may parallel the genetic code near domain separation, when bacterial and archeal domains diverged.
Third era, averaging evolution, ends with 50 mean initial assignments and 20 mean era 2-encoded functions (Yarus 2021), again, after passage of another, similar, era (at gray arrow 3, Fig. 1c). It seems likely that the SGC was completed in this era, adding definitive 21st and 22nd functions, initiation and termination. But fully competent, fully ordered, fully accurate codes are ≈ 4-fold rarer at gray arrow 3 than arrow 2, defining a past optimum.
Given a credible duration for any one event, this reasoning will estimate real times on an early Earth. It does not seem overly optimistic to suppose that this will be achieved.
Amirnovin R (1997) An analysis of the metabolic theory of the origin of the genetic code. J Mol Evol 44:473–476
Crick FH (1966) Codon-anticodon pairing: the wobble hypothesis. J Mol Biol 19:548–555
Freeland SJ, Hurst LD (1998) The genetic code is one in a million. J Mol Evol 47:238–248
Grosjean H, Westhof E (2016) An integrated, structure- and energy-based view of the genetic code. Nucleic Acids Res 44:8020–8040
Massey SE (2019) Genetic code error minimization as a non-adaptive but beneficial trait. J Mol Evol 87:4–6
Mathew DC, Luthey-Schulten Z (2008) On the physical basis of the amino acid polar requirement. J Mol Evol 66:519–528
Ronneberg TA, Landweber LF, Freeland SJ (2000) Testing a biosynthetic theory of the genetic code: fact or artifact? Proc Natl Acad Sci USA 97:13690–13695
Woese CR, Dugre DH, Saxinger WC, Dugre SA (1966) The molecular basis for the genetic code. Proc Natl Acad Sci USA 55:966–974
Wong JT-F (1981) Coevolution of genetic code and amino acid biosynthesis. Trends Biochem Sci 6:33–36
Yarus M (2018) Eighty routes to a ribonucleotide world; dispersion and stringency in the decisive selection. RNA 24:1041–1055
Yarus M (2021) Evolution of the standard genetic code. J Mol Evol https://doi.org/10.1007/s00239-020-09983-9
Yarus M (2017) The genetic code and RNA-amino acid affinities. Life 7:13
Handling editor: Michelle Meyer.
About this article
Cite this article
Yarus, M. Optimal Evolution of the Standard Genetic Code. J Mol Evol 89, 45–49 (2021). https://doi.org/10.1007/s00239-020-09984-8