FormalPara Key Points

Codon optimization is a method that is commonly used to increase the expression of biotherapeutic recombinant proteins through the use of synonymous codon mutations in messenger RNA (mRNA) coding regions.

A key assumption underlying codon optimization is that protein synthesis is restricted by rare codons; this assumption appears to be poorly supported in mammalian cells, which are frequently used to express recombinant proteins.

An unintended consequence of codon optimization is that it disrupts different types of information that overlap coding regions, which can affect local rates of translation elongation, lead to alterations in protein conformation, and increase immunogenicity.

1 Introduction

Various proteins, including hormones, monoclonal antibodies, enzymes, and blood factors, have great utility as drugs. In some cases, it has been possible to use material purified from natural sources for protein replacement therapy. For instance, diabetes mellitus has been treated using the peptide hormone insulin purified from cow and pig [1]; similarly, hemophilia A has been treated with clotting factor VIII purified from human blood plasma [2]. However, a major limitation with using proteins purified from animal or human sources is that many proteins with therapeutic potential are expressed at such low levels that it is not realistic to purify them. In these cases, protein therapy can become feasible when recombinant proteins are over-expressed in genetically engineered cells on an industrial scale [3]. Although a variety of cell types, including bacterial, yeast, insect, and mammalian, have proven useful for recombinant protein expression, most approved protein drugs are produced in mammalian cell lines and the most commonly used cell lines are derived from Chinese hamster ovary (CHO) [4, 5]. These cells have numerous features suitable for the production of therapeutic proteins, including the ability to grow in suspension, the ability to grow in chemically defined serum-free media, and the ability to be cultured on a large scale (>  10,000 L) [6]. In addition, CHO cells provide post-translational processes similar to those in human cells, which include co-translational folding, chaperone binding, and glycosylation. Moreover, CHO and other mammalian cell lines are less likely to yield undesired post-translational modifications that can lead to a protein being recognized as foreign by the patient.

1.1 Overview of Recombinant Protein Expression

In general, the process of recombinant protein expression in mammalian cells involves cloning a suitable complementary DNA (cDNA) sequence into an expression vector, such as a DNA plasmid, and then introducing the construct into a host cell line, which can be achieved by different methods including transfection, nucleofection, the use of virus vectors, as well as other methods [7,8,9,10]. Plasmids that enter the nucleus of transfected cells are transcribed, and messenger RNA (mRNA) encoding the target protein is translated. This process of transient gene expression is used to generate recombinant protein for several days and can produce milligram to gram amounts of recombinant protein, which is useful for academic studies and preclinical work. However, transient expression is not efficient for generating larger amounts of protein, which are required for clinical studies and commercialization.

The ability to express recombinant proteins on an industrial scale is possible because of tremendous advances in cell culture over the past century, including the development of antibiotics, sterile techniques, and chemically defined culture media [11,12,13,14]. Large-scale production is facilitated by generating stable transfected cell lines, which involves integrating an expression construct into the chromosomal DNA of the host cell line. Integration can occur at random chromosomal sites or in a site-directed manner which targets one or more chromosomal locations that may have been preselected for their abilities to facilitate high levels of expression and stability [15,16,17,18,19]. For generating stable transfected pools and clonal cell lines, a marker gene on the expression construct is typically used to enable screening or selection of cells [20, 21]. For example, the enhanced green fluorescent protein gene can be used as a screening marker to identify cells that express this protein and separate them from non-expressing cells by using fluorescence activated cell sorting (FACS). Stable transfected cells can also be generated by using a selectable marker gene and a method to kill cells that do not express this gene. Selectable markers include antibiotic-resistance genes, the neomycin-resistance gene, the dihydrofolate reductase (DHFR) gene, and the glutamine synthetase (GS) gene. This process can be illustrated using the GS marker gene: cells are transfected with an expression construct containing the GS gene; following transfection, cells are cultured under conditions that enable stably transfected cells, which express GS, to grow but kill cells that do not express this protein. Selection involves growing cells without glutamine and in the presence of methionine sulfoximine which inhibits endogenous GS activity, or by using an auxotrophic cell line that lacks GS activity.

Following selection or screening, clonal cell lines can be obtained by limited dilution or other cloning method, including FACS or growth in semi-solid matrix. At this stage, individual stably transfected cells express recombinant protein at dramatically different levels. Expression is affected by numerous variables, including the chromosomal insertion site or sites, and the number of integrated plasmids. Even after high expressing clones are identified, their expression levels can still be affected by various factors, including genetic instability of the insertion site and methylation-induced transcriptional silencing [22]. Considerable heterogeneity can also occur within clonal cell lines [23]. In addition, endogenous genes are sometimes disrupted by insertion of the expression construct, which can have unanticipated effects. For these reasons, screening for cell lines generated by random integration is much more extensive than for targeted integration and can involve testing thousands of cell lines. For secreted proteins, a particularly powerful method for identifying high expressing clones involves culturing cells in a methylcellulose semi-solid matrix containing fluorescently labeled antibodies that recognize the recombinant protein [24, 25]. Single cells form colonies in the matrix and fluorescent halos develop around the colonies. The size and intensity of the halos correlates with the amount of secreted recombinant protein. High expressing colonies are picked using an automated picker, e.g., ClonePix (Molecular Devices, Sunnyvale, CA, USA), based on halo size/intensity and other parameters, including the size and shape of the colony, as well as its vicinity to other colonies.

Once clonal cell lines with suitable expression, stability, and growth properties are identified, expression can be optimized for maximal production by adjusting culture conditions (see Wurm [26]). The amount of protein produced in stable cell lines can vary dramatically for different proteins, but yields of 1–10 g/L can typically be reached in CHO-based fed-batch cultures [27].

1.2 Use of Recombinant Proteins as Therapeutic Drugs

Over the years, numerous therapeutic proteins have been approved by the US Food and Drug Administration (FDA) [28]. Tissue plasminogen activator (tPA) was the first recombinant protein produced in mammalian cells (CHO) that was approved for clinical use [29]. tPA illustrates the advantage of using genetically engineered cells to overexpress a protein of interest as this protein can be expressed at high concentrations as a recombinant protein (50 pg/cell/day) but is only secreted naturally by mammalian cells at a low concentration [30].

A review of recent approvals of therapeutic recombinant proteins by the FDA for the period between 1 January 2011 and 31 August 2016 identified 62 proteins [5]. The majority are monoclonal antibodies (48%), which includes antibody–drug conjugates as well as antibody Fab fragments. Other major categories of proteins are coagulation factors (19%) and replacement enzymes (11%). The remaining therapeutics (22%) are fusion proteins, hormones, growth factors, and plasma proteins. The primary therapeutic indications for these approved proteins are in oncology (26%) and hematology (29%). Other indications are in cardiology/vascular disease, dermatology, endocrinology, gastroenterology, genetic disease, immunology, infectious diseases, musculoskeletal, nephrology, ophthalmology, pulmonary/respiratory disease, and rheumatology. Of these approved proteins, 50% were granted orphan designation.

More recently, there have been another 27 approvals by the FDA: 24 at the Center for Drug Evaluation and Research (CDER) between 1 September 2016 and 31 December 2017 and three at the Center for Biologic Evaluation and Research (CBER) between 1 September 2016 and 12 December 2017. These approvals included five biosimilars. Compared to the previous set of approvals discussed in Lagasse et al. [5], the percentage of approvals for new monoclonal antibodies was much higher (78 vs. 48%) and included one bispecific antibody and two antibody–drug conjugates. There were also two enzyme replacements, two vaccine antigens, and one Fc-fusion protein. In addition, a recombinant enzyme was approved in combination with a previously approved monoclonal antibody (mAb). The primary therapeutic indications for these proteins are in oncology (30%) and rheumatology (19%). Other indications are in dermatology, infectious diseases, hematology, genetic disease, immunology, musculoskeletal, and pulmonary/respiratory disease.

1.3 Expression Challenges Associated with Recombinant Proteins

As discussed in Sect. 1.1, the process of recombinant protein expression involves numerous steps that can affect expression, protein quality, and cell physiology. For many recombinant proteins, expression levels can determine commercial viability and often present a bottleneck for further development. Fortunately, there are numerous variables that can be considered for enhancing productivity (e.g., see Ayyar et al. [31]). In some cases, protein expression can be improved by using an alternative promoter to drive transcription of the recombinant mRNA, as different natural and synthetic promoters vary in strength and stability [32,33,34]. In addition, improvements in expression and stability can be realized by minimizing negative effects associated with some chromosomal sites, e.g., by including a chromosomal insulator sequence on the expression plasmid [35]. Recombinant mRNA levels can also be increased by purposefully generating cell lines with multiple gene copies, for example by using an expression construct containing the DHFR gene as a selectable marker [36]. For this approach, the DHFR construct is introduced into CHO cells that are deficient for DHFR and stable transfected cells are selected by using increasing concentrations of methotrexate, a drug that inhibits DHFR activity. Cell lines with multiple gene copies can also be generated by using site-directed integration approaches [18]. It is anticipated that the development of new integration strategies will provide even greater control of expression levels.

Increasing recombinant mRNA levels can be useful up to a point, beyond which there is no obvious further benefit, or even a negative effect [26, 37]. However, it should be recognized that high levels of mRNA are not necessarily linked to high levels of protein [38, 39]. For example, cells selected using the DHFR selection method can contain up to thousands of genomic copies of the expression construct, but protein levels are maximally increased by only 10- to 20-fold (e.g., Wurm [26]). Problems associated with large numbers of genomic copies of an expression construct include reduced stability of the trans-genes and other effects, including position effects and disruption of endogenous genes [26, 40, 41]. In addition, it is likely that high levels of recombinant mRNAs limit protein production by non-specific effects, e.g., by titrating transcription factors or RNA binding factors. Indeed, our own studies have shown that protein expression from an mRNA optimized for translation efficiency can be dramatically higher when transcription is driven by a weaker promoter than a stronger promoter (Mauro and Chappell, unpublished observations). In addition, some negative effects associated with overexpression are related to the biological activity or toxicity of the recombinant protein which may affect cell physiology.

Other features of recombinant genes can be modified to increase expression levels. For instance, protein production is often increased by including one or more introns in the recombinant gene [42]. In addition, the ability of an mRNA to compete with other mRNAs for the translation machinery and its efficiency of translation initiation can be enhanced by modifying the 5′ leader sequence, or by replacing it completely [43, 44]. One approach involves inserting natural or synthetic translation enhancing elements into the 5′ leader. Alternatively, initiation can be enhanced by completely replacing the 5′ leader with the 5′ leader of an efficiently translated mRNA, such as β-globin, or with a synthetic sequence optimized for ribosome recruitment and initiation [43]. Modification of 3′ untranslated region sequences can also yield increased expression by enhancing ribosome recruitment and mRNA stability [45].

Some proteins are inherently difficult to express because of features in the coding regions of the genes. This situation is not unexpected as some proteins, such as enzymes and hormones, are typically required at very low levels, can be harmful at higher levels, and are necessarily expressed poorly in the body. For example, the blood clotting factor VIII is required at low levels in the body and increased levels of this protein are associated with increased risk of thrombosis and stoke [46]. In cultured cells, this protein is notoriously difficult to express, and in the body, the factor VIII gene has evolved numerous features which limit its expression [47]. Unfortunately, some of these same evolved features likely make it difficult to overexpress the recombinant protein in cultured cells.

2 Codon Optimization

Codon optimization refers to approaches used for maximizing protein expression by overcoming expression limitations associated with codon usage. It is routinely used for applications in bioproduction as well as for in vivo nucleic acid therapeutic applications [31, 48]. Codon optimization has been reported to increase protein expression by up to >  1000-fold [49], although most reports are much more modest. Interestingly, synonymous codon mutations have also been used to de-optimize expression in order to fine-tune the expression of one of two light chain genes of a bispecific antibody, which resulted in increased the expression of this antibody [92]. An overview of the process of mRNA translation is included below to provide appropriate background and context for this approach.

2.1 Messenger RNA Translation

Translation is the process whereby an mRNA template is decoded into a polypeptide sequence. This process consists of three steps: initiation, elongation, and termination [50]. Initiation involves recruitment of the small 40S ribosomal subunit by the mRNA, either at the 5′ m7G cap structure or at an internal site. The 40S subunit then moves to a start site, which is typically an AUG codon that is recognized by the initiator-methionine transfer RNA (tRNA) associated with the small subunit. The large 60S ribosomal subunit subsequently joins to form a ribosomal complex which is capable of peptide synthesis. During the elongation cycle, the ribosome facilitates base pairing interactions between codons in mRNAs and anti-codons in aminoacyl-tRNAs, which are tRNA molecules covalently linked to their cognate amino acids [51]. Figure 1a shows the codon-amino acid associations that comprise the genetic code. In the elongation cycle, the peptidyl transferase activity of the ribosome mediates the transfer of amino acids from tRNAs to a growing polypeptide chain. Polypeptide synthesis stops when the translating ribosome reaches a stop codon, which leads to dissociation of the ribosomal complex and release of the newly synthesized protein.

Fig. 1
figure 1

Degeneracy of the genetic code. a Codon–amino acid associations. For each amino acid, both the three-letter and one-letter abbreviations are indicated. The AUG start codon, which encodes methionine, is indicated in green. This same codon is used to specify methionine residues within coding regions. Three stop codons are indicated in red; they do not specify amino acids but terminate translation. With exception of methionine and tryptophan, all amino acids are coded by two or more codons. b Degeneracy enables mRNAs containing different synonymous codons to encode the same polypeptide. This example shows how the same peptide sequence can be translated from mRNAs that differ significantly in their primary structure. In this example, the mRNA sequences in the left and right panels encode the same peptide but do not use any of the same codons and are only ≈ 43% identical at the nucleotide level. The nucleotide differences are indicated in red bold type in the right panel. Based on human codon usage [65], codons underlined by white bars can only be translated by the corresponding (cognate) aa-tRNA; codons underlined by red bars can be translated by both cognate and wobble tRNAs, and those underlined by blue bars can only be translated by wobble tRNAs because these codons lack a corresponding tRNA gene. In these illustrations, ribosomal subunits are indicated schematically as peach-colored structures; the smaller structure represents the 40S subunit, and the larger one represents the 60S subunit. The tRNA binding sites are labeled A, P, and E. For simplicity, each ribosome is shown with a tRNA molecule in the P site; the tRNA molecules are represented as cloverleaf structures. The tRNA in the P site is shown with the peptide chain encoded by the mRNA sequence shown. The next elongation cycle would involve recognition of the codon in the A site (ACC in the left panel; ACA in the right panel) by an aminoacyl (charged) Thr-tRNA. The peptidyl transferase activity of the ribosome would transfer the peptide chain from the tRNA in the P site to the threonine on the tRNA in the A site. A one-codon shift of the mRNA through the ribosome in the 3′ direction would then leave an uncharged tRNA in the E site, the tRNA with the growing peptide chain in the P site, and an empty A site, ready for the next aminoacyl tRNA. A aminoacyl, aa-tRNA aminoacyl-tRNA, E exit, mRNA messenger RNA, P peptidyl, tRNA transfer RNA

2.2 Altering Codon Usage

Codon optimization strategies attempt to increase protein expression by altering the codon usage of the gene. Altering codon usage is possible because 20 amino acids are encoded by 61 codons (Fig. 1a). Although methionine (Met) and tryptophan (Trp) are encoded by a single codon each, all other amino acids are specified by two, three, four, or six codons. Because of this degeneracy in the genetic code, it is possible for mRNA sequences with different synonymous codon compositions to encode the same polypeptide [52] (Fig. 1b). Synonymous codons therefore provide a great deal of flexibility. In fact, for recombinant protein expression, a gene can be synthesized without even knowing the mRNA sequence by reverse translating the amino acid sequence. This process of reverse translation was used to express the first recombinant peptide, somatostatin, without knowing the mRNA sequence [53]. As gene sequences became available and were analyzed, it became evident that synonymous codon usage in nature is not random. Bias in codon usage varies between different organisms, between different tissues of the same organism, and even between different parts of the same gene [54, 55]. Factors affecting codon bias in bacteria, yeast, and Drosophila include correlations between codon bias and translation efficiency [56,57,58,59,60]. Other variables affecting codon bias include the background nucleotide composition of the genome, which can vary significantly even within genomes [61]. In addition, codon bias can be affected by the expression levels of various tRNAs, which can vary between different tissues [62,63,64]. Moreover, even within individual genes, codon bias can be influenced by various constraints, which include splicing motifs, conserved mRNA secondary structures, amino-terminal coding sequences (codon ramp), as well as constraints affecting protein folding [55].

Different codon optimization strategies use synonymous codons to alter numerous features of mRNA coding sequences that can inhibit expression, including putative splice donor and acceptor sites. In addition, synonymous codons are used for convenience, e.g., to facilitate gene synthesis and cloning (reviewed in Mauro and Chappell [65]). However, the primary tactic for enhancing protein expression involves increasing the rate of synthesis by eliminating or minimizing occurrences of rare codons. The assumption is that poor expression is caused by poor codon usage. Over the years, codon optimization approaches have ranged from relatively simple approaches that replace all codons with the most frequently used ones [66, 67], to seemingly more sophisticated approaches, such as codon harmonization, which try to maintain regions of slow translation that are thought to be important for protein folding [68]. This approach of maintaining regions of slow translation may be oversimplified as various lines of evidence suggest that protein folding can be affected both by codons that are typically thought to mediate a slow rate of translation—to increase folding—as well as by codons thought to mediate a fast rate, which may be important for reducing the possibility of misfolded intermediates [69].

Together with my colleague Stephen Chappell, we have previously discussed and critically analyzed various codon optimization approaches for use in in vivo applications [65]. We identified three key assumptions that underlie various codon optimization strategies: (1) rare codons are rate-limiting for protein production; (2) synonymous codons are interchangeable without affecting protein structure and function; and (3) protein production can be increased by replacing rare codons with frequently used ones. A review of the literature indicates that these assumptions were either poorly supported or not generalizable. For example, the notion that rare codons are rate limiting for protein production is based on studies in Escherichia coli and lower eukaryotes and there is little evidence to support this idea in mammalian cells. In addition, there is abundant evidence demonstrating that synonymous codon changes, even individual codon changes, can significantly alter the formation of messenger ribonucleoprotein particles (mRNPs), mRNA secondary structure, mRNA stability, microRNA binding, translation, and protein folding [70,71,72].

2.2.1 Codon Usage in Mammals

One of the reasons codon usage is different in mammals is that a significant amount of variation in synonymous codon usage appears to be correlated with differences in the GC content of chromosomal regions known as isochores [61]. Isochores are large segments of DNA that have a uniform GC composition and encompass both coding and non-coding regions. An analysis of synonymous codon usage of different functional categories of human genes revealed that ≈ 70% of the variation in synonymous codon usage between genes could be explained by the GC content of the chromosomal region, as well as meiotic recombination, which is more common in these regions. Notably, synonymous codon differences caused by large-scale variations in GC content were found to be independent of the functional category of the genes. This observation indicates that different highly expressed genes in the same cell have different patterns of synonymous codon usage. For many of these genes, codon usage does not match tRNA abundance [61, 63, 73].

In non-mammalian organisms, various studies have indicated that highly expressed genes contain more frequently used codons, which in many cases correlate with the expression levels of the corresponding tRNAs. Recent studies in E. coli, fungi, yeast, and Drosophila have demonstrated that frequently used synonymous codons have faster elongation rates than less frequently used codons [56,57,58,59,60]. Although there is not yet any comparable evidence in mammalian cells, analyses of mRNA and tRNA populations do not support this idea. Indeed, various studies have reported good correspondence between overall codon usage in cells and corresponding tRNA levels.

In one study, an analysis of different human cell types identified two distinct tRNA pools that are differentially expressed in proliferating or differentiated cells [74]. The authors found that codon usage in the transcriptome was coordinated with the expression of corresponding tRNAs such that there was a balance between the codon populations and the tRNA pools that were required for their translation. Similar results were found in a study that determined the frequency of usage for all codons as well as tRNA expression levels in mouse liver and brain tissues at eight different developmental stages [63]. The results showed that the codon pools from the expressed mRNAs and the anticodon pools were highly correlated in both tissues through development. In addition, it was noted that there did not appear to be differential codon usage between highly expressed and poorly expressed genes. In another study, tRNA pools and codon usage were analyzed in human and mouse liver cancer cell lines (in vitro) and quiescent liver cells (in vivo) [73]. The authors concluded that the tRNA pool of any of these cell types was capable of translating the mRNA transcriptomes of any other cell type with similar efficiency. In addition, no evidence was found to support the notion that highly expressed mRNAs in the different cell types were optimized for translation efficiency. The authors suggested that any variabilities in codon usage between different gene sets were best explained by variations in GC content.

In mammals, lack of evidence for slower elongation rates at rare codons also comes from ribosome profiling studies. Ribosome profiling is a technique that uses deep sequencing to identify segments of mRNAs that are protected by ribosomes in cells. In a study performed using mouse embryonic stem cells, cells were treated with the drug harringtonine to stall new initiation events at the start codon. By using ribosome profiling to monitor run-off elongation, it was possible to determine the kinetics of translation in these cells [75]. This study reported that translation speed was largely independent of codon usage and there was no evidence of ribosomal pausing at rare codons. Although the authors did not rule out the possibility of specific examples, they found no evidence for a large effect of codon usage on the overall rate of elongation.

Another study supporting the notion that rare codons are not limiting for expression comes from an analysis of protein coding sequences in the human genome [76]. This study found that rare codons for alanine, proline, serine, and threonine are used preferentially in the first 50 codons of the coding region. The effect on expression of the rare alanine codon was tested in constructs with multiple alanine codons in the first 50 codons of a synthetic fusion protein. The results showed that expression from constructs containing the rare alanine codon was much higher than from those containing the more frequently used alanine codons.

2.2.2 Wobble Decoding

An important element that can affect the rate of elongation and is disrupted upon codon optimization is the type of tRNA interaction, i.e., whether a codon uses standard (Watson–Crick) or wobble tRNA base pairing interactions. A codon pairs to its cognate tRNA via three Watson–Crick interactions; by contrast, a codon can base pair to a non-cognate tRNA via a wobble interaction that uses standard base pairing for the first two nucleotides and less stringent pairing for the third nucleotide, e.g., G:U base pairing. Ribosome profiling in Caenorhabditis elegans and a human cell line (HeLa) indicated that the rate of elongation is slower at codons decoded by wobble tRNA interactions than at codons decoded by Watson–Crick tRNA interactions [77]. In human cells, there was an ≈ 65 to 300% increase in ribosome occupancy at codon positions for which the third base interaction was a wobble G:U base pair compared to a standard G:C base pair, consistent with a slower rate of elongation at these codons. An in-depth analysis of ribosome profiling data in yeast also demonstrated that recognition of codons by wobble base pairing is slower than for codons translated by Watson–Crick base pairing [78].

In yeast, wobble appears to be associated with another finding, which is that specific pairs of adjacent codons significantly reduce the rate of elongation, independent of any dipeptide effects [79]. In this study, it was observed that for 16 of 17 inhibitory codon pairs, one or both codons were wobble codons. In addition, for 10 of 11 pairs, it was shown that codon order was important, suggesting that the slower translation at some codon pairs was caused by more than just the additive effects of each codon. Moreover, the inhibitory effects could be suppressed more effectively by overexpressing a non-native tRNA with an exact match to the anticodon, than with native (wobble decoding) tRNAs. In another study, these inhibitory codon pairs were shown to be associated with faster mRNA decay [80]. Additional evidence that specific di-codon pairs affect translation in mammalian cells comes from an analysis of 35 synonymous single nucleotide polymorphisms (sSNPs) in 27 different genes for 22 human genetic diseases or traits, which identified disruptions determined by pairs of consecutive codons rather than by individual codon bias [81].

Wobble decoding is associated with significant complexity, which is disrupted by codon optimization. This complexity is illustrated in Fig. 1 of Mauro and Chappell [65]. Additional complexity comes from the fact wobble itself can vary between organisms that express different subsets of the 61 possible aminoacyl tRNAs. Synonymous codon changes can disrupt the pattern of cognate and wobble tRNA interactions because some codons are decoded by only one cognate tRNA, other codons are decoded by both cognate and wobble tRNAs, and still other codons lack a corresponding tRNA gene and are decoded by only non-cognate tRNAs. In Fig. 1b, notice how the pattern of cognate, cognate/wobble, and wobble codon usage is completely different for the two mRNAs. tRNA wobble is only one variable, but shows the complexity of trying to understand and recreate the elongation rhythm of an mRNA.

2.2.3 Additional Considerations

The goal of maintaining the natural folding pattern of a recombinant protein by preserving the elongation rhythm of the natural mRNA in the body is not trivial. There are numerous differences between the natural cell type in which a protein of interest is expressed, e.g., liver sinusoidal cells in the body, and a production cell line, such as CHO or human embryonic kidney 293 (HEK293), in a bioreactor under production conditions. Differences that could affect elongation include tRNA concentrations, levels of other mRNAs that determine whether translation conditions are competitive or non-competitive, and the codon composition of the transcriptome. tRNA concentrations are determined in part by which tRNA genes are present, the number of genes, and their expression levels. An additional potential consideration for production cell lines involves variations in codon usage that may be influenced by culture conditions, which are likely to affect both tRNA expression and the transcriptome. Moreover, overexpression of recombinant mRNA, either by transcription or translational enhancement, may itself disrupt the balance of codon demand and tRNA abundance, causing some tRNAs to become limiting and inadvertently altering elongation rates at specific codons. Even an unmodified natural mRNA coding sequence is likely to be translated differently in a production cell line than in the body. An important question is how do these differences affect protein folding?

Another significant consideration associated with the use of codon-optimized constructs for in vivo applications, including gene therapy, RNA therapeutics, and DNA/RNA vaccines, is translation from out-of-frame cryptic translation start sites in coding regions [65]. Many out-of-frame reading frames are altered by codon optimization and encode novel peptides that may have undesirable properties. An example of this type of cryptic initiation was reported by Lorenz et al. [82] who codon optimized a papillomavirus E7 oncoprotein mRNA to isolate E7-specific T cell receptors for T cell receptor gene therapy. The codon-optimized mRNA was expressed from transfected dendritic cells that were incubated with T cells. The results revealed a T cell response with the codon-optimized but not wild-type sequence. This response was mapped to a cryptic peptide from the +3-alternative reading frame. Granted, expression of novel cryptic peptides from codon-optimized mRNAs is less serious when expressing therapeutic protein in a bioreactor because the therapeutic proteins are purified. However, it is still a consideration because the novel cryptic peptides may have unexpected biological effects which may negatively affect the physiology of the cells or the expression and processing of the therapeutic protein.

The various lines of evidence discussed here indicate that trends regarding codon usage and elongation rates in mammals are much weaker than in other organisms. These lines of evidence include the effects of chromosomal isochores on GC distribution patterns and codon usage, the observed balance in codon and tRNA pools, as well as the effects associated with wobble decoding. However, these findings do not rule out possible effects for some genes or under certain conditions. For example, a study in HEK293 cells suggested that non-optimal codons are critical for promoting the translation of selective mRNAs during amino acid starvation [83].

2.3 Mammalian Codon Optimization: What’s the Harm?

Synonymous codon mutations are known to potentially affect protein expression at various levels and there is mounting evidence indicating that translation itself is affected and can lead to dramatic alterations in the conformation and processing of some proteins. Numerous examples in various reviews document this evidence (see McCarthy et al. [81], Gotea et al. [84], and Hunt et al. [85]).

A critical issue with codon optimization is that while it maintains the amino acid sequence of a protein, it can disrupt multiple other layers of information encoded in mRNA coding sequences [86, 87]. These overlapping functional elements are often difficult to identify. However, some of these elements can affect the rate of elongation locally, alter protein folding, and lead to changes in protein conformation and post-translational modifications. The non-neutral nature of synonymous codon mutations has been exploited in various studies which have screened synonymous mRNA variants to identify conformational variants of the encoded proteins with altered function (e.g., Cheong et al. [88]). The non-interchangeability of synonymous codons is also the basis for large-scale random recoding, which has been used successfully to attenuate more than a dozen viruses [89,90,91]. The approach of using synonymous codon mutations to alter protein function is very useful for particular applications, including industrial enzyme optimization. However, the possible effects of synonymous codon mutations on protein conformation are much riskier in the production of therapeutic proteins as they may lead to problems in the patient, including production of anti-drug antibodies that reduce drug efficacy, as well as immunogenic complications [93,94,95].

Disruption of overlapping information defining mRNA secondary structures that affect the rate of elongation at specific sites in the coding region was suggested to explain results obtained following codon optimization of a feline endogenous retroviral RD114-TR envelope protein [96]. Although codon optimization resulted in increased protein yield, there were associated glycosylation defects that interfered with correct processing of the envelope protein which led to the production of an inactive protein.

Factors associated with production of recombinant therapeutic proteins in CHO, or other cell lines, can lead to differences with the natural protein that trigger production of anti-drug antibodies in patients. Differences may include glycosylation, factors affecting the integrity of the recombinant protein, and conformational alterations. Recombinant erythropoietin (EPO) illustrates the type of problem that might occur if anti-drug antibodies also recognize the endogenous protein. Some patients treated with recombinant EPO for anemia associated with chronic renal failure developed neutralizing antibodies against EPO [97, 98]. These antibodies inhibited the activities of both the recombinant and endogenous proteins, which stopped red blood cell production and caused patients to develop pure red cell aplasia. In one of these studies, recombinant EPO preparations from different manufacturers were compared and it was found that some formulations were more or less likely to result in the development of anti-EPO antibodies [98]. While it is not known if codon optimization of recombinant EPO constructs contributed to this problem, it illustrates the type of problem that might be expected if a codon-optimized mRNA gives rise to a recombinant protein with an altered conformation.

Synonymous codon mutations are worrying inasmuch as many diseases have been linked to single synonymous codon mutations. A codon-optimized mRNA can be altered by up to 80% from its native form [99]; consequently, the net result is the introduction of a large number of synonymous codon mutations into an mRNA. A recent example illustrating the effects of a single synonymous codon mutation in mammalian cells comes from the analysis of the cystic fibrosis transmembrane conductance regulator (CFTR) gene [64]. This study demonstrated that a synonymous mutation of a threonine codon in this gene (ACT to ACG) affected both the conformation and function of the CFTR protein. Analysis of ribosome-protected fragments in a cystic fibrosis bronchial epithelial cell line revealed that ribosome occupancy of ACG codons was much higher than that of ACT codons; indeed, ACG was amongst the codons with the highest ribosome occupancy, suggesting that the mutated codon is one of the most slowly translated codons in these cells, and that the natural ACT codon is translated much more rapidly. These results were corroborated by data showing that the tRNA levels for these two codons were correlated with the predicted relative translation speeds of these codons. In addition, the authors showed that the structural and functional defects in the mutated CFTR protein could be rescued by increasing levels of the tRNA corresponding to the mutated ACG codon. These results strongly support the notion that a single synonymous codon mutation in the CFTR protein causes both structural and functional deficits because of slower translation at the mutated codon. Although the effects of a rare codon in this example seem contrary to those reported in many other studies in mammalian cells, it provides an example of the complexity of codon usage because the tRNA corresponding to the mutated ACG codon was not found to be rare in other human tissues, suggesting that the effects observed in the epithelial cell line are tissue specific.

Codon optimization should be considered one of various possible factors that may contribute to the immunogenicity of a recombinant protein. In addition, not all biologicals are equivalent in terms of potential safety issues that may arise. For example, recombinant monoclonal antibodies that function by targeting other molecules may be inherently safer than recombinant versions of natural proteins, which can have dramatic consequences if anti-drug antibodies against the recombinant protein recognize the endogenous protein. Nevertheless, in any case, an additional goal of codon optimization, beyond increased expression, is increased safety.

2.4 Lost Opportunities?

A potential problem associated with codon optimization is that it is routinely used to try to increase protein yields when a protein moves from academic and preclinical studies to clinical trials. However, it is likely that many academic and preclinical studies are performed using gene constructs based on natural mRNA sequences. Codon-optimized variants may behave differently and underlie instances in which a protein generated very promising data in preclinical studies but failed to perform as expected after being scaled up under GMP conditions. The concern is that highly effective protein drugs may be negatively affected or even fall by the wayside when a codon-optimized version of a protein is used.

For instance, molecules that are potentially very useful for vaccine development include broadly neutralizing antibodies, e.g., from rare HIV-infected patients. The development of these antibodies in patients can take many years, often involving multiple rounds of extensive somatic mutation [100, 101]. Subtle changes such as those that arise from synonymous mutations associated with codon optimization may affect the binding activities of these broadly neutralizing antibodies and prevent them from functioning identically to those on which they were based, reducing or perhaps even eliminating their usefulness. This example is provided to illustrate the type of problems that may occur, and is not limited to broadly neutralizing antibodies from rare HIV-infected patients.

We do not know the extent to which codon optimization of recombinant proteins has resulted in reduced efficacy or increased immunogenicity. However, it is likely that in some cases these proteins represent lost opportunities and may be worth revisiting with non-codon-optimized mRNAs. This is particularly true for any proteins or antibodies that did not behave as expected after codon optimization, e.g., after being scaled up for clinical trials.

2.5 Why Does Codon Optimization Sometimes Increase Expression?

If codon optimization does indeed increase protein yields in mammalian cells because of enhanced codon usage and more efficient elongation, then this effect should be robust and reproducible. Although there is some evidence that the translation rates of some codon-optimized mRNAs are faster than those of non-optimized mRNAs (reviewed in Hanson and Coller [72]), increased expression does not seem to be a general finding, and numerous studies report little or no effect [65, 102]. In unpublished studies, we ordered codon optimized light and heavy chain genes for a mAb. Three light chain genes and three heavy chain genes were ordered from the same commercial provider. Comparison of the codon-optimized nucleotide sequences revealed that they were all different, i.e., different synonymous codon mutations were used for each gene. Strikingly, when combinations of these light and heavy chain genes (t = 9) were expressed in transiently transfected CHO cells and mAb expression levels were compared, the results showed that expression varied by >  5-fold. The magnitude of the difference in mAb expression between different light and heavy chain combinations is difficult to reconcile with the proposed mechanism of increased elongation rates and does not inspire confidence regarding the expected expression properties of codon-optimized genes.

It seems unlikely that the increased expression of some codon-optimized mRNAs in mammalian cells is due to increased elongation rates, but rather the result of an inadvertent event. For example, increased expression may result from elevated recombinant mRNA levels, which can occur by various mechanisms, including disruption of a miRNA seed sequence, decreased degradation of the mRNA, or increased transcription. In yeast, it was observed that codon usage bias was positively correlated with mRNA levels, which was at least partially due to effects on mRNA stability [103]. In addition, several recent studies have indicated that the effects of codon optimization occur at the level of transcription. In Neurospora it was shown that increased mRNA and protein levels obtained from codon-optimized mRNAs were not due to increased mRNA stability or translation but increased transcription [104]. This study suggested that some genes with non-optimal codons undergo transcriptional silencing at the chromatin level. A similar conclusion was reached in studies performed in mammalian cells which analyzed two Toll-like receptors (TLRs) [105]. This study showed that codon optimization of TLR7 increased its expression by 40-fold, whereas codon optimization of a closely related protein (TLR9) had no effect. Ribosome profiling studies indicated that the translation efficiency of codon-optimized TLR7 was only modestly increased and that the effect on expression was caused primarily by increased mRNA levels that resulted from increased transcription. The authors suggested that the effect on transcription was caused by an increase in GC content following codon optimization.

2.6 Suggested New Goals for Codon Optimization

In many cases, codon optimization enhances protein expression, and it is expected that these methods will continue to improve as algorithms incorporate empirical observations based on codon usage and patterns that are correlated with high protein expression [106]. This is acceptable if the goal is increased expression, and it is appropriate for some applications, e.g., for protein evolution and increasing the expression and/or activity of industrial enzymes. However, for recombinant expression of natural therapeutic proteins in targeted cells, an additional goal should be to maintain the conformation and processing of the natural protein sequences. As suggested earlier, the best approach for increasing protein production is to increase the rate of translation initiation, directly or through factors affecting this process, for example, by incorporating translation enhancer elements, increasing mRNA levels, or using introns. Because of the potential problems associated with synonymous codon mutations, it is suggested that they should be used sparingly if at all, and, if so, with scientific justification. For the production of therapeutic proteins, it seems difficult to justify the large number of synonymous mutations associated with codon optimization.

In light of the possibility that codon optimization can lead to alterations in protein conformation, it has been suggested that it is crucial to assess the consequences of codon optimization before using a recombinant protein drug in patients [70]. The increased use of high-resolution methods for comparing conformational differences between proteins derived from natural and codon-optimized mRNAs is useful in identifying protein variants that may be potentially harmful [107]. It is expected that the development of new methods for rapidly and more easily probing protein conformation will enable screening of large numbers of protein variants at an early stage of development. It seems that there is still a need for additional research.

3 Summary and Conclusions

Numerous studies indicate that the scientific bases for codon optimization in mammals are poorly supported; because of this, it is difficult to justify the use of codon optimization as a tool for bioproduction of therapeutic proteins. The question that therefore needs to be asked is why is codon optimization still commonly used? One possible reason is that in some cases, higher levels of protein expression are required for clinical trials and commercialization, and these expression levels can sometimes be obtained by using codon-optimized mRNAs—regardless of the underlying mechanism. Unfortunately, some of the potential problems associated with codon optimization, which can affect protein function and increase immunogenicity, may not be seen until the drug is in late stage clinical trials, or after the drug is on the market [99].

It is surprising that biotherapeutic approvals by the FDA do not yet require disclosure of gene sequences [5], as knowledge of gene structure—native or codon optimized—would be useful in determining whether particular problems affecting drug safety are associated with codon optimization. Gene sequence information should be an important component in the FDA’s quality by design considerations. Thankfully, the effects of synonymous codon usage and potential problems associated with codon optimization have been recognized and are actively being studied by scientists at the FDA [5, 102]. Hopefully the FDA will soon take steps to address this situation. It should be noted that the absence of nucleic acid information also significantly impacts the generation of biosimilars, for which similarity is hard to achieve without knowing the gene sequence of the innovator drug. However, because biosimilars are replacing proteins developed using older technologies, it has been suggested that it may actually be better if biosimilars are not identical to the reference protein [5]. Moving towards the use of more natural mRNA sequences would be a step in the right direction.