# Algorithms for optimizing cross-overs in DNA shuffling

- 3k Downloads
- 3 Citations

## Abstract

### Background

DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining parent genes. The resulting libraries are subjected to large-scale genetic selection or screening to identify those chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been applied quite successfully, it is limited by its homology-dependent, stochastic nature. Consequently, it is used only with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library.

### Results

This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more diverse parents and to generate more predictable, optimized libraries. Our CODNS (cross-over optimization for DNA shuffling) approach employs polynomial-time dynamic programming algorithms to select codons for the parental amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms to optimize the local sequence identity or the nearest-neighbor approximation of the change in free energy upon annealing, objectives that were previously optimized by computationally-expensive integer programming methods. We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% amino acid sequence identity), two very distantly related DNA polymerases, Pol X and *β* (15%), and beta-lactamases of varying identity (26-47%).

### Conclusions

Our methods provide the protein engineer with a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.

### Keywords

Codon Free Energy Dynamic Programming Algorithm Diversity Variance African Swine Fever Virus## Background

The harnessing of DNA recombination *in vitro* has transformed protein engineering by enabling engineers, like nature, to sample sequence space more broadly than is allowed by point mutagenesis at individual residues. Recombination produces chimeras comprised of sequential fragments from parent genes, thereby bringing together sets of sequences that were previously active in the parental background, and are thus likely to be less disruptive than random ones. Chimeragenesis typically produces combinatorial libraries, and those chimeras with beneficial properties can be identified by large-scale genetic screening and selection.

DNA shuffling is both *homology-dependent* (recombination can occur only in runs of similar DNA sequence), and *stochastic* (the engineer does not control the recombination sites). Due to dependence on sequence similarity, DNA shuffling may fail to generate desirable chimeras (or any chimeras at all) for diverse parents, as they have only a few, small regions of DNA similarity, insufficient to generate many cross-overs. Homology-independent stochastic methods (e.g., ITCHY [8] and SHIPREC [9]) mitigate the need for such parental sequence similarity, but at the cost of generating many more non-viable chimeras.

In contrast with stochastic methods, site-directed methods enable the engineer to explicitly choose break-point locations so as to optimize expected library quality (e.g., by employing structural information [10], or by minimizing predicted disruption [11, 12], library diversity [13], or both factors [14]). We have developed a site-directed method employing planned ligation of parental fragments by short overhangs [15]. We have coupled this approach to robotic implementation in order to generate specific chimeras in defined experimental vessels [16]. Such highly-directed methods of chimera generation are most useful when screening represents a significant effort. In those situations where screening or genetic selection is readily available, then stochastic approaches, with less overall cost, might prove preferable.

We demonstrate the effectiveness of CODNS in several case studies. We first optimize the GAR transformylases previously optimized by eCodonOpt [17]. We then show that CODNS can optimize two DNA polymerases (Pol X and Pol *β*) that are sufficiently diverse (15% amino acid sequence identity) to previously require the development and application of the SCOPE method [10], instead of direct application of DNA shuffling. Finally, we study the impact of parental sequence identity by considering pairs of beta-lactamase parents of differing diversity levels.

## Methods

We take as input the amino acid sequences of the *parent* proteins to be shuffled, aligned to a length of *n* (amino acids and gaps) based on sequence and/or structure. For simplicity of exposition, we present our methods for the most common case of shuffling two parents, *a*_{1} and *a*_{2}. Our methods readily extend to creating equivalent sites for recombination in multiple parents, and it remains interesting future work to allow for non-uniform shuffling (i.e., where different cross-overs are possible between different pairs of parents).

To optimize the shuffling experiment, we select a codon for each amino acid for each parent, yielding DNA sequences *d*_{1} and *d*_{2} of length 3*n* (maintaining gaps for those in the amino acid sequences). To expand the pool of codons being considered at a particular position, we may choose to make an amino acid substitution. Thus we take as additional input a specification of the *allowed substitutions* for each residue position for each parent, along with a number *m* of them to make. The allowed substitution specification may be derived from sequence and/or structural analysis of the parents, including general amino acid substitution matrices [18], position-specific amino acid statistics from related proteins [19], and $\mathrm{\Delta}\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{fold}}}^{\circ}$ fold predictions for possible substitutions [20]. The results presented below determine allowed substitutions under the BLOSUM62 substitution matrix, considering only "conservative" substitutions which score no more than 4 worse than wild-type [15].

In describing the algorithms, we use *possible codon sets* representing the codons allowed at each position in the wild-type and under the allowed substitutions. For position *i*, set *C*_{1}[*i*] contains the possible codons for *a*_{1}[*i*], pairing each with an indication of whether or not it requires a substitution, e.g., {(TTT, 0), (TTC, 0), (TGG, 1)} for an F that could potentially be mutated to W. Set *C*_{2}[*i*] is defined similarly for the second parent. We note that these may readily be used to restrict where to employ mutations (e.g., masking based on structural analysis, as discussed by Moore and Maranas [17]), by allowing only wild-type codons (or amino acids) in some positions.

We consider four types of objective function, targeting *common nucleotides* (at aligned positions), *nearest-neighbor approximation to change in free energy of annealing* (from dinucleotide pairs), *common nucleotide runs* (in contiguous strings), or *library diversity* (among resulting chimeras). We develop increasingly more complex dynamic programming algorithms to optimize these objectives.

### Common nucleotide optimization

where *I* is the indicator function (1 for true, 0 for false).

With no substitutions allowed, each residue position is independent of each other one. Thus we simply select for each position a pair of codons (one for each parent) with a maximal number of common nucleotides. When substitutions are allowed, we need to allocate them for optimal impact. While several approaches are possible, we develop here one based on dynamic programming, to serve as the basis for the more complex objective functions we pursue in subsequent subsections.

*N*[

*i*,

*s*] denote the number of common nucleotides within the first

*i*residues, using exactly

*s*substitutions. The value of

*N*[

*i*,

*s*] extends the value of

*N*[

*i*- 1,

*s*- (

*t*

_{1}+

*t*

_{2})] with the additional number of common nucleotides obtained by selecting a pair of codons for position

*i*while making

*t*

_{1}+

*t*

_{2}additional substitutions (0 or 1 for each parent). Optimal substructure holds, since the optimal value of

*N*[

*i*,

*s*] depends on the optimal value of

*N*[

*i*- 1,

*s*- (

*t*

_{1}+

*t*

_{2})]. The recurrence is

where *g* gives the number (0-3) of common nucleotides for a pair of codons.

After filling in the dynamic programming table, we trace back from *N*[*n*, *m*] to generate an optimal pair of DNA sequences. The matrix is of size *n* * *m* and each cell takes constant time to compute.

### $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ optimization

*nearest-neighbor*approximation [24]:

where $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}\left({d}_{1}\left[i\right]\cdot {d}_{1}\left[i+1\right],\phantom{\rule{2.77695pt}{0ex}}{d}_{2}\left[i\right]\cdot {d}_{2}\left[i+1\right]\right)$ is the free energy change associated with annealing dinucleotide *d*_{1}[*i*]·*d*_{1}[*i* + 1] with dinucleotide *d*_{2}[*i*]·*d*_{2}[*i* + 1]. These values can be computed from enthalpic Δ*H* (kcal/mol) and entropic Δ*S* (cal/mol·K) nearest-neighbor parameters compiled at 37°C and [Na^{+}] = 1.0 M [24], including both pairs of complementary strands. To actually estimate the change in free energy, there are additional constant terms such as the average initiation energy contribution; we omit them as they do not affect the optimization. While the underlying $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ parameters are defined on pairs of dinucleotides, we abuse notation a bit in our formulation below and use $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ for 4-mers to mean the sum over the constituent dinucleotides.

*b*·

*c*indicates the concatenation of base

*b*onto codon

*c*, and $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ estimates the change in free energy, as described in the text. Cell

*A*[

*i*,

*s*,

*b*

_{1},

*b*

_{2}] holds the best score for the first

*i*positions, using exactly

*s*substitutions, with third nucleotides

*b*

_{1}(first parent) and

*b*

_{2}(second parent) for position

*i*. As with common nucleotide optimization, if a codon pair makes

*t*

_{1}+

*t*

_{2}substitutions at position

*i*, then

*A*[

*i*,

*s*,

*b*

_{1},

*b*

_{2}] extends the solution to a cell for position

*i*- 1 with

*s*- (

*t*

_{1}+

*t*

_{2}) substitutions, considering any of the third nucleotides ${b}_{1}^{\prime}$ and ${b}_{2}^{\prime}$ at position

*i*- 1.

The table is of size *n* × *m* × 5^{2} for 2 parents, since there are only (4 + 1)^{2} combinations of single nucleotide pairs for two parents (four nucleotides and a gap each). Each cell can be computed in constant time. In practice, we construct a 2D table (over *i* and *s*), with each cell maintaining a list of scores for the (*b*_{1}, *b*_{2}) pairs that actually occur.

### Run optimization

Moore and Maranas argued that the nearest-neighbor approximation to change in annealing free energy is a better objective for shuffling optimization than the number of common nucleotides [17]. Intuitively, since the nearest-neighbor approximation considers adjacent nucleotides together rather than treating them independently, it is more likely to yield sufficient complementarity between fragments and thereby promote recombination. Here we go even further and explicitly optimize for contiguous complementary regions, since annealing is driven by sufficiently long (anecdotally 6 nt or more) such regions.

*common nucleotide run*as a maximal-length substring appearing at aligned positions in the DNA sequences

*d*

_{1}and

*d*

_{2}, and use as our objective function:

*f*, which must be non-decreasing, indicates the value for DNA shuffling of a run of length |

*R*|, and the sum is taken over all runs. We have implemented and tested several different scoring functions; the results use the following two functions:

In *f*_{1}, we count the total number of nucleotides in a run, but only if the run exceeds a given length (we empirically evaluated several thresholds). This assumes that cross-overs are impossible for runs with fewer than *θ* common nucleotides, and become increasingly likely with additional nucleotides beyond *θ*. In *f*_{2}, we consider cross-overs impossible for fewer than 6 nucleotides and very likely for 9 nucleotides or more (scoring the total number of nucleotides as in *f*_{1}), and we ramp up from the impossible score of 0 at 5 nt to the likely score of 9 at 9 nt, thereby counting the partial benefit that may be provided by runs between 6 and 9 nucleotides.

*R*[

*i*,

*s*,

*r*] holds the best score for the first

*i*positions, using exactly

*s*substitutions, such that the final nucleotide in the codons chosen for position

*i*is the

*r*th in a run (0 if mismatch). Again, if we make

*t*substitutions at position

*i*, then

*R*[

*i*,

*s*,

*r*] extends the solution to a cell for position

*i*- 1 with

*s*- (

*t*

_{1}+

*t*

_{2}) substitutions. Now we must also account for the preceding run length; there are several cases (Figure 3, right): the codons chosen for the current amino acid position may continue a run from the previous position, may end that run, and may start a new run. In any case, the current

*r*and possible codon pair determines the preceding

*r*' at which to look, and optimal substructure still holds. The recurrence is thus

where *a*(*c*_{1}, *c*_{2}) and *z*(*c*_{1}, *c*_{2}) give the lengths of the longest common prefix and suffix, respectively, of a pair of codons. The first case handles a common codon, while the second case handles an unequal codon pair, which may end and/or begin a run. The score depends on that from the related cell, with an increment in *f*(·) accounting for any extension in run length and initiation of a new run. (See again Figure 3, right.) When there is a tie, we prefer the codon pair with the most common nucleotides, even if that has no impact on run score. This choice increases overall sequence identity, to promote better annealing of strands from different parents.

The matrix is of size *n* * *m* * (3*n* + 1), since the run length potentially ranges from 0 to the entire DNA sequence length (3*n*). However, in practical cases, most run lengths are not attainable. Furthermore, for *r*_{1} *< r*_{2}, if *R*[*i*, *s*, *r*_{1}] + *f*(*r*_{2}) *- f*(*r*_{1}) *< R*[*i*, *s*, *r*_{2}], then the *r*_{2} cell "dominates" the *r*_{1} one--the *r*_{1} one cannot be part of the optimal solution. Thus we modify the usual dynamic programming algorithm slightly, to avoid filling in cells with unattainable or dominated run lengths. We perform the standard nested loop over *i* (residue position) and *s* (number of substitutions). Then for each *i* and *s* we determine which run lengths are attainable and undominated and fill in only those entries. Rather than keeping a 3D table, we keep a 2D table in which each cell has a list of run lengths and their scores. Note from the structure of Eq. 14 that we can determine the run lengths for *i*, *s* from the possible codons at *i* and the run lengths that were attained and undominated for *i* - 1 and *s*, *s* - 1, and *s* - 2 (depending on the numbers of substitutions required for the codons).

### Diversity optimization

*diversity variance*over a library as:

where *λ* is the number of fragments, *m*(*H*_{ i }, *H*_{ j }) is the *mutation level* (number of amino acid differences) between a pair of chimeras *H*_{ i } and *H*_{ j }, and $\overline{m}$ is the average of *m* over the library. (We drop a constant factor of 2, which doesn't affect the optimization.) To mitigate the effect of neutral mutations, rather than using literal equality we measure *m* using one of the standard sets of amino acid classes. The goal is to minimize the variance, seeking to sample sequence space as uniformly as possible.

The objective function is defined in terms of the chimeras in the library. In the context of DNA shuffling, we assume that a sufficiently large run of common nucleotides (with respect to a threshold *θ* as in Eq. 10) results in a breakpoint, and thus that the (full-length) chimeras are well-defined as all combinations of fragments between the breakpoints. Breakpoints resulting from smaller runs only add to the diversity of the resulting library.

For an efficient algorithm, we must be able to compute the objective function during the optimization, without enumerating the exponential number of chimeras. In our previous site-directed work [13], we developed a recursive formulation relating the diversity variance for a library to that of a sub-library with one fewer breakpoint. That formulation took as given the total number of breakpoints, which isn't available in the DNA shuffling context. However, similar algebraic manipulations (omitted due to lack of space) yield a related formula without requiring pre-knowledge of the number of breakpoints.

**Claim 1**

*The diversity variance d*(

*l*,

*k*)

*of a library from parent sequences P*

_{ a }

*and P*

_{ b }

*with kth breakpoint is at residue l can be computed from the diversity variance d*(

*l*',

*k -*1)

*for a library with*(

*k*- 1)

*st breakpoint at residue l*'

*< l by the following formula:*

*where*

*and we use notation P*[*i*, *j*] *to indicate the substring from position i to j, inclusive*.

We add two more dimensions, to keep track of *k*, how many runs of length *θ* we have seen (i.e., confidently yielding breakpoints), and *l*, where the last one was, as in the claim. Intuitively these two additional dimensions are necessary since the number of breakpoints affects the size of the library and thus the diversity variance, and since the additional diversity induced by a run depends on the nucleotides between the previous breakpoint and the new one. Note that in Eq. 16, *k* is the number of breakpoints, with the last breakpoint always at the end of the current position *l*; however, in Eq. 19, *k* is the number of previous runs, or *k* + 1 when substituted into Eq. 16. As with run optimization, our implementation avoids filling in the table for run lengths that are unattainable (though the notion of dominated entries does not carry over).

### Codon usage

In order to promote better protein expression, we follow the GeneDesigner protocol [25] in employing organism-specific codon usage tables. A codon usage table for an organism [26] encodes the frequency with which each codon has been observed in a sequence database; different organisms display different "preferences" [27]. In a preprocessing step, we disallow rare codons that make up less than 10% of the occurrences for their amino acid. Then when computing one of the recurrences, we use the codon usage table to resolve cases where multiple possible codons give the same score (i.e., they have the same implications for continuing, ending, and beginning runs). In such cases, we selecting among the possible codons with probability according to their usage frequency.

## Results and discussion

We use three case studies to demonstrate the effectiveness of CODNS in optimizing DNA shuffling experiments. The first two case studies are a pair of glycinamide ribonucleotide (GAR) transformylases (previously optimized by eCodonOpt [17]) and a pair of distantly related DNA polymerases (previously recombined by SCOPE [10]). We optimize shuffling plans using from 0 to 10 mutations under each of the objective functions, abbreviated in the figures as *cn* (common nucleotides), Δ*G* (nearest-neighbor approximation to change in free energy of annealing), *f*_{1} (runs under *f*_{1} scoring), *f*_{2} (runs under *f*_{2} scoring), and *dv* (library diversity). We examine particular plans optimized under different objectives, in order to see how they differ in allocating mutations and producing homologous runs suitable for cross-overs. We then study the overall trends in optimizing the objectives and in producing runs. We also consider the diversity of the chimeras that would result by recombination under different run-optimal plans. Comparisons with what would result eCodonOpt [17] can be made by noting that it optimizes *cn* and Δ*G* (though we use an efficient dynamic programming algorithm to do so). In a third case study, we evaluate the effects of wild-type sequence identity on the optimization, using different pairs of beta-lactamases.

### GAR transformylases

The parents for our first case study are a GAR transformylase from *E. coli* and one from humans. Previous work showed that DNA shuffling crossovers are extremely rare without codon optimization [17]. We obtained the (gapless) alignment from the supplementary material of [17], and transcribed it to 201 amino acids with 82 (40.8%) in common. The wild-type DNA sequences had 47% nucleotides in common [17], with only two runs of length 7 and no runs longer than 7 nt.

*f*

_{1}optimization, but places the runs more evenly throughout the entire sequence so that crossing over at those sites would yield chimeras comprised of more uniformly-sized fragments better sampling the sequence space spanned by the parents. ("Size" in diversity optimization refers to residues at which the parents differ, not just the total number of amino acids [13].)

*f*

_{2}metric gives "partial credit" for run lengths of 6, 7, and 8, we break out those contributions to its score. We see most of the optimization still focuses on full 9 nt and larger runs, which is natural given the reduced score contribution for shorter runs (since they are believed to be less productive in promoting recombination). The trends for diversity optimization are not shown here since scores are not directly comparable for libraries of different sizes (resulting from different numbers of runs yielded by different patterns of codons and mutations).

*f*

_{2}, as it also optimizes for "partial credit" runs (of lengths 6, 7, and 8).

### DNA polymerases

Our second case study involves two distantly-related members from the X-family of DNA polymerases: African swine fever virus DNA polymerase X (Pol X) and *Rattus norvegicus* DNA polymerase beta (Pol *β*). While these two proteins share a similar fold, they have very low sequence identity. The site-directed SCOPE method [10] was developed due to the difficulty in producing viable Pol X -- Pol *β* chimeras by other methods. We obtained the published structure-based sequence alignment of the two parents, in which the full Pol X and the palm and finger domains of Pol *β* were aligned to a length of 214 residues and gaps, with only 32 residues (15%) in common. The wild-type DNA sequences had only 158/642 (24%) nucleotides in common, with no common nucleotide runs of length greater than 5. Thus standard DNA shuffling techniques are unlikely to produce any cross-overs.

*f*

_{2}forms some potentially productive shorter runs. The difference increases with more mutations, as run optimization directly allocates them so as to produce more runs, while, due to the parental diversity, the indirect choices made to optimize common nucleotides and $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ are unlikely to lead to runs. With less freedom, it is harder to optimize diversity. We do see that while the

*f*

_{1}plan is diversity-optimal for 0 and 4 mutations, it is not for 8 mutations, and the diversity-optimal plan spreads mutations out more. We also observe that the N-terminal region is so diverse that no run is produced there even with 8 mutations. Diversity optimization thus tends to create runs that are more evenly distributed in the large C-terminal region.

### Beta-lactamases

Our third case study examines the effect of wild-type sequence identity. Beta-lactamases, which hydrolyze the beta-lactam found in certain antibiotics (e.g., penicillin), have been the object of much chimeragenesis work, including DNA shuffling [1] and site-directed methods [11]. We previously developed a multiple sequence alignment (272 residues and gaps) of diverse beta-lactamases [15]. For the present study, we considered (a) the common beta-lactamase targets TEM-1 (*E. coli*) and PSE-4 (*Pseudomonas aeruginosa*) (42% amino acid identity); (b) the even more diverse pair from *P. aeruginosa* and *Bacillus licheniformis* (26% id); (c) the more similar pair from *E. coli* and *Proteus mirabilis* (47% id).

*f*

_{1}metric (Figure 12). We again see the linear tread for both objectives with increasing mutations from 0 to 10. The actual energy score and run score both depend on parental sequence identity, with the same ranking on both metrics.

## Conclusion

DNA shuffling is a staple of protein engineering, and we have demonstrated that our new algorithms can substantially improve the expected productivity of an experiment. Even without performing any mutations, we are able to allocate codons to better form runs. By performing a small number of conservative substitutions, not expected to significantly affect stability or activity, we generally are able to increase the number of runs and the number of nucleotides in runs, linearly with the number of substitutions. Finally, since we are establishing runs whose lengths are sufficient to promote regular recombination, we can enhance our optimization to account for properties of the resulting chimeric library. Future directions include extending run optimization to incorporate the type of potential underlying $\mathrm{\Delta}{\mathsf{\text{G}}}_{\mathsf{\text{nn}}}^{\circ}$ (i.e., accounting for differences in nucleotide content), to optimize multiple parents simultaneously, and to integrate CODNS within our Pareto-optimization framework [29] in order to optimize productivity of shuffling in concert with other properties. While both extensions will increase the computational expense, the resulting gain in experimental efficiency could be well worth it. In summary, our methods yield a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.

## Notes

### Acknowledgements

This work is supported in part by US NSF grant IIS-1017231.

### References

- 1.Stemmer WPC: Rapid evolution of a protein
*in vitro*by DNA shuffling. Nature. 1994, 370: 389-391. 10.1038/370389a0.CrossRefPubMedGoogle Scholar - 2.Stemmer WPC: DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. Proc Natl Acad Sci USA. 1994, 91: 10747-10751. 10.1073/pnas.91.22.10747.PubMedCentralCrossRefPubMedGoogle Scholar
- 3.Littlehales C: Profile: Willem 'Pim' Stemmer. Nat Biotechnol. 2009, 27: 220-10.1038/nbt0309-220.CrossRefPubMedGoogle Scholar
- 4.Crameri A, Raillard SA, Bermudez E, Stemmer W: DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature. 1998, 391: 288-291. 10.1038/34663.CrossRefPubMedGoogle Scholar
- 5.Chang C, Chen T, Cox B, Dawes G, Stemmer W, Punnonen J, Patten P: Evolution of a cytokine using DNA family shuffling. Nat Biotechnol. 1999, 17: 793-797. 10.1038/11737.CrossRefPubMedGoogle Scholar
- 6.Ness J, Welch M, Giver L, Bueno M, Cherry J, Borchert T, Stemmer W, Minshull J: DNA shuffling of subgenomic sequences of subtilisin. Nat Biotechnol. 1999, 17: 893-896. 10.1038/12884.CrossRefPubMedGoogle Scholar
- 7.Christians F, Scapozza L, Crameri A, Folkers G, Stemmer W: Directed evolution of a thymidine kinase for AZT phophorylation using DNA family shuffling. Nat Biotechnol. 1999, 17: 259-264. 10.1038/7003.CrossRefPubMedGoogle Scholar
- 8.Ostermeier M, Shim JH, Benkovic SJ: A combinatorial approach to hybrid enzymes independent of DNA homology. Nat Biotechnol. 1999, 17: 1205-1209. 10.1038/70754.CrossRefPubMedGoogle Scholar
- 9.Sieber V, Martinez CA, Arnold FH: Libraries of hybrid proteins from distantly related sequences. Nat Biotechnol. 2001, 19: 456-460. 10.1038/88129.CrossRefPubMedGoogle Scholar
- 10.O'Maille PE, Bakhtina M, Tsai MD: Structure-based combinatorial protein engineering (SCOPE). J Mol Biol. 2002, 321: 677-691. 10.1016/S0022-2836(02)00675-7.CrossRefPubMedGoogle Scholar
- 11.Meyer MM, Silberg JJ, Voigt CA, Endelman JB, Mayo SL, Wang ZG, Arnold FH: Library analysis of SCHEMA-guided protein recombination. Protein Sci. 2003, 12: 1686-1693. 10.1110/ps.0306603.PubMedCentralCrossRefPubMedGoogle Scholar
- 12.Ye X, Friedman A, Bailey-Kellogg C: Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination. J Comput Biol. 2007, 14: 777-790. 10.1089/cmb.2007.R016. Conference version: Proc. RECOMB, 2006, pp. 15-29CrossRefPubMedGoogle Scholar
- 13.Zheng W, Ye X, Friedman AM, Bailey-Kellogg C: Algorithms for selecting breakpoint locations to optimize diversity in protein engineering by site-directed protein recombination. Comput Syst Bioinformatics Conf. 2007, 6: 31-40.CrossRefPubMedGoogle Scholar
- 14.Zheng W, Friedman AM, Bailey-Kellogg C: Algorithms for joint optimization of stability and diversity planning combinatorial libraries of chimeric proteins. J Comput Biol. 2009, 16: 1151-1168. 10.1089/cmb.2009.0090. Conference version: Proc. RECOMB, 2008, pp. 300-314CrossRefPubMedGoogle Scholar
- 15.Saftalov L, Smith P, Friedman A, Bailey-Kellogg C: Site-directed combinatorial construction of chimaeric genes: general method for optimizing assembly of gene fragments. Proteins. 2006, 64 (3): 629-642. 10.1002/prot.20984.CrossRefPubMedGoogle Scholar
- 16.Avramova L, Desai J, Weaver S, Friedman A, Bailey-Kellogg C: Robotic hierarchical mixing for the production of combinatorial libraries of proteins and small molecules. J Comb Chem. 2008, 10: 63-68. 10.1021/cc700106e.CrossRefPubMedGoogle Scholar
- 17.Moore G, Maranas C: eCodonOpt: a systematic computational framework for optimizing codon usage in directed evolution experiments. Nucleic Acids Res. 2002, 30: 2407-2416. 10.1093/nar/30.11.2407.PubMedCentralCrossRefPubMedGoogle Scholar
- 18.Henikoff S, Henikoff JG: Amino acid substitutions from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915.PubMedCentralCrossRefPubMedGoogle Scholar
- 19.Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge University PressCrossRefGoogle Scholar
- 20.Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320: 369-387. 10.1016/S0022-2836(02)00442-4.CrossRefPubMedGoogle Scholar
- 21.Sun F: Modeling DNA shuffling. J Comput Biol. 1999, 6: 77-90. 10.1089/cmb.1999.6.77.CrossRefPubMedGoogle Scholar
- 22.Moore GL, Maranas CD, Lutz S, Benkovic SJ: Predicting crossover generation in DNA shuffling. Proc Natl Acad Sci USA. 2001, 98: 3226-3231. 10.1073/pnas.051631498.PubMedCentralCrossRefPubMedGoogle Scholar
- 23.Maheshri N, Schaffer D: Computational and experimental analysis of DNA shuffling. Proc Natl Acad Sci USA. 2003, 100: 3071-3076. 10.1073/pnas.0537968100.PubMedCentralCrossRefPubMedGoogle Scholar
- 24.SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci USA. 1998, 95: 1460-1465. 10.1073/pnas.95.4.1460.PubMedCentralCrossRefPubMedGoogle Scholar
- 25.Villalobos A, Ness J, Gustafsson C, Minshull J, Govindarajan S: Gene Designer: a synthetic biology tool for constructing artificial DNA segments. BMC Bioinformatics. 2006, 7: 285-10.1186/1471-2105-7-285.PubMedCentralCrossRefPubMedGoogle Scholar
- 26.Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from the international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 2000, 28: 292-10.1093/nar/28.1.292. [http://www.kazusa.or.jp/codon/]PubMedCentralCrossRefPubMedGoogle Scholar
- 27.Guoy M, Gautier C: Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982, 10: 7055-7074. 10.1093/nar/10.22.7055.CrossRefGoogle Scholar
- 28.Waterman MS, Byers TH: A dynamic programming algorithm to find all solutions in a neighborhood of the optimum. Math Biosci. 1985, 77: 179-188. 10.1016/0025-5564(85)90096-3.CrossRefGoogle Scholar
- 29.He L, Friedman AM, Bailey-Kellogg C: A divide and conquer approach to determine the Pareto frontier for optimization of protein engineering experiments. Proteins. 2012, 80 (3): 790-806. 10.1002/prot.23237.CrossRefPubMedGoogle Scholar
- 30.Joern J: Directed Evolution Library Creation: Methods and Protocols. 2003, Humana Press, 85-89. DNA shuffling, Methods Mol Biol, vol 1.CrossRefGoogle Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.