Systematic analysis of short internal indels and their impact on protein folding
Protein sequence insertions/deletions (indels) can be introduced during evolution or through alternative splicing (AS). Alternative splicing is an important biological phenomenon and is considered as the major means of expanding structural and functional diversity in eukaryotes. Knowledge of the structural changes due to indels is critical to our understanding of the evolution of protein structure and function. In addition, it can help us probe the evolution of alternative splicing and the diversity of functional isoforms. However, little is known about the effects of indels, in particular the ones involving core secondary structures, on the folding of protein structures. The long term goal of our study is to accurately predict the protein AS isoform structures. As a first step towards this goal, we performed a systematic analysis on the structural changes caused by short internal indels through mining highly homologous proteins in Protein Data Bank (PDB).
We compiled a non-redundant dataset of short internal indels (2-40 amino acids) from highly homologous protein pairs and analyzed the sequence and structural features of the indels. We found that about one third of indel residues are in disordered state and majority of the residues are exposed to solvent, suggesting that these indels are generally located on the surface of proteins. Though naturally occurring indels are fewer than engineered ones in the dataset, there are no statistically significant differences in terms of amino acid frequencies and secondary structure types between the "Natural" indels and "All" indels in the dataset. Structural comparisons show that all the protein pairs with short internal indels in the dataset preserve the structural folds and about 85% of protein pairs have global RMSDs (root mean square deviations) of 2Å or less, suggesting that protein structures tend to be conserved and can tolerate short insertions and deletions. A few pairs with high RMSDs are results of relative domain positions of the proteins, probably due to the intrinsically dynamic nature of the proteins.
The analysis demonstrated that protein structures have the "plasticity" to tolerate short indels. This study can provide valuable guides in modeling protein AS isoform structures and homologous proteins with indels through placing the indels at the right locations since the accuracy of sequence alignments dictate model qualities in homology modeling.
KeywordsProtein Data Bank Protein Pair Amino Acid Frequency Protein Data Bank File Relative Solvent Accessibility
List of abbreviations used
Protein Data Bank
Root Mean Square Deviation
Sequence insertions/deletions (indels) occur during evolution and alternative splicing (AS) process in eukaryotes. The generation of various protein isoforms through alternative splicing has been considered as one of the major evolutionary mechanisms for increasing the proteome size and functional diversity [1, 2]. Recent high-throughput analysis based on mRNA-SEQ data from diverse human tissue and cell lines suggested that alternative splicing is almost universal (up to 94%) in human multi-exon genes . While there are several types of splicing events that result in different splice isoforms when compared to the primary sequences, such as truncation, substitution, insertion and deletion, the internal insertion/deletion cases are the dominant form of alternative splicing variants and are of great interest due to its potential impact on the folding and stability of isoform structures [3, 4]. In addition, genes containing "switch-like" exons are more likely to have isoforms with indels . It is critical to our understanding of the function of alternatively spliced protein isoforms if we know how sequence changes, especially sequence insertions and deletions, affect the structure of the splice variants as structures hold key information for the function of proteins.
Our current knowledge about how alternative splicing affects protein structures is very limited. While there are about 28,000 annotated protein isoforms from recent UniProt release 15.11 (November 24, 2009)  and over 60,000 protein structures deposited in Protein Data Bank (PDB) , fewer than 10 pairs of alternatively spliced isoforms have documented structures . Prediction of isoform structures generally falls into the category of homology modeling. However, homology modeling of proteins with indels is not a trivial task. The key to the success in homology modeling with indels is alignment accuracy, especially the positioning of the insertion or deletion sequences. For example, several groups at CASP8 (the 8th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction) used the same protein 2G39 as the template to model target protein T0438, but only three of nine models placed the insertion sequence (12 amino acids) in the right place . Another infamous/famous example in indel positioning is the modeling of the long AS isoform of Piccolo C2A domain that has a nine-residue insertion in a loop. Instead of folding as part of the loop, the nine-residue insert displaces a β-strand that is pushed into the calcium-binding region through local rearrangement, leading to a dramatic change in calcium binding affinity .
While it is generally believed that insertions and deletions are well tolerated in loops [10, 11], insertions and deletions within secondary structures (α-helices and β-sheets) may have a dramatic effect on the overall structure and are considered deleterious and unfavorable during evolution [4, 12]. Tress et al. argued that AS isoform is probably an unlikely route to increase functional diversity due to probably large structural impact induced by indels . Yet in a number of studies with genetically engineered insertions and deletions on T4 lysozyme, Matthews' group showed that the protein has structural plasticity to tolerate indels within secondary structures [13, 14, 15]. Three recent large scale analyses also offered a similar view that protein structures have some degree of "plasticity" to tolerate insertions and deletions through maintaining the same structural fold [16, 17, 18].
The major goal of this paper is to investigate the impact of short internal indels (less than 40 amino acids) on protein structures, especially for indels within secondary structures. Large indels may fold as an individual domain or the protein pairs may adopt different folds due to the large differences between two sequences . Terminal indels are not considered in this study as terminal fragments are relatively flexible and terminal deletion/truncation have become a standard protocol in recombinant protein expression and protein crystallization [19, 20, 21]. In addition, the widely-used cloning vectors with His-tags introduce sequence artifacts that are included in the PDB SEQRES records and determining the exact tag sequences is problematic [22, 23]. Though there are several similar surveys about indel statistics since 1992 [10, 24, 25, 26, 27], our approach is different as our goal is to study the impact of indels on protein structure and to provide guidance for isoform structure prediction. Therefore it is critical that the locations of the indels are unique and unambiguous. Otherwise the structural changes would be less well defined. For example, it is not uncommon that two proteins with high global sequence identity have a low local sequence similarity . To address this issue, we take local sequence similarity into account and only consider protein pairs with both high global and local (sequences flanking the indels) sequence similarity (>75%) to ensure the uniqueness of indel sequences and positions. In addition, we include the "disordered conformation" in our structural analysis. It has been demonstrated that intrinsically disordered or unstructured regions are responsible for many important cellular functions and a link between alternative splicing and protein intrinsic disorder has been recently reported [17, 29, 30].
In this paper, we report a systematic analysis of a large non-redundant indel dataset with highly homologous protein pairs. Previously we found that the immunoglobulin (Ig) family, rich in certain amino acids including tyrosine, glycine, and serine in the third complementarity-determining region of the Ig heavy chain (CDR-H3), was overrepresented in an indel dataset [28, 31, 32]. Therefore those Ig-related indel sequences are not considered in our current analysis. Our results show that internal indels tend to have less regular secondary structures (α-helices and β-strands), but are rich in "disordered conformation", which is in line with the work by Romero et al . Our data also show that proteins with short indels, including the ones within regular secondary structures, generally preserve the structural fold with some local structure rearrangement and refolding presumably for structural stability and functionality. The source of the indel, either naturally occurring or experimentally engineered, are described and the statistical significance of the features from natural indels is discussed. A webserver SCINDEL http://bioinfozen.uncc.edu/scindel was developed for convenient visualization of indel induced structural changes.
Generation of a non-redundant dataset of short internal indels
The second step of the procedure is to ensure the uniqueness of indel sequences and locations by checking the local sequence similarity. Two proteins with high global sequence identity may have regions that show low local sequence similarity . If an indel happens to be in the low sequence similarity area, the placement of this indel may change dramatically with minor changes of alignment parameters, resulting in different indel sequences and locations. In our approach, similarities of the sequences flanking the indels (20 amino acids on each side) were calculated. We only consider indel sequences from protein pairs with both high global sequence similarity and highly similar flanking regions (above a cutoff value of 75%).
Lastly, indel sequences were further processed to generate a non-redundant dataset of indels ranging from 2 to 40 amino acids. Two indel sequences are considered redundant if two protein pairs are from the same family and have the same indel sequences with very similar secondary structures at approximately the same residue positions.
To check if an internal indel is a result from engineered mutants or from natural variants, we combined the information from the SEQADV record of PDB files with manual inspection of related publications. The PDB SEQADV record describes conflicts between residue sequences in the ATOM/HETATM records and those in sequence databases . Since there are several possible reasons for these conflicts, including engineered mutants, natural variants, disordered fragments, or cloning artifact, careful manual inspection is needed.
Secondary structure types and relative solvent accessibility of indel residues
Each indel residue was assigned to one of four secondary structure states, helix, strand, coil and disordered. DSSP program was used to assign the first three secondary structure states: helix, strand and coil . Following the widely used convention, H (α-helix), G (310-helix) and I (π-helix) from DSSP are classified as helix type while E (extended strand) and B (residue in isolated β-bridge) states are classified as strand type. All the other states from DSSP are considered as coil. The disordered residues are defined by comparing the "ATOM" and "SEQRES" records in PDB file. If a residue or a fragment appears in "SEQRES", but is missing from the "ATOM" record in a PDB file, this residue or fragment is considered as disordered or unstructured . The relative solvent accessibility was calculated by dividing the absolute value from DSSP by the maximum accessibility of each residue . We employ a three-state classification for relative solvent accessibility: buried (≤7%), intermediate (>7% and ≤37%), and exposed (>37%) . The disordered/unstructured residues were considered as exposed in solvent accessibility analysis.
For comparison purpose, a non-redundant data set with 4731 protein chains, in which no pair of protein chains has more than 25% sequence identity, each structure has a resolution of better than 2.5 Å, and the size is in the 50-1000 amino acids range, was used as background analysis for amino acid composition, secondary structure types, and residue solvent accessibility.
Protein sequence and structure comparisons
Sequence alignment is done with the Needle program with default parameters [35, 36]. Two different structure alignment programs, FAST  and CE , were used for global and local structure alignment respectively. The structural difference/similarity is measured by the Cα RMSD of aligned residues between two structures. The structural changes induced by indels were evaluated by comparing the structure and sequence alignments of each pair. A webserver SCINDEL http://bioinfozen.uncc.edu/scindel was developed for convenient visualization of both the sequence and structure alignments and structural changes caused by indels.
Results and discussion
A non-redundant dataset of short internal indels from highly homologous protein pairs
The protein chains were first clustered into 11,541 groups using BLASTClust as described in the Methods section. The first filtering step of removing redundant protein chains in each cluster resulted in 1,932 clusters that have two or more members. A total of 1,237,062 internal indels were identified from 445,552 distinct protein pairs. A dataset of 1,114 non-redundant indels were generated after removing indels that are false, redundant, or the results of low local sequence similarity. As described earlier , the dataset is rich in indels (931 of 1114) derived from one specific family member, immunoglobulin variable domain (b.1.1.1) based on the latest SCOP release 1.75 . These indels are generally located in the CDR-H3 loops that play crucial roles in antigen recognition and binding specificity [32, 45]. The CDR-H3 loops are dominated by residues tyrosine, glycine, and serine, but have fewer lysine, glutamine, and glutamic acid [28, 31, 32]. Due to the over-representation of indels from the immunoglobulin family, these indels were removed and the final non-redundant indel dataset includes 183 indels. The detailed information for each indel, including length, amino acid sequence, host proteins, start and end positions of the indel, and SCOP classification is available at http://bioinfozen.uncc.edu/scindel/nonredundant_indels.html.
Statistical analysis of the indel sequences and structures
We also checked the conformation of five residues flanking the indel sequences on each side and found that more residues are in β-sheet or α-helix states when they are further way from the indel sites (Additional file 1, Figure S2).
Source of the non-redundant indel sequences
A number of studies have been devoted to study the structural and functional consequences of insertions and deletions with artificially engineered constructs [12, 13, 14, 15]. For example, Matthews' group has used T4 lysozyme as a model system to investigate the effect of insertions on protein structures, including an insertion in an α-helix [13, 14]. The very question about the indel dataset in this study is how many of the indel sequences are from naturally occurring proteins rather than "man-made". Based on the SEQADV records of PDB files and manual inspection, we found that about 70% of the indel sequences in our dataset are the results of engineered insertion/deletion mutants. Besides the T4 lysozyme structural studies, the majority of the engineered insertion/deletion mutants were constructed to investigate the functional importance of a particular fragment. Deletion mutants have also served as one of the rational engineering approaches to better crystallization in protein structure determination . Indels from protein pair 2J4OA-2POPA (19 AAs) and 2RHSB-2RHQB (4 AAs) are such examples. The indel statistics from naturally occurring indels in terms of secondary structure types and solvent accessibilities are significantly different from those of the reference dataset and are highly similar to those derived from all indels (Figures 4B and 4C). Though there are variations in amino acid frequencies between all indels and naturally occurring ones, the general trend is surprisingly similar (Figure 4A). Due to the small size of the naturally occurring indel dataset (55 indels, 263 residues), the significances of differences between the observed numbers of amino acids, secondary structure types, and relative solvent accessibility and the expected numbers (using either "Background" or "All" indels as references) were calculated with χ2 test. Not surprisingly, there are significant differences between the observed numbers in the naturally occurring indel dataset and the expected numbers (based on background distributions) with p-values of 5.1e-4, 2.6e-68, and 9.32e-41 for amino acid, secondary structure types, and relative solvent accessibility, respectively (Additional file 1, Table S2). More importantly, there are no statistically significant differences between the naturally occurring indels and the all indel dataset in terms of amino acid frequencies (p = 0.46) and secondary structure types (p = 0.91) (Additional file 1, Table S2), suggesting that the features from our all indel dataset may represent the properties of real, non-engineered indel sequences. Statistical analysis also showed that the levels of solvent accessibility are different at a significance level of 0.01 between all indels and naturally occurring indels (p = 0.0094), largely due to the differences in the intermediate and exposed categories (Figure 4C). Unlike secondary structure types, the classification of relative solvent accessibility into three categories is rather broad and arbitrary; so the differences may become minor if different bins or classifications are used.
Impacts on structural changes by indels
We shall point out that even though all the short indels (engineered or natural) in our dataset do not show big impact on protein structures, it does not necessary mean that short internal indels have no deleterious effect. First of all, all the proteins with natural indels in our dataset are probably the ones that survived from evolution events while the others with dramatic structural effect might have disappeared during evolution. Secondly, even in the cases where indels do not induce structural change, a disastrous loss of function may occur. Nevertheless, these data (natural and engineered indels) strongly suggest the inherent structural plasticity of protein structures [16, 17, 18].
We performed a systematic study to investigate the impact of short internal indels on protein structures through mining the highly homologous protein pairs in PDB. In addition to protein evolution, indels can be the results of alternative splicing. We found that short internal indels tend to occur between secondary structure elements and a significant number of indels are disordered, which is in agreement with the earlier studies that demonstrated the associations among indels, structural disorder, and functional diversity . The rationale of choosing highly homologous protein pairs (with high sequence identity for both overall and indel flanking sequences) is two-fold: 1) to avoid "random" positioning of indel(s) in a protein pair due to low local sequence similarity even though overall sequence similarity is high; and 2) to provide a better approximation to the AS isoforms with internal gaps (100% identical in indel flanking regions). These steps ensure unambiguous indel sequences and their unique positions, reducing the possibility of including false indels due to sequence alignment errors.
One important observation from this study is that most of the indels in the dataset are not derived from evolution events. Indels have been engineered into proteins for various purposes, including structural and functional studies of short peptides and better protein crystallization. Our statistical analysis showed that there are significant differences between naturally occurring indels and the control dataset. On the other hand, there are no statistically significant differences between naturally occurring indels and all indels in terms of amino acid frequencies and secondary structure types. These data suggest that the indel properties derived from our all indel dataset are very useful.
The very question about modeling isoform structures or structural changes due to indels is how to improve the sequence alignment for comparative modeling since the performance of current comparative modeling techniques rely heavily on accurate alignments . Very rarely, sequence alignment errors can be recovered by current comparative modeling programs. We believe this systematic analysis, along with earlier reports on the case studies with individual or a small number of indel pairs, will help us in this regard as well as in our understanding of the structural plasticity of proteins.
This research was supported by the NSF CAREER grant (DBI#0844749) and UNC Charlotte startup fund to JTG. The authors would like to thank Mr. Jon McCafferty for his help with programming.
- 4.Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, Lopez G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Storling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramirez F, Schlicker A, Denoeud F, Jones P, Kerrien S, et al.: The implications of alternative splicing in the ENCODE protein complement. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(13):5495–5500. 10.1073/pnas.0700800104PubMedCentralCrossRefPubMedGoogle Scholar
- 5.UniProt-Consortium: The Universal Protein Resource (UniProt) 2009. Nucleic acids research 2009, (37 Database):D169–174.Google Scholar
- 8.Keedy DA, Williams CJ, Headd JJ, Arendall WB, Chen VB, Kapral GJ, Gillespie RA, Block JN, Zemla A, Richardson DC, Richardson JS: The other 90% of the protein: assessment beyond the Calphas for CASP8 template-based and high-accuracy models. Proteins 2009, 77(Suppl 9):29–49. 10.1002/prot.22551PubMedCentralCrossRefPubMedGoogle Scholar
- 16.Wang P, Yan B, Guo JT, Hicks C, Xu Y: Structural genomics analysis of alternative splicing and application to isoform structure modeling. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(52):18920–18925. 10.1073/pnas.0506770102PubMedCentralCrossRefPubMedGoogle Scholar
- 17.Romero PR, Zaidi S, Fang YY, Uversky VN, Radivojac P, Oldfield CJ, Cortese MS, Sickmeier M, LeGall T, Obradovic Z, Dunker AK: Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(22):8390–8395. 10.1073/pnas.0507916103PubMedCentralCrossRefPubMedGoogle Scholar
- 28.Kim RM, J Guo J-T: A systematic study of homologous protein structures with insertions/deletions. In Computational Systems Bioinformatics: 2009. Volume 000. Stanford, CA, USA: Life Sciences Society; 2009:103–113.Google Scholar
- 29.Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD, Chiu W, Garner EC, Obradovic Z: Intrinsically disordered protein. J Mol Graph Model 2001, 19(1):26–59. 10.1016/S1093-3263(00)00138-8CrossRefPubMedGoogle Scholar
- 32.Zemlin M, Klinger M, Link J, Zemlin C, Bauer K, Engler JA, Schroeder HW Jr, Kirkham PM: Expressed murine and human CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their amino acid composition and predicted range of structures. Journal of molecular biology 2003, 334(4):733–749. 10.1016/j.jmb.2003.10.007CrossRefPubMedGoogle Scholar
- 44.Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic acids research 2004, (32 Database):D226–229. 10.1093/nar/gkh039Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.