Interrogation of alternative splicing events in duplicated genes during evolution
- 2.8k Downloads
Gene duplication provides resources for developing novel genes and new functions while retaining the original functions. In addition, alternative splicing could increase the complexity of expression at the transcriptome and proteome level without increasing the number of gene copy in the genome. Duplication and alternative splicing are thought to work together to provide the diverse functions or expression patterns for eukaryotes. Previously, it was believed that duplication and alternative splicing were negatively correlated and probably interchangeable.
We look into the relationship between occurrence of alternative splicing and duplication at different time after duplication events. We found duplication and alternative splicing were indeed inversely correlated if only recently duplicated genes were considered, but they became positively correlated when we took those ancient duplications into account. Specifically, for slightly or moderately duplicated genes with gene families containing 2 - 7 paralogs, genes were more likely to evolve alternative splicing and had on average a greater number of alternative splicing isoforms after long-term evolution compared to singleton genes. On the other hand, those large gene families (contain at least 8 paralogs) had a lower proportion of alternative splicing, and fewer alternative splicing isoforms on average even when ancient duplicated genes were taken into consideration. We also found these duplicated genes having alternative splicing were under tighter evolutionary constraints compared to those having no alternative splicing, and had an enrichment of genes that participate in molecular transducer activities.
We studied the association between occurrences of alternative splicing and gene duplication. Our results implicate that there are key differences in functions and evolutionary constraints among singleton genes or duplicated genes with or without alternative splicing incidences. It implies that the gene duplication and alternative splicing may have different functional significance in the evolution of speciation diversity.
KeywordsAlternative Splice Gene Duplication Duplicate Gene Large Gene Family Identity Criterion
Gene duplication is one way to increase genome complexity, and may provide a source for genetic novelty. After duplication, genes are temporally released from evolutionary constraint, and may be subject to pseudogenization, subfunctionalization or neofunctionalization . Previous results have demonstrated that after duplication those duplicates had a higher evolutionary rate, due to relaxation of purifying selection [2, 3, 4, 5]. This acceleration of evolutionary rate may create divergent function(s) for one or both of the duplicates. Gene duplication provides raw material for generating new protein function, and accordingly, not all gene duplication events could be preserved and eventually fixed in the population. It has been shown in yeast genomes that complex genes with longer protein sequences, more protein domains and more cis-regulatory regions are more likely to remain as duplicated genes after a whole genome duplication event [6, 7].
In addition to duplication events, some genes in eukaryotes can undergo alternative splicing to increase their expressional flexibility, and increase the complexity at the transcriptome or proteome level [8, 9]. The relationship between alternative splicing (AS) and gene duplication is an interesting question in the evaluations of gene functions and evolution. Previous studies have shown that there was a negative correlation between gene duplication events and AS especially in newly duplicated genes in human [6, 10], and C. elegans[11, 12]. These results implied that duplication and AS may share similar function and therefore can be interchangeable in evolution [10, 13]. However, there was also a recent study reported where duplicated genes actually had a higher proportion of AS and larger number of AS isoforms per duplicated locus in human . Whether duplicated genes tend to gain more AS compared to single copy genes (singletons), and the relationship between gene duplication and AS in evolution, are unsolved and interesting questions to be answered.
There are two advances we make to help answer this question. First, it has been found that fewer newly duplicated genes possess AS, and it also has been suggested that the acquirement of AS after gene duplication is one factor to avoid pseudonization or gene loss in plants and Drosophila [15, 16]. Therefore, the age of duplication may play an important role in acquiring AS. Hence we took this into consideration in this study by defining gene duplicates with different identity criteria. Second, aided by improved sequencing technology and increasing EST and cDNA experimental data, many more alternative splice isoforms had been identified [17, 18]. Most human genes are expected to have alternative splicing isoforms . The relationship between alternative splicing and gene duplication using the newly updated data is worthy of further investigation.
In addition to the relationship between duplication and AS, since these are two different means to achieve diversity, we think that genes which employ both, either, or none of these two separate strategies might be distinct from each other in terms of evolutionary rates, and in other characteristics such as functions. Therefore, we study the length of protein product, number of domains, evolutionary rate, and gene functions for these four groups (duplicates with/without AS and singletons with/without AS) of genes.
Duplicates acquired more AS than singletons
The proportion of these four groups varied at different identity criteria. Under the high identity criterion, many genes are classified as singletons, and the AS proportion for singletons is much higher than that observed for duplicates. There is a low proportion of AS in duplicates under a high identity criterion, and the proportion of AS increases as the identity criterion becomes lower. Take an identity criterion >90 and >20 in human as an example, 49% and 67% of duplicated genes have AS, respectively. On the other hand, 68% and 65% of singleton genes have AS for identity >90 and >20, respectively. Under an identity criterion >90, there is a significant enrichment of AS in singleton genes (Pearson's Chi-squared test, p-value < 2.2e-16). While under an identity criterion >20, the trend is reversed, and there is a significant enrichment of AS in the duplicated genes (Pearson's Chi-squared test, p-value = 0.01). Given that at a lower identity criterion the gene families identified include those gene families identified at a higher identity criterion, and that AS proportion still increases at a lower identity criterion, we could observe that anciently duplicated genes overall have a much higher proportion of AS compared to those recently duplicated genes and singletons. Similar reasoning can be made in the subsequent analyses.
Smaller gene families have higher AS proportion and more AS isoforms contrary to large gene families
It is worth noting in Fig. 2 the trend of increasing AS proportion in duplicated genes as the identity criterion gets lower. The AS proportions of duplicated genes become larger than that of singletons at those lower identity criteria. If we take singletons (gene family size =1) into consideration, the AS proportions of singletons are larger than all sizes of gene families at higher identity criteria (>70 or >90), but become smaller than those small and moderate gene families (gene family size range from 2 - 4 or even 5 - 7) at lower identity criteria (>50, >30 or >10). Even though the AS proportions for large gene families (greater or equal to 8) remain lower than singletons across all identity criteria, the AS proportions for other gene families (gene families size from 2 to 7) are significantly higher than that in singleton groups at a lower identity criterion, especially for the group identity >10. This result shows that our finding of a higher AS proportion for duplicated genes was mainly contributed by those slight or moderately duplicated gene families.
Features comparison among AS/no AS duplicates/singletons
The positive correlation between duplication and AS found at lower identity criteria indicates that there are genes that somehow tend to both gain duplication and undergo AS, while there are other genes that have neither in the long-term consideration. This suggests that there may be two extreme types of genes, one is more likely to become diverse whether through gene duplication or AS. The other type does not need to become diverse, and tends to remain as a single copy gene in the genome, and may also express only one transcript. In addition to these two extreme types of genes, there are still some genes which may gain either AS or duplication. We are interested in the differences in features between these four groups of genes which have not been examined in previous studies. In order to investigate these questions, we make comparisons among the following four groups: gene family where all of the members have AS (A_F), gene family where none of the members of the family has AS (N_F), AS singletons (A_S) and no AS singletons (N_S). We investigated the difference in protein length, domain number, evolutionary rate, and GO distribution in these four groups and found some significant and evident differences did exist.
Length of protein product
Distribution of number of domains
Previous studies have suggested that duplication and AS may be interchangeable and therefore were reversely correlated [10, 13]. On the other hand, a recent study suggested a positive correlation . The apparent discrepancy may be due to the availability of AS data at the times of these studies, or how the duplicates were defined. Previous studies identified duplicated genes by either InParnoid  or BLASTP which are both based on sequence similarity [10, 13, 14]. In this study, we used duplicates identified by EnsemblCompara GeneTrees which also took the phylogenetic tree structure into consideration and may provide higher coverage and better ability to handle large gene families . The discrepancy may most probably be due to the similarity cutoffs used to determine duplicates, which are reflections on the age of duplications. Therefore in this paper, we investigated the relationship between duplication and AS across different duplication times (e.g. Duplicates were identified according to identity >10, >50 or >90). We found that in human and mouse overall the duplicated genes had higher proportion of alternative splicing after long-term evolution, even though those recently duplicated genes tended not to have AS. This finding is consistent with a recent publication which concluded that the gain of AS in duplicated genes is in a age-dependent manner .
The evolutionary rate of genes after gene duplication was though to be increased due to the relaxation of selection . Previously, Jin et al. proposed an “Accelerated AS model”, which suggested that the relaxation of constraint for duplicates after gene duplication may accelerate the rate of AS acquirement . Our results do not directly support nor reject the hypothesis; we could not conclude whether the acceleration exists since the time needed for the acquisition of alternative splicing is unclear. However, we made an interesting observation that although duplicated genes did have a higher evolutionary rate (Ka/Ks), the proportion of AS among recently duplicated genes (identity >90) was lower than those of singleton. It was only after the inclusion of ancient duplicates did the proportion of AS becomes higher compared to that of the singletons.
Proportion of Alternative splicing (AS) for singletons and duplicated genes of four mammalian genomes.
AS proportion for singletons
AS proportion for duplicates
On the other hand, we did not observe this kind enrichment of AS for duplicated genes in zebrafish, fly, and worm. This may result from different evolutionary stresses between mammalian and other eukaryotes and therefore the preference of alternative splicing and gene duplication in the genome are different. It is also possible that this trend does exist in other eukaryotes but is not being observed only because of the lack of alternative splicing information for zebrafish, fly, and worm. We actually found a tendency to increase in proportion of alternative splicing in duplicated genes in both zebrafish and fly, even though they are not statistically significant. A recent study in Drosophila also showed that newly duplicated genes usually did not have alternative splicing isoforms and were expressed at a lower expression level, while those genes may have gained their diverged function or expression patterns after they developed alternative splicing activity . It is possible that we did not observe the same trend in zebrafish, fly, and worm simply due to the lack of proper AS information. As NGS technique improves and data accumulates, further investigation can be done to answer this question in the future.
Genes categorized as having no AS may really have none, but it is also possible that its AS isoforms are not detected due to their relatively low level of expression . In order to examine this possibility in our dataset, we tested whether those genes express at a relatively lower level by exploring current human EST datasets. The result of the EST hit distribution is shown in Additional file 3. We found that there was indeed a high correlation between the EST counts and AS detection. Genes without alternative splicing annotation may be a misleading artifact arising from its low expression level, and it is possible that their AS isoforms are not yet found. Therefore we could not rule out the possibility that a portion of the no AS gene families may actually have AS. However, the apparent lack of AS prompted us to attribute these genes into the no AS categories.
As previous studies on gene retention after whole genome duplication have suggested, subsets of genes were more likely to remain as duplicates and may be more likely to evolve diverse functions [28, 29]. Among those genes apt to become diverse, some develop either duplication or AS, and still some gain both duplication and AS. From our results, we suggest that longer genes and genes which contain more domains are more likely to increase their gene diversity at the genomic level via gene duplication, and increase their complexity at the transcriptome level by acquiring alternative splicing. Or in alternative, genes which undergo duplication and alternative splicing may become longer and may increase their number of domains. In addition, we found that there are different evolutionary rates and preferences for molecular function or biological process for no AS gene families and AS gene families. These all imply that the selection between duplication and AS are highly dependent and is not random. Further investigation on what factors affect gene duplication and AS may reveal the evolutionary relationship between those two mechanisms. In conclusion, our analysis revealed a sophisticated relationship between duplication and AS depending on the age of the duplicates. Compared to singletons, there is actually a higher proportion of ancient duplicates that have AS isoforms, while fewer new duplicates have alternative splicing isoforms. In addition, among those duplicated genes, genes with longer sequence and more domains are more likely to develop AS. As for the evolutionary rate, our results suggest that in consideration of a long-term evolution, duplicated genes may be under higher evolutionary constraints compared to singleton genes, and among those duplicated genes, those having AS are the ones actually under higher constraints compared to those having no AS.
Our analysis revealed a sophisticated relationship between duplication and AS depending on the age of the duplicates. Compared to singletons, there is actually a higher proportion of ancient duplicates that have AS isoforms, while fewer new duplicates have alternative splicing isoforms. In addition, among those duplicated genes, genes with longer sequence and more domains are more likely to develop AS. As for the evolutionary rate, our results suggest that in consideration of a long-term evolution, duplicated genes may be under higher evolutionary constraints compared to singleton genes, and among those duplicated genes, those having AS are the ones actually under higher constraints compared to those having no AS.
The protein sequences and paralog information of seven species: human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), chimp (Pan troglodytes), zebrafish (Danio rerio), worm (Caenorhabditis elegans) and fly (Drosophila melanogaster) were downloaded from Ensembl release 60 . For most analysis, only the peptides labeled as “known” were preserved. The Ka, Ks, cDNA sequences and protein GO category of human and mouse were also downloaded from Ensembl.
Duplication and AS identification
The paralog pairs were generated by grouping paralog information derived from EnsemblCompara GeneTrees . Different set of paralog pairs are identified at different identity criterion. A paralog pair is qualified as a pair if at least one gene aligns to the other with a protein sequence identity larger than the identity criterion chosen. Genes were classified as duplicated genes if they have any paralog(s). Gene pairs were clustered together as a gene family if there was at least one gene in common. Gene families were further classified into AS gene families, no AS gene families, and others. AS gene families were defined as gene families within which all paralogs have AS, and no AS gene families were defined as gene families within which none of the paralogs have AS. Genes were classified as having alternative splicing isoforms if their genes have more than one protein product recorded as known peptides in Ensembl.
Features of genes analysis
The raw data of gene information downloaded from Ensembl were further used in analyzing human (and mouse sometimes) for protein function, domain number, gene length, and Ka over Ks analysis. Domains in protein sequences were identified by RPS-BLAST using pfam release 24.0 as database  with default parameters, except the expected value is set be smaller than 0.01.
Human EST data were downloaded through the NCBI ftp server: http://www.ncbi.nlm.nih.gov/genbank/. Not all EST data were suitable for expression analysis since some libraries may have been subject to expression count distortions such as amplification or normalization. Therefore, EST libraries were subject to manual reviews based on extended annotations of clone libraries obtained from the Cancer Genome Anatomy Project (CGAP) http://cgap.nci.nih.gov/. ESTs from unsuitable libraries were eliminated. Representative transcription isoforms were also selected and repeat-masking was applied with RepeatMasker . ESTs from the prior procedure were aligned against the set of transcription isoforms with BLAT . Based on the BLAT result, each EST was assigned to the best-matching isoform provided that the EST sequence was at least 100 bp long and the alignment region had to achieve 95% identity or higher. If more than one best-match existed, an arbitrary assignment was made. An expression profile of the representative isoforms was generated basing on EST assignments.
This work is supported in part by grant from Academia Sinica and National Science Council.
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
- 30.Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, et al: Ensembl 2011. Nucleic Acids Res. 2011Google Scholar
- 32.Smit A, Hubley R, Green P: RepeatMasker Open-3.0. 2010Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.