Background

Protein diversity in eukaryotic genomes is mainly credited to alternative splicing (AS). It is a fundamental mechanism by which a single pre-mRNA can produce more than one transcript. It is also considered by many to be an important mechanism for controlling gene expression [1]. The introns in the pre-mRNA are spliced out and the exons are united in different combinations leading to a change in the primary transcript structure. This change in transcript structure can affect the encoded protein thereby disrupting its structure and also its function. The disruption in the protein structure and function brought about by AS are frequently associated with diseases [2]. Results from previous studies indicate that more than 60% of human genes are alternatively spliced [39].

Association of AS with many diseases such as cardiovascular, cancer and neurodegenerative disorders sheds light on the fact that it is crucial to conduct an in-depth study on AS [10]. Analyses have also shown that 15% of point mutations that cause genetic disease affect pre-mRNA splicing [10], providing a link between AS events and inherited genetic diseases.

Large scale sequencing of eukaryotic genomes and the knowledge of AS being an important player in controlling gene regulation has seen the emergence of several efforts [39] to create bioinformatics resources on alternative transcripts and protein isoforms [11]. Conflicting results from previous analyses aiming to compare the rate of alternative splicing between different organisms contradict AS databases who discuss genome-wide computational analysis. All vertebrates and invertebrates showed a similar rate of alternative splicing with respect to both the number of genes affected and the number of variants per gene in a large-scale expressed sequence tag (EST) analysis across distinct eukaryotes by Brett and coworkers [12]. On the contrary considerable variation in the rates of alternative splicing across organisms was reported by Lee and co-workers [5]. Understanding the phenomenon of AS is difficult as these databases do not provide sufficient information for multi-gene comparison across various species. ASAP II [5] concentrates mainly on comparative and evolutionary studies. ECGene [9] provides functional annotation for AS genes in various genomes. Alternative Splicing Transcript Database (ASTD) [3, 4] does an exhaustive analysis of AS events in three species, namely human, mouse and rat. Representing the transcripts and their relation to each other has become extremely complicated due to the increasing number of transcripts for each gene. This has seen the dawn graph theory and its application to represent a gene transcript. Graph theory is a prominent concept that has been used to express transcripts and capture their relation, among many other solutions. The language of graph theory offers a mathematical abstraction for the description of biological relationships [13]. Modrek and Lee used directed acyclic graphs for EST analysis, with the genomic DNA sequence as reference [14]. Pevzner and coworkers [15] were the first to use de Bruijn graphs to depict the transcripts alone, without referring to the genomic DNA sequence, where the maximum common sub-sequences between transcripts were condensed into nodes and the variable regions connected by edges. Alternative Splicing Gallery (ASG) resource uses such an approach [7].

Our group has used directed acyclic splicing graphs, without a genomic DNA sequence as reference, with exons as nodes, interconnected by introns as edges, where the paths through the splicing graph represents the transcripts. This scheme was applied to the genome-wide analysis of Drosophila melanogaster [6], leading to the DEDB data resource. Here, the first transcript served as a reference sequence to generate splicing graphs, with automatic rule-based classification of splicing events. To reduce the uncertainty in selecting the primary transcript, this methodology was further enhanced. The most conserved exons in all transcripts of a given gene were chosen to be distinct reference exons and all others were considered to be variant exons. In order to generate a splicing graph from a set of transcripts for a given gene, we thereby developed the Alternative Splicing Graph Server (ASGS) [8].

As a part of the bovine genome annotation team, we have used comparative genomics in order to associate alternative splicing patterns in human and mouse to cow [16]. Comparative genomics studies the correlation between genome structures and functions across different biological species. It aims at understanding many aspects of the evolution of modern species.

The intermediate evolutionary distance between human and bovine is 70-100 Myr [17]. The bovine model has been found to be relevant to human health research priorities such as obesity, female health and communicable diseases. Cow provides a valuable biological model in these significant areas because of the vast amount of research that has been conducted with respect to genetic and environmental interactions associated with complex, multi-genic physiological traits [18]. The Cetartiodactyl order of mammals, to which cattle and all other ruminants belong, is phylogenetically distant from the primates, and thus contains invaluable information for understanding human genome evolution [19].

In this study, we have analysed transcripts for each gene in the bovine genome. Since the bovine genome is not yet completely annotated we minimized any gene structure bias in the input data by carrying out comparative genome analysis on the orthologous subset of AS genes for the three species. We present here the comprehensive analysis of all bovine, human and mouse transcripts based on splicing graphs. AS events in these three genomes and their functional significance in terms of gene ontology (GO) [20] classifications were also identified. The two main AS events (cassette exons and intron retention) in the human disease genes (94) from NCBI Genes and Disease database [21] were mapped onto their respective bovine orthologous genes. A protein domain analysis on 12 human disease genes that are known to be occurring in cow was vital in providing significant insights into the protein structure/function affects of AS.

Materials and methods

Data

For AS analysis, the GTF files for Bos taurus, Homo sapiens and Mus musculus were extracted from Ensembl ver. 54 [22]. Each line in the Gene Transfer Format (GTF) [23] file corresponds to the structure of the exons making up the transcripts, coding sequence, start codon and stop codon information. For our analysis, we extracted only the protein coding genes and eliminated the pseudo genes and mitochondrial genes. The unspliced transcript sequences were also obtained from Ensembl for cow to analyse the splice site motifs.

Splicing graphs

The procedure used in ASGS [8] has been adopted for compiling the graphs. The transcript information, including start and stop of each exon are compiled from the GTF file for each of the three genomes to generate the splicing graph. All transcripts are converted to the leading strand for consistency. Exons are divided into two main groups; distinct and variant. The exon that occurs in the majority of transcripts is retained as the distinct exon, with the rest classified as variant. When exons overlap, the exons with well-determined borders, occurring in most of the transcripts is considered to be distinct. If an exon is completely contained in another larger exon, these are not merged but retained as individual exons, considered variant and then entered into a list maintaining the mapping of variant exons to distinct exons [24]. Splicing graphs are then generated using these distinct and variant exons. The first line of the resultant splicing graph is composed entirely of distinct exons, followed by subsequent lines showing the locations of variable exons. The exons are connected by edges, representing introns in the set of transcripts provided. Splicing graphs were compiled for every alternatively spliced gene for the three genomes. The splicing graphs were then further analysed to identify the splicing events and patterns for orthologous genes.

Detection and classification of alternative splicing events and patterns

We have analysed nine alternative splicing events namely, cassette exons, intron retention, alternative donor sites, alternative acceptor sites, alternative transcriptional start and termination sites, alternative initiation and termination exons and mutually exclusive exons. Figure 1 defines the rules to locate each of the nine events and these rules were applied to generate the splicing graphs. This classification schema has been previously described in DEDB [6] and incorporated into ASGS [8] for the identification of the splicing events. 5' and 3' ends of the transcripts are usually difficult to determine experimentally due to sequencing errors which could cause anomalies in the analysis of alternative transcriptional start and termination sites [6]. The other internal AS events, however, are not affected by these sequencing errors. Two types of analyses namely gene level and event level were carried out. The percentage of total events present in each genome for the orthologous genes is portrayed by the event level analysis. The gene level analysis calculates the percentage of all AS genes and orthologous AS genes showing each of the events for the three genomes.

Figure 1
figure 1

Generation of alternative splicing (AS) events using splicing patterns. Rules were derived to detect nine alternative splicing events. Distinct exons are shown in black, while variant exons are shown in blue.

Splicing graphs have been made more informative to help identify distinct and associated variant exons by visual representation of distinct (D) exons in black and variant (V) exons as blue. AS events can therefore be depicted using a minimum of four sub-graph components called splicing patterns. Figure 2 depicts the four unique sub-graphs Class I (D-D), Class II (D-V), Class III (V-D) and Class IV (V-V). The fundamental definition of transcript diversity is given by a detailed analysis of the relationship of each exon to its successor, designated as a splicing pattern.

Figure 2
figure 2

Classification of inter-exonic connections as splicing patterns. Four component splicing patterns have been defined, depending on connections between distinct exons (black) and variant exons (blue). Class I refers to connections between two successive distinct exons while Class IV refers to connections between two successive variant exons. Classes II and III depict connections between a distinct exon and a variant exon and vice-versa.

Qualitative and quantitative analysis of exons and introns

Basic statistical measures like the mean, median and standard deviation were calculated for all three genomes in order to analyse the exon and intron size conservation across the three genomes for the complete and orthologous AS gene sets. The number of exons per transcript for the three genomes was also calculated.

Splice site motif analysis

Splice site mutations are believed to cause several genetic diseases. It is therefore very important to identify variations in the splice site. The frequencies of GT-AG, GC-AG, AT-AC splice site motifs were computed for bovine and analysed and compared to the splice site information for human and mouse obtained from ASTD.

GO annotation

Analysis of the GO annotations was conducted for two sets of data. In the first set, the transcript sequences of orthologous bovine AS genes obtained from Ensembl were processed using ESTScan, as it can detect and extract coding regions from low quality sequences with high selectivity and sensitivity and is also able to accurately correct frame shift errors [25]. To obtain even datasets, the human and mouse transcript sequences were also processed using ESTScan. The output was then processed using another bioinformatics tool, Blast2GO [26], which we have successfully used in the annotation of expressed sequence tag sequences [27]. The BLAST results from this program were then mapped to GO terms to obtain the GO annotation. The annotation output file was then processed using a plotting tool, WEGO [28] in tool to compile the GO annotation results into category-based lists.

The second dataset was a text file comprising GO annotations for bovine AS genes orthologous to human and mouse AS genes, obtained from Ensembl using the BioMart [29] tool. The second dataset was reformatted and put through the WEGO tool to compile the GO annotation results for plotting.

Mapping of human disease genes to bovine orthologous genes

A well-annotated set of all available (94) human disease genes was extracted from NCBI Genes and Disease database [21], with the view towards analysing which of these genes were alternatively spliced in human and bovine genomes. Of these 94 genes, AS analysis was conducted on the 66 spliced genes (with more than one transcript). The two most important events, cassette exon and intron retention, were examined in detail in these 66 genes. These exons were then mapped onto the orthologous exons in bovine using CLUSTALX [30] multiple sequence alignment tool to identify the conservation of these exons and the splicing event, across the two species. Irrespective of the position of the exons in different transcripts, if two pairs of exons have a good percentage of alignment they are still considered as conserved exons, thereby implying that in the event of exon shuffling, the exon pairs are still considered conserved.

Protein domain analysis of the orthologous disease gene set

We identified eight human disease genes that have bovine orthologues. The protein sequences encoded by the transcripts for these human and bovine genes were analyzed using Pfam [31] domain search tool to identify the effects of alternative splicing on the functional protein domains.

Results and discussion

It was observed that only 21% of bovine genes were alternatively spliced as opposed to 68% of genes in human and 57% of genes in mouse upon comparison of 4567 bovine AS genes with 16715 human AS genes and 16491 mouse AS genes. The statistics provided by ASAP II database (26%, 53%, 53% for cow, mouse and human respectively) [5] compare well to these estimates of the number of AS genes in cow, mouse and human, although they appear almost twice as much as those reported by Nagasaki and group [32] (32.1% and 23% for human and mouse genomes, respectively). All AS genes in cow which have alternatively spliced orthologues in both human and mouse were extracted to minimize any gene structure bias and to get the best-annotated genes in cow for analysis. Such an approach has been adopted by the studies of Chen et al [33]. In order to compile the orthologous genes subset, one-to-one, many-to-many, one-to-many and apparent mappings have been used. We found that 3504 genes in cow have alternatively spliced orthologues in human and mouse amounting to 3835 and 3774 genes respectively. This dataset amounted to 16% of bovine alternatively spliced genes, compared to 16% in human and 13% in mouse. Our values are consistent with those (10%) observed by Brett et al. [12] for AS between human and other species, including mouse and cow reinstating the credibility of our approach of using orthologous AS gene subsets for multi-species comparisons and to estimate the extent of AS in cow.

Qualitative and quantitative analysis of exons and introns

Compared to 8.0 and 6.5 transcripts per gene in human and mouse respectively, our results indicate that bovine AS genes are represented by 2.3 transcripts per gene on average. Overall, bovine AS genes show less transcript diversity compared to human and mouse AS genes as indicated by these numbers which are quite similar to those in the orthologous gene set as well. General statistical characteristics of the intron-exon structure of eukaryotic genomes are invaluable for understanding the structure and evolution of genes and genomes. Deutsch and Long [34] estimated that each gene comprises 5.0 exons of mean length 51 nt separated by introns of mean length 3413 nt; and 4.4 exons of mean length 52 nt separated by introns of mean length 1321 nt for human and mouse genes, respectively, using available gene structure information on ten model organisms. We found that each bovine transcript comprises close to 13 exons of mean length 181 nt, separated by introns of mean length 5215 nt, while human and mouse transcripts comprise close to 8 and 7 exons, respectively, of mean length 178 and 160 nt, respectively; separated by introns averaging 5314 and 4311 nt, respectively (Table 1). While all three transcriptomes are composed of exons and introns of similar size, bovine AS genes are more fragmented than human and mouse AS genes since these numbers are again similar to those obtained for the orthologous AS gene set.

Table 1 Comparison of alternative splicing in bovine, human and mouse genomes

Splicing graphs

We generated a total of 4567 bovine, 16715 human and 16491 mouse splicing graphs. The transcript structure of each multi-transcript gene for all three genomes was compiled using the splicing graph approach [8]. The splicing graphs were further decomposed into component splicing patterns (as described in Materials and methods). It was noted that 2485 bovine genes are single exonic genes. It is possible to verify all the splicing events from the splicing graphs thereby suggesting that it could be utilised as an excellent visual analysis tool. One such splicing graph of Myc responsible for causing the disease Burkitt Lymphoma is shown in Figure 3. It can be easily seen from Figure 3 that the gene has two different transcripts.

Figure 3
figure 3

Splicing graph for the human disease gene Myc (Burkitt Lymphoma). The splicing graph represents the gene in a very simple and easily understandable format.

Alternative splicing events and patterns

The nine AS events discussed above have been identified in the orthologous set for bovine genome and are compared to those in human and mouse. Equation 1 was used to calculate the % of genes showing each AS event in each of the three genomes for the gene level analysis.

(1)

The first four AS event categories in Figure. 4, refer to splicing events at the ends of a gene, while the remaining five represent internal events. The results of our gene level analysis highlight that most of the genes showed external events. As suggested earlier the high percentage for transcriptional start and termination sites events could be the result of sequencing errors. We observed that majority of the genes in cow (59%-64%) have cassette exons, with 19%-20% of the genes having intron retention. Very few genes exhibited mutually exclusive exons (3%-4%). Figure 4 clearly shows that fewer bovine genes exhibit AS events than that of those in human and mouse. The values for both the datasets of all three genomes is tabulated in Table 2.

Figure 4
figure 4

Distribution of AS events - gene level analysis for bovine, human and mouse orthologous AS genes. Nine events, described in Figure 1, were used to classify the observed AS phenomena based on the number of genes displaying these events, as shown in Table 2.

Table 2 Statistics of alternative splicing events for all AS genes and the orthologous AS gene subset (gene level analysis)

It should be noted that each AS gene contains several events. The distribution of each event compared to the total number of AS events observed in the orthologous set of the three genomes represent the event level analysis as shown in Equation 2. (Table 3, Figure 5).

Figure 5
figure 5

Distribution of alternative splicing events-event level analysis for bovine, human and mouse orthologous AS genes. Event level analysis of each of the nine events, described in Figure 1 and based on the data in Table 3.

Table 3 Statistics of alternative splicing events for the orthologous gene subset (event level analysis)
(2)

Considerable conservation was observed in each of the nine AS events for the three species. Our analysis proves that exon skipping or cassette exon is the most prevalent internal AS event in the orthologous genes of all three species, comprising 28%, 26% and 16% of all AS events in bovine, human and mouse, respectively. On the other hand, intron retention and mutually exclusive exons were the least favoured AS events. Intron retention accounted for only 3% of bovine AS events, compared to 3% in human and 2% in mouse. Haussler and co-workers [35] estimated 38% exon skipping and 3% intron retention in human, which are very similar to our values. ASD [3, 4] reports 52% cassette exons and 17% intron retention, which differ considerably from our calculations. This could however be due to the fact that ASD has used the entire human genome for their calculations whereas we have only utilized orthologous AS genes for our analysis.

Overall, from the two sets of analyses, fewer bovine genes show equivalent % of AS events compared to human and mouse, which implies that these orthologous AS genes in cow show high variation between the transcripts structure, despite low number of actually different transcripts as opposed to human and mouse genes.

The splicing pattern analysis was done for the orthologous AS genes by calculating the percentage of the four classes in the splicing pattern to determine the exact nature of the transcript diversity. Among all the patterns described above we observe that Class I (Distinct-Distinct) patterns have the highest occurrence (70%) (Table 4 and Figure 6). Class IV (Variable-Variable) is over-represented (13%) in bovine genes compared to human (5%) and mouse (6%). The diversity in bovine AS genes is thus predominantly composed of edges linking two variable exons, as opposed to human and mouse AS genes, where the transcript diversity is predominantly composed of edges linking a distinct exon with a variable one or vice versa.

Figure 6
figure 6

Splicing pattern distribution in the orthologous bovine, human and mouse alternatively spliced genes. Statistics on four component splicing patterns have been complied, with the transcript diversity index defined as the fraction of all patterns involving variant exons.

Table 4 Alternative splicing class distribution based on splicing patterns for orthologous bovine, human and mouse AS genes

Splice site motif analysis

The splice site motif analysis yielded consistent values in the three genomes. 99% of the splice site motifs in bovine AS genes were found to be GT-AG (Table 5). The data for the orthologous AS gene set is very similar (data not shown).

Table 5 Splice site motif analysis for bovine, human and mouse AS genes

GO analysis of orthologous gene sets

Gene Ontology (GO) analysis was carried out for all three organisms on the orthologous AS gene set where the GO categories were selected based on the work done by Chen et al [33]. The transcript sequences for the orthologous AS genes of human, mouse and bovine were analyzed. It was observed that the overall GO categories for all the three species were very similar (Table 6 and Figure 7). In the area of molecular function, the highest functionality was observed for "binding" in all three species. In terms of biological process, "cellular processes" was the preferred category, while for cellular component, "cell part" was most popular. This high similarity in functionality could reflect the common lineage of bovine, human and mouse, as mammalian.

Figure 7
figure 7

Occurrence of gene ontology (GO) terms in bovine, human and mouse for the orthologous AS gene subset. GO terms have been categorized on the basis of A. molecular function, B. biological process and C. cellular component.

Table 6 Gene ontology (GO) annotation summary for the orthologous AS gene set.

However, a similar plot was also created for the bovine genome, using a different set of annotations, where the entire GO details were obtained from Ensembl using the BioMart tool [28]. This analysis showed considerably low percentage for bovine as opposed to the previous plot. This, we believe can be a result of low level of annotation available for bovine genes. In this plot, a considerable drop in functionality was noticed across all the areas for bovine genome (Table 6 and Figure 7). Therefore, we were able to identify 38% more functional information in terms of GO annotations than currently available in Ensembl for bovine genes.

Mapping of human disease genes to bovine orthologous genes

The use of farm animals like cattle, pigs, sheep, goats, horses and chickens as research models has won many Nobel Prizes for researchers worldwide [36]. Various new opportunities in areas of biomedical research have been created by the application of the tools for genetic manipulation and genomic sequencing in farm animals [16]. This provides valuable insights into gene function and genetic and environmental influences on animal production and human diseases [36]. Because of the size and relatively long intervals between generations, domestic species are widely used to unravel the mechanisms involved in programming the development of an embryo and fetus, resulting in adult onset of diseases [37, 38]. Rogers et al. [39] have identified that the CFTR gene knockout model of pig better mimics human pathology than mouse models as they fail to develop the hallmark pancreatic, lung and intestinal obstructions that occur in humans. Reynolds et al. [40] note that surgery, blood sampling, tissue recovery, serial biopsies, instrumentations, whole organ manipulations and many other biomedical applications are more easily achieved in animals larger than a mouse, suggesting that size does matter when it comes to animal models. Hence mapping human disease genes to bovine orthologous genes is an excellent mode for carrying out analytical work and verifying the suitability of cow as a model organism.

Out of the 94 human disease genes that were collected, we observed splicing in 66 cases, (70.21%). Mapping these 66 spliced human genes onto orthologous bovine genes suggested that only 17 of the orthologous bovine genes were spliced (18.09%). Cassette exons occur in 38 of human disease genes (120 cassette exons, Table 7) and 14 orthologous bovine genes. At the exon level, we observed that 97 of 120 human exons (Table 7) were conserved in bovine, indicating a high level of conservation in this dataset across both the species. Previously, for a larger dataset [16], it was reported that majority of genes with cassette exons in human were present and regulated in cattle. However, at the gene level, for the current dataset, we have observed that only 3 genes with cassette exons in human (Table 8) were present and regulated in bovine.

Table 7 Human disease genes: Conservation of cassette exons in bovine orthologous genes.
Table 8 Human disease genes: Cassette exons present and regulated in bovine orthologous genes.

We also carried out a detailed survey of the 94 human disease genes to identify intron retention events. We noted intron retention in nine human genes out of which, in five genes IR was present and constitutive in bovine (> 50%; Table 9). It has been indicated before that the expression of intron-containing sequences occur in a variety of diseases [41].

Table 9 Human disease genes: Intron retention present and constitutive in bovine orthologous genes.

Protein domain analysis of the orthologous disease gene set

For the eight human disease genes that have orthologous genes in the bovine genome, (three genes with CE and five genes with IR), protein domain analysis revealed that AS affects the structure and function of the proteins encoded by the various transcripts from these genes. It was evident that due to AS, the majority of the transcripts either lacked the complete functional domain or lacked an essential component/segment of the functional domain. This suggests that AS is a major machanism that could render these proteins non-functional, besides perturbing the structure or fold of the protein.

For the set of the bovine orthologous genes, only two of eight genes appear to be spliced, resulting in probable structure and function disruption. These genes are responsible for spinal muscular atrophy and colon cancer, with the former noted as a disease caused by AS [1]. Further investigation revealed that four of these eight genes had all the domains from their human counterparts conserved. This implies that 4/8 orthologous bovine genes (including the two AS genes) had essential segments or complete functional domains missing, due to AS.

Wilson's disease is another disease that has been characterised in cow (OMIA). We observe that the human gene known to be responsible for this disease has a retained intron in one of its transcripts, which is orthologous to the only transcript available in the corresponding bovine gene. Thus, the cow would be most suitable as a model organism for this human disease.

Conclusion

This is the first comprehensive study of the bovine transcriptome, with 21% of bovine genes exhibiting alternative splicing, compared to 68% and 57% in human and mouse, respectively. Our analyses show that bovine AS genes are composed of fewer transcripts but many more exons than human and mouse AS genes, although comprising exons and introns of comparable extents. Nine different splicing events were compared among cow, human and mouse genomes. Compared to their human and mouse counterparts many more bovine AS genes show intron retention. The most common AS event was found to be exon skipping and the least common events were intron retention and mutually exclusive exons. With predominantly introns linking two variable exons, as opposed to human and mouse genes fewer AS bovine genes show high transcript variability.

38% more functional information than currently available in Ensembl was identified with our approach which helped us collate the GO annotations for bovine AS genes. The orthologous bovine AS genes are functionally very similar to human and mouse genes as suggested by GO annotations.

From the results of our protein domain analysis it is evident that AS plays a major role in disease implications in both human and cow, and is suitable as a model for investigating spinal muscular atrophy, colon cancer, tangier disease, glaucoma, spinocerebellar ataxia, polycystic kidney disease, autoimmune poly grandular syndrome and wilson's disease. Our results provide a window of opportunity for more in-depth analysis over a larger dataset, where the cow can serve as a model organism for many more human diseases.

Note

Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.