Origin and evolution of a placental-specific microRNA family in the human genome
- 4.7k Downloads
MicroRNAs (miRNAs) are a class of short regulatory RNAs encoded in the genome of DNA viruses, some single cell organisms, plants and animals. With the rapid development of technology, more and more miRNAs are being discovered. However, the origin and evolution of most miRNAs remain obscure. Here we report the origin and evolution dynamics of a human miRNA family.
We have shown that all members of the miR-1302 family are derived from MER53 elements. Although the conservation scores of the MER53-derived pre-miRNA sequences are low, we have identified 36 potential paralogs of MER53-derived miR-1302 genes in the human genome and 58 potential orthologs of the human miR-1302 family in placental mammals. We suggest that in placental species, this miRNA family has evolved following the birth-and-death model of evolution. Three possible mechanisms that can mediate miRNA duplication in evolutionary history have been proposed: the transposition of the MER53 element, segmental duplications and Alu-mediated recombination. Finally, we have found that the target genes of miR-1302 are over-represented in transportation, localization, and system development processes and in the positive regulation of cellular processes. Many of them are predicted to function in binding and transcription regulation.
The members of miR-1302 family that are derived from MER53 elements are placental-specific miRNAs. They emerged at the early stage of the recent 180 million years since eutherian mammals diverged from marsupials. Under the birth-and-death model, the miR-1302 genes have experienced a complex expansion with some members evolving by segmental duplications and some by Alu-mediated recombination events.
KeywordsSegmental Duplication miRNA Family Segmental Duplication Event MER53 Element Potential Paralogs
Kyoto Encyclopedia of Genes and Genomes
MiRNAs are endogenously expressed, single-stranded RNAs ~22 nucleotides (nt) in length . In animals, miRNAs are transcribed as long primary miRNA (pri-miRNA) sequences that are processed in the nucleus to give precursor sequences of miRNA (pre-miRNAs). The pre-miRNA sequences are exported to the cytoplasm where they are cleaved to produce mature miRNAs. The miRNAs are then incorporated into RNA-induced silencing complexes where they function either to inhibit translation or to mediate the degradation of their target mRNAs commonly by binding to complementary regions in the 3' untranslated regions (UTRs) [2, 3, 4]. MiRNAs play a pivotal role in many cellular functions by regulating normal developmental and physiological processes [5, 6, 7], and are involved in disease development [8, 9].
Repetitive elements (repeats) include tandem repeats and interspersed repeats (DNA transposons and retrotransposons). Interspersed repeats are responsible for gene (or exon) shuffling and duplication [10, 11] as well as for regulatory changes [12, 13]. Gene (or exon) shuffling and duplication leads to the de novo creation of protein domains [12, 14] or new protein sequences [15, 16]. Recently, many miRNAs derived from repetitive elements have been identified in mammals and plants [17, 18, 19, 20, 21]. Some transposable elements become integrated in multiple loci in the genome and evolve into different members of a miRNA family. An example of this is the hsa-mir-548 family, the members of which are derived from Made1 transposable elements . Some transposable elements surrounding miRNAs have also been found to facilitate the expansion of miRNA clusters [18, 22]. As the evolution of many miRNAs remains obscure, the analysis of miRNAs derived from repetitive elements may facilitate the understanding of the evolution of miRNAs.
Here we report our study of the miR-1302 gene family that has been experimentally verified in the human genome . In miRBase (Release 16.0, Sept 2010), this family has 11 members distributed in the human genome. Members of this family have recently been identified using computational methods in the chimpanzee and horse genomes [24, 25]. We have found that all members of this family are derived from one transposon, the MER53 element . The MER53 element is a type of DNA transposable element with a 193-bp (base pair) consensus sequence that exists in eutherian species. MER53 elements are characterized by the presence of terminal inverted repeats and TA target site duplications that can form palindromic structures . If integrated into the genome and transcribed, they may be processed into miRNAs by the miRNA processing machine. In this paper, we focus on MER53-derived miRNAs in the human genome. First, we identified MER53-derived miRNAs in known miRNAs and scanned the human genome for their paralogs. Next, the phylogenetic distribution and evolution dynamics of the miR-1302 family were analyzed. Finally, we investigated the functions of the predicted target genes and analyzed the over-representation of Gene Ontology terms and KEGG pathways for this miRNA family.
Conservation Evaluation of MER53-derived miRNAs in the Human Genome
Percentage and Conservation Scores for MER53-derived hsa-mir-1302 Members Embedded in Repeats
The average phastCons conservation scores of pre-miRNA sequences have been used to determine the conservation of pre-miRNA. In our study, the average phastCons conservation scores of hsa-mir-1302 members are much lower than the thresholds used in previous studies (Table 1) [21, 29].
Potential Paralogs of the miR-1302 Family in the Human Genome
Of the 5,839 MER53 elements annotated in the hg18 genome assembly, we identified 44 MER53 elements that may encode miRNAs (Additional File 2). Eight of the 44 potential sequences overlap with experimentally verified miR-1302 precursor sequences  that are distributed on different chromosomes (Additional File 2). Hsa-mir-1302-5 and hsa-mir-1302-7 were not identified by our method as their multi-branched loops were filtered out by the MiPred program and hsa-mir-1302-8 was not identified because the selected region did not meet our criterion to be a pre-miRNA. Thus, we have identified 36 MER53 elements that may encode miRNAs belonging to the miR-1302 family and that have not yet been reported in the human genome.
Phylogenetic Distribution of Orthologs of the Human miR-1302 Family
MER53 elements are only found in eutherian species (placental mammals) . Because eutherian mammals diverged from marsupials and monotremes 180 and 210 million years ago, respectively , the homologs of the hsa-mir-1302 family in placental mammals may all be derived from MER53 elements, explaining why homologs of the mir-1302 family are not found in opossum and platypus (Figure 2). From this, we can infer that MER53 elements and MER53-derived miRNA genes emerged at the early stage of the recent 180 million years since eutherian mammals diverged from marsupials.
During evolution, many miR-1302 genes have been gained and lost (Figure 2). To estimate the gain and loss of miR-1302 genes during evolution, we have used the previously reported parsimony method  to infer that miR-1302-1, miR-1302-2, miR-1302-4, miR-1302-5, miR-1302-6, miR-1302-7, and miR-1302-8 evolved after the MER53 elements had been inserted and fixed. For example, miR-1302-2, miR-1302-4 and miR-1302-5 gene are present in human and marmoset but absent in other species such as tarsier, suggesting that these genes were generated in the ancestor of human and marmoset and lost in tarsier (Figure 2). In addition, in the tarsier genome the orthologs of miR-1302-1 and miR-1302-6 are further diverged than other miRNA genes (Figure 2). In humans, three additional hsa-mir-1302-2 genes and one hsa-mir-1302-3 have been duplicated from the original hsa-mir-1302-2 (discussed in detail in the next section). The miRNA gene family has experienced repeated gene duplication and while some of the duplicated genes have diverged functionally others have become pseudo genes or have been deleted from the genome of these species. This pattern is clearly shown in Figure 2.
Previous workers have pointed out that if the gene members of a family evolve in a birth-and-death manner the genes will cluster by type and not by species, while under the concerted model they will cluster by species [33, 34]. Except for hsa-1302-2 and hsa-1302-3 that are produced from recent segmental duplication events and cluster together, the miR-1302 genes do not show a within-species clustering pattern (Figure S2 in Additional File 1). Combining the evolutionary divergence between sequences information (Additional file 4) and the results of previous analysis, we can infer that the miR-1302 genes evolved in a birth-and-death manner.
Segmental Duplication and Alu Repeats Mediate the Expansion of the miR-1302 Gene in the Human Genome
Precursors of Human miR-1302 in Segmental Duplications
Segmental Duplication Pair
miRNA Genes in the Duplication
Coordinate of Duplication (SD1)
Corresponding Duplication (SD2)
miRNAs in SD1
miRNAs in SD2
There is only a two-nucleotide difference between hsa-mir-1302-2 and hsa-mir-1302-3 (see Figure S3A in Additional File 1), the same as the difference in the alignment of their corresponding MER53 sequences (see Figure S3B in Additional File 1). This indicates that, owing to DNA segmental duplication, the hsa-mir-1302-2 sequence was also duplicated. When the hsa-mir-1302-2 gene underwent duplication, as part of the evolution process, it is possible that mutations occurred resulting in a new gene, hsa-mir-1302-3.
Because they have not been found in regions of segmental duplication, the other hsa-mir-1302 genes may be products of MER53 transposition events alone. To determine if they were also formed by segmental duplication events, we analyzed the 36 paralogs of the hsa-mir-1302 family that we identified in the human genome (Additional File 2). We found that, like most of the members of the hsa-mir-1302 family, none of them were located in regions of segmental duplications.
Target Prediction and Functional Analysis
The function of hsa-miR-1302 is still unknown. We have analyzed the targets of the mature human miR-1302 in an attempt to explore its potential function. In the human genome, 1,835 target genes are predicted by both PITA  (17,844 predicted target genes and left 39,154 target sites if removing the duplicate coordinates, see Additional file 6) and TargetScan [42, 43] (2,055 target genes with 94 conserved sites and 2,334 poorly conserved sites, see Additional file 7). We used the 1,835 target genes at the intersection between the two predicted gene sets as valid targets (Additional file 8). Ninety-one of the TargetScan predicted conserved sites (Additional file 9) and 1,744 of the TargetScan predicted poorly conserved sites were identified in the 1,835 valid target gene data set. To determine the functions and pathways that may involve hsa-miR-1302, all the valid targets of hsa-mir-1302 were annotated using WebGestalt  and KEGG . The 1,835 target genes are widely distributed across all the chromosomes (see Figure S5A in Additional File 1) and are expressed in most tissues of the body. The highest expression levels are in the nervous system, and lowest are in the soft tissue and in the adrenal medulla (see Figure S5B in Additional File 1). In the nervous tissue, the genes are enriched in intracellular membrane-bounded organelles and in the synapse (see Figure S5C in Additional File 1). Functionally, the target genes are over-represented in transportation, localization, system development processes, and in the positive regulation of cellular processes. They may also play a role in binding and transcription regulation (see Figure S5D, Figure S5E in Additional File 1 and Additional file 10). Overall, these predicted target genes are implicated in cell proliferation and cell division, metabolism, development and in the immune response. In the pancreas some of the genes have roles in the insulin-related signal pathway and in pancreas pathology, while in the nervous system some are involved in learning, memory, and signal transduction and are implicated in neural disease development (see Figure S5 D-E in Additional File 1 and Additional file 10). The KEGG annotation indicates that some of the target genes are involved in diseases and are enriched in pathways that lead to glioma, chronic myeloid leukemia, colorectal cancer, pancreatic cancer, type II diabetes mellitus, neurodegenerative disorders such as Alzheimer's disease, and pathogenic Escherichia coli infection (Additional file 10). As miR-1302 was first identified in pluripotent human embryonic stem cells and embryoid bodies  and is enriched in pathways to cancer, it may influence the biological processes taking place in stem cells, in tumor cells and in the early embryo.
Recently, miRNAs that originate from repetitive elements have been identified in mammals and plants [17, 18, 19, 20, 21]. Here we report a microRNA family, the miR-1302 family, which originates from the DNA transposable element, MER53. MER53 is a medium reiteration frequency, non-autonomous DNA transposon related to the mariner family. MER53 elements can form palindromic stem-loop structures (see Figure 1B and Figure 1C) . Once an MER53 element becomes inserted into the genome in the region of an active transcription and is fixed by natural selection, it may be transcribed and processed by the enzyme machine system of miRNA into miRNAs. Previous studies have shown that transposable-element-derived miRNAs are less conserved than non-transposable-element-derived miRNAs . However, we find that the average conservation scores for the miR-1302 genes are very low (Table 1). Because quite a few orthologs of the precursors of human miR-1302 are found (See Additional File 3) in eutherian mammals, we suggest that the MER53 elements and MER53-derived miRNA genes may have evolved after eutherian mammals diverged from marsupials and monotremes as recently as 180 million years ago. In placental species, many miR-1302 genes have been gained and lost, indicating that they may have evolved following the birth-and-death model of evolution. It should be noted that the parsimony reconstruction of gain and loss of the miR-1302 genes is influenced by the fact that the miR-1302 genes were first identified in the human genome and the orthologs in other species were identified using computational approaches. Nevertheless, the theoretical approach is a good starting point for deducing the evolution mode for miRNA families.
MiRNAs, like protein-coding genes, form gene families and like MIR166 in plants, many of the pre-miRNAs produce either similar or identical mature miRNAs [27, 46]. These pre-miRNAs are classified as one family. The following questions arise: How do miRNA genes evolve to become miRNA families? What is the evolution dynamics of these miRNA genes? And are their evolutionary patterns the same as those for protein-coding gene families? Previous studies have shown that the expansion of miRNA families usually occurs through tandem or segmental duplications [22, 47, 48, 49, 50]. In the present study, we have focused on the precursors of human miR-1302. Because genes of the miR-1302 family are found on several chromosomes, we investigated the possibility that they too evolved through segmental duplication events. We suggest that besides the transposition effect of the MER53 elements, the four copies of hsa-mir-1302-2 and the one copy of hsa-mir-1302-3 were produced through segmental duplication events. When Alu sequences are near or at the boundaries of the duplication units they are known to mediate the expansion of segmental duplications through recombination. Our results indicate that the hsa-miR-1302-2 and hsa-mir-1302-3 genes may have evolved because of Alu-mediated recombination events. However, this mechanism apparently does not apply to other members of the human miR-1302 family or to potential paralogs in the human genome. They may have evolved by the transposition of MER53 elements alone.
Small RNAs regulate gene expression in many ways: they mediate antiviral responses; they play a role in the organization of chromosomal domains; and they restrain the spread of selfish genetic elements. Small RNAs guide transcriptional and post-transcriptional silencing machinery to specific target sequences that include genes and transposable elements . The target genes of miR-1302 are over-represented in functions that require the binding of metal ions and binding to DNA. They are mainly involved in metabolism, regulation of cellular physiological processes, signal transduction and transport. The MER53-derived miRNAs may, therefore, play important roles in cell proliferation and cell division, metabolism, development, pancreas physiology and pathology, nervous system physiology, diseases and in the immune response. Because the functions of miR-1302 have been predicted and predictions notoriously produce a large number of false-positives, a better method to assign a function to miR-1302 would be to combine the expression profiles of the miRNAs and the target genes. This is a study that we would like to do in the future.
In this study, we report the origin and evolution of the miR-1302 family in the human genome. Overall, we have identified 36 novel potential paralogs of miR-1302 genes in the human genome and 58 orthologs of the human miR-1302 genes in 21 placental species. Our data show that all members of miR-1302 family are derived from MER53 elements and we have proposed that they emerged at the early stage of the recent 180 million years since eutherian mammals diverged from marsupials. Segmental duplication events have facilitated the expansion of the miR-1302 family while the expansion of these segmental duplications may also have been facilitated by Alu-Alu-mediated recombination events. Because, in placental species many miR-1302 genes have been gained and lost, we have proposed that their development proceeded according to the birth-and-death model of evolution. Furthermore, we have found that the predicted target genes of miR-1302 are over-represented in transportation, localization, and in system development processes as well as in the positive regulation of cellular processes. Many of the potential target genes are predicted to function in binding and transcription regulation.
Finding MER53-derived miRNAs Within Known Human miRNAs
The genomic locations and sequences of repetitive elements (including MER53) in the human genome were taken from the University of California Santa Cruz (UCSC) Genome Browser  using the Table Browser  and analyzed using RepeatMasker 3.27 . The sequences and coordinates of the human pre-miRNAs and mature miRNAs were downloaded from miRBase v13.0  and mapped to the human genome (hg18). The nomenclature for some members of the hsa-mir-1302 family differs between miRBase v13.0 and that in later releases. However, the sequences are the same and the new nomenclature has been explained in the Results section. The genomic locations of the miRNAs and the repeats in the whole genome sequence were compared using the UCSC Table Browser  and Galaxy . If the overlap of coordinates (equivalent in this case to percentage identity) between a repetitive element and a pre-miRNA sequence was at least 50% for the pre-miRNA sequence or 100% for the mature miRNA sequence, then the miRNA was considered to be a repeat-derived miRNA . To retrieve information and for further analysis, the data and results were processed using Linux shell and Perl scripts. Unless otherwise specified, all the data in the present work were analyzed using these tools.
Genome-wide Identification of Potential Paralogs of MER53-derived miR-1302 Genes
To identify potential paralogs of MER53-derived miR-1302 genes, we developed a three-step operational scheme. We first searched the human genome for candidate MER53 elements using the BLAST program. Only hits with exact matches in the "seed region" (nucleotides 2-8) of miR-1302 were selected and two potential miR-1302 precursor sequences of length 110 nt harboring the "seed region" were excised from the hit sequences using a method similar to that described earlier . An excised sequence refers to a mature miRNA that is processed from the left or right arm of a potential precursor sequence. Using MiPred , excised sequences that were predicted to be miRNA precursors were selected for further analysis. Finally, if the candidate sequence could be transcribed and if it was not the exon of a protein-coding gene, then it was assumed to be a miRNA precursor. To determine if the sequence was transcribed and was not an exon of a protein-coding gene, we compared the coordinates of the potential precursor sequences with the coordinates of ESTs and the exons of protein-coding genes in the hg18 genome assembly.
Phylogenetic Distribution of the Orthologs of MER53-derived miR-1302 Genes
The orthologs of the eight members of the hsa-mir-1302 family in different organism were retrieved from the Multiz alignments of 44 vertebrate species . When there were gaps or short inserts (1-10 bp) in the selected alignment region then the corresponding sequences were checked using the Ensembl Genome Browser. cja-mir-1302-5, ggo-mir-1302-6 and ggo-mir-1302-7 were detected by the BLAT program with best reciprocal hits. Because a number of workers have used the liftOver program provided by the UCSC Genome Bioinformatics Group to determine orthologs [57, 58, 59, 60, 61], we checked our previously determined ortholog sequences by applying liftOver to over.chain files (downloaded from ftp://hgdownload.cse.ucsc.edu/) to further authenticate them (see details in Additional File 3). The miPred  classifier was used to validate these sequences as potential pre-miRNAs.
The UCSC's phastCons program  was used to calculated the conservation score of each nucleotide of every member of the hsa-mir-1302 family of the 17-species Multiz alignment  from the UCSC Genome Browser . The 17-species are human, chimp, rhesus, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum, chicken, frog, zebrafish, tetraodon, pufferfish (fugu). The Multiz data contains a measure of evolutionary conservation for each nucleotide in the human genome against the other 16 genomes. To estimate the probability that each nucleotide belongs to a conserved element in the multiply aligned sequences, phastCons computes conservation scores based on a phylogenetic hidden Markov model . The phastCons scores range from 0 to 1 and are a measure of the probability of purifying selection. The score estimates the probability of a nucleotide or a region being under selective pressure. The average score of each pre-miRNA was calculated from the per-site conservation score.
Sequences were aligned using the R-Coffee program . Molecular evolutionary analyses were performed using MEGA 4 . The extent of nucleotide sequence divergence was estimated using the uncorrected p-distances and evolutionary distances were calculated with the pairwise deletion option. Phylogenetic trees were reconstructed using the neighbor-joining method  and the statistical reliabilities of the internal branches were assessed using 1,000 bootstrap replicates.
Two independent approaches have been developed to detect segmental duplications (SDs) [66, 67]. In one of the approaches, SDs (≥1 kp and ≥90% identity) were discovered by identifying high-copy repeats, removing the repeats from the whole genomic sequence and then searching for similar sequences in the genome. The repeat sequences were reinserted into the pairwise alignments, the ends of alignments were trimmed, and global alignments were generated (see reference  for details). To test our hypothesis that, in addition to the effect of MER53 transposition, SD events may contribute to the expansion of the hsa-mir-1302 family, we analyzed the segmental duplication data that were pre-computed by Bailey and his colleagues  and that we downloaded from the UCSC Genome Browser . By comparing the coordinates, we determined whether human miR-1302 genes were located in regions of segmental duplications. We used Circos v0.52  to show the relationships between segmental duplications harboring miR-1302 genes.
To find out if the segmental duplications that harbor miR-1302 genes are produced by Alu-mediated recombination events, we examined sequence features at the junctions of duplications. We defined junction sequences as the sequences at the terminal ends of segmental duplications spanning a ±5 bp interval. We compared the divergence of Alu elements located at the junctions with the divergence of the corresponding internal Alus in the pairing duplications . Sequence divergence was estimated using Kimura's two-parameter model for genetic distance in MEGA 4 . The data were analyzed using version 2.9.0 of the R programming language and environment for statistical computing and graphics .
Target Prediction and Regulatory Function Analysis
We collected the 3'UTR-sequences of human coding refGenes from the UCSC Genome Browser . Two programs, PITA  and TargetScan [42, 43], were used to identify potential target sites of hsa-miR-1302. The intersection of the two independently computed sets of target genes was used to list the valid potential targets. The number of possible target sites for human miR-1302 in all human coding genes was calculated after removing the coordinates of overlapping genes. To determine the functional categories to which the target genes belong, we used the WebGestalt program  to display the GO categories of the target genes. We also mapped the genes to KEGG pathways .
This work is supported by a grant from the National Natural Science Foundation of China (Project No. 61073141). We would like to thank the two unidentified reviewers for their constructive comments on an earlier version of the manuscript.
- 23.Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu A-L, Zhao Y, McDonald H, Zeng T, Hirst M, et al: Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 2008, 18 (4): 610-621. 10.1101/gr.7179508.PubMedCentralCrossRefPubMedGoogle Scholar
- 24.Zhou M, Wang Q, Sun J, Li X, Xu L, Yang H, Shi H, Ning S, Chen L, Li Y, et al: In silico detection and characteristics of novel microRNA genes in the Equus caballus genome using an integrated ab initio and comparative genomic approach. Genomics. 2009, 94 (2): 125-131. 10.1016/j.ygeno.2009.04.006.CrossRefPubMedGoogle Scholar
- 32.Nozawa M, Miura S, Nei M: Origins and evolution of microRNA genes in Drosophila species. Genome Biology and Evolution 2010. 2010, 180-189. 10.1093/gbe/evq009.Google Scholar
- 57.Lee AS, Gutierrez-Arcelus M, Perry GH, Vallender EJ, Johnson WE, Miller GM, Korbel JO, Lee C: Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum Mol Genet. 2008, 17 (8): 1127-1136. 10.1093/hmg/ddn002.CrossRefPubMedGoogle Scholar
- 62.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMedCentralCrossRefPubMedGoogle Scholar
- 69.R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.