Can-SINE dynamics in the giant panda and three other Caniformia genomes
Although repeat sequences constitute about 37% of carnivore genomes, the characteristics and distribution of repeat sequences among carnivore genomes have not been fully investigated. Based on the updated Repbase library, we re-annotated transposable elements (TEs) in four Caniformia genomes (giant panda, polar bear, domestic dog, and domestic ferret) and performed a systematic, genome-wide comparison focusing on the Carnivora-specific SINE family, Can-SINEs.
We found the majority of young recently integrated transposable elements are LINEs and SINEs in carnivore genomes. In particular, SINEC1_AMe, SINEC1B_AMe and SINEC_C1 are the top three most abundant Can-SINE subfamilies in the panda and polar bear genomes. Transposition in transposition analysis indicates that SINEC1_AMe and SINEC1B_AMe are the most active subfamilies in the panda and the polar bear genomes. SINEC2A1_CF and SINEC1A_CF subfamilies show a higher retrotransposition activity in the dog genome, and MVB2 subfamily is the most active Can-SINE in the ferret genome. As the giant panda is an endangered icon species, we then focused on the identification of panda specific Can-SINEs. With the panda-associated two-way genome alignments, we identified 250 putative panda-specific (PPS) elements (139 SINEC1_AMes and 111 SINEC1B_AMes) that inserted in the panda genome but were absent at the orthologous regions of the other three genomes. Further investigation of these PPS elements allowed us to identify a new Can-SINE subfamily, the SINEC1_AMe2, which was distinguishable from the current SINEC1_AMe consensus by four non-CpG sites. SINEC1_AMe2 has a high copy number (> 100,000) in the panda and polar bear genomes and the vast majority (> 96%) of the SINEC1_AMe2 elements have divergence rates less than 10% in both genomes.
Our results suggest that Can-SINEs show lineage-specific retransposition activity in the four genomes and have an important impact on the genomic landscape of different Caniformia lineages. Combining these observations with results from the COSEG, Network, and target site duplication analysis, we suggest that SINEC1_AMe2 is a young mobile element subfamily and currently active in both the panda and polar bear genomes.
KeywordsCaniformia species Can-SINEs Transposable element activity Putative panda-specific elements
Long interspersed element
Long terminal repeat
Million years ago
Short interspersed element
Transposition in transposition
Time of most recent common ancestor
Target site duplication
Transposable elements (TEs) are mobile DNA sequences that occupy a large proportion of genomes of carnivores [1, 2]. For example, TEs account for about 36.1% of the dog genome , 37% of the giant panda genome , and 38.1% of the polar bear genome . TEs have a major impact on genomic architecture and can affect genes in many ways, such as gene mutation, gene activation/silencing, mRNA alternative splicing, X-chromosome inactivation, and promoting additional forms of structural genetic variation [6, 7, 8, 9, 10]. TEs can generally be divided into class I and class II elements [11, 12]. Class I elements, also called retrotransposons, amplify using a copy-and-paste mechanism via an RNA intermediate [11, 13]. Short interspersed elements (SINEs), long interspersed elements (LINEs) and long terminal repeat (LTR) are different types of retrotransposons. SINE elements have no protein-coding sequences and depend on the enzymatic machinery of LINEs to proliferate . Class II elements, also called DNA transposons, can relocate in the genome using a cut-and-paste mechanism .
During evolution, active retrotransposons can proliferate in the genome and accumulate diagnostic nucleotide variations (e.g., substitutions, insertions, and deletions) over time. Based on the diagnostic variations that are shared by all members, retrotransposons have diversified into a variety of subfamilies, each with its own set of diagnostic sequence characteristics and period of activity . In some cases, the relationships of retrotransposon subfamilies indicate hierarchical characteristics, with the youngest subfamilies containing the most diagnostic mutations and oldest subfamilies the least . Understanding proliferation patterns of young active retrotransposons is crucial because the accumulation of new copies and lineage-specific subfamilies in different taxa can contribute to the genome variations between species. For example, a genome-wide comparison of the human and chimpanzee genomes indicated a large number of species-specific L1, Alu and SINE-VNTR- Alu (SVA) elements that have been inserted since their divergence ~ 6 Million year ago (Mya) [17, 18].
Caniformia (dog-like carnivores) is a suborder in Carnivora. Species in Caniformia are characterized by significant morphological, ecological, and behavioral variation . In particular, the giant panda is globally iconic due to its physically attractive attributes, almost exclusive bamboo based diet, unique morphological and physiological adaptations, and its characteristic black and white pelage . Analysis of the draft genome of the giant panda published in 2010 indicated at least 70 Mb of TE sequences (3% of genome) have divergence rates ≤10%, and these young elements could contribute to the genomic and functional variations of the species . However, little is known about the transposition activities of young TEs in different Caniformia species and their contributions towards the evolution of Caniformia genomes. In this study, we reannotated TEs and characterized Can-SINEs in the giant panda genome and three other Caniformia species, the domestic dog (C. lupus familiaris), polar bear (Ursus maritimus), and ferret (Mustela putorius furo) using an updated Repbase library. Then, the two-way genome alignments were employed to identify lineage-specific Can-SINE elements, and the putative panda-specific (PPS) elements were identified using the dog, ferret, and polar bear as outgroups (Fig. 1c). These elements integrated into the panda genome recently and were potentially from new Can-SINE subfamilies that are still active in the panda genome. Our study provides new insight into the contributions of Can-SINEs to the genomic diversity in Caniformia species and the important roles of young and active Can-SINEs in shaping the panda genome.
TE content in the four Caniformia genomes
Major types of TEs in the panda, polar bear, dog, and ferret genome
Can-SINE landscape within the four Caniformia genomes
The average divergence rates of SINE subfamilies indicate that most subfamilies are relatively old (Fig. 2b). The MamSINE1 shows the highest divergence rate (~ 30%) in all four genomes. Only six subfamilies (SINEC_Cf, SINEC_Cf2, SINEC_Cf3, SINEC1A_CF, SINEC1B2_CF, and SINEC2A1_CF) in the dog genome, one subfamily (SINEC_Mv) in the ferret genome, one (SINEC1_AMe) in the polar bear genome, and two (SINEC1_Mv and SINEC1_AMe) in the panda genome have divergence rates of ≤10% (Fig. 2b). SINEC1B_AMe subfamily has the third lowest divergence rate in the panda and polar bear genomes and it has accumulated high copy number in both genomes, suggesting elements in SINEC1B_AMe might have integrated into these genomes recently (Fig. 2b).
TinT patterns of Can-SINEs in the four genomes
PPS_SINEC1_AMe and PPS_SINEC1B_AMe elements
To determine if these PPS elements were new insertion events, we examined the TSD sites and nucleotide divergence of these elements. Elements from an active subfamily should have intact TSDs at the insertion sites. Evident TSD sequences were observed in 78 of 139 PPS_SINEC1_AMe elements (56.1%), but only eight of 111 PPS_SINEC1B_AMe elements (7.0%) (Additional file 2).
Compared to Repbase consensus sequences, the average divergence rates were 7.38 ± 1.37% for PPS_SINEC1_AMe elements, 11.86 ± 2.20% for PPS_SINEC1B_AMe1, and 12.60 ± 2.03% for PPS_SINEC1B_AMe2 (Additional file 2). These rates were smaller than the average divergences of original subfamilies (SINEC1_AMe, 8.82% and SINEC1B_AMe, 13.09%) indicating that the identified PPS elements are young and newly integrated elements in the panda genome. However, compared to the COSEG generated PPS consensus sequences, the divergence rates were 6.39 ± 3.27% for PPS_SINEC1_AMe elements, 14.30 ± 3.61% for PPS_SINEC1B_AMe1 and 16.53 ± 3.85% for PPS_SINEC1B_AMe2(Additional file 2). The average age of PPS_SINEC1_AMe was estimated to be ~ 15.20 Mya (based on BEAST) or ~ 17.42 Mya (based on CpG and non-CpG mutations), which was about the time the panda and polar bear diverged from each other. Additionally, estimated average age of PPS_SINEC1B_AMe1 was ~ 21.54 Mya (based on BEAST) and~ 31.83 Mya (based on CpG and non-CpG mutations). The average age of PPS_SINEC1B_AMe2 was estimated to be ~ 21.45 Mya and ~ 59.13 Mya (based on CpG and non-CpG mutations). It seems that these two SINEC1B_AMe subfamilies are relatively old.
To determine the possible functional impact of PPS elements, we compared their genomic locations to known genes in the panda genome. A total of 45 PPS_SINEC1_AMe, 29 PPS_SINEC1B_AMe1 and 13 PPS_SINEC1B_AMe2 elements inserted into genic regions. Only one PPS_SINEC1_AMe element inserted into a sequence of a non-coding RNA gene (LOC105241841), and all others inserted into intronic regions (Additional file 2).
Potential new Can-SINE subfamily identification
Because SINEC1_AMe and SINEC1B_AMe are large Can-SINE subfamilies, they could contain smaller unique subfamilies, similar to Alu subfamilies in primate genomes . Using COSEG analysis, we further examined the evolutionary dynamics of SINEC1_AMe and SINEC1B_AMe subfamilies to determine whether the three PPS consensuses were new clades within these subfamilies. COSEG analysis on all SINEC1_AMe elements in the panda genome divided SINEC1_AMe into 17 subfamilies, and SINEC1B_AMe elements into 28 subfamilies. These subfamilies were selected for Network analysis with three PPS consensuses and Repbase consensus. After excluding CpG sites, PPS_SINEC1_AMe was grouped with five SINEC1_AMe COSEG consensuses and formed a major group containing 82,157 elements (Fig. 6b). This group is the largest group and 4 substitutions away from the Repbase SINEC1_AMe consensus. By contrast, both PPS_SINEC1B_AMe1 and PPS_SINEC1B_AMe2 were grouped with the Repbase SINEC1B_AMe consensus and three other COSEG consensuses to form the largest group in the network, containing 17,246 elements (Fig. 6c).
Overall, PPS_SINEC1_AMe represents a major group of SINEC1_AMe element, the group has a relatively low divergence rate and young age, and its consensus contains several distinct diagnostic mutations compared to the SINEC1_AMe consensus. Therefore, we conclude that PPS_SINEC1_AMe elements represent a novel Can-SINE subfamily, that we named SINEC1_AMe2, a subfamily separating from the current SINEC1_AMe, following the TE subfamily naming convention .While, evidence from the two PPS_SINEC1B_AMe subfamilies was not sufficiently robust to be new Can-SINE subfamilies.
Characteristics of the new Can-SINE subfamily
We then repeated RepeatMasker analysis against the panda and polar bear genome including the SINEC1_AMe2 consensus to characterize this new Can-SINE subfamily. A total of 143,166 and 106,889 SINEC1_AMe2 elements were found in the panda and the polar bear genomes, respectively. At least 96.8% of them (124,118 elements in the panda and 104,892 in the polar bear) have divergence rate of ≤10%, indicating they are young in both genomes. The SINEC1_AMe2 elements account for ~ 57% of SINE elements with divergence rates of ≤10% in the panda genome (Additional file 1: Table S3). In addition, we identified 19 SINEC1_AMe2 elements with a zero divergence rate in the panda genome, with all being panda specific loci and are absent at the orthologous regions of the polar bear genome. Similarly, we identified 49 SINEC1_AMe2 elements with a zero divergence rate in the polar bear genome that are polar bear specific loci. These results indicate that SINEC1_AMe2 is one of the youngest and most active SINE subfamily in the panda and polar bear genomes.
Characteristics of Can-SINEs in the four genomes
We characterize TEs in the genomes of four Caniformia species, focusing on Can-SINEs. Despite similarities in the four major TE types, obvious differences exist among the four Caniformia genomes particularly in the young and active Can-SINEs. SINEC1_AMe and SINEC1B_AMe are the most abundant Can-SINE subfamilies in the panda and polar bear genomes, with relatively low divergence rates. In contrast, the number of elements in these two subfamilies is much lower in the dog genome. Instead, several other Can-SINE subfamilies (e.g., SINEC1C1_CF, SINEC2A1_CF) are the most abundant subfamilies in the dog. By contrast, the ferret genome accumulates the highest numbers of MVB2 in the four genomes. This extensive genomic variability will be a useful resource for the study of ancestral relationships among different Caniformia lineages. It is noted that some of the differences between genomes might be underestimated because the draft genomes of panda, polar bear, and ferret were assembled by short read sequences, which is known to lead the underestimation of some longer repeat sequence copy numbers [4, 29].
The transposition activity of Can-SINEs
Previous studies have suggested that a retrotransposon subfamily accumulate its respective copies for a certain period of time and then become quiescent. Other newer subfamilies subsequently become active, and the pattern repeats itself [16, 30]. This pattern is well illustrated by the Alu family of SINEs in primates. During the early stages of primate evolution, AluJ subfamilies were active. The activity of these subfamilies was later reduced, and the AluS subfamilies became active. The AluY subfamilies were even more taxonomically specific in that they began their expansion in primates more recently [16, 30]. The propagation of retrotransposon subfamilies in different lineages varies greatly over the evolution of Carnivora. Walters-Conte et al.  identified two Can-SINE subfamilies within the Feliformia (cat-like carnivores) suborder. These two subfamilies shared close relationships with the Repbase SINEC_Fc1 and SINEC_Fc2 subfamilies, and one of the two families (SINEC_Fc1) arose recently and showed evidence for active proliferation. Within the Caniformia, Wei and Kirkness  suggested that the SINEC_Cf repeats comprised a major Can-SINEs subfamily that had undergone recent expansion and identified at least 10,000 polymorphic SINEC_Cf loci in different dog breeds. These studies demonstrated that the transposition activity of Can-SINE subfamilies vary widely over evolutionary time, with periods of low and high activity. Moreover, the transposition rate varies greatly among different Caniformia lineages. Only a few Can-SINE subfamilies in each species demonstrate average divergence rates of ≤10% (Fig. 2b), indicating the young and active subfamilies are few in Caniformia as is found in other mammals .
TinT analysis indicated that SINEC1_AMe and SINEC1B_AMe subfamilies are the two most active subfamilies in the panda and polar bear genomes and have maintained high transposition activities for a long time (Fig. 3), which is further supported by their accumulated high copy numbers and relatively low divergence rates (Fig. 2). It is also worth noting that the transposition activity estimated by TinT might result from the lack of annotation of Can-SINE subfamilies in the panda and polar bear. It is possible that mutliple recent Can-SINE subfamilies are lumped under one big subfamily (such as SINEC1_AMe and SINEC1B_AMe), leading to an extended transposition activity period.
Unlike the panda and polar bear genomes, the two most active subfamilies in the dog genome are SINEC2A1_CF and SINEC1C1_CF, and these subfamilies have maintained transposition activity to the present in the TinT analysis. However, previous dog genome research  demonstrated SINEC_Cf was the dominant young subfamily probably because these two subfamilies have not been recognized until recently. In particular, the SINEC2A1_CF subfamily accumulates high copies (~ 180,000 elements) in the dog genome compared to the other three Caniformia species (only 11–103 elements) (Fig. 2a). It also has the lowest average divergence rate (5.77%, Fig. 2b) and the longest period of transposition activity (Fig. 3). All the results suggest that the SINEC2A1_CF subfamily has actively propagated specifically in the dog lineage and may have played an important role in shaping the architecture of the dog genome. We also found that MVB2 is the dominant Can-SINE subfamily in the ferret genome with the highest number, relatively low divergence rate and the longest period of transposition activity. Overall, it is suggested that different Can-SINE subfamilies actively propagated during the divergence and radiation of Caniformia lineages, represent a great source of genomic diversity among Caniformia species.
The differences in Can-SINE amplification between the four genomes might due to the characteristics of SINE amplification. During evolution, only a few SINE insertions function as source of novel SINE transcripts during amplification at any given time . With the accumulation of mutations, transcription of a given master copy will eventually be inhibited and be replaced by an alternate copy. As results, SINE subfamilies possess the diagnostic nucleotide sequence to be classified into phylogenetic lineages . Our results of Can-SINE amplification are in line with the evolutionary relationships between the four Caniformia species. The dog is an outlier group from the other genomes, and has the most varied Can-SINE subfamilies, in particular the Can-SINE “CF” subfamilies. Similarly, the ferret genome accumulated the highest numbers of MVB2 probably due to the closer relationship of ferret to the America mink than other species.
Considering a large number (10,000) of polymorphic SINE_Cf in dogs , the differences of Can-SINE transposition activity in the four genomes might be underestimated because these polymorphic loci could not be detected in one reference genome and thus this was not investigated in the present study. Besides, polymorphic loci in the four Caniformia species might lead to ambiguous results for identification of species-specific Can-SINE elements. Further studies with genomes from more individuals will assist in identifying the effects of polymorphic elements.
The characteristics of putative panda-specific Can-SINEs and the new Can-SINE subfamily
After the panda-associated genome alignments, we identified 250 PPS elements in the panda genomes that are absent from the orthologous regions in the other three genomes. Further investigation of these PPS insertions allowed us to identify a potentially new Can-SINE subfamily, SINEC1_AMe2, which is distinguished from the current SINEC1_AMe subfamily. SINEC1_AMe2 has similar structures to other Can-SINEs with a conservative tRNA-related region . However, its consensus sequence is distinguishable from the SINEC1_AMe consensus by eight mutations with half of the mutations outside of the hypervariable CpG sites. In the Network analysis of COSEG consensuses within SINEC1_AMe, SINEC1_AMe2 formed the group with the highest copy number. Additionally, several lines of evidence suggest that it is generated from a source element different from SINEC1_AMe. First, the SINEC1_AMe2 consensus was constructed from the panda-specific elements that are young and intact elements with apparent TSD sequences and relatively long poly-A/T tail. Second, SINEC1_AMe2 subfamily had more than 100,000 copies in the panda and polar bear genomes, suggesting it is a major subfamily with high activity. Third, the majority of SINEC1_AMe2 had divergence rates of ≤10%, suggesting it is a younger subfamily than SINEC1_AMe and it represents most of the young elements within the current SINEC1_AMe subfamily (Additional file 1: Table S3). In particular, we identified 19 and 49 SINEC1_AMe2 elements in the panda and polar bear genome with a zero divergence rate. In contrast, no such elements were found in SINEC1_AMe elements, indicating SINEC1_AMe2 is different from SINEC1_AMe. All of these 68 elements are panda or polar bear specific loci, which provides a strong support that the SINEC1_AMe2 subfamily is still active in the panda and polar bear genomes. Considering all of the evidence, we conclude that SINEC1_AMe2 is a new subfamily separate from SINEC1_AMe, and it started integrating into the genomes of the panda and polar bear 15–17 Mya. Since then it has been active and successfully propagated as the dominant active subfamily in both genomes.
By contrast, the data does not support the PPS_SINEC1B_AMe elements form a new Can-SINE subfamily. Although two consensuses were generated from young PPS_SINEC1B_AMe elements, they are shorter than SINEC1B_AMe consensus, and have most of mutations at CpG sites or (CT)n sites which might not be fixed diagnostic mutations. Furthermore, elements in these subfamilies lack TSD sequences, exhibit relatively high divergence rates, and the estimated insertion age is old. The two PPS_SINEC1B_AMe consensuses were grouped with SINEC1B_AMe consensus in the network analysis (Fig. 6c), which indicates they are similar to SINEC1B_AMe consensus. In combination, the evidence does not support that PPS_SINEC1B_AMe elements are from a new subfamily within SINEC1B_AMe. It is worth noting that not all new Can-SINE subfamilies have been surveyed in the present study, some clades of subfamilies in the SINEC1_AMe and SINEC1B_AMe Network might contain smaller new subfamilies that are worthy of more comprehensive investigations in the future.
In the present study, we reannotated TEs based on updated Repbase library and conducted a systematical comparison of Can-SINEs compositions, subfamilies and transposition activities between the genomes of four Caniformia species: panda, polar bear, dog, and ferret. There are significant differences in the copy numbers, average divergence rates and transposition activities among Can-SINE subfamilies in the four genomes. We identified 139 PPS_SINEC1_AMe elements and 111 PPS_SINEC1B_AMe elements that inserted in the panda genome but were absent at the orthologous regions of the other three genomes. Further investigations of these PPS insertions allowed us to identify a new Can-SINE subfamily, SINEC1_AMe2, which is distinguishable from the current SINEC1_AMe subfamily. Combing evidences from different analysis, we conclude that SINEC1_AMe2 is a young subfamily and has been active and successfully propagated to be the dominant active subfamily in the panda and polar bear genome.
TE annotation of the four Caniformia genomes
The genome assemblies of four Caniformia species (AilMel_1.0, CanFam3.1, MusPutFur1.0, and UrsMar_1.0) were downloaded from NCBI . The scaffold N50 for each genome assembly varied from 1,281,781-15,940,661 bp (Additional file 4). The genome assemblies were analyzed for TE composition using a local installation of the RepeatMasker program . The standard Repbase consensus library (version 2015.8.7) was used for the analysis . Cross_Match v0.990329  was used for sequence search in the RepeatMasker program. Perl-scripts calcDivergence-fromAlign.pl and RepeatLandscape.pl in the RepeatMasker package were used to create the repeat landscape.
Using RepeatMasker annotation, the frequencies, average divergence rates, proportion of genome, and subfamily distributions of all TEs in the four genomes were extracted and calculated using a custom Perl script and R statistic software (version 3.2.3).
Transposition in transposition analysis
TinT analysis can be used to determine the relative age of transposon families [27, 28], and has been successfully applied in primates, sharks, and amniotes transposons studies [41, 42, 43, 44]. TinT analyses were performed for Can-SINEs in panda, polar bear, ferret and dog genomes with default parameters, except for Minimum Repeat Extension (4).
Panda-associated two-way genome alignments
We performed in silico screening for putative panda-specific elements based on the two-way alignments described in Kent et al.  and Doronina et al. . All elements belonging to the two most active SINE subfamilies in the panda genome (SINEC1_AMe and SINEC1B_AMe) were selected. Elements in these subfamilies were filtered for the length (SINEC1_AMes ≥150 bp, SINEC1B_AMes ≥170 bp) and the average divergence rate (SINEC1_AMes ≤10%, SINEC1B_AMes ≤15%) to obtain potential young insertions. The selected elements were extracted with 500 bp of flanking sequence on each side for comparison of orthologous loci among the four Caniformia species. A local installed Blast+ 2.2.28 tool was used to compare the flanking sequences to other three genomes (panda vs polar bear/dog/ferret) . The orthologous loci of these elements in the four species were then obtained. Loci with alignment length covering ≥90% of the query sequence and having identity ≥80% were selected. A locus that had ≥80% identity in the flanking sequence across all species was considered an orthologous locus and is referred as “orthologous locus” in the text. A custom Perl script (parse_blast6.out.pl) was then used to convert the results to the gff format. Bedtools  was used to determine the overlapped region between the Blast+ and RepeatMasker outputs. Elements that were present in the panda genome and absent in the other three genomes at the orthologous loci were considered PPS elements. To investigate distributions of these PPS elements, the annotation file for the panda genome was downloaded from NCBI  and Bedtools was used to obtain the overlapping regions between PPS elements and genes.
Analyses of PPS Can-SINE elements
The COSEG program  has been designed to identify repeat subfamilies based on significant co-segregating mutations. During analysis, COSEG ignores non-diagnostic mutations and potentially misleading mutational events to give an accurate representation of relationships between subfamilies of elements. We applied COSEG program to PPS_SINEC1_AMe and PPS_SINEC1B_AMe elements to generate the consensus sequences for subfamilies. The minimum subfamily size of COSEG was set at 10 for PPS elements. The Repbase SINEC1_AMe and SINEC1B_AMe consensus sequences  were selected as original template for PPS_SINEC1_AMe and PPS_SINEC1B_AMe respectively. To better understand the relationships between the PPS subfamilies and other Can-SINE subfamilies, Network analysis was then performed combining consensus of PPS_SINEC1_AMe, and PPS_SINEC1B_AMe with consensus sequences of all Can-SINE subfamilies downloaded from Repbase [39, 51]. All consensus sequences were aligned using Mega5 software  and converted to the .rdf file format by the DNAsp program . The .rdf files were then imported to the Network program, and the median-joining analyses were run.
We also performed COSEG analysis on the all SINEC1_AMes elements and SINEC1B_AMes elements in the panda genome separately to investigate potential subfamilies within them. The minimum subfamily size of COSEG was set at 1623 for SINEC1_AMes and 803 for SINEC1B_AMes, and the original template was their corresponding Repbase consensus sequence (SINEC1_AMe and SINEC1B_AMe, respectively). Then the Network analysis was conducted on all COSEG SINEC1_AMe consensuses plus PPS_SINEC1_AMe, and all SINEC1B_AMe consensus plus PPS_SINEC1B_AMe with Repbase consensus, to investigate relationships of the PPS subfamilies to other SINEC1_AMe and SINEC1B_AMe elements.
We manually checked the target site duplication sequences for all identified PPS elements. To do this, 30 bp flanking sequence including TSDs was extracted by an in-house Perl script. The 6–15 bp identical flanking sequences or flanking sequences with one different nucleotide on both sides of an element were regarded as the TSDs.
Age estimation of PPS_SINEC1_AMe subfamily
Two methods were applied to estimate the age of PPS SINEC1 subfamilies. For the first method, the consensus sequences of PPS_SINEC1_AMe and PPS_SINEC1B_AMe were aligned with their members using Mega5 software after removing insertions and poly (A/T) tails. Elements that contained deletions larger than 50 bp were excluded from the analysis. Due to the low-quality alignments in the (CT)n regions, the region was also excluded from the age estimates of the PPS_SINEC1B_AMe elements. Mutations of both subfamilies were divided to CpG and non-CpG mutations. An age estimate for each subfamily was determined by an average of results based on the two types of mutations (CpG and non-CpG mutations) with a custom Perl script . For the second method, the BEAST program, applied to estimate dates of divergence using transposon data previously published [55, 56], was used to estimate the age of the PPS_SINEC1_AMes and PPS_SINEC1B_AMes elements with the following parameters: Site Heterogeneity = ‘gamma’; Clock = ‘strict clock’; Species Tree Prior = ‘Yule Process’; Prior for tmrca = ‘Normal distribution’ with 2.0 standard deviation; Prior for clock.rate = ‘Uniform’ with initial rate set at 0.0013 (neutral mutation rate) and upper rate set at 0.0104 (CpG mutation rate) to include the major mutation rates found in the panda genome. All other parameters were set at default settings. All chosen parameters are similar to the previous study . The estimated age was determined by tmrca with a calibration time of 15.5 Mya based on the divergence time between the panda and polar bear reported in previous studies (12~ 19 Mya) [4, 21, 23, 58, 59].
PCR validation and DNA sequencing of PPS can-SINEs
To validate the PPS elements, PCR was performed on a four-species Caniformia DNA panel including the giant panda, polar bear, dog and mountain weasel (Mustela altaica) (a species closely related to the ferret) (Additional file 5). Twenty-three SINEC1_AMes and one SINEC1B_AMe were randomly chosen for validation. Primers were designed based on the 300 bp flanking sequences using Primer 3 [60, 61] and were listed in Additional file 3. PCR amplification on each locus was performed in 20 μL reactions with 10–50 ng genomic DNA, 60 nM of each oligonucleotide primer, 10 μL Master Mix. PCR reaction conditions were as follows: an initial denaturation at 94 °C for 5 min, followed by 35 cycles of denaturation at 94 °C, annealing at the previously determined optimal annealing temperature (Additional file 3), and extension at 72 °C for 1 min each, followed by a final extension of 72 °C for 5 min. PCR products were analyzed on 2% agarose gels stained with 0.5 μL GoldView (TIANGEN) and visualized with UV fluorescence. Additionally, all PCR products were Sanger sequenced to verify bp composition .
Characteristics of the potential new subfamily
We rerun the RepeatMasker program against the giant panda and polar bear genomes by combining the consensus of the potential new SINEC1_AMe2 subfamily with other Repbase Can-SINE consensuses. Elements belonging to the potential new subfamily were described, from where elements with zero divergence rate to the consensus were identified. The 500 bp flanking sequences on both sides of these young elements were further extracted to investigate the presence/absence pattern in the panda and polar bear genomes.
We thank Guo Cai and Xiuyue Zhang for providing animal samples. We thank Tianlin Sun for the useful advice of BEAST analysis. We thank Megan Price for langue editing.
This work was supported by the Chengdu Giant Panda Breeding Research Foundation Project [CPF2014–13].
JL and JX initiated and designed the study. CP, LN, JD, JY, XZ and CZ conducted this work, and CP wrote the manuscript. JL and JX edited the manuscript. All authors read and approved the final manuscript.
This study complies with all applicable sections of the final rules of the Animal Welfare Act (Code of Federal Regulations, Title 9), The Public Health Service Policy on Humane Care and Use of Laboratory Animal Welfare, and The Guide for the Care and Use of Laboratory Animals from the National Research Council.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 25.Gittleman JL. Carnivore behavior, ecology, and evolution. Q Rev Biol. 1996;48(4925):1515.Google Scholar
- 37.National Center for Biotechnology Information, U.S. National Library of Medicine: (2015) http://www.ncbi.nlm.nih.gov/genome/. Accessed 13 Feb 2016.
- 38.RepeatMasker: (2016) http://www.repeatmasker.org. Accessed 16 Dec 2015.
- 39.Repbase. Genetic Information Research Institute. 2001. http://www.girinst.org/repbase/update/. Accessed 13 Feb 2017.
- 40.Cross_match: (2015) http://www.phrap.org/phredphrapconsed.html. Accessed 28 Apr 2016.
- 47.Blast: (2015) https://blast.ncbi.nlm.nih.gov/Blast.cgi. Accessed 23 Apr 2016.
- 48.Bedtools: (2010) http://bedtools.readthedocs.io/en/latest/. Accessed 5 Apr 2016.
- 49.National Center for Biotechnology Information, U.S. National Library of medicine: (2017) ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/335/GCF_000004335.2_AilMel_1.0/GCF_000004335.2_AAilMe_1.0_genomic.gff.gz . Accessed 5 Apr 2017.
- 50.RepeatMasker: (2015) http://www.repeatmasker.org/COSEGDownload.html. Accessed 16 June 2016.
- 51.Network 5: (2016) http://www.fluxus-engineering.com/sharenet.htm. Accessed 24 June 2016.
- 63.Timetree: (2018) http://www.timetree.org/. Accessed 24 Sept 2018.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.