Background

Understanding the molecular mechanisms underlying the adaptation of woody plants to local environmental conditions remains a big challenge in biology due to their long and perennial life history. However, woody plants represent a large proportion of biodiversity on the earth and harbor many different phenological traits that herbaceous plants do not feature (https://www.worldwildlife.org). One such trait is the continuous flowering behavior of rose, an important crop of high importance in human society. Guaranteeing a constant supply of raw materials for cut flowers and related products, continuous flowering becomes one of the most important biological and economical traits for roses [1]. Therefore, the genetic control and related gene-regulatory-network for continuous flowering regulation attracts efforts for many years not only from scientists but also from breeders [1, 2]. The number of QTL regulating continuous flowering remains disputable [1,2,3]. RoKSN, a homolog of Arabidopsis TFL1-like gene in roses, is the only known gene responsible for continuous flowering [4, 5].

Domestication of cultivated roses mainly involves hybridization among more than a dozen species [2, 6,7,8]. Frequent inter-species crossing/backcrossing and polyploidization of roses has made the classification of roses very difficult [9,10,11,12]. A set of high-quality and well-characterized genomic tools/resources are necessary for understanding the biology and domestication of modern roses that encompass more than 30,000 cultivars [13]. Recently, several genetic mapping populations have been developed (see reviews [1, 2]) and determination of rose genomes is carried out [14] with the sequence released very recently for a doubled-haploid of Rosa chinensis ‘Old Blush’ [15]. However, due to high-level heterozygosity caused very likely by inter-species crossing and polyploidization, achieving an accurate and complete rose genome seems not so easy. Alternatively, a comprehensive gene expression atlas can be constructed with multiple tissues in different species.

The first sets of gene expression atlas were constructed using microarrays containing about 350 (tetraploid R. hybrida) [16] and later with about 4800 selected ESTs (R. chinensis, R. wichuriana, and R. hybrida) [17]. A more comprehensive database containing about 80,714 transcript clusters for R. chinensis ‘Old Blush’ was constructed from 13 tissues/organs at different developmental stages or under different abiotic and biotic stresses with Illumina and 454 sequencing platforms [18]. Several recent studies have also been done with various Rosa species for different purposes [19,20,21,22] even for a single-nucleotide-polymorphism (SNP) array [23, 24]. Though all these studies have promoted significantly the understanding of the rose biology, the quality of these transcriptomes is normally relatively poor with low N50 value, poor completeness, and short average length. Very recently the publishing of genome sequences of a doubled-haploid for R. chinensis ‘Old Blush’ set a milestone for rose research [15, 25]. However, missing half genome information of roses, most of which often feature high heterozygosity due to frequent intra−/inter-species hybridization and polyploidization, might cause lacking of power in understanding the genetic bases of roses traits associated with recessive markers [3].

To identify the molecular components underpinning rose specialty and to find molecular markers for clarifying phylogenetic relationship of Rosaceae, we generated a set of high quality reference transcriptomes for two rose species, which featured at least six pair of contrast traits [3], by sequencing three tissues at different developmental stages and by integrating published datasets. We identified about 4447 transcripts conserved in Rosaceae plants, among which 405 were under significant selection pressure, and 164 transcripts present only in roses.

Results

Transcriptome sequencing and assembling

RNA samples from shoot materials and young leaves were profiled with high-throughput sequencing (Figs. 1 and 2). This resulted in 421.8 million and 427.5 million clean reads for Rosa chinensis ‘Old Blush’ (OB) and R. wichuriana ‘Basye’s Thornless’ (BT), respectively (Table 1). The raw sequence files have been uploaded to the National Center for Biotechnology Information Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/) under the accession numbers SAMN07808857–07808870.

Fig. 1
figure 1

Leaf (a) and shoot (b) materials used for RNA-seq in this study. For each panel, left for Rosa wichuriana ‘Basyes’ Thornless’ (BT), and right for R. chinensis ‘Old Blush’ (OB). Bars = 1 cm

Fig. 2
figure 2

Working flow for assembling of reference trancriptomes and identification of Rosacaeae-common and Rosa-specific transcripts. Main steps were shown in boxes with key transcripts numbers given. Major tools used in these analysis were marked in blue. Dashed arrows and boxes indicated the data generated from this study could be explored in these applications

Table 1 Summary of sequencing strategies and sequences obtained

Next we generated the assemblies for BT (68,612 transcripts) and OB (74,975 transcripts) with transcripts N50 about 2099/1732 bp (BT/OB) and average length about 1338/1170 bp (BT/OB). This was much better than previously published data for these two species (Table 2, Additional file 1: Table S1, Additional file 2: Table S2; Fig. 3a) [18, 20, 26, 27] and other species/materials in Rosa [19, 21, 22, 28]. A further assembly with all available data for OB produced even better transcript N50 (2092 bp) and average length (1359 bp). Comparing the assembled OB transcripts with the genome assembly (v2.0_a1) that containing 45,469 coding and 4918 non-coding genes [15] revealed that almost 100% of these transcripts could be mapped. On the other hand, by looking for the genes included in the genome assembly in the assembled transcripts we identified about 97% of these genes in our OB assemblies. Mean GC content of all assemblies (44.2–46.4%) was comparable to that of published (45.8–46.5%) roses (Table 2). A BUSCO analysis revealed a high proportion of complete (C) and single copy (S) from 54.4 to 68.8%, and complete (C) and duplicated (D) from 24.5 to 28.2%. Fragmented (F) and missing (M) BUSCO items occupied about 5.6–17.4% (Fig. 3b; Additional file 1: Table S1). The prediction of high number of transcripts could be correlated with high frequency of intra−/inter-species hybridization and polyploidization in roses. Taken together, these results suggest the high quality, completeness, and coverage of the assembled BT and OB transcriptomes.

Table 2 Statistics of final assemblies for this study and published data
Fig. 3
figure 3

The assembly of high quality transcriptomes for roses. a Length distribution in proportion of assembled unigenes for the two species, Rosa chinensis ‘Old Blush’ (OB, bars filled in grey color), and R. wichuriana ‘Basyes’ Thornless’ (BT, open bars). Bars filled in black color mark the length distribution of shared transcripts between the two species (coreset1; see below and main text). b BUSCO analysis shows the completeness of assemblies and coreset1. c Annotation results of the assembled unigenes and core-sets for Rosa. The coreset1 is between the two species, while coreset2 is for the unigenes shared among Rosa (see Fig. 2) based on published and newly collected data from this study. For each category (Nr_plants, GO, Uniprot, Swissprot and COG databases), total unigene counts annotated in different databases besides the proportion (in brackets) are given. Shared and total unigenes annotated by all databases are also given. d Venn diagram shows the results of coreset1 identification. About 10,773 transcripts were identified at the 95% sequence identity level between the two species

Functional annotation

Five databases, including NR, GO, Uniport, Swissprot and COG databases, were explored to annotate the function of these assemblies. About 72.7% (BT) to 74.4% (OB) transcripts could be annotated with either database, while only about 17.6% (OB) to 18.6% (BT) transcripts with annotation shared for all databases (Table 2). Interestingly, about 67.4% (BT) to 71.8% (OB) transcripts features GO annotation, making the GO enrichment analysis feasible in following steps. Detailed annotation information was included in Fig. 3c as well as Additional file 2: Table S2 and Additional file 3: Figure S1. For both species, categories related to Fragaria presented the biggest proportion of transcripts (Additional file 2: Table S2).

Identification of the conserved orthologous transcript elements sets between OB and BT (coreset1), and for Rosa (coreset2)

With an aim at identifying the transcripts shared for all Rosaceae plants, we first identified the transcripts shared between the two species. With a relatively stringent threshold (identity = 90%, minimum length coverage =0.7, minimum alignment length = 200 bp) we identified 10,073 unique transcripts shared by the two species (coreset1; Fig. 3d and Table 2). Interestingly, coreset1 showed more than 80% of the transcripts with annotation (Fig. 3c and Additional file 4: Table S3).

We next screened for the transcripts common for all the Rosa plants with sequence identity set at 80%. With this threshold, about 5959 transcripts (N50 at 2326 bp and mean length at 2161 bp) were shared in all Rosa species (coreset2) with most of them well annotated (98.7%; Table 2; Fig. 3c).

Identification of the Rosaceae-common and Rosa-specific transcripts

One of the purposes of this research was to find transcripts/markers that could be potentially used for phylogenetic reconstruction of Rosaceae plants. We then compared the coreset2 transcripts with genes from M. domestica, R. occidentalis, P. avium, and Fragaria veseca, the closest relatives of Rosa (Fig. 4a). About 4447 transcripts were found to present in all five genus (Rosaceae-common; 74.6% of coreset2 transcripts), while only 164 transcripts were specific for Rosa plants (Rosa-specific; 2.75% of coreset2). Most of the Rosaceae-common transcripts were annotated (4228 or about 96%; Table 2 and Additional file 5: Table S4) and could encode for proteins (4341; 97.6%). About 504 Rosaceae-common transcripts belong to the 1440 BUSCO single-copy genes (35%; Additional file 5: Table S4, Additional file 6: Table S5). GO enrichment assay revealed that most of these transcripts were involved in very important and basic function (Fig. 4b, Additional file 5: Table S4). Phylogenetic analysis showed that most of Rosaceae-common transcripts supported the clustering of P. avium with M. domestica (2812 or 65%; Fig. 4c upper panel and Fig. 4d black dots) or R. chinensis with F. veseca (1436 or 33%; Fig. 4c lower panel and Fig. 4d blue dots) (Additional file 7: Figure S2).

Fig. 4
figure 4

Identification and characterization of Rosaceae-common potential coding gene. a Venn diagram shows the Rosaceae-common and Rosa-specific transcripts. Note that, except Rosa, transcripts specific for other genera were not identified (marked with na). For that we are not interested in other share sets. b GO enrichment analysis of the 4447 Rosaceae-common transcripts (http://bioinfo.cau.edu.cn/agriGO). X-axis shows the enrichment fold of specific GO terms in comparison with the background. BP, CC and MF mean biological process, cellular component and molecular function separately. The area indicates gene counts. c Representive phylogentic topologies based on 4447 Rosaceae-common genes. Upper panel indicates about 65% topologies (2812) supporting the clustering of Prunus with Malus, while the topology in lower panel is supported by 33% genes (1436). Numbers on branches indicate distance. d Three-dimensional plots for the genetic distances of the 4447 transcripts between Rosa and Fragaria (X-axis) /Malus (Y-axis) /Prunus (Z-axis). Black and blue dots mark the genes supporting the topologies in C (Black for upper panel and blue for lower panel), while gray dots show genes supporting other topologies. e Distribution and GO enrichment analysis of the 409 selected Rosaceae-common transcripts. Y-axis shows the enrichment fold of specific GO terms in comparison with the background. Only four GO items are significantly enriched (marked in orange color). f Clustered heat map comparing scaled expression values for the 409 selected Rosaceae-common transcripts. Yellow indicates higher while purple marks lower expression. Blue and red bars indicate membership in the identified transcription clusters

Features of the Rosaceae-common transcripts

Among the 4447 Rosaceae-common transcripts, 4341 (97.6%) had the coding potential and were used for selection pressure analysis with the ratio of dN (non-synonymous) vs. dS (synonymous) changes. Malus/Prunus/Fragaria orthologous sequences were taken as background (Additional file 8: Table S6). This analysis identified 409 transcripts significantly selected with P < 0.05 (after Bonferroni correction). Surprisingly, about 42% these genes (173) displayed a higher expression in shoot apical meristems (SAM; Fig. 4f, blue bar), while the expression of the others being high in leaf of both species or lower expressed in both SAM and leaf materials (Fig. 4f, red bar). These significantly selected genes showed an enrichment GO items for DNA repair (GO:0006281), responses to DNA damages stimulation (GO:0006974), cellular response to stress (GO:0033554) and cellular response to stimulus (GO:0051716) (Fig. 4e, Additional file 9: Table S7). Interestingly, all four GO items featured the same set of nine genes (Additional file 10: Table S8) [29,30,31]. Though absolute expression level differed, eight of the nine genes showed an increased expression in materials of shoot apical meristem (Additional file 11: Figure S3), suggesting that these genes might play important roles during the development or environmental adaptation of roses.

Characteristics of the Rosa-specific transcripts

Among the 164 Rosa-specific molecules, 147 had protein-encoding potential with 136 of them having annotation (65 were uncharacterized previously; Additional file 12: Table S9). About 72 of these Rosa-specific genes were expressed at relatively lower level in both leaf and SAM materials (Fig. 5a, green and black bars), while around 92 of them (~ 56.1%) were expressed higher in SAMs than leaf materials (Fig. 5a, blue and yellow bars). No GO item was significant enriched. However, transcripts encoding for putative F-box family members were strongly enriched (24x; Fig. 5b; Additional file 12: Table S9). F-box proteins represent one of the largest super-families in eukaryotic organisms, and contain F-box motif to recognize substrate proteins in ubiquitin-proteasome pathway [32, 33]. F-box proteins play a pivotal role in many physiological activities such as cell-cycle progression, transcriptional regulation, hormone signaling, programmed cell death and cell signaling transduction. F-box proteins can be distinguished to three classifications: F-box/kelch-repeat type, F-box/LRR-repeat proteins, and other F-box proteins. At least seven Rosa-specific transcripts belonged to F-box/LRR-repeat type, which might play essentials roles in adaptation of biotic stresses (Additional file 12: Table S9). Coincidently, four TMV resistant genes, which are essential for defenses against various plant viruses [34, 35], were also significantly enriched in these Rosa-specific genes (Fig. 5b). These indicate that roses might have evolved new defense-related proteins to against its specific biotic stresses. Further experiments will be required to address the biological roles of these Rosa-specific transcripts.

Fig. 5
figure 5

Identification and characterization Rosa-specific transcripts. a Heat map comparing scaled expression values for the 164 Rosa-specific transcripts. Yellow indicates higher while purple marks lower expression. Blue, green, yellow, and black bars indicate membership in the identified transcription clusters. b F-box and TMV resistance protein were significantly enriched in Rosa-specific transcripts. X2 tests were performed online (http://www.quantpsy.org/chisq/chisq.htm) by comparing the Rosa-specific transcripts number with those from Rosaceae-common, coreset1 and coreset2 genes. P values were corrected with Bonferroni correction

Discussion

As one of the most important horticultural plants, rose has its special biology. Continuous flowering, fragrance, flower shape, thorn and many traits not presenting in Populus and other woody model plants could be found in roses, hence roses are now becoming a model woody species for understanding the molecular mechanisms regulating these traits. Interestingly, the breeding of modern roses often involves frequent hybridization and polyploidization among species, which often feature stronger diseases resistance and cold resistance, better fragrance and lack of prickles [36,37,38]. On the other hand, tracing the processes and history of modern roses domestication and breeding remains a challenge [2].

In this report, we produced high quality transcriptome assemblies for R. chinensis ‘Old Blush’ (OB) and R. wichuriana ‘Bayes’ Thornless’ (BT) with transcripts N50 above 2 kb and mean length about 1.3 kb. By incorporating published data for OB, we generated an improved assembly with mean transcript length longer than 1.3 kb. We identified 10,073 highly conserved transcripts (coreset1) between OB and BT with BUSCO assay confirming the high level of completeness with these assemblies. As coreset1 transcripts were based on a relatively stringent level of sequence identity (90%), they could directly be used to evaluate the differential expression of orthologous genes between species.

These assemblies were then explored to identify about 5959 transcripts shared by Rosa sp. Finally we detected 4447 transcripts shared among Rosaceae, and about 164 transcripts only present in the Rosa transcriptomes. Since comparative selection pressure analysis provides an efficient way for understanding the molecular signatures underlying morphological trait variation and domestication [39,40,41], we next identified the genes under positive selection in roses among these Rosaceae-common transcripts. This analysis detected 409 rose genes significantly selected with around 40% of them being highly expressed in SAM tissues, suggesting that they might play important roles in the development of young organs/tissues of roses. It’s noteworthy that nine of these positively selected genes belong to GO items related to DNA repair and stress adaptation (Fig. 4) [29,30,31, 42]. This strong enrichment seems be associated with a strong requirement in the biological processes upon genome damage/confliction due to switch of environmental conditions and/or frequent intra- and inter-species hybridization. The Rosaceae-common dataset contains about 504 single-copy transcripts overlapped with the BUSCO set of genes. These single-copy transcripts could be directly used to clarify the Rosaceae phylogenetic relationship (Fig. 4), a challenge likely caused by frequent hybridization, rapid radiation, polyploidization and domestication [43,44,45,46,47].

In contrast to the Rosaceae-common transcripts, the identification of Rosa-specific transcripts seemed very interesting. Although the domestication processes had been documented [2], the evolutionary history and molecular mechanisms controlling traits special for roses are still not clear [1]. About half of the 164 transcripts are uncharacterized or without known GO annotation. These Rosa-specific transcripts might be related to the phenotypes that have not been characterized in other species. It’s possible that sequences of these transcripts have significantly diverged but remain similar functions like the homologs in Arabidopsis and roses. Interestingly, we observed a strong enrichment of F-box and TMV resistant proteins. These proteins might play a role in processes involved in biotic stress adaptation of roses.

Conclusions

In this study, we provided better quality transcriptome assemblies for roses, and pinned out the genes might make rose special. In addition, we identified transcripts common in Rosaceae, and these should help us to clarify the phylogenetic relationships of Rosaceae plants.

Methods

Plant materials and data generation

R. chinensis ‘Old Blush’ (OB) and R. wichuriana ‘Basye’s Thornless’ (BT) plants were grown in the glasshouses at the Flower Research Institute of Yunnan Academy of Agricultural Sciences (Kunming, Yunnan, China). Leaf materials of about 4 cm length (3.5–4.5 cm from base of pedicel to leaf tip with all the leaflets just became flatten; leaves at this stage were supposed to become completely functional for photosynthesis; Fig. 1) were collected in autumn (November 21st, 2015) from blooming OB plants and non-blooming BT, in spring (March 21st, 2016) when both species were setting flower buds (Fig. 1). Shoot tip materials with most leaf materials removed were sampled on 21st March, 2016. At least six biological replicates, composed each of at least 3 individual plants, were performed for each developmental stage and for each species.

Total RNA was isolated using the RNAprep Pure Plant Kit (Tiangen, Beijing) and mRNA was purified with poly-T oligo-attached magnetic beads (see Fig. 2 for workflow). Around 1 μg mRNA with RNA Integrity number (RIN) score greater than 8 was used for library construction. Fragmentation was carried out using divalent cations under elevated temperature in an Illumina proprietary fragmentation buffer. Sequencing libraries were generated using the TruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA, USA). First strand cDNA was synthesized using random oligonucleotides and SuperScript II. Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities and the enzymes were removed. After adenylation of the 3′ ends of the DNA fragments, Illumina PE adapter oligonucleotides were ligated to prepare for hybridization. To select cDNA fragments of the preferred 380 bp in length, the library fragments were purified using the AMPure XP system (Beckman Coulter, Beverly, CA, USA). DNA fragments with ligated adaptor molecules on both ends were selectively enriched using Illumina PCR Primer Cocktail in a 15 cycle PCR reaction. Products were purified (AMPure XP system) and quantified using the Agilent high sensitivity DNA assay on a Bioanalyzer 2100 system (Agilent). Sequencing was carried out on either Illumina NextSeq 500 or Hiseq2000 platform.

Data filtering

Approximately 105Gb pair-end data was generated for all samples (about 52.3Gb for BT and 52.8Gb for OB; Table 1). The final data volume for OB was about 116.1Gb including the published data. Data information for other species/materials was listed in Table 2. The quality of raw reads was assessed and filtered with a custom pipeline using FastQC (V0.10.1; http://www.bioinformatics.babraham.ac.uk/projects/fastqc) and Trimmomatic (V0.36; ILLUMINACLIP:TruSeq3-PE.fa:2:45:10/ LEADING:10/ TRAILING:10/SLIDINGWINDOW:4:25/MINLEN:48) [48]. Adaptor sequences, reads PHRED quality below 92%, and PCR duplicates were all removed with custom perl scripts (https://github.com/ckenkel/annotatingTranscriptomes). Short read archive (SRA) accessions for all data are found in Table 1.

Assembly and functional annotation of transcriptomes for OB and BT

Prior to assembly, data for each species was concatenated (SAM, leaf_Nov and leaf_Mar), and read abundance was normalized to 50x coverage using the in silico normalization tool in Trinity [49] to spare assembly time and minimize memory requirements. Assembly for BT was constructed with data generated from this study, while two assemblies were built up first with the newly produced data and later combining with published data (see Tables 1 and 2). After filtering and normalization the data was about 157 Gb, comprising approximately 1.3 billion normalized read pairs, which were then assembled using optimized parameters (Kmer = 2, min_glue = 5, SS_lib RF) in Trinity (r2014_07–17) [49]. Trinity assembly was clustered by TGICL toolkit to remove identified duplicates [50]. Open reading frames (ORFs) were predicted using Transdecoder (https://github.com/TransDecoder/TransDecoder/wiki). Hmmer3 was used to identify additional ORFs matching Pfam-A domains [51].

The completeness of these assemblies was further evaluated with Benchmarking Universal Single-Copy Orthologs (BUSCO) strategy with RNA mode using 1440 near-universal single-copy orthologs [52]. Functional annotation was performed for each of the transcriptomes at the peptide level using a custom pipeline that defines protein products and assigns transcript names against multiple databases (Fig. 2). Predicted proteins/peptides were analyzed using InterProScan5 [53], which searched all available databases including Gene Ontology (GO:201605). BLASTx analysis was performed with the NCBI non-redundant protein sequences (NR) database, eukaryotic ortholog groups (KOG) database, KEGG ortholog (KO) database, Swiss-Prot protein database, Gene Ontology (GO) database, and protein family (PFAM) database. The resulting .gff3 and .tbl files were further annotated with functional descriptors in Transvestigator (https://doi.org/10.5281/zenodo.10471).

Calling of the conserved orthologous transcript elements set between OB and BT (coreset1), and for Rosa (coreset2)

To identify the transcripts shared between the two Rosa species, we identified coreset1 between OB and BT using orthoMCL [54] and an optimized reciprocal blast method [55] with a sequence identity at 90%, coverage above 0.7 and more than 200 bp length.

To screen for shared transcripts within Rosa (coreset2), we compared coreset1 to other published RNA-seq data for R. multiflora, R. roxburghii, etc. (see Table 2 for the data used in this analysis). Sequence identity was set at 80%.

Identification of Rosaceae-common and Rosa-specific transcripts

We further compared the coreset2 transcripts to known CDS for Malus domestica (v3.0.a1), Prunus avium (v1.0.a1), Rubus occidentalis (v1.0.a1), Fragaria veseca (v4.0) in order to find out the transcripts present in all Rosaceae and only in Rosa plants. Reciprocal blast analysis was carried out at sequence identity of 70% with other parameters the same like for coreset1. As the coreset2 transcripts were used to blast against the CDS from other plants, this analysis did not detect other genera-specific transcripts/genes. Both sets of transcripts were later subjected to GO and KEGG Orthology (KO) enrichment analysis on the basis of a hypergeometric test (FDR < 0.05).

Selection analysis of Rosaceae-common transcripts

Selection pressure analysis was performed for the 4341 transcripts with coding potential in Rosaceae-common set of molecules as previously described [39,40,41]. Putative orthologous genes from M. domestica, P. avium and F. veseca were treated as background, while Rosa transcripts were considered as foreground for the selection analysis. Amino acid alignments were carried out with ClustalO (with default parameters) [56]. Phylogenetic analysis was done with FastTree (−gtr –gamma; http://tree.bio.ed.ac.uk/ software/figtree/). The codeml program in PAML package was used to estimate the synonymous (dS) and non-synonymous (dN) rates in these transcripts according to Yang [57]. Fisher’s exact test with Bonferroni correction was used to compare the significance of selection pressure between H0 and H1 models.