Single-molecule real-time sequencing facilitates the analysis of transcripts and splice isoforms of anthers in Chinese cabbage (Brassica rapa L. ssp. pekinensis)
Anther development has been extensively studied at the transcriptional level, but a systematic analysis of full-length transcripts on a genome-wide scale has not yet been published. Here, the Pacific Biosciences (PacBio) Sequel platform and next-generation sequencing (NGS) technology were combined to generate full-length sequences and completed structures of transcripts in anthers of Chinese cabbage.
Using single-molecule real-time sequencing (SMRT), a total of 1,098,119 circular consensus sequences (CCSs) were generated with a mean length of 2664 bp. More than 75% of the CCSs were considered full-length non-chimeric (FLNC) reads. After error correction, 725,731 high-quality FLNC reads were estimated to carry 51,501 isoforms from 19,503 loci, consisting of 38,992 novel isoforms from known genes and 3691 novel isoforms from novel genes. Of the novel isoforms, we identified 407 long non-coding RNAs (lncRNAs) and 37,549 open reading frames (ORFs). Furthermore, a total of 453,270 alternative splicing (AS) events were identified and the majority of AS models in anther were determined to be approximate exon skipping (XSKIP) events. Of the key genes regulated during anther development, AS events were mainly identified in the genes SERK1, CALS5, NEF1, and CESA1/3. Additionally, we identified 104 fusion transcripts and 5806 genes that had alternative polyadenylation (APA).
Our work demonstrated the transcriptome diversity and complexity of anther development in Chinese cabbage. The findings provide a basis for further genome annotation and transcriptome research in Chinese cabbage.
KeywordsChinese cabbage Anther Full-length transcript Alternative splicing Fusion transcript
circular consensus sequence
long non-coding RNA
open reading frame
single-molecule real-time sequencing
Gene sequencing emerged as a revolutionary technology in the field of biological research. The first of these technologies was Sanger sequencing; however, due to low throughput and poor automation, Sanger sequencing was severely limited in its application in genome and transcriptome analysis . The advent of NGS technologies, such as ABI SOLiD, Illumina Solexa, and Roche 454 systems, stimulated structural and functional genomics studies for diverse plant species. Among these technologies, Illumina sequencing has the advantages of high accuracy, high throughput, high sensitivity, and low cost, and is now the most widely used platform in genome sequencing . C. sativus was the first vegetable crop to complete genome-wide de novo sequencing by NGS. Subsequently, the main crop genomes of S. tuberosum, T. aestivum, B. napus, G. raimondii, and other crops were sequenced. Short-read RNA-Seq by NGS is frequently applied for transcriptome analysis. Using short-read RNA-Seq, researchers can obtain profiles for genome-wide expressed genes, including low-abundance genes, as well as new genes and SNPs . Research on gene expression profiling of pollen and anther development in the genus Brassica has accumulated in recent years [4, 5, 6, 7, 8, 9, 10, 11]. However, although NGS technologies are effective, they still have several drawbacks, including the generation of relatively short reads, which may lead to misassembly and gaps . Moreover, short reads are not well suited to accurately detecting structural variations (SVs) and transcript isoforms generated by AS events [13, 14]. Limited by NGS methods, short RNA-Seq reads must be assembled into longer DNA contigs , a process that is susceptible to misassembly of short sequence reads transcribed from highly repetitive regions or similar members of multiple gene families . This problem may become even more severe for polyploid plants that often harbor higher sequence similarity between coexisting subgenomes, which frequently indirectly leads to annotation error. Moreover, short-read RNA-Seq cannot distinguish between alternatively spliced forms for individual transcripts, which can make up a large proportion of transcripts. For instance, approximately 83.4% of multiple-exon genes are subject to AS in A. thaliana, which contributes to organismal protein diversity without massively increasing the number of genes .
Third generation sequencing (TGS) technologies have recently been developed, which is known for single-molecule sequencing (SGS) and sequencing in real-time . The first TGS technology platform was delivered by Helicos Biosciences, but it proved unworkable from the market because it was relatively slow, expensive, and generated short reads (~ 32 bp) . Soon after, single-molecule real-time sequencing (SMRT) sequencing by PacBio emerged as unique opportunity for constructing full-length transcripts . The distinguishing features of SMRT technology is the production of long reads. Initially, the average length of reads generated by SMRT technology was just ~ 1.5 kb, but is now 10-15 kb . Therefore, SMRT can improve the accuracy of gene models as it allows generation of reads that cover full-length transcripts . However, SMRT sequencing still has major technical defects and limitations, namely its relatively high cost, lower throughput, and high error rate. Therefore, at present, a combination of NGS technologies and SMRT sequencing is preferable: consensus sequence reads are constructed from raw PacBio subreads and aligned with the reads generated from appropriate NGS platforms. Using this approach, multiple complex genomes have been successfully de novo assembled or improved [22, 23, 24, 25, 26, 27, 28, 29, 30].
SMRT sequencing has been previously effectively applied to transcriptome analysis. Well-characterized full-length transcripts are not only beneficial for analysis of gene structure and alternative splicing, but also greatly improve functional studies of important loci . Early applications of SMRT sequencing on the transcriptome were relatively narrow, and most focused on model organisms such as humans  and yeast . Since 2015, SMRT technology has been widely applied to characterize the full-length sequence of genome and transcripts in diverse species. SMRT has facilitated structural genomics and grain transcriptome research in common hexaploid wheat . In danshen, the application of SMRT sequencing to different root tissues revealed that about 40% of the detected gene loci had occurred alternative splicing (AS) events . In maize B73, over 111,000 transcripts from six tissues were identified, unveiling the complexity of the transcriptome by SMRT sequencing . PacBio SMRT was employed for the sorghum transcriptome, and over 11,000 novel splice isoforms, alternative polyadenylation (APA) of ~ 11,000 expressed genes, and more than 2100 novel genes were uncovered at an unprecedented scale . The A. thaliana transcriptome was analyzed by SMRT, enhancing the understanding of differentially expressed AS isoforms under normal conditions and in response to ABA treatment . In moso bamboo, over 42,280 distinct splicing isoforms and 25,069 polyadenylation sites were found . In Dendrobium officinale, the full-length cDNA transcripts of stems and leaves uncovered multiple genes involved in polysaccharide synthesis . The red clover transcriptome was analyzed by SMRT sequencing and the results uncovered about 29,730 novel isoforms from known genes and 2194 novel isoforms from novel genes, in addition to over 5000 AS events, over 4300 long non-coding RNAs (lncRNAs), and 3700 fusion transcripts . Using SMRT technology, a total of 113,321 transcripts were obtained from alfalfa leaves from three different development stages; sequencing data uncovered about 7568 AS events and 17,740 lncRNAs . The above works are crucial for providing deeply understanding of their genome and transcripts.
The genus Brassica comprises multiple economically important vegetable and oil crops. The ‘triangle of U’ is well established and refers to three diploid species, B. rapa (A genome, 2n = 20), B. nigra (B genome, 2n = 16), and B. oleracea (C genome, 2n = 18), as well as three amphidiploid species, B. napus (AC genome, 2n = 38), B. juncea (AB genome, 2n = 36), and B. carinata (BC genome, 2n = 34). In 2011, the first genus Brassica genome draft, B. rapa genome v1.5, was published. The 283.8 Mb genome was generated using next-generation sequencing (NGS) technology with a contig N50 size of 46 kb, and greatly facilitated genomics and molecular biology research, as well as the generic breeding of B. rapa and other Brassica species . The second version (v2.0) was assembled in 2017. Further improving the scaffold order, the upgraded B. rapa genome v2.5 was 389.2 Mb with a contig N50 size of 53 kb . However, restricted by the read length of NGS technology, the above genome versions had the disadvantages of poor continuity, assembly errors, and low assembly rate of repetitive sequences. A more recent release, B. rapa genome v3.0 was de novo assembled and re-annotated using single-molecule sequencing (PacBio), optical mapping (BioNano), and chromosome conformation capture (Hi-C) technologies. The total length of B. rapa genome v3.0 was 353.14 Mb, with a contig N50 size of 1.45 Mb and a scaffold N50 size of 4.45 Mb, including 1301 scaffolds and 389 gaps . The high-quality reference genomic information lays a solid foundation for the development of genetics and functional genomics of B. rapa, especially the cloning of important agronomic trait regulatory genes and the analysis of genetic background. Only by analyzing the molecular mechanism of trait formation at the genetic level, can we carried out targeted genetic breeding, molecular marker-assisted breeding and even molecular design breeding, greatly improve breeding efficiency and accelerate the cultivation of excellent new varieties.
Owing to broad adaptability and numerous end-uses, Chinese cabbage is the most widely cultivated and consumed vegetable within B. rapa. Although the reference genome has been improved using the PacBio Sequel platform, sequence and structural data of tissue-specific mRNA remains scarce in Chinese cabbage. The main objective of this study was to characterize full-length transcripts in Chinese cabbage anthers using the emerging SMRT sequencing technology to unveil the transcriptome complexity of anther development. SMRT sequencing data, corrected by short-read NGS technology, were used to analyze full-length transcripts in anthers to further reveal AS events, lncRNAs, and fusion isoforms in Chinese cabbage. This study provides a valuable resource for further genome re-annotation and increases our understanding of the anther transcriptome.
Transcriptome sequencing and error correction
Summary of ROI from the PacBio Sequel platform
Mean FLNC length (bp)
Ant 1-2 k
Ant 2-3 K
Ant > 3 K
Classification of reference genome comparison results
Low PID map
High quality map
Loci and isoform detection and characterization
Gene structure annotation
Loci < 1 K
Loci 1-2 K
Loci 2-3 K loci
Loci > = 3 K
PacBio data set
Based on the characteristics of library construction, we could not guarantee the structural integrity of the 5′ end of the transcripts. Therefore, the full-length evaluation of FLNC and isoforms produced by PacBio Sequel platform were only estimated at the 5′ end. With multiple-exon transcripts of genome annotation as a reference, isoforms obtained from the PacBio data set with identical direction and overlap greater than 20% were screened. If the first splice donor site at 5′ end of genome annotation was indeed included at the first splice donor site of the isoform obtained from PacBio data set, then the isoform was considered to be a full-length isoform, and the corresponding FLNC was considered to be a full-length FLNC. Our data indicated that approximately 76.66% multiple-exon isoforms and 88.22% multiple-exon FLNC contained the same splice donor site at the 5′ end as the reference annotation, and were regarded as full length, implying a relatively high integrity in structure (Additional file 2: Table S5).
Functional annotation of novel isoforms
LncRNA and ORF prediction of novel isoforms
Various models of AS
Fusion transcript and APA identification
A fusion transcript refers to a new gene formed by splicing together two or more separate genes, which were known as chimeric transcripts. The mechanisms leading to the generation of fusion transcripts include genomic structural variation, transposition, or trans-splicing after transcription. In this study, we identified 104 fusion transcripts, involving 187 annotated genes (Additional file 2: Table S11). Fusion transcripts were most frequently distributed on chromosome A03, followed by chromosome A09 and A01. According to the chromosomal distribution, we detected 101 inter-chromosome and 3 intra-chromosome fusion transcripts (Fig. 6f). This result was consistent with those of other species such as maize  and red clover . Previous studies have indicated that most fusion transcripts are composed of two genes . Consistent with these studies, all the 104 fusion transcripts in our data were composed of two genes. In addition, three fusion transcripts detected by SMRT were randomly selected and experimentally validated in anther and four other floral organs. The experimental results confirmed the authenticity of these chimeric RNAs (Additional file 1: Figure S1).
At present, the reference genome of Chinese cabbage has been updated to version 3.0 using single-molecule sequencing. However, full-length transcripts, alterative spliced transcripts, fusion genes, and APA sites of Chinese cabbage have not been well-explored at the transcriptional level. Anthers are the male reproductive organs of plants that can produce pollen grains. The regulatory network of anther development is an extremely complex process involving a range of biological events . In Arabidopsis, anther development has been divided into 14 stages, which make up two phases: microsporogenesis and microgametogenesis . Briefly, anther development originates from stamen primordium formation, and microspore mother cells undergo meiosis to form haploid microspore tetrads. The microspores are surrounded by callose, and the release of individual microspores from the tetrads requires the action of callose enzyme secreted by the tapetum. Then, microspore wall is synthesized, followed by tapetum degradation, pollen mitosis divisions, septum cell degeneration, stomium differentiation, and finally anther dehiscence, releasing mature pollen grains. These events are relatively independent, and there is coordination in time and space. The abnormality of gene structure or expression in one of the events may cause loss of pollen function, which can generate male sterile lines. One crucial application of plant male sterility is hybrid seed production, and the advantage of hybrids is that they can increases seed yield and improve stress resistance [50, 51]. Therefore, it is necessary to investigate full-length mRNA information, providing a comprehensive view of splice isoforms in anther development.
PacBio sequencing is an effective platform for sequencing full-length transcripts because of its generation of long reads, which have an average length of 12 kb . This long read length is why the PacBio sequencing platform can comprehensively analyze splice isoforms of each gene without assembly. In our work, we analyzed the full-length transcriptome of Chinese cabbage anther using the PacBio Sequel platform and yielded a total of 1,098,119 CCSs. Of these, 827,322 transcripts were identified as FLNCs, and the length of each sequencing library was consistent with the library standard (Fig. 3; Table 1). Single-molecule sequencing has a high base error rate of about 13%, mainly due to the addition of extra bases, especially in homopolymers . However, as such errors occur randomly, there is no error bias, unlike that observed with NGS technology. Currently, the most common and effective method to further correct PacBio sequencing is to use high-accuracy data from an Illumina platform. With error correction using short-read RNA-Seq, 725,731 high quality FLNCs were identified to obtain 51,501 isoforms, consisting of 38,992 novel isoforms from 11,398 known genes and 3691 novel isoforms from 2682 novel genes (Additional file 2: Table S6). These results demonstrated that PacBio transcriptome sequencing can heighten the capacity to obtain full-length transcripts and enrich novel or uncharacterized isoforms or genes. Of the novel isoforms obtained, 407 high-confidence lncRNAs and 37,549 novel isoforms with predicted ORFs were identified (Additional file 2: Table S8 and Table S9). During pollen development and the fertilization process in B. rapa, a total of 12,501 putative lncRNAs were detected with an average length of 373 bp . In our data, the mean length of predicted lncRNAs from novel isoforms was 1127 bp (Additional file 2: Table S8). Previously, the B. rapa genome was annotated using only ORFs, and thus there was no 5′ and 3′ UTRs defined. In 2013, Tong et al. provided a global transcriptional landscape in B. rapa accession Chiifu-401-42 and defined the 5′ and 3′ UTRs. The mean length of 5′ and 3′ UTRs was 139 bp and 184 bp, respectively . In Arabidopsis, the mean length of 5′ and 3′ UTRs was 88 bp and 184 bp, respectively . In our PacBio sequencing data, the mean length of 5′ and 3′ UTRs from novel isoforms with predicted ORF was 788 bp and 641 bp, respectively (Fig. 8c, d).
In addition to capturing full-length transcripts, another advantage of PacBio sequencing is the ability to detect AS events, which play a crucial role in regulating cellular molecules, cellular physiology, and developmental pathways [45, 55, 56]. The proportion of AS genes in rice, maize, B. rapa, and A. thaliana is 33, 37, 42, and 61%, respectively [57, 58, 59]. Limited by short reads, previous studies of the transcriptome using NGS technology have only been able provide individual splice junctions, while PacBio sequencing technology can be applied to alternatively spliced forms for each mRNA . IR is the most common event in various genomes, which supports an intron-definition mechanism for pre-mRNA splicing . In our study, we collected anthers from all developmental stages to harvest relatively comprehensive spliced isoforms. However, we detected a total of 453,270 AS events, and the majority of AS events were XSKIP (Fig. 9a). Previous studies indicated that alternative spliced transcripts have tissue-specific expression in various plants [61, 62, 63, 64]. For novel splice junctions in B. rapa, 34.4% of alternative spiced transcripts were detected in only one tissue . Therefore, differences in the prevalence of AS events may be related to tissue specificity. Those findings illustrate the complexity of the anther-specific transcriptome. Unfortunately, the expression levels of transcripts detected by PacBio sequencing have not been analyzed, and there is no way to analyze the expression pattern of different isoforms from one gene caused by AS events.
Taking the model plant Arabidopsis as an example, the key regulatory genes during anther development have been quite extensively reported, and are mainly involved in microsporogenesis, tapetum layer formation, callose layer development, pollen wall formation, and anther dehiscence . Chinese cabbage and Arabidopsis both belong to the Brassicaceae family, and so are closely related and have high sequence similarity. Therefore, we complied 34 genes that have been confirmed to be involved in anther development in Arabidopsis (Additional file 2: Table S13). In addition to the three whole genome duplications (WGDs) that occurred in Brassicaceae, the Brassica genome has undergone an additional ancient triplication, accompanied by gene fractionation . Thus, based on the best BLASTX search in the Brassica database, we obtained 53 annotated genes from the PacBio annotation data (Additional file 2: Table S13). Of these genes, AGAMOUS (AG), SPOROCYTELESS/NOZZLE (SPL/NZZ), BARELY ANY MERISTEM1/2 (BAM1/2), Extra sporogenous cells/Excess microsporocytes 1 (EMS1/EXS), SOMATIC EMBRYOGENESIS RECEPTOR-LIKE KINASE 1 (SERK1), and TAPETUM DETERMINANT 1 (TPD1) were annotated for microsporogenesis in the early stages of anther development. For tapetal development and programmed cell death (PCD), the key genes detected were Arabidopsis thaliana MYB DOMAIN PROTEIN 80/103 (AtMYB80/AtMYB103), Dysfunctional Tapetum 1 (DYT1), Tapetal Development and Functional 1 (TDF1), Aborted microspores (AMS), and Male sterility (MS1). For pollen exine formation, Callose synthase 5 (CALS5), Cyclin-dependent kinase G1 (CDKG1), AUXIN RESPONSE FACTOR17 (ARF17), No exine formation 1 (NEF1), Ruptured pollen grains 1 (RPG1), Defective in exine formation 1 (DEX1), No primexine and plasma membrane undulation (NPU), CYP703A2, Acetyl-coenzyme A synthetase 5 (ACOS5), MALE STERILITY 2 (MS2), Less adherent pollen5 (LAP5), and ATP-binding cassette G26 (ABCG26/WBC27) were detected. For pollen intine formation, Cellulose synthase 1/3 (CESA1/3), ARABINOGALACTAN PROTEIN 6/11 (APG6/11), and Fasciclin-like arabinogalactan protein 3 (FLA3) were identified. For anther dehiscence, MYB DOMAIN PROTEIN 26 (MYB26), and NAC SECONDARY WALL THICKENING PROMOTING FACTOR 1 (NST1) were generated. Moreover, some of the loci were found to contain different alternatively spliced isoforms in our PacBio data set. For example, two loci (BraA07g036270.3C and BraA07g029410.3C) were annotated as SERK1, which is important for anther cell specification, but only BraA07g036270.3C expressed two alternatively spliced isoforms. As early as the meiosis phase, the callose layer begins to deposit outside the plasma membrane of microspore mother cells, which is the initiation of pollen wall development. In Arabidopsis, 12 CALS genes were identified, of which CALS5 plays an important role in callose synthesis during the tetrad period. In mutant cals5, callose is insufficiently produced around microspores, resulting in defects in primexine formation and subsequently affecting the deposition of sporopollen in the pollen exine . In our data, CALS5 (BraA09g010050.3C) had ~ 1065 spliced variants, and XMSKIP predominated in AS models. For primexine formation, two loci (BraA10g025410.3C and BraA02g004840.3C) were annotated as NEF1. Two AS events, IR and XAE, were detected in BraA10g025410.3C, and XSKIP was found in BraA02g004840.3C. In Arabidopsis, multiple CESA genes encoding a cellulose synthase associated with pollen intine formation were cloned; the knockout mutant of cesa1 and cesa3 exhibited the gametophytic sterility phenotype, with abnormal pollen wall . Both the annotated CESA1 and CESA3 in Chinese cabbage contained two loci each. BraA01g005650.3C was one of the CESA1 loci, of which 40 alternatively spliced isoforms were detected, including twelve IR, six XIR, twelve XMSKIP, and ten XAE. CESA1 (BraA03g057280.3C) had twelve alternatively spliced isoforms, consisting eight IR, two XIR, and two XAE. Similarly, two loci were annotated as CESA3 (BraA03g002020.3C and BraA02g001600.3C). Eight AS events: six IR and two XAE were detected in BraA03g002020.3C, and two IR were detected in BraA02g001600.3C. Our research identified AS events in key genes active during anther development in Chinese cabbage.
Full-length transcriptome technology was used to explore the transcripts and splice isoforms present during anther development in Chinese cabbage. A total of 51,501 isoforms were identified using the PacBio Sequel platform. Meanwhile, 453,270 AS events were detected, and XSKIP events were found to have occurred extensively in anther. A total of 53 key genes active during anther development were detected in our PacBio sequencing, of which eight annotated loci had alternatively spliced isoforms. Additionally, 104 fusion transcripts and 24,816 poly-A sites were also predicted in this study. These new findings provide a valuable resource for complete characterization of anther-specific transcriptome data and improved Chinese cabbage genome annotation.
The excellently Chinese cabbage DH line ‘FT’ was independently created by our laboratory (Liaoning Key Lab of Genetics and Breeding for Cruciferous Vegetable Crops) using isolated microspore culture technology. The DH line ‘FT’ is characterized by extremely early maturity, heat resistance, ovoid leaf head, and white petals (Fig. 1a). In August 2018, the DH line ‘FT’ seeds were placed in a 4 °C refrigerator for vernalization, and then sown in a greenhouse at Shenyang Agricultural University. At the full-bloom stage, three plants with consistent growth were randomly selected, and the whole buds of a complete inflorescence from each plant were individually collected in pieces of aluminum foil (Fig. 1b). Then, the anthers from each bud were detached, frozen in liquid nitrogen, and stored at − 80 °C prior to SMRT sequencing (Fig. 1c).
PacBio library construction and sequencing
Total RNA from the three samples was extracted using Trizol reagent (Invitrogen, CA, USA). RNA purity and integrity was monitored by NanoPhotometer® spectrophotometer (IMPLEN, CA, USA) and a Bioanalyzer 2100 system (Agilent Technologies, CA, USA). RNA contamination was assessed by 1% agarose gel. RNA concentration was detected using a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA). Equimolar rations of the total RNA from each sample were mixed together. The full-length cDNA was prepared using a SMARTer™ PCR cDNA Synthesis Kit (Takara Biotechnology, Dalian, China). Size fractionation (1–2, 2–3, and > 3) of full-length cDNA was achieved using the BluePippin™ Size Selection System (Sage Science, Beverly, MA). The filtered full-length cDNAs were subjected to re-amplification, end repair, SMRT adapter ligation, and exonuclease digestion. After secondary screening by BluePippin™, three SMRTbell libraries were constructed with the Pacific Biosicences DNA Template Prep Kit 2.0. Library quantification and size was measured using a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA) and Bioanalyzer 2100 system (Agilent Technologies, CA, USA). Subsequently, SMRT sequencing was performed on a PacBio Sequel platform by Frasergen Bioinformatics Co., Ltd. (Wuhan, China).
Illumina RNA-Seq library construction and sequencing
In parallel, the quantity and purity of equally mixed RNA were analyzed using Bioanalyzer 2100 and RNA 6000 Nano LabChio Kit (Agilent, CA, USA). Poly (A) mRNA was isolated by poly-T oligoattached magnetic beads (Invitrogen). Following fragmentation, the cleaved RNA fragments were reverse-transcribed into a cDNA library following treatment with the mRNASeqample Preparation Kit (Illumina, San Diego, USA). After assessing the library quality, we performed PE300 sequencing on an Illumina Hiseq 2500 at the LC Sceiences (Hangzhou, China) following the vendor’s recommended protocol.
Quality filtering and error correction
PacBio raw reads were preprocessed and filtered using SMRT Link v5.0. Briefly, CCSs were generated from the subread SAM file with the following parameters: minimum subread length = 50; minimum number of passes = 1, minimum predicted accuracy = 0.8, and minimal read score = 0.65. Then, CCSs were classified into either full length or non-full length reads, by assessing the presence of the 5′ and 3′ adapters and poly (A) tail. FLNC reads were full length CCSs containing all three elements, with no additional copies of the adapter sequence within the DNA fragment.
The high-quality Illumina short reads were used to error correction for FLNC reads. The proovread software v2.12 was widely and efficiently applied for correcting FLNC sequences by iterative short read consensus . Using GMAP2 software, the FLNC sequences before and after error correction were compared to the B. rapa v3.0 reference genome [69, 70] with “—no-chimeras and –n 100” to calculate the PID values, including global PID and local PID (Fig. 2b). The higher the PID values, the more consistent the sequencing data was with the reference genome. The PID values of the genomic comparison before and after error correction were separately counted, and the higher PID values were updated. Then, the uniquely mapped FLNC sequences with high PID (global PID > 95% and local PID > 97%) were used to annotate loci and isoforms.
Gene loci and isoform finding
Gene loci and isoforms were identified based on the alignment position of the corrected FLNC reads. For loci, two transcripts that overlapped by at least 20% of their initiation sites on the same strand, and had at least one exon overlap of more than 20%, were considered to be the same loci transcript. These same loci transcripts were further analyzed for isoform identification. The process mainly included the removal of redundant transcripts and the filtration of low-reliability transcripts. The redundant transcripts were removed as follows: firstly, if all the splicing sites of the same loci transcripts were identical, they could be considered one isoform; secondly, if one isoform was degraded at the 5′ terminal region, but the remaining region was consistent with other isoforms, it should be filtered out. For false positives, when the global PID < 99%, each isoform structural model must supported with least two FLNC reads; otherwise, if there was only one sequence, then all junction sits of the sequence were fully supported by the genomic annotation or Illumina RNA-Seq data.
Novel gene and isoform identification
The above gene loci and isoforms were compared with the reference annotation to identify known genes and isoforms, as well as novel genes and isoforms. A sequenced gene was determined to be a novel gene by satisfying any of the following criteria: (i) There is no overlap or there is an overlap of less than 20% of the annotated genes; or (ii) the overlap with the annotated gene is greater than 20%, but the gene direction is inconsistent. In addition, if the sequenced isoform contained one or more new splicing sites, or if the sequenced isoform and annotated isoform were not both single-exon, it was considered to be a novel isoform.
The novel isoforms were annotated by NR, KOG, KO, and Swiss-Prot databases with Diamond [71, 72]. KEGG pathways were searched by KOBAS v2.0 . GO annotations were performed by BLASTX v2.2.26 and BLAST2GO v2.3.5 software .
LncRNA and ORF identification
To identify LncRNA, novel isoforms from known genes or novel isoforms from novel genes obtained by PacBio data were first searched against NR, KOG, KO, and Swiss-Prot databases with default parameters. The isoforms that had BLAST hits with 1E-5 were filtered out, and the remaining isoforms were further evaluated for protein-coding capacity by CPAT v1.2.2 (http://lilab.research.bcm.edu/cpat/).
To predict ORFs, transDecoder software was used to identify potential coding sequences (http://transdecoder.sf.net). By default, the length of ORFs predicted by TransDecoder.LongOrfs was at least 100 amino acids. To improve the sensitivity of ORFs, possible ORF-translated proteins were aligned to the Swiss-Prot database by BlastP for homologous protein identification. Simultaneously, protein domain identification was determined from the Pfam database by Hmmscan [75, 76]. Subsequently, TransDecoder. Predict was used to filter all predicted ORFs based on the above results, and retained ORFs that have homology to the Swiss-Prot database or with the same domain.
AS, fusion transcript, and APA identification
Alternative splicing (AS) events were ascertained using ASprofile software . The splice types, (M) SKIP, (M) IR, AE, X (M) KIP, X (M) IR, and XAE were classified and characterized by comparing different isoforms at the same gene loci using ASprofile (Fig. 2c). Fusion transcripts were those where the 5′ and 3′ sequences mapped to two or more gene loci in the reference genome, corresponding to the 5′ partner and 3′ partner genes. The iso-seq fusion transcripts detection software, self-developed by Frasergen Inc. (Wuhan, China), was used for fusion gene detection. A schematic diagram of the software is shown in Fig. 2d. Poly-A site was an important post-transcriptional modification of RNA. The reliable APA sites were obtained by Tapis software .
Total RNA from floral organs of the DH line ‘FT’, including anthers, sepal, filament, petal, and pistil were extracted and mixed as described above. Reverse transcription was conducted using the FastQuant RT Super Mix (TIANGEN, China). RT-PCR was performed in 10 μl volumes containing 50 ng DNA, 1.0 μl of 10 Taq Reaction Buffer (containing Mg2+), 0.8 μl of 2.5 mM dNTP, 1 μl each of 0.5 μm forward and reverse primers, and 1 U of Taq DNA polymerase (TIANGEN, China). The amplification was performed on an iCycler thermocycler (Bio-Rad, USA) with the following cycling parameters: initial denaturation at 95 °C for 5 min, and 35 cycles of 95 °C for 30 s, 56 °C for 30 s, and 72 °C for 30 s, with a final extension at 72 °C for 10 min. Gene-specific primers were designed with Primer Premier 5.0 by GENEWIZ (Suzhou, China). PCR products were analyzed on 2% agarose gels and followed by Sanger sequencing. All primers are listed in Additional file 2: Table S14.
We acknowledge Yanbo Feng from Frasergen Bioinformatics Co., Ltd. (Wuhan, China) for providing relevant literature regarding the PacBio sequel platform, and actively coordinating communication with technical staff to facilitate the completion of this manuscript. We also thank Editage (www.editage.cn) for English language editing.
ZL and FH conceived and designed this study. CT analyzed the data and wrote the manuscript. CT and HL performed the verification experiments. CT, JR, and XY performed data analysis. All the authors read and approved the final manuscript.
This work was supported by the National Key Research and Development Program of Chinese (No. 2016YFD0101701) and National Natural Science Foundation of China (No. 31772298 and No. 31201625). Each of the funding bodies granted the funds based on a research proposal. They had no influence over the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
All the authors declare that they have no competing interests.
- 10.Yan X, Dong C, Yu J, Liu W, Jiang C, Liu J, Hu Q, Fang X, Wei W. Transcriptome profile analysis of young floral buds of fertile and sterile plants from the self-pollinated offspring of the hybrid between novel restorer line NR1 and Nsa CMS line in Brassica napus. BMC Biochem. 2013;14(3):1–16.Google Scholar
- 17.Zhu FY, Chen MX, Ye NH, Shi L, Ma KL, Yang JF, Cao YY, Zhang YJ, Yoshida T, Fernie AR, Fan GY, Wen B, Zhou R, Liu TY, Fan T, Gao B, Zhang D, Hao GF, Xiao S, Liu YG, Zhang J. Proteogenomic analysis reveals alternative splicing and translation as part of the abscisic acid response in Arabidopsis seedlings. Plant J. 2017;91(3):518–33.PubMedCrossRefPubMedCentralGoogle Scholar
- 22.Allen SL, Delaney EK, Kopp A, Chenoweth SF. Single-Molecule Sequencing of the Drosophila serrata Genome. G3 (Bethesda). 2017;7(3):781–788.Google Scholar
- 23.Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, Borrill P, Kettleborough G, Heavens D, Chapman H, Lipscombe J, Barker T, Lu FH, McKenzie N2, Raats D, Ramirez-Gonzalez RH, Coince A, Peel N, Percival-Alwyn L, Duncan O, Trösch J3, Yu G, Bolser DM, Namaati G, Kerhornou A, Spannagl M, Gundlach H, Haberer G, Davey RP, Fosker C, Palma FD, Phillips AL, Millar AH, Kersey PJ, Uauy C, Krasileva KV, Swarbreck D, Bevan MW, Clark MD. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017;27(5):885–96.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.Edger PP, VanBuren R, Colle M, Poorten TJ, Wai CM, Niederhuth CE, Alger EI, Ou S, Acharya CB, Wang J, Callow P, McKain MR, Shi J, Collier C, Xiong Z, Mower JP, Slovin JP, Hytönen T, Jiang N, Childs KL, Knapp SJ. Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience. 2018;7(2):1–7.PubMedCrossRefPubMedCentralGoogle Scholar
- 28.Prakash G, Kumar A, Sheoran N, Aggarwal R, Satyavathi CT, Chikara SK, Ghosh A, Jain RK. First Draft Genome Sequence of a Pearl Millet Blast Pathogen, Magnaporthe grisea Strain PMg_Dl, Obtained Using PacBio Single-Molecule Real-Time and Illumina NextSeq 500 Sequencing. Microbiol Resour Announc. 2019;8(20). pii: e01499–18.Google Scholar
- 29.Zhang L, Hu J, Han X, Li J, Gao Y, Richards CM, Zhang C, Tian Y, Liu G, Gul H, Wang D, Tian Y, Yang C, Meng M, Yuan G, Kang G, Wu Y, Wang K, Zhang H, Wang D, Cong P. A high-quality apple genome assembly reveals the association of a retrotransposon and red fruit colour. Nat Commun. 2019;10(1):1494.PubMedPubMedCentralCrossRefGoogle Scholar
- 32.Xu Z, Peters RJ, Weirather J, Luo H, Liao B, Zhang X, Zhu Y, Ji A, Zhang B, Hu S, Au KF, Song J, Chen S. Full-length transcriptome sequences and splice variants obtained by a combination of sequencing platforms applied to different root tissues of Salvia miltiorrhiza and tanshinone biosynthesis. Plant J. 2015;82(6):951–61.PubMedCrossRefGoogle Scholar
- 38.Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I, Cheng F, Huang S, Li X, Hua W, Wang J, Wang X, Freeling M, Pires JC, Paterson AH, Chalhoub B, Wang B, Hayward A, Sharpe AG, Park BS, Weisshaar B, Liu B, Li B, Liu B, Tong C, Song C, Duran C, Peng C, Geng C, Koh C, Lin C, Edwards D, Mu D, Shen D, Soumpourou E, Li F, Fraser F, Conant G, Lassalle G, King GJ, Bonnema G, Tang H, Wang H, Belcram H, Zhou H, Hirakawa H, Abe H, Guo H, Wang H, Jin H, Parkin IA, Batley J, Kim JS, Just J, Li J, Xu J, Deng J, Kim JA, Li J, Yu J, Meng J, Wang J, Min J, Poulain J, Wang J, Hatakeyama K, Wu K, Wang L, Fang L, Trick M, Links MG, Zhao M, Jin M, Ramchiary N, Drou N, Berkman PJ, Cai Q, Huang Q, Li R, Tabata S, Cheng S, Zhang S, Zhang S, Huang S, Sato S, Sun S, Kwon SJ, Choi SR, Lee TH, Fan W, Zhao X, Tan X, Xu X, Wang Y, Qiu Y, Yin Y, Li Y, Du Y, Liao Y, Lim Y, Narusaka Y, Wang Y, Wang Z, Li Z, Wang Z, Xiong Z, Zhang Z; Brassica rapa Genome Sequencing Project Consortium. The genome of the mesopolyploid crop species Brassica rapa. Nat Genet. 2011;43(10):1035–9.PubMedCrossRefPubMedCentralGoogle Scholar
- 40.Zhang L, Cai X, Wu J, Liu M, Grob S, Cheng F, Liang J, Cai C, Liu Z, Liu B, Wang F, Li S, Liu F, Li X, Cheng L, Yang W, Li MH, Grossniklaus U, Zheng H, Wang X. Improved Brassica rapa reference genome by single-molecule sequencing and chromosomeconformation capture technologies. Hortic Res. 2018;5:50.PubMedPubMedCentralCrossRefGoogle Scholar
- 46.Weirather JL, Afshar PT, Clark TA, Tseng E, Powers LS, Underwood JG, Zabner J, Korlach J, Wong WH, Au KF. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 2015;43(18):e116.PubMedPubMedCentralCrossRefGoogle Scholar
- 49.Sander.Google Scholar
- 52.Vembar SS, Seetin M, Lambert C, Nattestad M, Schatz MC, Baybayan P, Scherf A, Smith ML. Complete telomere-to-relomere de novo assembly of the Plasmodium falciparum genome through long-read (>11kb), single molecule, real-time sequencing. DNA Res. 2016;23(4):339–51.PubMedPubMedCentralCrossRefGoogle Scholar
- 59.Zhang G, Guo G, Hu X, Zhang Y, Li Q, Li R, Zhuang R, Lu Z, He Z, Fang X, Chen L, Tian W, Tao Y, Kristiansen K, Zhang X, Li S, Yang H, Wang J, Wang J. Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. Genome Res. 2010;20(5):646–54.PubMedPubMedCentralCrossRefGoogle Scholar
- 70.Yuan D, Tang Z, Wang M, Gao W, Tu L, Jin X, Chen L, He Y, Zhang L, Zhu L, Li Y, Liang Q, Lin Z, Yang X, Liu N, Jin S, Lei Y, Ding Y, Li G, Ruan X, Ruan Y, Zhang X. The genome sequence of Sea-Island cotton (Gossypium barbadense) provides insights into the allopolyploidization and development of superior spinnable fibres. Sci Rep. 2015;5:17662.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.