Revisiting avian ‘missing’ genes from de novo assembled transcripts
Argument remains as to whether birds have lost genes compared with mammals and non-avian vertebrates during speciation. High quality-reference gene sets are necessary for precisely evaluating gene gain and loss. It is essential to explore new reference transcripts from large-scale de novo assembled transcriptomes to recover the potential hidden genes in avian genomes.
We explored 196 high quality transcriptomic datasets from five bird species to reconstruct transcripts for the purpose of discovering potential hidden genes in the avian genomes. We constructed a relatively complete and high-quality bird transcript database (1,623,045 transcripts after quality control in five birds) from a large amount of avian transcriptomic data, and found most of the presumed missing genes (83.2%) could be recovered in at least one bird species. Most of these genes have been identified for the first time in birds. Our results demonstrate that 67.94% genes have GC content over 50%, while 2.91% genes are AT-rich (AT% > 60%). In our results, 239 (53.59%) genes had a tissue-specific expression index of more than 0.9 in chicken. The missing genes also have lower Ka/Ks values than average (genome-wide: Ka/Ks = 0.99; missing gene: Ka/Ks = 0.90; t-test = 1.25E-14). Among all presumed missing genes, there were 135 for which we did not find any meaningful orthologues in any of the 5 species studied.
Insufficient reference genome quality is the major reason for wrongly inferring missing genes in birds. Those presumably missing genes often have a very strong tissue-specific expression pattern. We show multi-tissue transcriptomic data from various species are necessary for inferring gene family evolution for species with only draft reference genomes.
KeywordsMissing gene Avian genome de novo assembly Evolution
Reads Per Kilobase of exon model per Million mapped reads
Transcripts Per Kilobase of exon model per Million mapped reads
Tissue-specific expression index
Gene gain and loss are common events during various speciation processes . However, high-quality genomes are an essential prerequisite for inferring gene gain and loss at the genome-wide scale. There has long been debate as to whether birds have less genes than mammals. Many genes were not found in the first avian reference genome (chicken, Gallus gallus), and the gene loss and/or accelerated gene evolution hypothesis in the avian lineage was proposed . When more avian genomes became available, Zhang et al.  and Lovell et al. , using multiple genome comparisons, proposed there were 640 and 274 protein-coding genes (respectively) that were lost in the avian lineage. The two studies have drawn similar conclusions that these gene losses are due to fragmentation or deletion of syntenic blocks during evolution [3, 4]. However, several recent genome-wide and/or case studies recovered some genes initially presumed lost in bird genomes [5, 6, 7]. It was thought that both GC composition and GC repeats within these missing genes were significantly higher than that of other genes , and that they also clustered in GC-rich regions . As PCR amplification is sensitive to extreme GC-content variation, this creates uneven genomic representation within classical Illumina libraries and large genomes are generally inefficiently assembled, particularly those created following standard protocols . Searches for several genes that have been shown to be important in mammals but were considered to be lost in the chicken, have in fact discovered full length cDNAs for these genes [6, 9, 10]. At the time, the newly released chicken genome, Galgal5, included around 1900 protein-coding genes not present in Galgal4, annotating some of the genes previously thought to be missing . Recent advances suggest that a considerable number of the presumed ‘missing genes’ are not really missing in the avian genome. As more genes are recovered, a recent study concluded that avian genomes contain similar numbers of genes to mammals and non-avian reptiles . To be able to directly address these conflicts, we need strong evidence to find these missing genes in multiple bird species. Different studies have shown that recovering genes through transcriptome assembly methods is an effective method that can compensate for the impact of poor genome quality.
This study used multiple transcriptomic data sets from 5 bird species (chicken, Gallus gallus; duck, Anas platyrhynchos; pigeon, Columba livia; goose, Anser cygnoides; zebra finch, Taeniopygia guttata) to exhaustively searching for the missing genes in birds, and also elucidate the effects of GC content, expression pattern, and assembled genome quality on gene loss studies. We demonstrate that de novo assembly of multiple transcriptomes from various tissues can rescue most missing genes in the absence of complete reference genomes, and most presumed missing genes have a strong tissue-specific expression pattern.
Animal tissues and RNA-Seq
Chicken RNA-seq data encompassing 26 tissues were downloaded from GenBank. From the public dataset, we only kept the paired-end reads of at least 70 bp in length for use in the de novo assembly. Duck samples (both adult and embryos) were obtained from Pekin Gold Duck Inc. Pigeon samples were obtained from Beijing Sunyi pigeon farm. Four tissues from geese were obtained from Zhejiang Goose farm, while other tissues were download from GenBank. Zebra finch samples were obtained from the Beijing Zoo (Additional file1: Table S1). Tissue samples were snap-frozen in liquid nitrogen and then stored at − 80 °C until RNA extraction. RNA was extracted by homogenization at low temperature and preservation in Trizol reagent (Invitrogen, USA). Approximately 10 μg of sheared cDNA was prepared for Illumina sequencing according to the manufacturer’s protocols. Libraries were prepared from a 200–230 bp size-selected fraction following adapter ligation and agarose gel separation. The library was sequenced using a multiplexed paired-end protocol with 150 bp of data collected per run on the Illumina Hiseq 2500/4000. Base calling was performed by the Illumina instrument software. The FASTX Toolkit (−v 0.0.14) (http://hannonlab.cshl.edu/fastx_toolkit/) was used to filter the obtained data. Reads less than 70 bp were removed as were reads having > 5% low quality bases (<Q30).
De novo transcriptome assembly and quality evaluation
In order to ensure the accuracy of the downstream analysis, we performed a quality assessment of the assembled transcripts. We used the orthlog hit ratio (OHR)  to evaluate the integrity and richness of the transcripts. By comparison of the constructed sequences with the known sequences in the related species database, we defined the ratio of the best comparison results to the reference sequence of OHR. The closer the OHR is to 1.0, the more complete the constructed transcript is. The chicken has a large number of gene sequences that are well annotated, so we selected the protein-coding genes in chicken as reference sequences (Ensembl, V92). The OHR of the five species were calculated as the ratio of the length of the best CDS sequence to that of the known genes. The OHR distribution diagram of a known sequence was made using the R package (//www.R-project.org/).
Comparative genomic analysis
Previously published candidate missing gene lists by Lovell et al. (Additional file 1: Table S1) and Zhang et al. (Additional file 1: Table S10), were used as the targets to test whether these presumably missing genes are really lost in birds. There are 274 missing genes in birds in the Lovell study and 640 genes in the Zhang results. We combined each missing gene list to obtain 806 candidate missing genes in birds. All following comparative genomics and expression studies were conducted based on this missing gene list. After obtaining the peptide sequences of these missing genes from human, we used these human genes as targets with which to search for homologous bird genes from our assembled transcripts. The BLASTP  program (identity> 40%; −E value = 1e-10) was used to search the bird sequences. We chose the amino-acid sequence of human orthologues to search for the orthologues from the assembled transcripts in the five bird species. Only the contig which had the highest alignment score was selected as the best candidate missing sequence in each bird. After obtaining the best sequence of the missing gene in birds, basic information such as length and GC content were calculated.
Human (Homo sapiens, GRCh38.p12), mouse (Mus musculus, GRCm38.p6) and anole lizard (Anolis carolinensis, AnoCar2.0) gene annotations (Ensembl V92) were used as references with which to compare the distribution of GC-content within protein-coding genes from the five birds used in this study. Co-linear analysis of chromosome fragments among human, chicken and lizard was done using LASTZ (−-step 10,--gapped) (-V 1.04) . The visual map of the common linear region was made using the R package. BLASTX was used to compare recovered bird missing gene transcripts with SWISS-prot protein sequences. tBLASTn was used to compare human homologous protein sequences with Chicken (Galgal5), duck (BGI_1.0), goose (AnsCyg_PRJNA183603_v1.0), pigeon (Cliv_1.0) and Zebra finch (Taeniopygia_guttata-3.2.4) genomes.
In order to compare Ka/Ks values of missing genes with all annotated protein-coding genes in the chicken genome, we used chicken-human orthologues as references. Chicken-human single copy orthologues (Ensembl V92) were extracted using Ensembl Biomart for Ka/Ks analysis. First, the cDNA sequences were translated into amino-acid sequences and aligned by MUSCLE software , the aligned amino-acid sequences were converted to cDNA alignment according to the original cDNA sequences. Ka/Ks values were calculated for each orthologous group using KaKs calculator (verion 2.0)  with default parameters(−c = standard code, −m = MA).
Salmon software  was used to obtain quantitative information for each transcript sequence, including the normalized TPM and the number of reads mapped on each transcription group by default parameters. The RPKM  of each transcription group sequence was then calculated, and used to calculate the specific expression index of the downstream tissue. The Tissue Specific expression index (TSI) was proposed by Yanai , and can accurately measure the specific expression of a gene. We calculated the tissue-specific-index of high confidence genes in four species, not include goose. For TSI to be computationally significant, the number of tissues to be included should be > 10. Only 8 goose tissues were available and were thus excluded from the analysis. Genes were defined as being highly expressed in a tissue if they had expression 3-fold higher than the average expression in all tissues. We calculated the tissue-specific expression indices of genes in four species of birds - chickens, ducks, pigeons, and zebra finch as these species have data from more tissues.
RT-PCR for candidate genes
In order to confirm the de novo assembled cDNA for some very important ‘missing genes’, we used RT-PCR and Sanger sequencing to obtain the candidate missing gene cDNA sequences. From the high-confidence gene list, we did literature searching using the missing gene name. Based on the PubMed search results, we chose those genes for which there have been in-depth studies in human, but with no related studies in birds. We used chicken-related tissues based on gene expression pattern for further RT-PCR analysis.
Total RNA was extracted from the corresponding tissue using Trizol reagent (Invitrogen, USA). First-strand cDNA was generated from 1μg of RNA using PrimeScript™ RT reagent Kit with gDNA Eraser (Takara, Japan) following the manufacturer’s instructions. Each gene-specific primer was designed using primer 5 software and the corresponding fragment was amplified in a 30 ul PCR reaction containing 1 ul cDNA, 2 mM MgCl2, 0.5 mM of each primer and 0.5 X super fidelity PCR mix (NEB, England). Temperature cycles were as follows: initial denaturation at 95 °C for 3 min; 30 cycles at 95 °C for 1 min; annealing at 60 °C for 20 s; polymerization at 72 °C for 1 min; and final extension step at 72 °C for 10 min. The annealing temperature and extension times varied depending on the primer Tm and the length of the fragment being amplified. Specificity of the amplification products was verified by electrophoresis on a 0.8% agarose-gel and by Sanger sequencing.
Summary of RNA-Seq samples and de novo assembly statistics
Total Tissue Numbers
Total Clean Reads(M)
Assembled Transcripts Numbers
Assembled Transcripts N50 (bp)
We mapped high-confidence chicken, duck, goose, pigeon and zebra finch transcripts to their corresponded reference assembly. This comparison found that all these genes could only be very partially mapped onto the reference genome (Additional file 1: Table S4). Meanwhile, we also used human homologous protein sequences from the missing gene list to compare with the 5 bird genomes using tblastn  (−E value = e-10). This yielded 556 (chicken, Galgal5), 513 (duck), 506 (goose), 529 (pigeon), and 495(zebra finch) matches (Additional file 1: Table S5). The alignment quality of the de novo assembled transcripts is much better than using human protein sequences (Additional file 1: Table S4, S5). All these results confirm the wide existence of presumed missing genes in the five birds studied.
Due to the presence of microchromosomes in birds, the avian genome is seen to represent a highly stable karyotype [2, 7]. As we have now recovered those ‘missing’ genes in our five studied birds, we can re-analyze the chromosomal location of these genes to investigate whether there are indeed lost syntenic blocks. Among the mapped 419 high-confidence transcript sequences on the Galgal5 chicken assembly, 322 (76.85%) gene sequences aligned to known chromosomes and 91 (21.72%) gene sequences mapped to unplaced scaffolds (Additional file 1: Table S4; Additional file 2: Figure S2). We directly performed a co-linear analysis of the corresponding human, chicken, and lizard chromosomal segments of the four syntenic blocks (Additional file 2: Figure S3) which harbor the relatively closely-linked missing genes, and found that these regions were partially homozygous. The other mapped genes distributed on different chromosomes/un-placed contigs, with no obvious clustering (Additional file 1: Table S7).
In this study, we found that a small portion of missing genes don’t have genomic/transcripts information based on current reference assembly and de novo assembled transcripts. After exhaustively searching de novo assembled transcripts and their current reference genomes for all five birds, we could not find any orthologues for 135 genes in any assembled transcriptome from the five birds, and didn’t find meaningful orthologues in any of the five bird reference genomes. These results suggest that these 135 genes are most probably lost in avians (Additional file 1: Table S3B). All the missing genes described by Bornelov et al.  who reconstructed chicken transcripts from transcriptome data from three tissues, were also found in our assembled chicken transcripts, of which 34 were found in all 5 birds (Additional file 1: Table S3A). To determine whether these 135 genes are really missing in birds, will require further studies. Furthermore, precisely inferring these missing genes also depends on multiple finished bird genomes.
Recent studies combined both mapping-based annotation and de novo assembly methods to predict chicken transcripts, and obtained 20% more transcripts than the ENSEMBL annotation pipeline . By comparing the newly constructed chicken high-confidence transcript sequence with two different chicken reference genomes (Galgal5, Galgal4), we obtained 419 and 382 alignments, respectively (Additional file 1: Table S4). Thirty-four genes missing in Galgal4 were also found annotated in the improved GalGal5 assembly (Additional file 1: Table S10). Our study also found all high-confidence genes were also present in the different bird reference assembly (Additional file 1: Table S4). This result helps explain why current genome annotation does not include these genes. Both genome assembly and annotation have major impact on inferring missing genes. As the quality of the genome assemblies improve, the numbers of genes in birds will increase.
Based on current results, it was found that high GC content was only one cause of missing genes in general. It is observed that GC content of these missing genes is slightly increased from lizard through to human. The evolutionary significance of this change in genic GC content is something that should be revisited. Previous studies have shown that microchromosomes harbour higher gene-density, GC content and recombination rates than macrochromosomes . Recombination is tightly related to the phenomenon of GC-biased gene conversion . These high GC-content missing gene are actually present in the avian genome, and had also been hypothesized as being part of missing blocks of genes . The majority of the missing genes were recovered in the microchromosomes and unplaced scaffolds. The process of GC-biased gene conversion (gBGC) has a major impact on recombination rate across the genome [29, 30]. The gBGC may play a major role in the high recombination rates seen in avian microchromosomes.
Our results also found very interesting results that current missing genes are highly enriched in the tissue-specific expressed group. Unique tissue specificity and low expression of genes are some of the reasons that hinder the construction of high quality transcripts using RNA-Seq data. In this study, more than 55% (high-confidence) or 88% (low-confidence) of the proposed missing genes were obtained through assembly of 196 transcriptomic data sets, indicating that multi-tissue transcriptome assembly can largely solve the missing gene problems caused by poor genome quality. This a good complementary strategy for concluding gene loss in the absence of very-high quality genomic/annotation data.
We constructed a relatively complete and high-quality bird transcript database from a large amount of avian transcriptomic data, and recovered most of the genes previously presumed to be missing in birds. Most of these genes have been identified for the first time in birds, and some incorrectly annotated genes were also corrected. From our comprehensive analysis results, we can demonstrate that detailed transcriptomic data from various tissues/organs are an essential complement to inferring gene gain and loss, before we can achieve a ‘finished’ genome. Based on the current study, we conclude that most of the presumed missing genes are in fact present in the bird genomes, but not in the current reference assemblies. High GC-content is one reason for wrongly inferring missing genes in birds, and some of these genes (about 40%) have similar, or lower, GC-content compared with genome background. Those presumably missing genes often have a very strong tissue-specific expression pattern. This study demonstrates that high quality genome data and annotation are necessary for investigating true gene loss.
The authors thank Mrs. Jun-yin Li for help collecting and maintaining the chickens.
The work was supported by the earmarked fund for National Scientific Supporting Projects of China (2015BAD03B06), Modern-industry Technology Research System (CARS-42-9), National Natural Science Foundation of China (31572388) to ZCH and the Program for Changjiang Scholars and Innovative Research Team in University (IRT_15R62) to NY. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
Most of the RNA-seq datasets were sequenced in this study and part of the data were downloaded from Sequence Read Archive. For details see Additional file 1: Table S1. Our sequenced RNA-seq data were deposited in Sequence Read Archive (SRA) database under the accession numbers (SRP141084) (https://www.ncbi.nlm.nih.gov/sra/SRP141084).
ZTY, FZ and ZCH carried out the data analysis experiments, FBL, TJ, GSL, DTS, CLZ and ZW carried out the sampling, data interpretation and molecular experiments. ZTY, ZCH, NY and JS drafted and edited the manuscript. ZCH conceived and supervised the project. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Animal experiments were approved by the Animal Care and Use Committee of China Agricultural University. All experiments were performed according to regulations and guidelines established by this committee. Animals used in this study were owned by China Agricultural Poultry Resources Station, who consented to the use of these animals in this study.
Consent for publication
The authors declare that they have no competing interests. Dr. Jacqueline Smith is a member of the editorial board (Section/Associate Editors) of this journal.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Blomme T, Vandepoele K, De Bodt S, Simillion C, Maere S, Van de Peer Y: The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol 2006, 7(5):R43.Google Scholar
- 4.Lovell PV, Wirthlin M, Wilhelm L, Minx P, Lazar NH, Carbone L, Warren WC, Mello CV. Conserved syntenic clusters of protein coding genes are missing in birds. Genome Biol. 2014;15(12).Google Scholar
- 6.cBornelov S, Seroussi E, Ycosefi S, Pendavis K, Burgess SC, Grabherr M, Friedman-Einat M, Andersson L: Correspondence on Lovell et acl.: identification of chicken genes previously assumed to be evolutionarily lost. Genome Biol 2017, 18.Google Scholar
- 9.Zhang Q, Liu L, Zhu F, Ning ZH, Hincke MT, Yang N, Hou ZC. Integrating De novo transcriptome assembly and cloning to obtain chicken Ovocleidin-17 full-length cDNA. PLoS One. 2014;9(3).Google Scholar
- 10.Seroussi E, Cinnamon Y, Yosefi S, Genin O, Smith JG, Rafati N, Bornelov S, Andersson L, Friedman-Einat M. Identification of the long-sought leptin in chicken and duck: expression pattern of the highly GC-rich avian leptin fits an autocrine/paracrine rather Than endocrine function. Endocrinology. 2016;157(2):737–51.CrossRefGoogle Scholar
- 11.Warren WC, Hillier LW, Tomlinson C, Minx P, Kremitzki M, Graves T, Markovic C, Bouk N, Pruitt KD, Thibaud-Nissen F, et al. A new chicken genome assembly provides insight into avian genome structure. G3-Genes Genom Genet. 2017;7(1):109–17.Google Scholar
- 15.O'Neil ST, Emrich SJ. Assessing De novo transcriptome assembly metrics for consistency and utility. BMC Genomics. 2013;14.Google Scholar
- 26.Orgeur M, Martens M, Borno ST, Timmermann B, Duprez D, Stricker S. A dual transcript-discovery approach to improve the delimitation of gene features from RNA-seq data in the chicken model. Biol Open. 2018;7(1).Google Scholar
- 28.Glemin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L: Quantification of GC-biased gene conversion in the human genome. Genome Res. 2015;25(8):1215-28.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.