A unified gene catalog for the laboratory mouse reference genome
- 1.5k Downloads
We report here a semi-automated process by which mouse genome feature predictions and curated annotations (i.e., genes, pseudogenes, functional RNAs, etc.) from Ensembl, NCBI and Vertebrate Genome Annotation database (Vega) are reconciled with the genome features in the Mouse Genome Informatics (MGI) database (http://www.informatics.jax.org) into a comprehensive and non-redundant catalog. Our gene unification method employs an algorithm (fjoin—feature join) for efficient detection of genome coordinate overlaps among features represented in two annotation data sets. Following the analysis with fjoin, genome features are binned into six possible categories (1:1, 1:0, 0:1, 1:n, n:1, n:m) based on coordinate overlaps. These categories are subsequently prioritized for assessment of annotation equivalencies and differences. The version of the unified catalog reported here contains more than 59,000 entries, including 22,599 protein-coding coding genes, 12,455 pseudogenes, and 24,007 other feature types (e.g., microRNAs, lincRNAs, etc.). More than 23,000 of the entries in the MGI gene catalog have equivalent gene models in the annotation files obtained from NCBI, Vega, and Ensembl. 12,719 of the features are unique to NCBI relative to Ensembl/Vega; 11,957 are unique to Ensembl/Vega relative to NCBI, and 3095 are unique to MGI. More than 4000 genome features fall into categories that require manual inspection to resolve structural differences in the gene models from different annotation sources. Using the MGI unified gene catalog, researchers can easily generate a comprehensive report of mouse genome features from a single source and compare the details of gene and transcript structure using MGI’s mouse genome browser.
KeywordsMouse Genome Genome Annotation Genome Feature Annotation File Mouse Genome Informatics
Generating lists of genes and other genome features in specific chromosomal regions of the reference mouse genome is a common task among biomedical researchers. Although conceptually simple, generating a complete and non-redundant list of genome features can be challenging because there are multiple major independent genome annotation providers that use different methods for predicting genes. Each of these genome annotation processes generates a set of gene models in which some predictions are unique to a particular pipeline. Even when genes are predicted in common, there are often differences in exon structure and inconsistencies in nomenclature. The accession identifiers associated with the predictions also differ among the various providers. As a consequence, a list of genes downloaded from one source does not always match a gene list obtained from a different source. Further, there are a number of annotation projects that specialize in specific types of genome features such as regulatory regions (Yue et al. 2014) and functional RNAs (Chan and Lowe 2009; Kozomara and Griffiths-Jones 2014). These features are often not included in the predictions of the major annotation providers or are represented incompletely.
We describe here the methods we use to combine annotations from multiple sources into a single “unified gene catalog” for the laboratory mouse reference genome. The Mouse Genome Informatics (MGI) unified gene catalog process does not simply append the different sources of mouse genome feature predictions/annotations together; rather, equivalent genome features from different sources are mapped to a single, unique accession identifier and assigned official standardized nomenclature. Genome features from specialty annotation databases such as miRBase for miRNAs (Kozomara and Griffiths-Jones 2014), Rfam for rRNAs (Burge et al. 2013), and gtRNAdb for tRNAs (Chan and Lowe 2009) are also integrated into in the MGI gene catalog. In this report, we focus on the integration of gene models and curated annotations from the three major genome annotation providers: NCBI, Ensembl, and Vertebrate Genome Annotation database (Vega).
The primary inputs for the MGI unified gene catalog are the genome feature predictions and annotations for the laboratory mouse reference genome generated by three major genome annotation providers: Ensembl, NCBI, and Vega. Each of these annotation providers employs different computational and manual methods that result in similar, but not identical, sets of gene models. For mouse, NCBI’s Eukaryotic Genome Annotation Pipeline starts with the alignment of transcripts and proteins, especially sequences that included the curated RefSeq resource (Pruitt et al. 2014). Splign is used for transcript alignment (Kapustin et al. 2008) and ProSplign for protein alignment). The pipeline also incorporates results from an HMM-based gene prediction program (Gnomon) (Pruitt et al. 2014). The Vega database represents clone-by-clone manual curation of finished genomic sequences by the Human and Vertebrate Analysis and Annotation (HAVANA) group at the Sanger Institute (Wilming et al. 2008). Ensembl’s automatic gene annotation system relies on the alignments of mRNAs and protein sequences to the assembly (Flicek et al. 2013). In addition, the Ensembl genome annotation incorporates all of the genes manually annotated by HAVANA group. GU accepts as input annotation files in General Feature Format (GFF or GFF3; http://www.sequenceontology.org/gff3.shtml). Genome features are usually genes, but any entity with genome coordinates can be used as input. GU is highly configurable and the amount of coordinate overlap required to call two features equivalent can be adjusted; the types of genome features to be included or excluded in the analysis can be configured, and a requirement for features to be on the same strand can be turned on or off.
Materials and methods
Genome feature counts from the annotations of the reference mouse genome by NCBI, Ensembl, and Vega and counts of feature types following the integration of the three annotation sources into the MGI unified gene catalog
Other genome features (non-coding RNAs, etc.)
MGI unified gene catalog
Genome feature predictions from NCBI v104 were based on the reference genome assembly GRCm38.p2 for the laboratory mouse (C57BL/6J). Predictions and annotations from Ensembl v78 and Vega v58 were based on assembly GRCm38.p3. Details regarding the differences in the assembly versions are available from the Genome Reference Consortium (GRC) web site (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/) (Church et al. 2015). Annotation files from Ensembl and Vega were converted from GTF format to GFF3 prior to the GU analysis.
Gene unification (GU) using the fjoin algorithm
Annotation files in GFF3 format from two genome annotation sources were used as input to the GU process using the fjoin algorithm (Richardson 2006). Pairwise comparison of the annotations from NCBI, Ensembl, and Vega to each other and to the current MGI gene catalog was performed. The fjoin program was configured to consider a genome coordinate overlap of a single-nucleotide position on the same strand as sufficient for establishing the initial assertion of equivalency of two genome features. For protein-coding genes, the fjoin analysis was further constrained to consider coordinate overlaps between exons. We informally evaluated the results obtained from fjoin using different overlap values (e.g., 1, 10, 20, 50, 70, and 100 nt). Only minor differences were noted in the numbers of features in each of the fjoin categories. For example, small RNAs were often included in 0:1 or 1:0 categories when overlaps of more than 20 nt were required leading to false negative equivalency assertions. Changing overlap parameters did not significantly reduce the time needed for manual review of genome features in complex overlap categories (e.g., 1:n, n:1, n:m).
Processing fjoin categories
Genome features that were deemed to be equivalent to existing MGI genome features (1:1 category) were loaded into MGI without manual review. For genome features novel to MGI (0:1 category), new gene records were created and assigned official nomenclature according to the guidelines of the International Committee on Standardized Nomenclature for Mice.
The genome features that fell into the n:1, 1:n, and n:m categories were reviewed manually. When the manual review identified genome features needing changes to annotations in external resources, the cases were shared with curators at HAVANA and NCBI using a private, on-line Mouse Genome Annotation (MGA) issue tracking system. The MGA allows for coordinated analysis and resolution of annotation discrepancies thereby supporting ongoing refinement and improvement of mouse genome annotations. The MGA resource was implemented using JIRA (https://www.atlassian.com/) and is hosted by NCBI.
In addition to genome features and their coordinates, the MGI gene catalog also includes the biotype annotations associated with the features in each of the annotation sources. Some of the prediction/annotation sources include biotype annotations for both genome features and the transcripts of those features. We used only genome feature level annotations to identify contradictory biotypes among features our unification process identified as being equivalent; transcript-level biotypes were not considered in the biotype conflict evaluation. As there is no single biotype vocabulary that all annotation groups share in common, we maintain a mapping among the various terminologies in a biotype thesaurus. For example, NCBI annotates all pseudogenes using the biotype term “pseudo.” In contrast, pseudogenes from Vega are classified into multiple subcategories: unitary_pseudogene, processed_pseudogene, translated_processed_pseudogene, transcribed_processed_pseudogene, unprocessed_pseudogene, translated_unprocessed_pseudogene, and transcribed_unprocessed_pseudogene. All of these terms are considered equivalent to the biotype of “pseudogene” in MGI. Similarly, Vega’s IG_pseudogene and TR_pseudogene biotypes and Ensembl’s IG_D_pseudogene, IG_C_pseudogene, IG_V_pseudogene TR_J_pseudogene, and TR_V_pseudogene are all considered equivalent to the term “pseudogenic gene segment” in MGI.
The MGI biotype thesaurus is updated as new biotype terms appear in the annotation files from the three major annotation providers. Within MGI, genome features are given biotype labels based on terms in the Sequence Ontology (Mungall et al. 2011).
Updating the unified gene catalog
The unified gene catalog is updated whenever a new annotation version is released by NCBI, Ensembl, or Vega. The updates include the addition of new genome features as described above as well as the discontinuation of genes that no longer have evidence to support them. When a new version of the reference mouse genome assembly is released, the coordinates in the unified catalog are converted to the new assembly coordinates using NCBI Remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap).
Pairwise comparison of annotations using fjoin
Summary of gene unification results for NCBI (version 104), Ensembl (version 78), and Vega (version 58)
Genome features equivalent across all sources
Genome features unique to MGI
Genome features unique to NCBI
Genome features unique to Ensembl
Genome features unique to Vegaa
Genome features requiring manual assessment (1:n, n:1, n:m categories)
The 3095 genome features that are unique to MGI mostly consist of genes created from full-length cDNAs sequenced as part of the functional annotation of the mouse (FANTOM) initiative (Okazaki et al. 2002). The 12,719 genome features that are unique to NCBI are mostly long non-coding RNA genes and pseudogenes. These distinct features are likely linked to NCBI’s genome analysis for version 104 which incorporated RNA Seq alignments from projects represented in the Sequence Read Archive (Shumway et al. 2010) to assist in gene structure prediction. Similar to NCBI, the 11,957 Ensembl genome annotations not in NCBI are mostly long non-coding RNA genes and pseudogenes. Over 900 of the unique genes in the Ensembl/Vega annotations are located on chromosome Y, which was not well annotated in previous releases. The reconciliation of genome features from Ensembl, Vega, and NCBI with the previous version of the MGI unified gene catalog resulted in 8896 new genome features in MGI. New records in MGI were created for these features, and they were reviewed and assigned official gene nomenclature and biotype annotations.
Accessing the MGI unified gene catalog
Researchers who wish to report an issue or suggested correction for specific mouse genome annotations can submit a report using the public MGA web site (http://www.ncbi.nlm.nih.gov/genome/guide/mouse/MGAReport.shtml). Submissions from the web site results in a “ticket” in the tracking system shared by the mouse annotation curation groups at The Jackson Laboratory, The Sanger Institute, and NCBI.
In this report, we describe the method by which we combine annotation outputs from multiple, independent genome analysis pipelines into a unified gene catalog for the mouse reference genome. As annotated genome assemblies for other mouse strains are generated, we will use the methods described in this report to generate additional strain-specific gene catalogs. The algorithm that drives the GU process, fjoin, is organism-agnostic and could be used to support similar annotation integration efforts for any organism for which there are multiple sources of genome feature predictions.
The MGI unified gene catalog effort has similarities to the Consensus CDS (CCDS) project (to which MGI is a contributing partner) at NCBI (Farrell et al. 2014). Similar to the CCDS initiative, the primary inputs for MGI’s gene catalog are genome annotations from Ensembl, NCBI, and Vega. The CCDS focuses on those annotations/predictions with consistently annotated full-length coding regions (i.e., those with an ATG and valid stop-codon) that can be translated using consensus splice sites without frameshifts. Where the goal of the CCDS is to identify the highest confidence protein-coding gene annotations only, MGI’s gene catalog includes all genome annotations, regardless of biotype. Gene models categorized as equivalent by our unification process are likely to be representations of the same gene or transcription unit. Equivalent gene models, however, are not necessarily identical in gene structure and our pipeline does not evaluate which gene model is likely to be the “best” representation.
The unified gene catalog serves as the foundation for the annotation of biological attributes (i.e., phenotype, function, expression, and pathway membership) of mouse genes by expert curators and bio-data analysts at the MGI (http://www.informatics.jax.org) database (Eppig et al. 2015). The MGI gene catalog also serves as the basis for mouse genome features represented at NCBI’s Gene resource (Brown et al. 2015; Sayers et al. 2012) and was a primary source of genes for the first phase of The International Knockout Mouse Project (KOMP) (Bradley et al. 2012; International Mouse Knockout et al. 2007).
The output from MGI’s unified gene catalog process systematically identifies gene models that are potentially problematic in their structural details as well as those that appear to be equivalent across different sources but have contradictory biotype annotations. The ongoing assessment of genome annotation issues at MGI in cooperation with the international mouse genome annotation community ensures that the biomedical community gains maximum benefit from the reference mouse genome.
The development of the fjoin algorithm and its application to building and maintaining a unified catalog of mouse genes was supported by NIH HG00330-P1. The authors gratefully acknowledge the ongoing contribution of all of the members of the MGI consortium at The Jackson Laboratory to maintaining and functionally annotating the reference mouse genome feature catalog. The authors thank all of the curators at NCBI and HAVANA for their collaborative efforts to improve annotations for the reference mouse genome. The authors thank Dr. Joel Graber for his critical review of an early version of this manuscript.
- Bradley A, Anastassiadis K, Ayadi A, Battey JF, Bell C, Birling MC, Bottomley J, Brown SD, Burger A, Bult CJ, Bushell W, Collins FS, Desaintes C, Doe B, Economides A, Eppig JT, Finnell RH, Fletcher C, Fray M, Frendewey D, Friedel RH, Grosveld FG, Hansen J, Herault Y, Hicks G, Horlein A, Houghton R, Hrabe de Angelis M, Huylebroeck D, Iyer V, de Jong PJ, Kadin JA, Kaloff C, Kennedy K, Koutsourakis M, Lloyd KC, Marschall S, Mason J, McKerlie C, McLeod MP, von Melchner H, Moore M, Mujica AO, Nagy A, Nefedov M, Nutter LM, Pavlovic G, Peterson JL, Pollock J, Ramirez-Solis R, Rancourt DE, Raspa M, Remacle JE, Ringwald M, Rosen B, Rosenthal N, Rossant J, Ruiz Noppinger P, Ryder E, Schick JZ, Schnutgen F, Schofield P, Seisenberger C, Selloum M, Simpson EM, Skarnes WC, Smedley D, Stanford WL, Stewart AF, Stone K, Swan K, Tadepally H, Teboul L, Tocchini-Valentini GP, Valenzuela D, West AP, Yamamura K, Yoshinaga Y, Wurst W (2012) The mammalian gene function resource: the International Knockout Mouse Consortium. Mamm Genome 23:580–586PubMedCentralCrossRefPubMedGoogle Scholar
- Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD (2014) Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res 42:D865–D872PubMedCentralCrossRefPubMedGoogle Scholar
- Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Garcia-Giron C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kahari AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJ, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SM (2013) Ensembl 2013. Nucleic Acids Res 41:D48–D55PubMedCentralCrossRefPubMedGoogle Scholar
- Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, Yamanaka I, Kiyosawa H, Yagi K, Tomaru Y, Hasegawa Y, Nogami A, Schonbach C, Gojobori T, Baldarelli R, Hill DP, Bult C, Hume DA, Quackenbush J, Schriml LM, Kanapin A, Matsuda H, Batalov S, Beisel KW, Blake JA, Bradt D, Brusic V, Chothia C, LE Corbani, Cousins S, Dalla E, Dragani TA, Fletcher CF, Forrest A, Frazer KS, Gaasterland T, Gariboldi M, Gissi C, Godzik A, Gough J, Grimmond S, Gustincich S, Hirokawa N, Jackson IJ, Jarvis ED, Kanai A, Kawaji H, Kawasawa Y, Kedzierski RM, King BL, Konagaya A, Kurochkin IV, Lee Y, Lenhard B, Lyons PA, Maglott, Maltais L, Marchionni L, McKenzie L, Miki H, Nagashima T, Numata K, Okido T, Pavan WJ, Pertea G, Pesole G, Petrovsky N, Pillai R, Pontius JU, Qi D, Ramachandran S, Ravasi T, Reed JC, Reed DJ, Reid J, Ring BZ, Ringwald M, Sandelin A, Schneider C, Semple CA, Setou M, Shimada K, Sultana R, Takenaka Y, Taylor, Teasdale RD, Tomita M, Verardo R, Wagner L, Wahlestedt C, Wang Y, Watanabe Y, Wells C, Wilming LG, Wynshaw-Boris A, Yanagisawa M, Yang I, Yang L, Yuan Z, Zavolan M, Zhu Y, Zimmer A, Carninci P, Hayatsu N, Hirozane-Kishikawa T, Konno H, Nakamura M, Sakazume N, Sato K, Shiraki T, Shiraki T, Waki K, Kawai J, Aizawa K, Arakawa T, Fukuda S, Hara A, Hashizume W, Imotani K, Ishii Y, Itoh M, Kagawa I, Miyazaki A, Sakai K, Sasaki D, Shibata K, Shinagawa A, Yasunishi A, Yoshino M, Waterston R, Lander ES, Rogers J, Birney E, Hayashizaki Y, Consortium F, I RGERGP, Team II (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420:563–573CrossRefPubMedGoogle Scholar
- Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O’Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, DiCuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM (2014) RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763PubMedCentralCrossRefPubMedGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Slotta D, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Wang Y, Wilbur WJ, Yaschenko E, Ye J (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13–D25PubMedCentralCrossRefPubMedGoogle Scholar
- Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, Sandstrom R, Ma Z, Davis C, Pope BD, Shen Y, Pervouchine DD, Djebali S, Thurman RE, Kaul R, Rynes E, Kirilusha A, Marinov GK, Williams BA, Trout D, Amrhein H, Fisher-Aylor K, Antoshechkin I, DeSalvo G, See LH, Fastuca M, Drenkow J, Zaleski C, Dobin A, Prieto P, Lagarde J, Bussotti G, Tanzer A, Denas O, Li K, Bender MA, Zhang M, Byron R, Groudine MT, McCleary D, Pham L, Ye Z, Kuan S, Edsall L, Wu YC, Rasmussen MD, Bansal MS, Kellis M, Keller CA, Morrissey CS, Mishra T, Jain D, Dogan N, Harris RS, Cayting P, Kawli T, Boyle AP, Euskirchen G, Kundaje A, Lin S, Lin Y, Jansen C, Malladi VS, Cline MS, Erickson DT, Kirkup VM, Learned K, Sloan CA, Rosenbloom KR, Lacerda de Sousa B, Beal K, Pignatelli M, Flicek P, Lian J, Kahveci T, Lee D, Kent WJ, Ramalho Santos M, Herrero J, Notredame C, Johnson A, Vong S, Lee K, Bates D, Neri F, Diegel M, Canfield T, Sabo PJ, Wilken MS, Reh TA, Giste E, Shafer A, Kutyavin T, Haugen E, Dunn D, Reynolds AP, Neph S, Humbert R, Hansen RS, De Bruijn M, Selleri L, Rudensky A, Josefowicz S, Samstein R, Eichler EE, Orkin SH, Levasseur D, Papayannopoulou T, Chang KH, Skoultchi A, Gosh S, Disteche C, Treuting P, Wang Y, Weiss MJ, Blobel GA, Cao X, Zhong S, Wang T, Good PJ, Lowdon RF, Adams LB, Zhou XQ, Pazin MJ, Feingold EA, Wold B, Taylor J, Mortazavi A, Weissman SM, Stamatoyannopoulos JA, Snyder MP, Guigo R, Gingeras TR, Gilbert DM, Hardison RC, Beer MA, Ren B, Mouse EC (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature 515:355–364PubMedCentralCrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.