GXD: a community resource of mouse Gene Expression Data
The Gene Expression Database (GXD) is an extensive, easily searchable, and freely available database of mouse gene expression information (www.informatics.jax.org/expression.shtml). GXD was developed to foster progress toward understanding the molecular basis of human development and disease. GXD contains information about when and where genes are expressed in different tissues in the mouse, especially during the embryonic period. GXD collects different types of expression data from wild-type and mutant mice, including RNA in situ hybridization, immunohistochemistry, RT-PCR, and northern and western blot results. The GXD curators read the scientific literature and enter the expression data from those papers into the database. GXD also acquires expression data directly from researchers, including groups doing large-scale expression studies. GXD currently contains nearly 1.5 million expression results for over 13,900 genes. In addition, it has over 265,000 images of expression data, allowing users to retrieve the primary data and interpret it themselves. By being an integral part of the larger Mouse Genome Informatics (MGI) resource, GXD’s expression data are combined with other genetic, functional, phenotypic, and disease-oriented data. This allows GXD to provide tools for researchers to evaluate expression data in the larger context, search by a wide variety of biologically and biomedically relevant parameters, and discover new data connections to help in the design of new experiments. Thus, GXD can provide researchers with critical insights into the functions of genes and the molecular mechanisms of development, differentiation, and disease.
KeywordsExpression Data Mouse Genome Informatics Data Submission Gene Expression Database Gene Expression Information
Recent technological advances have made it possible to rapidly determine the sequences of individual human genomes and to correlate genetic mutations with human diseases. Evolutionarily closely related to humans, the mouse is a pivotal model system for determining the molecular mechanisms that lead from specific mutations to developmental defects and disease phenotypes. In mouse, specific constitutive and conditional mutants can be easily generated, and tissues from many different strains and mutants, as well as all developmental stages, can be obtained for gene expression analyses. These expression data can then be correlated with phenotypic and disease data to gain insights into the function of genes and the molecular mechanisms that underlie human development, differentiation, and disease.
The objective of the Gene Expression Database (GXD) is to support and facilitate the studies of the molecular mechanisms that underlie developmental and disease processes. GXD systematically collects and integrates different types of expression data from wild-type and mutant mice through curation of the published literature and by collaboration with large-scale projects and makes them available to researchers in an extensive and easily searchable database (Finger et al. 2011; Smith et al. 2014a). Further, as an integral component of the larger Mouse Genome Informatics (MGI) resource, GXD combines its data with all the other genetic, genomic, function, phenotypic, and disease-related information in MGI, thus placing these expression data in context and making them readily accessible to many types of biologically and biomedically relevant database searches (Eppig et al. 2015; Smith et al. 2014b).
The importance of recording and integrating mouse expression data and placing them in a larger biological context cannot be overstated. It is impossible for any single individual to keep abreast of all the biomedical research data that are generated yearly, let alone to memorize all these data and their connections. The ability to find results of previous experiments quickly can save investigators months of research time, both in the library and in the laboratory. In addition, GXD and MGI enable researchers to discover new data connections, thus allowing them to develop scientific hypotheses and to design new experiments.
In the following paragraphs, we will discuss: the contents of GXD; how and why expression data are recorded in standardized ways; the integration of expression data with other data in MGI; and the tools provided by GXD to explore these data.
GXD collects endogenous gene expression information derived from wild-type and mutant mice. It includes data from all stages of development, including postnatal development, although the main emphasis is gene expression during the embryonic period. GXD provides researchers a comprehensive survey of the embryonic expression literature, detailed expression data, and tools to examine these data. Because different types of expression assays yield different information about gene products at the RNA and protein level, GXD has been designed as a system that can integrate multiple types of expression data (Ringwald et al. 1994). GXD’s emphasis has been, and continues to be, on data from RNA in situ, immunohistochemistry, knock-in reporter, RT-PCR, northern blot, and western blot experiments. Links to array and high-throughput sequencing expression data at NCBI GEO (Barrett et al. 2013) and the Expression Atlas at EMBL-EBI (Petryszak et al. 2014) are provided as well, and closer integration of these data within GXD is planned for the future.
GXD’s data content and acquisition efforts are unique, integrating heterogeneous expression data from disparate sources. GXD is the only database that systematically curates mouse developmental expression data from the literature. The GXD curators have read and entered the results from thousands of published papers into GXD. Additional data are acquired via electronic data submissions and through collaborations with large-scale data providers. The large-scale projects whose data are in GXD include: GenePaint (Visel et al. 2004), Eurexpress (Diez-Roux et al. 2011), the Brain Gene Expression Map (BGEM; Magdaleno et al. 2006), and the GenitoUrinary Development Molecular Anatomy Project (GUDMAP; Harding et al. 2011). Thus, the data in GXD represent the results of research performed by small- and large-scale laboratories worldwide. GXD currently contains detailed expression results from almost 70,000 experiments and data for nearly 1.5 million expression results examining the expression of approximately 13,900 genes. This includes data from more than 2100 mouse mutants. In addition, it has over 265,000 images of the original data, allowing researchers to view and interpret the experiments themselves. Eighty-two percent of the data are from RNA in situ hybridization studies and 10 % from RT-PCR experiments, reflecting the detailed spatial resolution and sensitivity required in developmental expression studies.
Comprehensive survey of the embryonic expression literature
GXD provides researchers with an effective way to search the mouse embryonic expression literature. Curators survey journals to find all publications that contain studies of embryonic gene expression using the assay types that GXD collects. They then index these publications with regard to the genes that have been studied, the expression assay types used, and the ages analyzed. Significantly, the curators review the entire article, including supplemental data, and use standard nomenclature for the genes. These annotations are then combined with bibliographic information from PubMed to generate the Gene Expression Literature index.
Standardized, detailed expression data
The GXD curators make extensive use of controlled vocabularies and ontologies when creating these expression records. These vocabularies can be simple lists, such as adjectives describing the pattern and level of gene expression, or large and highly structured, such as our anatomy ontology. Adding data to a database, such as GXD, involves more than collecting it from different sources. It must be reviewed and standardized, so it can be linked to and integrated with other data in the database. As will be discussed below, genes, alleles, and anatomical structures are among the key points of data integration in MGI, enabling the search capabilities that GXD provides.
To enable proper data integration and searching, it is important that GXD curators accurately identify the gene whose expression was analyzed, whether the data were obtained from the literature or from a large-scale download. Identifying the genes studied in a publication can be difficult because authors do not always use official nomenclature. Thus, a gene symbol given in a paper may refer to one of several possible genes. When entering data from the literature, curators use context, references cited in the paper, and probe or antibody information to accurately identify the genes examined. When reviewing data obtained from large-scale projects, probe information is essential for correct gene identification. The large-scale efforts usually generate probes for thousands of genes at the beginning of the project and then take several years to complete. During this time new genome assemblies may be released and new gene models created. When these data are submitted to GXD, the curators reanalyze all probe-to-gene associations to ensure that they are still correct. Because the MGI gene catalog is reviewed with each new genome assembly, once data are entered into GXD the correct probe-to-gene associations will be maintained and the expression information will remain associated with the correct genes.
GXD curators also ensure expression data from mutant mice are associated with the correct allele. Different alleles of the same gene may have quite different phenotypes. Therefore, in order to integrate expression and phenotype information, curators must accurately identify which mutant was used.
GXD annotates expression results using the Mouse Developmental Anatomy Ontology, developed in collaboration with the Edinburgh Mouse Atlas Project (Hayamizu et al. 2013). The ontology is extensive, containing more than 28,000 stage-specific anatomical terms hierarchically organized from tissue to tissue substructure. The detailed ontology and its hierarchical structure allows for the integration of data of varying levels of granularity and enables searches that include anatomical structures and their substructures. Within MGI this ontology is also being used to annotate sites of expression and activity data for recombinase alleles (the “Cre portal”; http://www.creportal.org), and its terms are used in the naming of Gene Ontology and Mammalian Phenotype ontology terms, laying the groundwork for closer integration with these data in the future.
It is important to note that it is our curatorial policy to annotate expression patterns strictly based on what the authors say in the text of the manuscript and the figure legends. GXD curators do not interpret the published images. To do otherwise would be an error-prone process because the authors are experts in their fields and base their interpretations on more data than is provided in the publication. GXD annotations are associated with the images so that users can see the primary data and interpret it themselves. GXD images, together with annotations, are also being made available to EMAGE (Richardson et al. 2014) for spatial mapping, and GXD provides links from image entries to corresponding entries in EMAGE. (Due to the inherent variation of embryos, spatial mapping is only applicable to a subset of in situ data from wild-type mice.)
Easy-to-use search tools
Search returns that allow for data refinement and exploration
The Assay Results tab (Fig. 5) provides the most detailed summary. It lists the structures assayed and whether or not expression was detected, as well as the gene analyzed, assay type used, information about mutant genotypes, reference information, links to associated images, as well as links to the detailed pages discussed above. It also has export features, allowing results to be easily downloaded in text or spreadsheet format.
The Images tab (Fig. 5) allows for review of expression images that match the search criteria. This search return is made possible by the GXD curators’ annotation of the metadata associated with these images. Images are displayed in GXD when we have permission from the publisher or when contributed by a data provider.
The Genes tab (Fig. 5) provides a list of the genes that match the search criteria. The list can be downloaded in text or spreadsheet format or forwarded to either the MGI batch query (Bult et al. 2010) or MouseMine (http://www.mousemine.org). This allows searching for additional information associated with the genes of interest, such as function, phenotype, and/or disease annotations.
There are two matrix view summaries: Tissue × Stage and Tissue × Gene. By default these matrices provide a high-level overview. However, toggles can be used to expand (and collapse) the matrices along the tissue axis to reveal greater levels of detail in the areas of particular interest. The number of returned results can be refined using filters.
The Tissue × Stage Matrix provides a summary of the temporal and spatial expression patterns of genes. This is a good way to get an overview of the expression data for a single gene, as illustrated in Fig. 6. Alternatively, using the Gene Expression Data Matrix link on the GXD home page (http://www.informatics.jax.org/expression.shtml) generates a Tissue × Stage Matrix that displays all of GXD’s data. Then, using the toggles and filters, users can view and iteratively select expression data for the specific tissues and/or developmental stages of interest.
The Tissue × Gene Matrix is useful when comparing expression patterns between genes. This can be especially helpful when using the Differential Expression search (Fig. 4) to find genes expressed in one structure and not another. Figure 7 shows a matrix generated by such a search. Each column corresponds to a gene whose expression pattern matched the search criteria. The row for the structure whose expression was searched for, renal medulla in this case, is filled with blue cells; the intensity of the color indicates the number of annotations in the database. The row for the other structure (renal cortex) is filled with red or blank cells. Red cells indicate expression of that gene was shown to be absent in the structure; blank cells indicate the expression of that gene has either not been analyzed or has not been recorded in GXD for the tissue.
Access to GXD
GXD is available through the MGI web site, http://www.informatics.jax.org. The GXD home page can be accessed directly at http://www.informatics.jax.org/expression.shtml; by means of the MGI home page by following the topic “Gene Expression Database (GXD)”; or from any other page within MGI by clicking the “Expression” tab of the navigation bar. This page gives access to GXD’s search forms and provides more information about GXD, including FAQs, help documents, news announcements, and links to other mouse expression resources.
GXD welcomes direct data submission. The creation of detailed expression records is time consuming. Hence, not every paper in the literature index has detailed expression records. Data submission is the most effective way to ensure that your data will be included in GXD. Inclusion of your data in GXD allows for its integration with the other data in GXD and the larger MGI resource. This will make your data available for database searching, thus adding to its utility and increasing its exposure to the scientific community. Furthermore, GXD ensures that data and data connections are maintained and kept up-to-date even if the gene models or the gene names used in the literature have changed. You can find instructions for data submission on the “Send us your data” tab at the bottom of the GXD Home Page (http://www.informatics.jax.org/mgihome/GXD/GEN/gxd_submission_guidelines.shtml).
GXD and MGI have dedicated User Support personnel. They can be contacted by emailing email@example.com or via the “Contact Us” link in the navigation bar of our web pages. User Support and GXD curators actively seek to provide presentations, demonstrations, and training sessions on GXD and the larger MGI resource at many meetings. User Support can also provide on-site training workshops, as well as remote interactive sessions, upon request. GXD is continually looking for ways to improve the user experience. Direct user interactions, surveys, and collaborations provide valuable feedback on the project from those who use our data. Your feedback is used to prioritize database improvements.
The first full version of the GXD database went online in July 1998 with 3600 publications in the literature index and 32,000 detailed expression results analyzing 517 genes (Ringwald et al. 1999). Since then the usefulness of GXD has grown tremendously as the data content has increased and more powerful web tools have been developed. Going forward we will continue these efforts to integrate gene expression data in the proper biological and analytical context by adding new types of expression data and developing even more advanced search and display tools. Thus, GXD will continue to be an essential resource for researchers who study the molecular basis of development, differentiation, and disease.
GXD is supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) of the National Institutes of Health (NIH); Grant number: HD062499. The GXD resource has benefited from the work of our former colleagues, as well as from our current colleagues from other MGI projects. In particular, we would like to thank the MGI software development team for their work to improve our web interface and its underlying infrastructure: Richard Baldarelli, Jon Beal, Olin Blodgett, Jeff Campbell, Lori Corbani, Sharon Giannatto, Kim Forthofer, Pete Frost, Lucie Hutchins, Jill Lewis, Dave Miers, and Kevin Stone. We would also like to acknowledge the efforts of our dedicated user support personnel, Joanne Berghout and David Shaw.
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res 41:D991–D995PubMedCentralCrossRefPubMedGoogle Scholar
- Diez-Roux G, Banfi S, Sultan M, Geffers L, Anand S, Rozado D, Magen A, Canidio E, Pagani M, Peluso I, Lin-Marq N, Koch M, Bilio M, Cantiello I, Verde R, De Masi C, Bianchi SA, Cicchini J, Perroud E, Mehmeti S, Dagand E, Schrinner S, Nürnberger A, Schmidt K, Metz K, Zwingmann C, Brieske N, Springer C, Hernandez AM, Herzog S, Grabbe F, Sieverding C, Fischer B, Schrader K, Brockmeyer M, Dettmer S, Helbig C, Alunni V, Battaini MA, Mura C, Henrichsen CN, Garcia-Lopez R, Echevarria D, Puelles E, Garcia-Calero E, Kruse S, Uhr M, Kauck C, Feng G, Milyaev N, Ong CK, Kumar L, Lam M, Semple CA, Gyenesei A, Mundlos S, Radelof U, Lehrach H, Sarmientos P, Reymond A, Davidson DR, Dollé P, Antonarakis SE, Yaspo ML, Martinez S, Baldock RA, Eichele G, Ballabio A (2011) A high-resolution anatomical atlas of the transcriptome in the mouse embryo. PLoS Biol 9:e1000582PubMedCentralCrossRefPubMedGoogle Scholar
- Harding SD, Armit C, Armstrong J, Brennan J, Cheng Y, Haggarty B, Houghton D, Lloyd-MacGilp S, Pi X, Roochun Y, Sharghi M, Tindal C, McMahon AP, Gottesman B, Little MH, Georgas K, Aronow BJ, Potter SS, Brunskill EW, Southard-Smith EM, Mendelsohn C, Baldock RA, Davies JA, Davidson D (2011) The GUDMAP database—an online resource for genitourinary research. Development 138:2845–2853PubMedCentralCrossRefPubMedGoogle Scholar
- Magdaleno S, Jensen P, Brumwell CL, Seal A, Lehman K, Asbury A, Cheung T, Cornelius T, Batten DM, Eden C, Norland SM, Rice DS, Dosooye N, Shakya S, Mehta P, Curran T (2006) BGEM: an in situ hybridization database of gene expression in the embryonic and adult mouse nervous system. PLoS Biol 4:e86PubMedCentralCrossRefPubMedGoogle Scholar
- Petryszak R, Burdett T, Fiorelli B, Fonseca NA, Gonzalez-Porta M, Hastings E, Huber W, Jupp S, Keays M, Kryvych N, McMurry J, Marioni JC, Malone J, Megy K, Rustici G, Tang AY, Taubert J, Williams E, Mannion O, Parkinson HE, Brazma A (2014) Expression Atlas update—a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments. Nucleic Acids Res 42:D926–D932PubMedCentralCrossRefPubMedGoogle Scholar
- Smith CM, Finger JH, Hayamizu TF, McCright IJ, Xu J, Berghout J, Campbell J, Corbani LE, Forthofer KL, Frost PJ, Miers D, Shaw DR, Stone KR, Eppig JT, Kadin JA, Richardson JE, Ringwald M (2014a) The mouse Gene Expression Database (GXD): 2014 update. Nucleic Acids Res 42:D818–D824PubMedCentralCrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.