Principles and design
The goal of the IMSR is to provide an online searchable web-based catalog of mouse resources available globally, including inbred, mutant, and genetically engineered mice, cryopreserved embryos and gametes, and ES cell lines. The IMSR website provides, for each strain or cell line, links for ordering, links to the repositories’ strain description, and links to phenotype and disease model data. Mouse repositories of any size and in any location are welcome to contribute data about their mouse resource holdings, providing those holdings are available to investigators who request resource access. This does not mean the resources are without cost, but that they are available to researchers. Most repositories charge customers to recover their cost of operating, and maintaining and shipping mouse resources. In addition, IMSR expects that resources will update their holdings on a regular basis. Many active repositories provide new data files on a weekly basis. Individual investigators are welcome to contribute to IMSR as a small repository as well, if they are distributing their resources and will ship their unique mouse mutants without special restrictions.
Repositories contributing to IMSR
There are currently 20 repositories and repository consortia (representing 46 individual repository sites) listing mouse resource holdings in IMSR (Table 1). These collectively hold 32,396 mouse strains (as live stocks, cryopreserved embryos, and cryopreserved gametes) and 209,328 mutant ES cell lines as of May 15, 2015 (Table 1). Of these, approximately 1300 strains exist as both ES cell lines and as some animated form (largely as cryopreserved sperm or embryos). There is virtually no duplication in strain holdings between repositories. However, at any given time, a repository may have available multiple forms of a given strain (e.g., frozen embryos or live mice) either through the dynamic cycle of cryopreservation, recovery, breeding, and re-cryopreservation that happens in providing or restoring a given repository’s strain holdings or as a matter of repository policy to store strains in multiple states (e.g., as cryopreserved embryos and sperm).
Table 1 IMSR (www.findmice.org) repositories’ holdingsa (data as of May 15, 2015)
Large-scale mutagenesis and analysis projects including the International Mouse Phenotyping Consortium (IMPC) mice recovered from ES cell line knockout mutations and mice generated in subsequent crosses to cre-deleter mice for removal of neo-cassettes (Brown and Moore 2012; Mallon et al. 2012) and several N-ethyl-N-nitrosourea (ENU) targeted phenotype screens (Li et al. 2015; Arnold et al. 2012; Goldowitz et al. 2004; Nolan et al. 2000; Hrabé de Angelis et al. 2000; Justice et al. 1999) have or are generating a significant number of new genetically defined mice that are actively being archived in existing mouse repositories. In addition, the adoption of gene-editing technologies using TALENs (zinc-finger nucleases, transcription activator-like effector nucleases) and CRISPR/Cas9 (clustered regularly interspaced short palindromic repeats/CRISPR associated system) (Aida et al. 2015; Singh et al. 2015; Sung et al. 2012) by the IMPC and the larger mouse genomics community will contribute to additional expansions to repository inventories.
Informatics infrastructure
An overview of the IMSR system is shown in Fig. 1. Data providers deposit files in a defined format on a private FTP site. The format is a simple tab delimited text file whose fields are specified on an IMSR help page (http://www.findmice.org/participate). An automated process checks the deposit area hourly. New submissions are scanned for formatting and content errors (e.g., missing values or IDs that do not designate valid genes). Any errors are communicated to the providers via email along with instructions for needed fixes and contact information for obtaining assistance. Submissions that pass the error checking process are archived, then processed to improve content against MGD data, indexed by Lucene (http://lucene.apache.org), and made available via our Solr instance (http://lucene.apache.org/solr/). IMSR is a noSQL system; all queries and page generation are supported by Solr/Lucene, which provides very fast response times. IMSR currently uses Solr 5.2 running in a WildFly 8.2 container. A parallel system is used for development and testing.
Concurrently, after a newly submitted data file passes automated checks, comparison with MGD data will reveal any data inconsistencies (e.g., incorrect strain names, mismatched IDs, or gene or allele nomenclature in need of updates). These will be curated and corrections returned to the repository provider so they can update their records. In this manner, repositories become more nomenclature accurate/current and iteratively improve their data; and users benefit from future loads of corrected data being more readily searchable using standard allele, gene, and strain designations.
User interface: the IMSR website
Users access IMSR primarily via a web-based interface (www.findmice.org). Searches can be performed using one or many parameters, including strain parameters, genetic parameters, and repository name/location. Strain parameters include the strain/stock designation, the strain ID, the state in which the strain/resource is maintained (live, cryopreserved embryo, cryopreserved ovary, cryopreserved or freeze-dried sperm, or ES cell line), and the strain type. Genetic parameters include the symbol or name of the phenotypic allele or gene of interest carried in the strain, the relevant allele or gene accession ID, and the type (origin) of the mutation and its chromosomal location. Repository parameters include the name of one or more specific repositories, or the selection of all repositories in a geographical regional location (Fig. 2). The results of a search are returned in tabular format, with each row in the table representing one unique genetic strain from a given repository. [Note, therefore, that if a repository holds a strain in multiple states, the strain is only listed once; but each “state” status is provided]. Search results can be exported in text or Excel format. Figure 3 shows 10 rows of the 29 rows returned when searching for strains carrying mutations in the Stat3 gene (as of May 15, 2015).
But how does an investigator who does not know what strain or what mutant he/she may need approach the IMSR? The key is found in the reciprocal links and complementary information contained in MGD (phenotype and disease model data) and the IMSR (strain listings). Figure 4 illustrates the interplay between these sites conceptually. A user directs questions that are phenotype or disease model oriented in nature to the MGD database where he/she can then view the specifics of a mutant phenotype or learn what human disease(s) this mutant is used to model. Each such page in MGD detailing the phenotype and disease models for a given mutant links directly to IMSR for users to physically locate strains or ES cell line resources containing the mutant in question. Similarly, a user of IMSR, when viewing a set of strains and ES cell lines that carry mutations in a particular gene can, for any of those strains, link directly to the MGD detail page describing the phenotype and disease models.
Data standards and challenges in maintaining and updating IMSR
The largest challenge of maintaining IMSR is the variability in data quality and completeness among the data files submitted to IMSR from different repositories. Although there is documentation about data fields and required format, data received, particularly from smaller repositories with less sophisticated informatics infrastructure, may be incomplete, or the repositories may not have some critical information from the original source that generated the mice. However, for any given strain holding submitted in the repository file, if the minimum data fields are provided (strain ID, strain name, state, strain type), that strain can display in the IMSR website, but only limited links to other information resources will be possible.
A second challenge for IMSR is the use of non-standard nomenclature in the gene, mutant allele, and strain designations provided to IMSR. In processing incoming files, IMSR scripts are run that attempt some automatic data field completion (e.g., if the repository provided a nomenclature-correct mutant allele, but left the gene field blank, the correct gene can be inferred). Other scripts compare incoming data with MGD data to allow withdrawn nomenclature or nomenclature synonyms to be replaced with the correct symbols and names on the IMSR website. These automatic corrections allow links to gene and allele data that would otherwise not be possible.
Many nomenclature errors cannot be easily interpreted as described above. These are displayed on the IMSR website ‘as is.’ A curator reviews logs of data errors from incoming repository files and returns corrections to the repository, where possible. It remains incumbent upon the repository to update their own holdings’ database with corrected nomenclature and IDs. This feedback is intended to improve the repository’s own site, as well as to ensure that the next data file provided to IMSR is correct and does not again appear in the error log.