Streamlined computational pipeline for genetic background characterization of genetically engineered mice based on next generation sequencing data
Genetically engineered mice (GEM) are essential tools for understanding gene function and disease modeling. Historically, gene targeting was first done in embryonic stem cells (ESCs) derived from the 129 family of inbred strains, leading to a mixed background or congenic mice when crossed with C57BL/6 mice. Depending on the number of backcrosses and breeding strategies, genomic segments from 129-derived ESCs can be introgressed into the C57BL/6 genome, establishing a unique genetic makeup that needs characterization in order to obtain valid conclusions from experiments using GEM lines. Currently, SNP genotyping is used to detect the extent of 129-derived ESC genome introgression into C57BL/6 recipients; however, it fails to detect novel/rare variants.
Here, we present a computational pipeline implemented in the Galaxy platform and in BASH/R script to determine genetic introgression of GEM using next generation sequencing data (NGS), such as whole genome sequencing (WGS), whole exome sequencing (WES) and RNA-Seq. The pipeline includes strategies to uncover variants linked to a targeted locus, genome-wide variant visualization, and the identification of potential modifier genes. Although these methods apply to congenic mice, they can also be used to describe variants fixed by genetic drift. As a proof of principle, we analyzed publicly available RNA-Seq data from five congenic knockout (KO) lines and our own RNA-Seq data from the Sall2 KO line. Additionally, we performed target validation using several genetics approaches.
We revealed the impact of the 129-derived ESC genome introgression on gene expression, predicted potential modifier genes, and identified potential phenotypic interference in KO lines. Our results demonstrate that our new approach is an effective method to determine genetic introgression of GEM.
KeywordsSequencing Congenic mouse Knockout mouse Genomic variation Genetic interactions Modifier genes Genetic background RNA-Seq variant calling qPCR validation Ang Cdkn1a Sall2
129P2OlaHsd mouse strain
Clustered regularly interspaced short palindromic repeats
Embryonic stem cell
Genetically engineered mice
Gene Expression Omnibus
Mouse embryonic fibroblasts
Mouse Genomes Project
Next Generation Sequencing
Polymerase chain reaction
Quantitative real-time PCR
Quantitative trait loci
Short hairpin ribonucleic acid
Single nucleotide polymorphism
Sequence Read Archive
Tumor Protein p53 (mouse)
University of California Santa Cruz
Variant Call Format
Variant effect predictor
Whole Exome Sequencing
Whole Genome Sequencing
The use of mouse models has resulted in a wealth of knowledge regarding gene function in animal and human diseases, including complex traits. The modern laboratory mouse is the result of careful breeding and trait selection that began in the early twentieth century [1, 2, 3]. Inbred mice, produced by brother-sister mating, are isogenic and homozygous, making it possible to know the genetic profile of the strain by typing an individual . Some inbred strains have features that are valuable for transgenic  and embryonic stem cell (ESC) technology . The 129-derived ESCs are particularly successful in germline transmission and have been extensively used in the creation of over 5000 knockout (KO) lines [6, 7, 8]. However, many ESC lines have been now derived from other strains. For example, ESCs from C57BL/6 N are used in large consortium projects (e.g., EUCOMM). After screening for an ESC clone harboring the targeted allele (e.g., KO and knockin [KI]), ESCs are typically injected into blastocysts (from a strain that differs in coat color) in order to obtain chimeras showing a mixture of black and agouti (or albino) spots, suitable to estimate the degree of chimerism. These chimeras need to be crossed with wild-type (WT) mice to test for germline transmission. The heterozygous carriers of targeted alleles are then either intercrossed, obtaining a line with mixed background, or backcrossed (typically to recipient C57BL/6), obtaining a congenic line by further backcrossing [4, 9]. However, this strategy has disadvantages; the resulting mice will contain mixed backgrounds, and the development of a full congenic line could take up to 5 years given that 10 generations of backcrosses are needed with the recipient strain . Although this timeframe can be reduced when using marker-assisted backcrossing (speed congenics), it could still take at least 2.5 years .
An important consideration is the complex phenotypic evaluation that could result from targeted gene analysis in mixed background lines. Each individual KO or KI mouse (and the wild-type [WT] littermates) will have a different genetic background compositions, due to differences in the segregating background genes from the two parental strains [12, 13]. Thus, the different genetic backgrounds of KO/KI models could influence the resulting targeted-gene phenotype [14, 15, 16, 17, 18], particularly affecting the reproducibility of translational studies when mixed and/or uncharacterized backgrounds are used [19, 20, 21]. Additionally, the presence of a segment of the ESC-derived chromosome flanking the targeted gene also known as the “congenic footprint”, can confound analysis of phenotypes associated with the targeted gene . The congenic footprint and its pattern of expression could lead to an inaccurate comparison between WT and KO/KI mice due to the linkage of genes at the targeted locus . In line with this, several reports have shown evidence of dramatic changes in gene expression associated with flanking genes, closely related to the genetic background [22, 24, 25, 26]. These interactions could incorporate bias in dissecting the KO/KI-dependent transcriptomes, adjudicating erroneous phenotypes [23, 27, 28, 29]. Incorporation of new genome editing nuclease-dependent techniques is certainly addressing this problem, allowing the generation of GEM on any inbred strain without using ESCs or chimeras. Still, novel variants could be fixed in these lines due to off-target effects from the Cas9 model generation  and/or genetic drift over time , justifying the need for accurate genetic background characterization in every GEM line used. Although background characterization can be performed using SNP genotyping in different platforms , these methods test a limited number of loci, not always related to protein coding genes, and do not detect novel variants.
Next generation sequencing (NGS) enables high throughput sequencing of genes and genomes at relatively low cost. However, resulting NGS data is very complex, and additional computational methods should be available for the scientific community to characterize the genetic background of GEM lines. Here, we present a computational pipeline that uses NGS data from whole genome shotgun sequencing (WGS), whole exome sequencing (WES) and/or RNA-Seq to detect the nature, ploidy and amount of introgressed variants in GEM lines. This pipeline can generate genome-wide plots of variants per genotype, detect congenic footprints and identify potential modifier genes, which will enable a better understanding of the phenotypic outcomes in studies using partially congenic or mixed background GEM lines, as well as to unravel novel genetic interactions in these models.
Isolation of primary mouse embryonic fibroblasts (MEFs) and cell cultures
We obtained Sall2 KO mice from Dr. Ryuichi Nishinakamura (Kumamoto University, Kumamoto, Japan) by a material transfer agreement (MTA, 2010). Genotyping of these mice was as previously described  and their housing was performed according to the Animal Ethics Committee of the Chile’s National Commission for Scientific and Technological Research (CONICYT, Protocol FONDECYT project 1,151,031). At 13,5 days post coitum female mice were euthanized with a CO2 inhalation process, and MEFs from Sall2 WT and KO embryos were isolated as described previously . Mice were routinely genotyped by isolating tail DNA as previously reported . In brief, 1 μL of genomic DNA was used for PCR analysis using the following oligonucleotides: forward, 5′-CACATTTCGTGGGCTACAAG-3′; reverse, 5′-CTCAGAGCTGTTTTCCTGGG-3′; and Neo, 5′-GCGTTGGCTACCCGTGATAT-3′. The sizes of the PCR products were 188 bp for the WT and 380 bp for the KO.
Sall2+/+, Sall2+/−, and Sall2−/− primary and immortalized MEFs were cultured in DMEM supplemented with 10% heat inactivated fetal bovine serum (FBS, GE Healthcare HyClone), 1% glutamine (Invitrogen), and 0.5% penicillin/streptomycin (Invitrogen). Experiments with primary Sall2+/+ and Sall2−/− MEFs were performed with early passages (passages 3–4). Immortalized Sall2+/+ and Sall2−/− MEFs were obtained using SV40 large T antigen based on a modified protocol from Zhu et al. . For transfection of primary MEFs, we used Lipofectamine 2000 (Invitrogen) and 2 μg of SV40 large T antigen expression vector (Addgene Plasmid #9053). After cell transfection, we proceeded to select for low density. To complete the immortalization process, 5–6 post-transfection passages were carried out. Human embryonic kidney epithelial cells (HEK293; American Type Culture Collection CRL-1573™) were cultured in DMEM supplemented with 10% FBS, 1% glutamine, and 0.5% penicillin/streptomycin.
RNA-Seq analysis for the detection of differentially expressed genes (DEGs)
We purified RNA (Qiagen) from Sall2+/+, Sall2+/− and Sall2−/− MEFs treated or not with doxorubicin 1 μM (Sigma Aldrich) for 16 h. RNA-Seq libraries were prepared at the University of Cambridge sequencing facility (UK). Sequencing in a Next-seq 500 machine yielded an output of 400 gigabases and four FASTQ files per sample. We merged the FASTQ files matching each sample and aligned the reads against the mouse genome assembly (mm10 build) using the HISAT2 aligner (v220.127.116.11, default settings) . We sorted the BAM files using the SortSam.jar script from Picard tools and implemented the HTSeq code (union mode) to quantify the number of reads per gene in each BAM file . The GTF file (genes.gtf) used in HTSeq was from the igenomes repository (mm10, Illumina). Prior to testing for differential expression, we normalized the count table with the RUVSeq package available in Bioconductor (R, Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/RUVSeq.html) with in-silico empirical negative controls and RUVg normalization . The edgeRun code (exact test, y = 50,000) was used to perform differential expression analysis between WT and KO samples . We selected further DEGs with an FDR < 0.001. Gene ontology analysis was performed by using the InnateDB database (https://www.innatedb.com) .
Computational pipeline for variant calling and characterization from the NGS data. Galaxy platform
We uploaded individual BAM files from the RNA-Seq data to the main Galaxy platform (https://usegalaxy.org/). After sorting, genome-wide simple diploid calling was applied using Freebayes (https://github.com/ekg/freebayes). We filtered variants from the resulting raw VCF (Variant Call Format) files using the VCFlib program (https://github.com/vcflib/vcflib) with the following criteria: -f “DP > 10” (Depth over 10 reads) and -f “QUAL > 30” (minimum Phred-scaled probability of error over 30). Chromosomal histograms were plotted using an “in-house” R script (see “script outline” in https://github.com/cfarkas/Genotype-variants). For identification of common variants in KO animals not present in their WT counterparts, we used several tools from the VCFlib toolkit available in Galaxy. We started intersecting KO VCF files using the VCF-VCF intersect program (reference genome mm10) and annotated genotypes (VCF annotate genotypes) using calls from the WT file. We filtered the resulting annotated VCF file by selecting lines that did not match those of the WT (Filter and Sort). An output file with the KO-linked variants was obtained.
Four BASH scripts were used sequentially to 1) sort bam files with SAMtools (sort_bam.sh), 2) perform variant calling with Freebayes (variant_collection.sh, parameters described above), 3) filter variants in each VCF file with VCFlib/Bcftools dependencies (filtering_combined_mouse.sh, parameters for VCFlib described above) and 4) dissect KO/KI-linked variants and visualize common variants for each genotype with R (genotype_variants_mouse.sh, see https://github.com/cfarkas/Genotype-variants ).
Visualization of variants in R
We developed a script written in R (genotype_variants.R) for proper visualization of variants across mouse chromosomes. The script takes the intersected VCF files from WT and KO mice in VCF format as inputs and produces an output of variant frequency per chromosome. The script also includes statistical detection of chromosomes with KO-linked variants in the experiments. We tested the frequency distribution of variants with the Cochran-Armitage test for trend distribution, available in the DescTools package implemented in the R statistical program (https://cran.r-project.org/web/packages/DescTools/index.html). Detected variants were binned every 10 million base pairs according to their chromosomal coordinates, ordered in a contingency table and plotted. After this, a Cochran-Armitage test for trend distribution was implemented to identify chromosomes containing KO-linked variants, based on the frequency distribution of WT and KO genotypes. Graphics were done with the ggplot2 package, implemented in R (https://cran.r-project.org/web/packages/ggplot2/index.html).
We isolated RNA from cells using TRIzol (ThermoFisher Scientific, Inc.) followed by chloroform and isopropanol extraction. The RNA samples were treated with Turbo DNA-free Kit (Invitrogen) to eliminate any residual DNA from the preparation. Total RNA (2 μg) was reverse transcribed using the M-MLV reverse transcriptase (PROMEGA) and 0.25 μg of Anchored Oligo(dT)20 Primer (Invitrogen; 12,577–011). We performed qPCR reactions in triplicate using KAPA SYBR FAST qPCR Master Mix (2X) Kit (Kapa Biosciences) and primer concentrations of 0.4 μM (Additional file 10: Table S1). Cycling conditions were as follows: initial denaturation at 95 °C for 3 min, then 40 cycles with 95 °C for 5 s (denaturation) and 60 °C for 20 s (annealing/extension). To control specificity of the amplified product, a melting-curve analysis was carried out. No amplification of unspecific product was observed. Expression of each gene was relative to Polr2a gene (RNA pol II) and plotted as fold change compared to control in each case.
Western blot analysis
Proteins from cell lysates (50–80 μg of total protein) were fractionated by SDS-PAGE and transferred for 1 h at 200 mA to PVDF membranes (Immobilon; Millipore) using a wet transfer system. The PVDF membranes were blocked for 1 h at room temperature in 5% nonfat milk in TBS-T (TBS with 0.1% Tween), and incubated with primary antibody at an appropriate dilution at 4 °C overnight in blocking buffer. After washing, the membranes were incubated with horseradish peroxidase-conjugated secondary antibodies diluted in TBS-T buffer for 1 h at room temperature. Immunolabeled proteins were visualized by ECL (General Electric Healthcare, Amersham, UK). Antibodies used for Western blotting were as follows: anti-angiogenin (1:500, ab10600; Abcam), anti-p53 (1:500, PAb240; Abcam), anti-p21 (1:500, sc-6246; Santa Cruz Biotechnology), anti-β-actin (1:10000, C4; Santa Cruz Biotechnology), and anti-SALL2 (1:1000, HPA004162; SIGMA).
Transient transfections and viral infection
For transient transfection, 1.5 × 106 immortalized MEFs (iMEFs) from Sall2+/+ mice were electroporated using 30 μg of plasmids at 1150 V for 30 milliseconds (NEON Transfection System, Thermo Fisher Scientific). For transduction of Sall2 shRNA into iMEFs, lentiviral particles were packaged in HEK293 cells by co-transfecting pCMV-dR8.2 dvpr (Addgene Plasmid #8455), pCMV-VSVG (Addgene plasmid #8454) and pLKO.1 (Addgene Plasmid #8453) containing the 5’-CCGGAAGTCATGGATACAGAAGCACACTCGAGTGTGCTCTGTATCCATGACTTTTTTTG -3′ (loop & stop in bold) sequence, which targets exon 2 of Sall2. The medium was changed every 24 h with 9 μg/mL of polybrene and 24, 48 and 72-h supernatants were filtered through a 0.45 μm filter, collected and added to WT iMEF cells in each case. iMEF cells were selected with 5 μg/mL of puromycin and further recovered with fresh DMEM medium.
CRISPR-Cas9 KO generation
WT iMEFs were electroporated as described above, with vectors encoding CRISPR-Cas9 in frame with PaprikaRFP (ATUM, DNA TWOPOINTO INC) using the following guide RNA sequences: GGTGAGCGAGGAATTCGGTC and TAGTCTAGGTGCTCCGGTAC targeting the largest exon of the mouse Sall2 gene (exon 2). These two proteins can be efficiently produced from one coded peptide that relies on the self-cleaving 2A peptide to allow translational skipping . At 16 h following electroporation, the top 2% of the brightest cells were sorted with BDFACSAria III cell sorter (BD Biosciences-US), and pools of 100 cells were plated. The pools were grown for two weeks, and Western blotting against SALL2 was performed to identify silenced cells. Genomic PCR and further sequence analysis were used to confirm CRISPR-Cas9-mediated edition of the Sall2 locus.
Genome-wide detection and distribution of variants from GEM lines
Since the 129P2 inbred strain (used for Sall2 gene targeting) was already characterized in the Mouse Genome Project (Wellcome Sanger Institute, UK) [46, 47], we next applied the pipeline to identify 129-derived variants from the Sall2 KO sequencing experiment. We plotted variants from each genotype according to genomic coordinates using our script written in R (genotype_variants.R, Fig. 2d). Variants were binned every 10 million base pairs (Mb) from each genotype and plotted by chromosome. In the case of Sall2 KO, the distribution of KO common variants was similar to the distribution of WT variants, with the exception of Chr 14, where the Sall2 gene targeting was done (located at 52.3 Mb) (Fig. 2d). We also investigated the distribution of all variants (subtracting C57BL/6J variants) in each KO line analyzed and applied the Cochran-Armitage test for trend distribution to find chromosomes presenting differential distribution of variants. According to the analysis, the Gtf2ird1 KO line displayed extensive backcrossing with C57BL/6J and shows a congenic footprint on Chr 5 where the Gtf2ird1 gene is located (P < 0.0001, Cochran-Armitage test for trend distribution) (Additional file 2). The Mecp2 KO also presented extensive backcrossing with C57BL/6J mice, but not an obvious footprint on Chr X where the Mecp2 gene is located (P = 0.4508) (Additional file 2). Still, variants linked to the targeted gene were expected due to the congenic nature of this KO line.
Similar to the Gtf2ird1 KO, the Stc1 KO line presented extensive backcrossing with C57BL/6J and a clear footprint on Chr 14 where Stc1 is located (P < 0.0001) (Additional file 2). The Itch KO also presented extensive backcrossing with C57BL/6J mice; however, four chromosomes display obvious targeted locus-linked variants (Chr 2, Chr 9, Chr 10 and Chr 16 with P < 0.0001 for the first three and P < 0.02 for the last) (see Additional file 2).
The Sall2 KO presented very similar distribution as shown in Fig. 2d, suggesting that most of the variants in this line come from 129P2-derived ESCs (Additional file 2). Thus, the mixed background with the ESCs was obvious in this KO due to the amount of 129P2 introgressed variants along ten chromosomes, including Chr 14 where Sall2 and the footprint are located. Five chromosomes presented differential distribution of variants, with Chr 14 showing the lowest p-value (Additional file 4: Table S1 ). Similar to the Sall2 KO, the Hnrnpd KO displayed a mixed background, but the average distribution of the variants greatly differed between genotypes (Additional file 2). Although a footprint was present on Chr 5 where Hnrnpd is located, the variant distribution was significantly different in 12 other chromosomes (Additional file 4: Table S1 ), likely due to a low number of backcrosses with C57BL/6J. Thus, we expected potentially disturbing passenger mutations from 129S6-derived ESCs (W4) in the Hnrnpd KO line . We also reviewed Casp4 variants on Chr 9, a gene naturally inactivated (5 base pair deletion) in several 129 strains (S1, S2, S6, P2, X1) . Variant calling from every biological replicate of this study revealed the genotype of 129 congenic Casp4 across samples, evidencing ploidy of Casp4 129-derived variants in one WT and in two Hnrnpd-KO samples (Additional file 4: Table S2). We confirmed this observation by the lack of expression of Casp4 exon 7, as described for several 129 strains  (Fig. 2e). Thus, besides variants that are linked to the targeted locus, mixed backgrounds in KO lines could have a deep influence on gene expression or phenotypes, as reviewed previously [10, 51, 52].
In addition to the RNA-seq data, we also tested our pipeline using WES data from the GEO dataset, GSE115017, and single cell WGS from the ArrayExpress archive, E-MTAB-4183. We successfully detected the introgressed variants from DBA/2 mice in the C57BL/6J-DBA/2 sample from the GSE115017 study, and mixed background samples from the E-MTAB-4183 study, depicting the number of chromosomes with ESC introgression, respectively (Additional file 3). Taken together, our procedures can offer a reliable way to detect genetic variation from NGS data, effectively identifying genetic introgression.
Dissection of variants linked to targeted genes: The congenic footprint
Ploidy of congenic footprint
In the case of the Stc1 KO, nearly half of the variants were heterozygous; thus, the ploidy of this footprint has heterozygous and homozygous distribution (Fig. 4f). Reviewing the distribution of homozygous and heterozygous variants for every littermate showed that the KO1 embryo displayed homozygous variants in both homozygous and heterozygous portions of the footprint, while KO2 and KO3 embryos only displayed these variants at the homozygous portion (Fig. 4g). Conversely, KO2 and KO3 embryos displayed heterozygous variants, while KO1 barely has these types of variants (Fig. 4h). Thus, the KO1 embryo is homozygous for both portions of the footprint while KO2 and KO3 are not. Figure 4i shows a summary of the ploidy in every littermate for the Stc1 KO line, evidencing ploidy variability in the footprint region. All these analyses suggest that the inheritance of the congenic footprint is complex and cannot be assumed as homozygous in every case.
The congenic footprint influences gene expression of Sall2-KO MEFs
To confirm Sall2-dependent DEGs in another genetic background, we also used data from a microarray study of transcription factor (TF)-inducible mouse ESCs in which a single TF (such as Sall2) is induced in a doxycycline-controllable manner , which allowed cross-validation of 37 other DEGs from the RNA-Seq experiment (Additional file 7: Table S3). From this comparison, 15 DEGs presented similar fold changes between studies (Fig. 5f). We evaluated two of these DEGs by qPCR, confirming trends from the RNA-Seq and the microarray studies (Fig. 5g). These 15 DEGs partly confirmed the initial gene ontology terms (Additional file 7: Table S4). Additionally, we cross-validated the Sall2-dependent downregulation of Ang, Pnp, and Rpph1 using a CRISPR model of SALL2 in HEK293 cells, lacking the highest expressed isoform of Sall2 (Fig. 5h). Our study confirmed that the congenic footprint and its interaction with the genetic background influence transcriptome analysis from KO lines. Thus, additional experimental approaches and cross validation are required to determine gene-dependent targets.
Screening of expression quantitative trait loci (eQTL) in the Sall2 KO congenic region
We also analyzed gene expression using doxorubicin as an environmental perturbation, since this drug increases nucleosome turnover around the promoters of active genes . We tested 16 congenic DEGs ranked by fold change for genotype dependency in the control condition, of which eight display linear genetic dependency (Fig. 6b, left). Global perturbation with doxorubicin altered fold changes of these genes and the DEGs with genetic dependence (Fig. 6b, right). Four of these genes displayed genetic dependence in both control and doxorubicin-treated conditions (Ang, Tmem260, 4930579G18Rik and Osgep, see red dots in Fig. 6b), and Ang was one of the most differentially expressed genes in both cases (Additional file 7: Table S1 and Additional file 8: Table S2, respectively). Sall2-KO Ang displayed low expression levels both in control and doxorubicin-treated MEFs compared to WT Ang expression. However, the fold change in Ang expression induced by doxorubicin was similar between genotypes (Fig. 6c). These results suggest that the congenic (129P2) Ang promoter, controlling both Ang and Rnase4 genes  is functional, but Ang transcription is low in the 129P2 strain. In agreement with our data, RNA-Seq data from the striatum of the eight Collaborative Cross founder strains  (SRA project ID: PRJNA228935) showed that Ang expression is remarkably low in six out of the eight strains (except C57BL/6J and CAST/EiJ), values corresponding to outliers in comparison to the group. We did not see this effect in the expression of Rnase4 (Fig. 6d). Moreover, strains with low levels of Ang in the striatum presented several variants in the Ang/Rnase4 gene, which were absent in the C57BL/6J and CAST/EiJ strains (Additional file 9A). These variants are also present in Sall2-KO MEFs, congenic from 129P2, but absent in the WT counterpart (Additional file 9B), suggesting an association of these variants with the low expression of congenic Ang. In line with this, Sashimi plots from the RNA-Seq data across mice founders supported by-pass of Ang transcription linked to the genomic variants (Fig. 6e and f, respectively and see Additional file 9C). Furthermore, an independent RNA-Seq study from the hippocampus of 129S1/SvImJ mice  (GEO DataSets accession GSE76567) showed strong downregulation of Ang transcripts compared to the C57BL/6J mice (Additional file 9D), a trend that we also experimentally confirmed in the cortex of the Sall2-KO mice by qPCR (Additional file 9E). By Western blot analysis, we confirmed strong downregulation of ANG protein levels in Sall2-KO MEFs (Fig. 6g), in agreement with the low Ang early detected by qPCR (See Ang in Fig. 5b). In contrast, mild downregulation of ANG protein levels was detected in Sall2-silenced cells (Fig. 6h) along with mild downregulation of Rnase4 (Fig. 6i). Similarly, CRISPR-Cas9-mediated Sall2KO in WT MEFs showed mild downregulation of Ang (Fig. 6j, see model validation in Additional file 11). These results suggest that SALL2 transcriptionally regulates Ang/Rnase4, but Ang expression is additionally affected by congenic variants present in the Sall2 KO line. Consistent with transcriptional regulation by Sall2, the Ang/Rnase4 promoter contains a cluster of three SALL2 binding sites around the transcription start site (data not shown). An Ang/Rnase4 promoter of 1231 base pairs displayed less activation in Sall2-KO versus WT cells, consistent with the mild downregulation of Ang and ANG protein levels in Sall2-silenced cells (Fig. 6k). Taken together, congenic Ang is transcribed at low levels due to genetic determinants inherited from 129P2, somehow masking Sall2-dependent transcriptional regulation. Thus, Ang could be classified as a potential modifier gene in Sall2-KO MEFs.
Genetic interference of Cdkn1a, a canonical target of Sall2
The origin of the ESCs used in gene targeting, the number of backcrosses and consecutive breeding used for the maintenance of GEM (KO/KI) lines (including potential genetic drift) all can have a profound impact in the genetic make-up of these models. These genetic variations within mice from the same KO or KI line will influence gene expression and phenotypes, potentially jeopardizing experimental conclusions. Thus, the genetic background of GEM mice imposes biases that need to be addressed before making conclusions to ensure reproducibility of gene expression and the phenotypes associated to a targeted gene.
We designed an automatized pipeline implemented in both the Galaxy platform and in a BASH/R script to perform genetic background characterization of GEM lines. Using NGS data, our pipeline can 1) identify introgression of ESC-derived variants in the C57BL/6 background and other recipient genomes, including genome-wide variant visualization; 2) define partial congenic, fully congenic, or mixed backgrounds and 3) detect and characterize the ploidy of the congenic footprint. After applying the pipeline, the Ensembl variant predictor algorithm  can be used to classify variants as novel or existent. However, a potential limitation of our pipeline in Galaxy, using WGS data (at high depth) is the amount of computational time employed in the variant calling, making the use of public servers impractical and restricting the calculations to a cluster. To circumvent this problem, we implemented the pipeline purely in BASH, raising the open file limit for such analysis (see Github: https://github.com/cfarkas/Genotype-variants). Thus, our pipeline is flexible in the use of both RNA and DNA sequencing data. Large-scale genomic sequencing data is superior for measuring introgression of genes or genomic segments, from one strain to another, as well as for identifying sequence differences in non-transcribed DNA. However, using RNA seq data, it is possible to assess influences on gene expression caused by the congenic footprints and to identify putative modifier genes with an eQTL strategy. Of relevance, is that our approach provided the opportunity to uncover genetic contamination along with novel variants fixed by genetic drift.
Advantages and Disadvantages for the use of NGS data in genetic background characterization
Genotype Variants Pipeline
NGS data: WGS/WES/RNA-seq
High for WES/WGS
Rare Variant Detection
Novel variant discovery
Yes, Using RNA-Seq (eQTLs)
Yes, using genetic linkage studies.
Increasing power with more depth
Genome wide-plots in R (free)
Programatically or included in commercial programs
To explore the introgression of gene variants in GEM mice, we applied the pipeline using publicly available high throughput data, in addition to our experimental data from Sall2 KO mice. As a proof of concept, we were able to identify the ploidy of 129-derived variants that leads to a Casp4 null mutation (reported in several 129 strains) in the background of Hnrnpd KO mice. We also found that the number of novel variants is highly variable between KO lines, even overpassing ESC introgressed variants. This observation represents a bias since novel and missense variants correlate in number, imposing novel backgrounds for the KO lines and the need for proper characterization of these variants.
Our studies indicate that the number of congenic genes varies between KO lines, and in one case introgressed genes are outside the targeted chromosome (e.g., for the Itch KO). The latter example implies that both genotypes (WT/KO) were independently maintained. Alternatively, we may have detected a partially (incomplete) congenic strain with residual segments outside the targeted chromosome. After obtaining linked variants by the WT subtraction, we suggest DNA sequencing of cells or tissues from heterozygous littermates, as it will further confirm the extension of the footprint. Since most of the variants near the target gene are homozygous, calls from a heterozygous genotype can discriminate these variants assuming Mendelian inheritance. This method was successful in the Sall2 KO, as evidenced by the > 60 Mb footprint. Nevertheless, a more complex scenario of ploidy can be found, as it is the case of the Stc1 KO where nearly half of the footprint is heterozygous and introgressed with different ploidy among KO littermates. We recognized that this issue is concerning in terms of reproducibility across biological replicates in KO studies.
Using Sall2 KO as a model, it was possible to assess the influences on gene expression caused by the congenic footprint and to identify putative modifier genes (eQTLs) using RNA-Seq data. By silencing Sall2 (using shRNA, CRISPR-Cas9) within cells of same genetic background (WT littermate), we also demonstrated the importance of validation of target-dependent genes initially identified using the Sall2 WT/KO MEFs. Likely because of the influence of the introgressed 129P2 genome in Chr 14 of Sall2 KO cells, several DEGs found in WT/KO MEFs comparison could not be confirmed by Sall2 shRNA experiments. Interestingly, Pnp, a gene within the congenic region of Chr 14, was identified as a DEG in the Sall2-shRNA studies, but not in the analysis of the RNA-seq data from Sall2 WT/KO lines. Further analysis uncovered a bias caused by genetic variation in the KO model due to mismatches in the PCR primer hybridization region (Fig. 5e). The congenic nature of Pnp likely explains the failure to detect it as a DEG in the Sall2 KO MEFs. In fact, polymorphisms in gene regulatory regions can modify their transcriptional output by creating or ablating transcription factor binding sites or other transcriptional regulatory elements .
Using several experimental approaches, we found that the low transcription of Ang in Sall2 KO MEFs is likely caused by genetic components inherited from the targeted ESCs, but also by the absence of functional SALL2 transcription factor. Our experimental data also suggest that congenic Ang is a modifier gene, which show effects on genes related to the targeted gene, specifically affecting SALL2-target Cdkn1a (p21) expression. However, we cannot discount the idea that the levels of Cdkn1a in the Sall2 KO could be consequence of a polygenic effect and not only due to low levels of ANG.
In summary, due to the mentioned constraints in the use of KO/KI congenic mice, conclusions related to gene expression and phenotypes could be misleading. Selection of an appropriate strain and characterization of the genetic background are critical aspects of any experiment using GEM lines. Even for technical reasons, polymorphisms in coding genes should be detected for adequate primer design if qPCR validation is intended. In silico characterization of variants coming from the genetic background, including the dissection of congenic variants, can improve our understanding of phenotypic outcomes in GEM lines. However, validation of data using alternative approaches (e.g., shRNA, siRNA, and CRISPR-Cas9 targeting) is also required for specific target-dependent conclusions. We suggest generating KOs by genome editing technologies, such as CRISPR-Cas9, in order to assign gene expression and phenotypes solely due to the targeted gene. Nevertheless, genetic characterization is also needed due to the occurrence of off-target mutations or genetic drift. Our strategy can refine the use of KO lines and open opportunities to uncover new genetic interactions, such as the Ang/Cdkn1a axis described here.
We present a computational pipeline implemented in the Galaxy platform and in BASH/R script to determine genetic introgression of GEM models using NGS data. The pipeline allows identification of congenic strains, ploidy nature of variants and the estimation of the backcrossing state in the models in use as well as visual assessment of congenic regions in the mouse genome. In addition, it allows identification of putative modifier genes. We suggest that our strategy together with target validation experiments refines the use of KO/KI lines and opens opportunities to uncover new genetic interactions that could impact phenotypic outcomes.
We thank Anthony Doran from the Keane team in the EMBL-EBI for helpful discussions and scientific advice on mouse genetics. We thank Dr. Paul Anderson from Harvard Medical School, Boston, MA for gifting the mature ANG-mCherry plasmid. We thank Dr. Kateryna Makova, Dr. Anton Nekrutenko, Rahul Vegesna and the whole Galaxy Team Project (Penn State University) for computational support and genomic analysis advice. We thank Dr. Teresa Caprile and Dr. Matias Hepp from Universidad de Concepción for helpful discussions. Finally, we thank Marjet Heitzer for editing the paper.
This work was supported by a Regular Fondecyt Grant (#1151031) to RP, Regular Fondecyt Grant (#1160731) to AC, Fondecyt PhD Scholarship 2013–2017 to CF, and P30 CA016672 DHHS/NCI Cancer Center Support Grant to MD Anderson Cancer Center to FB.
Availability of data and materials
Genotype-Variants pipeline is available on Github at https://github.com/cfarkas/Genotype-variants. Sall2 RNA-Seq data are deposited in GEO DataSets (accession number is GSE123168).
RP conceived the project, helped to design the experiments, analyze data and write the manuscript. AC interpreted the data and contributed to organize, analyze and write the manuscript. FB helped to interpret the data and write the manuscript. BR helped in the computational data analysis and drafted the manuscript. FF helped to perform several experiments and to interpret the data. CF conducted the sequencing experiments, validation experiments, analyzed the data, created the software used in this work and wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Mice genotyping and housing were performed according to the Animal Ethics Committee of the Chile’s National Commission for Scientific and Technological Research (CONICYT, Protocol FONDECYT project 1,151,031).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Guénet JL, Benavides F, Panthier J-J, Montagutelli X. Genetics of the Mouse. 1 edn. Berlin Heidelberg: Springer-Verlag; 2015.Google Scholar
- 7.Limaye A, Hall B, Kulkarni AB: Manipulation of mouse embryonic stem cells for knockout mouse production. Curr Protoc Cell Biol 2009, Chapter 19:Unit 19 13 19 13 11–24.Google Scholar
- 21.Linder CC. The influence of genetic background on spontaneous and genetically engineered mouse models of complex diseases. Lab Anim (NY). 2001;30:34–9.Google Scholar
- 25.Noyes HA, Agaba M, Anderson S, Archibald AL, Brass A, Gibson J, Hall L, Hulme H, Oh SJ, Kemp S. Genotype and expression analysis of two inbred mouse strains and two derived congenic strains suggest that most gene expression is trans regulated and sensitive to genetic background. BMC Genomics. 2010;11:361.PubMedPubMedCentralCrossRefGoogle Scholar
- 57.Lagarrigue S, Martin L, Hormozdiari F, Roux PF, Pan C, van Nas A, Demeure O, Cantor R, Ghazalpour A, Eskin E, Lusis AJ. Analysis of allele-specific expression in mouse liver by RNA-Seq: a comparison with cis-eQTL identified using genetic linkage. Genetics. 2013;195:1157–66.PubMedPubMedCentralCrossRefGoogle Scholar
- 67.Xue Y, Mezzavilla M, Haber M, McCarthy S, Chen Y, Narasimhan V, Gilly A, Ayub Q, Colonna V, Southam L, et al. Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations. Nat Commun. 2017;8:15927.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.