1 Introduction

There are many perspectives on what defines an “animal model,” but at the most fundamental level, it reflects an animal with a disease or condition with either face or construct validity to that observed in humans. Spontaneous animal models represent the truest form of the definition and are best exemplified by cross-species diseases such as cancer and diabetes where a particular species naturally develops the condition as observed in humans. However even in these models with high face, and seemingly construct validity, care must be taken when extrapolating from the animal phenotype to the human disease as the underlying mechanisms driving the disease may not be identical across species.

Animal models serve two primary purposes. The first use of animal models is to elucidate biological mechanisms and processes. A key assumption in this approach is that the animal species being examined has comparable enough physiology to reasonably allow for extrapolation to human biology and disease states. An extension of the first purpose is to use animal models for estimating efficacy and safety of new therapeutic treatments for alleviating human disorders. In both of these uses, the fidelity of the animal model is critically dependent upon the homology of the physiology between the animal model and human. The best model for human is human, and the greater divergence from human across the phylogenetic scale (e.g., nonhuman primates > rodents > zebrafish > drosophila) introduces increasingly larger gaps in genetic and physiological homology. For complex human-specific disorders such as schizophrenia or Alzheimer’s disease, our confidence in findings from animal models must be guarded as there is not a spontaneous animal model of these human conditions. For instance, besides humans, there is no animal that spontaneously exhibits Aβ plaques and neurofibrillary tangles that define the pathology of Alzheimer’s disease. Moreover, the complex spectrum of cognitive dysfunction and neuropsychiatric comorbidities that these diseases produce cannot be fully recapitulated or assessed (e.g., language impairment) in lower animal species. In such cases, animal models are relegated to attempts in simulating specific symptoms of the disorder (e.g., increasing striatal dopamine in rodents to model the striatal hyperdopaminergia observed in schizophrenia patients and thought to underlie the emergence of positive symptoms) or to model specific pathological processes observed in the human disease (e.g., generation of amyloid precursor protein overexpressing mice to model the Aβ deposition seen in Alzheimer’s disease patients). In this latter example, it is important to note that the translation of transgenic mice Aβ deposition and mechanisms that reduce its accumulation have translated well into human AD patients; however, because this is an incomplete representation of the disease, agents that reduce Aβ deposition in both animals and human AD patients have yet to prove successful in delaying disease progression.

Reproducibility and generalizability are two aspects of preclinical research that have come under much scrutiny over the last several years. Examples of failures to reproduce research findings, even in high-impact journals, are numerous and well described in the literature (Jarvis and Williams 2016). Perhaps the most obvious factor impacting across-lab reproducibility are deficiencies to note important methodological variables of the study. As we discuss later in this chapter, it is surprising how often key experimental variables such as specific strain or sex of animal used are omitted in the methods section. In a direct attempt to improve scientific reporting practices, initiatives such as use of the ARRIVE guidelines (Kilkenny et al. 2010) have been instituted across the majority of scientific journals. Such factors can also affect intra-lab reproducibility; for instance, when a particular student that ran the initial study has left the lab or the lab itself has relocated to another institution and the primary investigator reports challenges in reestablishing the model.

Challenges in generalizability of research findings are best exemplified by noted failures in the realm of drug development in which a novel compound exhibits robust efficacy in putative animal models of a human condition but fails to demonstrate therapeutic benefit in subsequent clinical trials. Indeed, medication development for Alzheimer’s disease has a remarkable failure rate of over 99% with a large percentage of drug development terminations attributed to lack of efficacy in Phase II or Phase III clinical trials (Cummings et al. 2014).

It is interesting to speculate that improvements in reproducibility of preclinical research may not necessarily translate into improved generalizability to the human condition (see Würbel 2002). For instance, close adherence in using the same age, substrain of mouse, husbandry conditions, and experimental details should improve the likelihood of reproducing another lab’s findings. However, it also follows that if a reported research finding is highly dependent upon a specific experimental configuration and the finding is lost with subtle procedural variations, then how likely is the finding to translate to the human condition? Humans are highly heterogeneous in their genetic composition and environmental determinants, often resulting in subpopulations of patients that are responsive to certain treatments and others that are described as treatment resistant. In preclinical research the best balance of improving both reproducibility and generalizability is to institute the inclusion of both sexes and incorporation of another strain or species. This approach will most certainly reduce the number of positive findings across these additional variables, but those findings that are consistent and robust will likely result in increased reproducibility across labs and to translate into clinical benefit. In the sections that follow, we highlight the importance of genetic background and sex in conducting preclinical research.

2 Genetic Background: The Importance of Strain and Substrain

Dating back to the early 1900s, researchers have recognized the value of genetic uniformity and stability of inbred strains, which have provided such benefits as reducing study variability and needed samples sizes and improving reliability of data. To date more than 20 Nobel Prizes have resulted from work in inbred strains, and this knowledge has provided significant medical and health benefits (Festing 2014). Certainly, it continues to be an acceptable strategy to conduct research on a single inbred strain of mice, provided that the context of the results is reported to not suggest that the data are generalizable to other strains and species (e.g., humans). A single inbred strain is not representative of the genetically diverse patient populations and is instead representative of a single genome. Moreover, even different substrains of a common strain of mice (e.g., C57BL/6J, C57BL/6N, C57BL/6NTac) exhibit unique genetic dispositions resulting in surprisingly divergent phenotypes (reviewed in Casellas 2011). Therefore, a major constraint in translational research has been the common practice of limiting preclinical pharmacology studies to that of a single strain of mice.

Within the context of rodent studies, one example where lack of generalizability of strain, substrain, and sex has been well documented is the rodent experimental autoimmune encephalomyelitis (EAE) model of multiple sclerosis (MS). MS is an autoimmune disease caused by demyelination in the CNS, which results in a spectrum of clinical presentations accompanied by progressive neuromuscular disorders and paralysis (reviewed in Summers deLuca et al. 2010 ). In mice, immunization with myelin/oligodendrocyte glycoprotein peptide induces EAE; however the variability of disease presentation across mouse models has been a major hindrance for facilitating drug development. In line with the genetic contributions to MS in human patients, mouse strains and substrains are genetically and phenotypically divergent which introduces heterogeneous loading of risk alleles and variations in phenotypes that contribute to the variability in disease onset and severity (Guapp et al. 2003 ). The MS field is not unique to the challenges of experimental variability resulting from the choice of genetic background in their rodent model and has been documented in most fields of study (Grarup et al. 2014; Jackson et al. 2015; Loscher et al. 2017; Nilson et al. 2000). While known for decades that mouse substrains are genetically and phenotypically diverse from each other, many in the research community are still not aware of this important caveat and the implication on experimental findings.

Case in point, the C57BL/6 mouse strain is one of the most common and widely used inbred strains with many substrains derived from the original lineage and now maintained as separate substrain colonies. The C57BL/6J line originated at the Jackson Laboratory by C.C. Little in the 1920s, and in the 1950s a cohort of mice were shipped to the National Institutes of Health where a colony was established and aptly named C57BL/6N (the suffix “N” refers to the NIH colony, while the “J” suffix refers to the Jackson Laboratory colony) (reviewed in Kiselycznyk and Holmes 2011). At some point spontaneous mutations (i.e., genetic drift) occurred in each of these colonies resulting in these two substrains becoming genetically distinct from each other with recent reports citing >10,000 putative and 279 confirmed variant differences as well as several phenotypic differences between C57BL/6 substrains (Keane et al. 2011; Simon et al. 2013). These genetic and phenotypic differences between substrains are not unique to C57BL/6 as 129 substrains, among others, and also have similar genetic diversity issues that must be considered when reporting and extrapolating research (Kiselycznyk and Holmes 2011). Important to note is that substrain nomenclature alone is not the sole information that identifies genetic and phenotypic diversity. Individual or private colonies established for >20 generations either at a commercial vendor or an academic institution are considered a substrain and hence must adhere to the guidelines for nomenclature of mouse and rat strains as established by the International Committee on Standardized Genetic Nomenclature for Mice (reviewed in Sundberg and Schofield 2010). Laboratory code which follows substrain notation annotates for strain/substrain source including commercial vendor (e.g., C57BL/6NHsd and C57BL/6NTac, respectively, for Harlan and Taconic) and is a critical piece of information to researchers that a substrain may have further genetic variation, as in the case for C57BL/6N, than the original NIH colony. The implication on research findings where failure to understand the role of substrain differences, as well as failures to prevent inadvertent backcrossing of substrains, has been highlighted recently (Mahajan et al. 2016 ; Bourdi et al. 2011; McCracken et al. 2017). In one example, Bourdi and colleagues reported that JNK2−/− knockout mice were more susceptible than their WT controls to acetaminophen-induced liver injury which was in contrast to findings from other laboratories demonstrating that JNK2−/− and inhibitors of JNK were protective from acetaminophen-induced liver injury (Bourdi et al. 2011). Through careful retrospective analysis, the researchers were able to determine that backcrossing on two different background substrains conferred either toxicity or protective effects (Bourdi et al. 2011).

This issue of genetic drift is not unique to mice. For instance, in the study of hypertension and attention-deficit/hyperactivity disorder (ADHD), one of the most studied rat models are the spontaneously hypertensive (SHR) and Wistar Kyoto (WKY) ratlines. In terms of ADHD, the SHR rats display symptoms of inattention, hyperactivity, and impulsiveness in various behavioral paradigms (Sagvolden et al. 2009). However like the C57BL/6 substrains, numerous SHR and WKY substrains have been generated over the years. The SHR ratline was derived originally from a WKY male with marked hypertension and a female with moderate blood pressure elevations. Brother-sister matings continued with selection pressure for spontaneous hypertension. The SHR line arrived at the National Institutes of Health (NIH) in 1966 from the Kyoto School of Medicine. From the NIH colony (SHR/N), SHR lines were derived by Charles River, Germany (SHR/NCrl), and the Møllegaard Breeding Centre, Denmark (SHR/NMol), as well as other institutions over the years. The SHR rat strains exhibit an ADHD-like phenotype, whereas the WKY line serves as a normative control. A problem exists, in that, while the WKY strain was established from the same parental Wistar stock as the SHR line, there is considerable genetic variability among WKY strains because the WKY breeding stock was not fully inbred prior to being distributed to different institutions for breeding which resulted in accelerated genetic drift. A further issue for using the WKY strain as a genetic and behavioral control for the SHR strain is that the inbreeding for the WKY strain was initiated over 10 years later than that of the SHR strain which calls into question the validity of the WKY rats as a proper control for findings in SHR rats (Louis and Howes 1990). As one might expect from such genetic diversity in SHR and WKY lines, findings from both cardiovascular blood pressure and ADHD phenotypes have at times been contradictory, and much commentary has been made about the appropriate selection of controls when studying phenotypes associated with these strains of rats (St. Lezin et al. 1992).

3 Importance of Including Sex as a Variable

The X and Y chromosomes are not the only difference that separates a female from a male. In preclinical studies there has been a pervasively, flawed assumption that male and female rodents have similar phenotypes. Publications that include such general statements as “data were combined for sex since no sex effect was observed” without the inclusion of the analysis, or simply reporting “data not shown” for the evaluation of effects of sex, are unacceptable. From basic physiological phenotypes (e.g., body weight, lean and fat mass) to any number of neuroendocrine, immune, and behavioral phenotypes beyond reproductive behaviors, males and females differ (reviewed in Hughes 2007; Karp et al. 2017). Furthermore, many human diseases affect males and females differently, whereas the influence of sex can affect disease susceptibility, symptom presentation and progression, and treatment outcomes. Well-documented sex differences exist for cardiovascular disease, autoimmune diseases, chronic pain, and neuropsychiatric disorders with females generally having greater incidences than males (reviewed in Regitz-Zagrosek 2012; IOM 2011). Therefore, ignorance of sex-specific effects in study design, phenotypes, pharmacokinetics, pharmacodynamic measures, or interpretation of data without sex as a covariate are failing to provide accuracy in reporting of the data. To this end, in 2014 the NIH issued a directive to ensure that both male and female subjects are represented in preclinical studies, an extension of the 1993 initiative to include women as participants in clinical trials receiving NIH funding (Clayton and Collins 2014).

With respect to animal models used in pharmacology experiments, sex differences in disease presentation and progression have also been reported. For example, while women have a higher prevalence of chronic pain and related disorders, preclinical studies have largely focused on male subjects. Problematically, after hundreds of studies historically employed male mice to study nociceptive responses mediated by the toll-like 4 receptor (TLR4), and subsequent pharmacology studies targeting TLR4 for analgesia, it was later discovered that the involvement of TLR4 in pain behaviors in male mice was dependent on testosterone (Sorge et al. 2011). Therefore, these results and any potential therapeutics for the treatment of pain with a mechanism of action targeting TLR4 could not be generalized to both sexes (Sorge et al. 2011). In another example, the NOD mouse model of Type 1 diabetes has a higher incidence and an earlier onset of diabetes symptoms in females than males (Leiter 1997). Consequently, female NOD mice are much more widely used than males although the incidence in the clinic is nearly 1:1 for males/females which may present a conundrum when potential novel treatments are only studied in a single sex as in the TLR4 experiments highlighted above. Furthermore, in neuropsychiatric disorders whereas major depressive disorder, for example, has a higher incidence in females than males, preclinical studies have largely used only males for testing – even though sex differences in rodent emotional behavior exist (Dalla et al. 2010; Kreiner et al. 2013; Kokras et al. 2015; Laman-Maharg et al. 2018). One of the more common arguments made for not including female subjects in preclinical studies is that they have larger variability, likely contributed to by the estrus cycle. However, a meta-analysis of 293 publications revealed that variability in endpoints using female mice was not greater than those in males, inclusive of variations in the estrus cycle as a source of variability in the females (Becker et al. 2016; Prendergast et al. 2014; Mogil and Chanda 2005). There are, however, baseline differences for males versus females across behavioral phenotypes that further highlight the need to study both sexes and with data analyzed within sex when drug treatment is evaluated in both sexes.

4 Pharmacokinetic and Pharmacodynamic Differences Attributable to Sex

In addition to the observation of sex differences across disease and behavioral phenotypes, sex differences are also commonly observed in pharmacokinetic (PK) and drug efficacy studies; yet for many years test subjects in both clinical and preclinical studies have most commonly been male. A survey of the pain and neuroscience field in the early 1990s revealed that only 12% of published papers had used both male and female subjects and 45% failed to reveal the sex of the subjects included in the studies (Berkley 1992). A later study building on this revealed that between 1996 and 2005 although researchers now reliably reported the sex of their preclinical subjects (97%), most studies (79%) were still performed on male animals (Mogil and Chanda 2005). Although the translatability of preclinical sex differences to human may not always be clear-cut, assessments of these parameters in both sexes can provide additional information during phenotyping and genetic studies, as well as the drug discovery and development process.

In the drug discovery field, there are multiple examples in the clinical literature of sex differences in both measured exposure and pharmacological effect in response to novel drugs. A meta-analysis of 300 new drug applications (NDAs) reviewed by the FDA between 1995 and 2000 showed that 163 of these included a PK analysis by sex. Of these 163, 11 studies showed greater than 40% difference in PK parameters between males and females (Anderson 2005). There are important implications for sex differences in exposure levels. For example, zolpidem (Ambien®) results in exposure levels 40–50% higher in females when administered sublingually (Greenblatt et al. 2014 ). These sex differences in exposure levels for zolpidem were also observed in rats, with maximal concentration (Cmax) and area under the curve (AUC) both significantly higher in females relative to males (Peer et al. 2016). While Ambien was approved in 1992, in 2013 the FDA recommended decreasing the dose by half for females due to reports of greater adverse events including daytime drowsiness observed in female patients (United States Food and Drug Agency 2018 ).

While any aspect of a drug’s pharmacokinetic properties could potentially lead to sex differences in measured drug exposure, sexually divergent differences in metabolism appear to be the most concerning (Waxman and Holloway 2009). In multiple species, enzymes responsible for drug metabolism show sexually dimorphic expression patterns that affect the rate of metabolism of different drugs. In humans, females show higher cytochrome p450 (CYP) 3A4 levels in the liver as measured by both mRNA and protein (Wolbold et al. 2003 ). Studies have also observed higher activity of this enzyme in females (Hunt et al. 1992). In rodents, both the mouse (Clodfelter et al. 2006, 2007; Yang et al. 2006) and rat (Wautheir and Waxman 2008) liver show a large degree of sexually dimorphic gene expression. For instance, rats exhibit a male-specific CYP2C11 expression pattern, whereas CYP2C12 shows a female-specific one (Shapiro et al. 1995). While rodent sex differences may not necessarily translate into similar patterns in humans, the complexity of metabolic pathways underscore the importance of understanding drug exposure in each sex, at the relevant time point, and in the relevant tissue when making pharmacodynamic measurements.

With respect to pharmacodynamics, sex differences exist in functional outcome measures, both with respect to baseline activity in the absence of drug, and in response to treatments. As critically highlighted in the field of preclinical pain research, a meta-analysis reported sex differences in sensitivity to painful stimuli in acute thermal pain and in chemically induced inflammatory pain (Mogil and Chanda 2005). For example, in a study by Kest and colleagues, baseline nociceptive responses and sensitivity to thermal stimuli were examined across males and females of 11 inbred mouse strains (Kest et al. 1999). Results of this study not only revealed divergent phenotypic responses across genotypes for pain sensitivity but also sex by genotype interactions. Moreover, when morphine was administered directly into the CNS, the analgesic effects varied across both strain and sex, further highlighting the importance of including both sexes in pharmacodynamic studies, as well as considering subject populations beyond a single inbred strain. These sex differences are not specific to morphine as they have also been demonstrated in rats and mice for sensitivity to the effects of other mu opioid receptor agonists (Dahan et al. 2008). Importantly, both sexually dimorphic circuitry and differences in receptor expression levels mediating pain perception and pharmacological responses, likely driven by genetics, are suggested to contribute to these differences (Mogil and Bailey 2010).

In clinical pain research, sex differences in pharmacodynamic responses have been highlighted by reports from clinical trials with MorphiDex, a potential medication for the treatment of pain that combined an NMDA antagonist with morphine (Galer et al. 2005). While many preclinical studies demonstrated robust and reliable efficacy, these reports were almost exclusively conducted in male subjects. During clinical trials where both men and women were included, the drug failed to produce any clinical benefit over standard pain medications (Galer et al. 2005). Intriguingly, it was later determined that while the drug was efficacious in men, it was ineffective in women with retrospective experiments in female mice corroborating these data (discussed in IOM 2011; Grisel et al. 2005). Overall, while we may not fully understand the biological underpinnings of sex differences in responses to pharmacology, profiling both sexes in preclinical pharmacology studies should provide insight into the differences and potentially enable better clinical trial design.

5 Improving Reproducibility Through Heterogeneity

While the major attention on the “reproducibility crisis” in biomedical research has generally been focused on the lack of translation related to issues with experimental design and publication bias, recent literature has provided insight to the concept that researchers might be practicing “overstandardization” as good research practices. For example, the considerations for controlling as much as possible within an experiment (i.e., sex, strain, vendor, housing conditions, etc.), and across experiments within a given laboratory in order to enable replication (i.e., same day of week, same technician, same procedure room), have not necessarily been previously considered an issue with respect to contributing to lack of reproducibility. However, as recently highlighted by several publications, this “standardization fallacy” suggests that the more control and homogeneity given to an experiment within a laboratory may lead to the inability for others to reproduce the findings given the inherent differences in environment that cannot be standardized across laboratories (Würbel 2000; Voelkl et al. 2018; Kafkafi et al. 2018). In this respect, there is indeed value in applying various levels of systematic variation to address a research question, both through intra- and interlaboratory experiments. One approach to improve heterogeneity beyond including both sexes within an experiment and extending experimental findings to multiple laboratories (interlaboratory reproducibility) is to also introduce genetic diversity. While it may be cost prohibitive to engineer genetic mutations across multiple lines of mouse strains in a given study, one could alternatively employ strategically developed recombinant mouse populations such as the Collaborative Cross (CC) (Churchill et al. 2004). The CC are recombinant inbred mouse strains that were created by cross breeding eight different common inbred strains resulting in increased genetic and phenotypic diversity. CC lines include contributions from the common inbred C57BL/6J strain as well as two inbred strains with high susceptibility for Type 1 and Type II diabetes, two inbred strains with high susceptibility for developing cancers (129S1/SvlmJ and A/J), and three wild-derived strains (Srivastava et al. 2017). A recent study from Nachshon et al. (2016) highlighted the value of using a CC population for studying the impact of genetic variation on drug metabolism, while Mosedale et al. (2017) have demonstrated the utility of CC lines for studying potential toxicological effects of drugs on genetic variation in kidney disease (Nachshon et al. 2016 ; Mosedale et al. 2017).

6 Good Research Practices in Pharmacology Include Considerations for Sex, Strain, and Age: Advantages and Limitations

Improving translation from mouse to man requires selection of the appropriate animal model, age, and disease-relevant state. Behavioral pharmacology studies with functional outcome measures that planned for enablement of translational efficacy studies should include pharmacokinetics and PK/PD modeling in the animal model at the pathologically relevant age. It should not be expected that PK data in young, healthy subjects would generalize to PK data in aged, diseased subjects or across both sexes. Similarly, pharmacodynamic measures including behavior, neuroendocrine, immune, metabolic, cardiovascular, and physiology may not generalize across age, sex, or disease state. Figure 1 depicts a sample flow diagram of experimental design parameters required for deliberation where species, strain, substrain, age, sex, and disease state are crucial considerations.

Fig. 1
figure 1

Example flow diagram for preclinical study design

7 Conclusions and Recommendations

Drug discovery in both preclinical studies and in the clinic has only begun to harness the power of genetic diversity. Large-scale clinical trials have focused on recruitment of patients (i.e., enrollment metrics) based on “all comers” symptom presentation for enrollment. It is tempting to believe that at least some of the high clinical attrition of new therapeutic agents can be attributable to a failure to consider patient heterogeneity. It is a common adage that a rule of thirds exist in patient treatment response to a medication: a third of patients show robust efficacy, a third exhibit partial benefit to the agent, and a third are termed “treatment resistant.” One reason that much of the pharmaceutical industry has moved away from developing antidepressant medications is that established antidepressant medications, such as SSRIs, when used as a positive control, do not separate from placebo in 30–50% of the trials, resulting in a “busted” clinical trial (reviewed in Mora et al. 2011). Importantly, the preclinical studies that have enabled these trials have largely used male subjects and frequently in otherwise healthy mice of a single inbred strain such as C57BL/6J mice (reviewed in Caldarone et al. 2015; reviewed in Belzung 2014). It is possible that preclinical studies focused on treatment response in both sexes and in genetically divergent populations with face and construct validity would be in a better position to translate to a heterogeneous treatment resistant clinical population.

Within the last decade, however, as the genetic contributions of diseases become known, precision medicine approaches that recruit patients with specific genetic factors (e.g., ApoE4 carriers at risk for Alzheimer’s disease) to test specific mechanisms of action will continue to evolve over recruitment for “all comers” patients with a diagnosis of Alzheimer’s disease (Watson et al. 2014). In this respect, in animal studies, analogous genetic factors (e.g., mouse model homozygous for the Apoe4 allele), and at an analogous mouse to human age comparison, to test a similar hypothesis are critical.

As previously stated above, the best model for human is human. In drug discovery prior to the FDA enabling clinical trials in humans, it is critical that the best approach to translation is the design and rigorous execution of preclinical pharmacology studies that best mirror the intended patient population. In this respect, for pharmacokinetic and pharmacodynamics studies, careful consideration should be taken for ensuring that the animal model used has face and construct validity, that both sexes are included and at an analogous age relevant to the disease trajectory, and that studies consider gene by environment interactions as ways to improve reliability, reproducibility, and translation from the bench to the clinic.