Introduction

Forward genetic approaches for relating genomic variability with phenotypic variability can be grouped as either linkage or association mapping. Because it is easy to create and maximize linkage disequilibrium in plant species the former set of methods were initially referred to as Quantitative Trait Locus (QTL) mapping, although it is now clear that association mapping also can be applied to quantitative traits. Linkage mapping is powerful but of low resolution, resulting in identifying genomic regions consisting of about 10 cM, which often consists of tens of millions of bases for most plant species. With the advent of high-throughput technologies for resequencing and genotyping, association mapping has emerged for species where it is not easy to create linkage disequilibrium. This approach exploits historical linkage and recombination accumulated over a large number of generations (Andersson and Georges 2004). Thus, it can provide high resolution information that can be used to identify the causative nucleotides underlying phenotypic variability. Depending upon the amount of linkage disequilibrium (LD) across the genome in the breeding population, association mapping can require genotyping with very high densities of molecular markers (Yu et al. 2008) and extremely large samples to achieve reasonable power (Hirschhorn and Daly 2005; Kingsmore et al. 2008).

A third approach is to combine the power of linkage mapping with the resolution of association mapping. This third approach can be thought of as an extension of the multiple family QTL approach (Jansen et al. 2003; Blanc et al. 2006), but is distinctive in that parental inbred lines are resequenced or array genotyped and this information is coupled with low-cost genotyping of their segregating progenies. The approach is conceptually equivalent to the human quantitative transmission disequilibrium test (QTDT) (Abecasis et al. 2000) combined with imputation of genotypes of relatives (Burdick et al. 2006). For the special case where the mapping population consists of multiple families of segregating progeny, usually Recombinant Inbred Lines (RILs), derived from inbred lines crossed to a single reference inbred line, the method has been called Nested Association Mapping (NAM) (Yu et al. 2008; Nordborg and Weigel 2008).

For purposes of mapping functional markers in NAM populations, parental genotypes at a large number of SNP loci need to be projected to their segregating progeny. For example, approximately 0.5 million SNPs have been genotyped in the 26 parental lines of the publicly available maize NAM population whereas only 1,106 SNP loci have been genotyped in both the parents and their 5,000 progeny. The challenge is to estimate both the genotype and genetic location of the parental genotypes in the segregating progeny. Three approaches might be considered (Yi and Shriner 2007): (1) estimate all missing genotypes by their expected values conditional on observed flanking markers (Haley and Knott 1992), (2) consider genotypes as unknowns to be predicted using an MCMC update procedure, and (3) multiple sampling of genotypes from a conditional probability distribution for each unknown locus (Sen and Churchill 2001). Given the large number of SNP loci and large number of families and progeny in NAM populations, the latter two approaches could be computationally challenging, depending upon the quality of the physical map. The first approach, however, may be accurate while computationally feasible.

Herein, we report on: (1) development of a method for imputing genotypes using an expectation approach, and (2) illustrate its use by applying it to the maize NAM population. In human family based association mapping (Burdick et al. 2006) parental SNPs are projected onto progeny in intervals with no recombinants. Herein, the method is extended to intervals with known recombination events.

Data and methods

Data

The following data sets were obtained from public information resources: (1) genotypes of 5,000 RILs representing 25 segregating families of the maize NAM mapping population (McMullen et al. 2009). These data are represented as NAM_SNP_genos_raw_20080703 at http://www.panzea.org/. (2) A composite linkage map created by McMullen et al. (2009) using the maize NAM genotypic data (http://www.panzea.org/). (3) The maize Accessioned Gold Path (AGP v1) (Wei et al. 2009), consisting of 10 chromosome pseudo-assemblies guided by the physical map, was obtained from the Arizona Genomics Institute (http://www2.genome.arizona.edu/genomes/maize). (4) the maize HapMap for the 26 founder lines of the maize NAM population. These data comprise nearly half a million SNP genotypes, and can be obtained from http://www.maizegenetics.net/maize-hap-map. Note that the maize HapMap data are continuing to be updated with new releases, so the version utilized herein will likely be outdated before publication of this manuscript.

Estimation of linkage map positions

In order to detect the associations between genotypes and complex quantitative traits, it is necessary to know the linkage map positions of the polymorphic loci and to trace inheritance of these using flanking markers. The linkage map positions are unknown for the majority of the 0.5 million SNPs which are genotyped in the parental lines maize NAM families. Their linkage map positions were assigned through linear interpolation between the maize AGP v1 (Wei et al. 2009) and maize NAM linkage map (McMullen et al. 2009), as described by Kong et al. (2002). SNP loci occurring on the same BAC are assigned the same position, because the number of recombination events within BACs for 200 RILs per family is expected to be negligible (Fig. 1).

Fig. 1
figure 1

Mapped positions of physical and linkage maps obtained through linear interpolation. The dark black dots are plotted positions of BAC accessions relative to the maize NAM linkage map. The light color curves are actually individual light color dots representing high density segregating SNPs. Locations of SNP loci were obtained through linear interpolation. AC185213 and AC197480 designated as dark black dots that deviate from the curves on chromosome 3 and AC187287 on chromosome 8 were not used in linear interpolations. A break in the curve on Chromosome 5 occurs because genetic distances on the linkage map corresponds with a small physical distances on the AGP map

Imputation of parental SNPs onto segregating progeny

SNPs with known physical locations were imputed in each RIL by computing the expectation of genotypic score given flanking marker genotypic scores, as described by Haley and Knott (1992). The maize NAM population consists of RILs which were produced by self pollinating the lines for five generations after the initial cross of the parental inbred lines. Thus, not all loci are homozygous in the segregating progeny. B73 alleles were coded as −1 and the alternative alleles as 1, heterozygous genotypes as 0.

Assuming one SNP locus Q is genotyped in parental lines but not in their progeny and this locus is flanked by two SNP loci A and B which are genotyped in parental lines and their progeny within a family, the expectation of genotype score is based on the following: (1) The transition probabilities from one genotype at one locus to one genotype at another locus (P(Q = q|A = a), P(B = b|Q = q)) are obtained by Jiang and Zeng (1997). These transition probabilities are functions of the frequency of recombinants between the two flanking loci and number of selfing generations. (2) The conditional probability of genotype of SNP Q given flanking SNP loci A and B is computed as:

P(Q = q |A = a, B = b) = P(Q = q|A = a)P(B = b|Q = q)/∑qP(Q = q|A = a)P(B = b|Q = q) (Jiang and Zeng 1997). (3) The expectation for the genetic score at SNP Q is computed as (1)P(Q = 1|A = a, B = b) + (0)P(Q = 0|A = a,B = b) + (−1)P(Q = −1|A = a,B = b) = P(Q = 1|A = a,B = b) − P(Q = −1|A = a, B = b). In situations where computation is needed at terminal ends of a linkage group, SNP locus Q will have only one adjacent polymorphic SNP locus. The conditional probability is computed as P(Q = q|A = a) = P(Q = q|A = a)/∑q P(Q = q|A = a). The expectation for the genetic score is computed by (1)P(Q = 1|A = a) + (0)P(Q = 0|A = a) + (−1)P(Q = −1|A = a) = P(Q = 1|A = a) − P(Q = −1|A = a).

Results and discussion

Estimation of linkage map positions

About 90%, i.e., 444,615 of 495,091 genotyped SNPs from the maize HapMap project, were assigned linkage map positions through linear interpolation between the maize AGP and NAM linkage maps (Table 1). The mapped positions of individual SNPs are available through the GFS Sprague Population Genetics website (Table S1 http://www.agron.iastate.edu/GFSPopGen/resources.html). Approximately 10% of the SNPs were not assigned to linkage map positions because they were located in: (1) BACs that were assigned to known chromosomes, but appear to be genetically located beyond the ends of the linkage group; (2) BACs which have not been mapped consistently to the same chromosomes by the maize AGP and NAM projects (Table 2), (3) BACs which are unassigned to chromosomes and (4) three BACs whose physical and linkage locations were not consistent within chromosomes 3 and 8 (Fig. 1). With removal of these three inconsistent BACs of the latter group, all relationships between physical and linkage maps show similar smooth curves with large numbers of BACs associated with little recombination in heterochromatic regions of the genome. The continuous nature of the curves indicates that gaps in the physical map are so small that they do not seriously affect the estimation of linkage map positions of SNPs by linear interpolation. If there had been large discontinuities and changes in direction of the curves, then such interpolation for placement of SNP loci would not be justified.

Table 1 Summary of estimated genetic locations of SNP loci in NAM parental lines obtained through linear interpolation of information from verified physical (AGP: http://www2.genome.arizona.edu/genomes/maize) and linkage (NAM: http://www.panzea.org/) maps
Table 2 Inconsistent relationships between maize physical map and NAM linkage maps

Imputation of SNP genotypes from parents to segregating progeny

About 444,615 SNP genotypes in the parental lines were projected onto RILs of the maize NAM population and are available for subsequent analyses at the GFS Sprague Population Genetics website (Table S2 at http://www.agron.iastate.edu/GFSPopGen/resources.html). In some families, SNP genotypes were considered missing if: (1) the genotype of either parent was missing, or (2) the genotypic score provided by the HapMap project was not equal to 0 or 1. The missing genotypes account for approximately 27% of the projected genotypes. About 5% of the projected genotypes have absolute genetic score values between 0.1 and 0.9.. The remaining 68% have absolute genetic score values in the range of 0.9 and 1.0. (Table 3).

Table 3 Summaries of absolute expected genotic scores in segregating progeny of the maize NAM population

Discussion

Plant species and model organisms (e.g., mouse: Churchill et al. 2004) exhibit characteristics that favor development of NAM populations. Pure inbred lines and large segregating families are relatively easy to develop or already available, whereas large samples (minimum of 2,000 cases and controls: Hirschhorn and Daly 2005; Kingsmore et al. 2008) of unrelated, yet adapted, accessions required for association mapping are not available in most crop species. Consequently, NAM populations are being developed for Arabidopsis (Buckler and Gore 2007) as well as soybean, barley and sorghum (personal communications). Alternatively, a large number of QTL mapping studies have been completed in various crops. If the inbred parental lines, stored in germplasm repositories, are resquenced or array-genotyped, already available phenotypic data can be exploited using a multiple family QTL analysis (Jansen et al. 2003; Jannink and Wu 2003).

As shown herein, the computational challenges of imputing parental genotypes onto segregating progeny can be handled simply through linear interpolation of genetic location and subsequent calculation of expected genotypes. Such information has been shown to provide powerful, precise and accurate identification of functional markers responsible for a variety of simulated genetic architectures (Guo et al. 2010). Importantly, forward genetic approaches which require large samples for quantitative traits, are enabled by sequencing or array-genotyping of parental lines coupled with sparse genotyping of segregating progeny. This significantly reduces costs and enables genome-wide mapping through resequencing or array-genotyping of dozens of lines rather than thousands (Yu et al. 2008; Nordborg and Weigel 2008).