Single feature polymorphism discovery
The Affymetrix Soybean Genome Arrays were used with cRNA of six genotypes of pigeonpea and differentially expression data were analyzed using RPP method to identify SFPs in pigeonpea (Das et al. 2008). In the genome arrays, only 37,376 transcripts were represented by probe sets consisting of 11 PM. Therefore, the microarray data obtained for these 37,376 transcripts were analyzed for generating “present”, “marginal”, and “absent” calls for these transcripts in the genotypes for SFP analysis. Scatter plots of 411,136 PM probes for all pair-wise combinations revealed much less variation between two biological replicates of each accession (except replicates of ICP 28) compared to that between any two accessions, suggesting the feasibility of detecting SFP probes between accessions. The number of “present” calls varied, ranging from 4,882 to 5,810 probes for the 15 pair-wise comparisons. It is important to mention here that the number of probes showing differential hybridization and thus qualifying for potential SFPs in the combination ICPL 8755 and ICPL 227 (4,989) were less than those in ICP 28 and ICPW 94 (5,405) or ICPL 151 and ICPL 87 (5,455) pairs (Table 1).
Table 1 Summary on identification and validation of SFP in pigeonpea
By using stringent criteria and RPP analysis, a total of 5,692 potential SFPs were discovered across the six genotypes (Table 1). As an example, Fig. 1 shows detection of SFP between ICPL 8755 and ICPL 227 genotypes for the probe #5 of the SFP probe set GmaAffx.5953.1.A1_at. As the six genotypes analyzed in this study represents the parents of three mapping populations, an effort was made to identify the SFPs for different parental combinations. In this context, 850 SFPs were identified from 5,405 “present” probe sets in the ICP 28 × ICPW 94 (cross 1), 854 SFPs out of 5,455 “present” probe sets in the ICPL 151 × ICPL 87 (cross 2), and 780 SFPs out of 4,989 in the ICPL 8755 × ICPL 227 (cross 3). However, many SFPs that were detected were unique to one parental combination. The number of SFPs in common between cross 1 and 2 was 17, between cross 1 and 3 it was 14, and between cross 2 and 3 a total of 19 SFPs were in common. A total of 10 SFPs were in common for all the three crosses used. These results reconfirm earlier observations on occurrence of low level of genetic diversity based on other marker systems such as amplified fragment length polymorphisms (Panguluri et al. 2006), Diversity Array Technology (Yang et al. 2011) and SSRs (Saxena et al. 2010a). Nevertheless, the present study adds a set of about 1,000 novel markers (SFPs) for genetics and breeding analysis in pigeonpea.
Validation of SFPs
With an objective to validate the SFPs at sequence level, a subset of SFPs was selected for allele-specific resequencing. In this context, homologues sequences in pigeonpea for the soybean probes detecting SFPs were identified by sequence analysis of the corresponding genes of soybean with the transcriptome assembly of pigeonpea (Dubey et al. 2011) comprising 127,754 TUSs defined based on cluster analysis of 454/FLX transcript reads and Sanger ESTs of pigeonpea (Raju et al. 2010). By applying e-value ≤ 10−5 and sequence similarity ≥ 80%, 2,745 (48.2%) out of 5,692 SFP containing probes identified homologues in pigeonpea. These pigeonpea sequences were further examined for the presence of interrogation position for the corresponding soybean SFP probe. As a result, 1,815 pigeonpea TUSs were found positive and used for primer designing with an expected amplicon size of 200–500 bp. In summary, primer pairs could be designed for 1,131 TUSs containing SFP probe target regions (Electronic Supplementary Material Table 1). In other cases, primers could not be designed, this was mainly due to either the probe target region being very near (≤20 bp) the end of sequence or inability to fit the default parameters of the Primer3 software (Nayak et al. 2010).
In order to investigate whether the identified SFPs were related to sequence variations, 179 SFPs were randomly selected for validation. PCR with these 179 primer pairs on the same set of six genotypes used for SFP discovery provided strong and prominent amplicons in 102 (56.98%) cases. In the remaining cases, either no amplification or nonspecific amplification was observed. Amplicons generated for the 102 primers were sequenced using Sanger sequencing methodology. As a result, good quality sequence data were obtained for 99 primer pairs, which were further analyzed to identify the SNPs and INDELs. Most sequences were 250–650 bp, but some were as short as 207 bp or as long as 1,336 bp. Analysis of the sequence data with “Divest” tool (Jayashree et al. 2009) showed a total of 7,535 sequence polymorphisms (including SNPs and INDELs) for 75 (75.7%) out of 99 primer pairs (Electronic Supplementary Material Table 2). Among all the sequence variations identified within 99 probe regions, 363 were SNPs and 44 INDELs. A representative alignment of genomic amplicon sequences examined for a putative SFP between ICP 28 and ICPW 94 genotypes for probe #3 of probe set Gma.12798.1.S1_at, revealed occurrence of four SNPs in the genotypes (Electronic Supplementary Material Figure 1). In addition, as compared to the pigeonpea TUS (TUS ID127906_2368_0221), there was a single base deletion in the two genotypes investigated. On the other hand, as compared to soybean gene (Gma.12798.1.S1_at), an insertion of 2 bp was observed in the both pigeonpea genotypes.
Based on SFP prediction, in aforementioned sequence data for 99 genes, the parental genotypes for cross 1 should have sequence polymorphisms for 38 genes, the parental genotypes for cross 2 should have sequence polymorphisms for 58 genes and 43 genes should have sequence polymorphisms for the parents for the cross 3. Sequence analysis for 75 polymorphic genes, however, confirmed SNPs or INDELs for 31 (81.58%), 20 (34.48%), and 18 (41.86%) genes for the parental genotypes of the cross 1, 2, and 3 respectively (Table 1). Across all the six genotypes, the accuracy of the array to predict the presence of a sequence variant based on SFP was 52.6%.
In summary, 52.6% predicted SFPs were found true, while the remaining 47.4% SFPs predicted were found false. The false discovery rate (FDR) of SFPs predicted in this study is relatively higher than FDRs reported in other studies using cRNA for SFP detection (Rostoks et al. 2005; Cui et al. 2005; Das et al. 2008; Yang et al. 2009). For instance, by using barley Affymetrix genome arrays to detect SFPs in barley genotypes, the FDR was 10–20% (Cui et al. 2005) and 40% (Rostoks et al. 2005). It is, however, noted that the present study deployed soybean genome arrays for SFP discovery in pigeonpea genotypes. Previously also, the soybean genome arrays were used to detect SFPs in cowpea (Das et al. 2008) and the barrel medic (Medicago truncatula) genome arrays were used for SFP discovery in alfalfa (Medicago sativa) (Yang et al. 2009). In these species, however, lower FDR (32% in cowpea and 17% in alfalfa) was observed. Higher (47.4%) FDR observed in the present study can be attributed to several factors. The phylogenetic distance between soybean and pigeonpea is higher than the phylogenetic distances between soybean–cowpea and Medicago–alfalfa. Owing to the relatively low level of sequence similarity between probes and transcripts in cross-species hybridization, more probes might be considered to cross-hybridize in cross-species hybridization as compared to self species hybridization (Bar-or et al. 2006). Some false positive cases may occur due to paralogs being sequenced in the tested genotypes which do not exhibit polymorphism. Variation in post-transcriptional modification such as alternative splicing in some genes may also have contributed to the FDR observed (Rostoks et al. 2005; Das et al. 2008).
In addition to checking the SFPs specific to the given cross, analysis was also done to validate the SFPs that were common to parents of cross 1 and 2 (17), cross 1 and 3 (14), cross 2 and 3 (19), and cross 1, 2, and 3 (10). While a higher proportion of SFPs were confirmed for the parents of cross 1 and 2 (14, 82.35%), the least proportion of SFPs were confirmed for the parents of cross 2 and 3 (9, 47.36%). While 50% of SFPs (5) that were common to all the parents were confirmed at the sequence level, 64.28% (9) SFPs were confirmed for the parents of cross 1 and 3. Such polymorphisms generally represent haplotypes, which will be very useful in mapping of these genes especially for construction of consensus genetic maps. Moreover, these sequence variations represent expressed portion of the genomes and hence can directly be associated with some important phenotypic traits.
Gene ontology descriptions
Functional analysis was done for the 922 pigeonpea TUSs for the corresponding soybean transcripts that detected SFPs (soybean transcripts) between the parents of the mapping populations (ICPL 8755 × ICPL 227 and ICPL 151 × ICPL 87) segregating for drought tolerance. Analysis of these sequences against UniProt database (Uniref50) showed that 752 TUSs (81.56%) had similarity to the proteins available in the database at a stringent criterion of e-value ≤ 10−5. Subsequently, these TUSs were analyzed for gene ontology descriptions based on their BLASTX functions. As a result, 724 (78.52%) TUSs could be assigned into three principal categories: molecular function (599), biological process (559), and cellular component (570). While distributing these TUSs into various subcategories of three main categories, the highest number of TUSs fell into cell part (555) followed by cellular process (453), nucleotide binding (426), and metabolic process (409) subcategories (Electronic Supplementary Material Figure 2). Gene ontology analysis was also extended to identify the genes related to stress responses. As a result, a total of 139 TUSs were found under “response to stimulus” subcategory which includes both abiotic and biotic responses. This provides an indication of the involvement of these genes for drought tolerance. Some functional genomics approaches like qRT-PCR may validate the function of these genes (Hu et al. 2009). Furthermore, the linkage mapping of these SFPs using aforementioned mapping populations may provide association of these genes for QTLs for drought tolerance that can be used as “functional markers” in marker-assisted selection approaches in pigeonpea breeding.
In summary, the present study identified 5,692 unique candidate SFPs extending the marker repertoire with functional marker systems in pigeonpea. Allele-specific sequencing of a set of selected genes detecting SFPs showed association of 52.6% SFPs analyzed with actual sequence polymorphisms in the probe sets. Gene ontology analysis of the genes detecting SFPs provided a set of candidate genes that may have association with drought tolerance. These candidate genes are useful resource for undertaking the gene expression analysis as well development of functional markers for both basic and applied research, especially for drought tolerance in pigeonpea improvement.