Background

The global diversity of human genomes is the outcome of a series of demographic and evolutionary events including migration, bottlenecks, admixture, population isolation, natural selection and genetic drift which occurred in different parts of the world at various time points in history [13]. Genomic signatures of many of these events have been preserved in the genomes of different populations and play a pivotal role in uncovering demographic histories in addition to understanding health and disease [4, 5]. In the last decade, two major large consortium based efforts; the HapMap project, the Human Genome Diversity Project (HGDP), as well as several other studies, based on genotyping of single nucleotide changes, have attempted to catalogue the genetic variations that exist between individuals of a population as well as within different populations across continents [611].

Data from these studies on genetic diversity have been instrumental in estimating the origin and history of different contemporary populations as well as shedding light on the evolutionary relationship between them [12]. Moreover, the genotype data from these studies have been subjected to various computational techniques to derive estimates of population sizes and divergence times for the major demographic events in human history, which in many cases have been found to be in agreement with evidence from existing historical accounts and archaeological records [13, 14]. However, these studies were based on a fixed number of single nucleotide polymorphisms (SNPs) which had clear ascertainment bias (the SNPs included in the genotyping platforms were selected on the basis of their occurrence and frequencies primarily in European populations), therefore it was difficult to reliably assess the nature and extent of genomic diversity that exists among different populations from these studies [15].

The next major wave of information about genetic and genomic diversity in human populations came from studies based on exome and whole genome sequencing [1619]. The 1000 Genomes Project, for example, in addition to identifying millions of novel SNPs and more than a million short structural variants (SSVs), showed that rare variants account for a large majority of the existing genetic diversity between individuals as well as within populations [17, 18]. Moreover, it was suggested that there is an excess of rare and deleterious mutations in human genomes, probably resulting from exponential population growth and weak purifying selection [17, 18]. Studies based on deep sequencing of selected regions from thousands of individuals further show that the majority of rare coding variants, with allele frequencies lower than 0.0005, are also population-specific and potentially deleterious [19]. In addition to thousands of contemporary human genomes, sequencing of many archaic genomes has also been performed recently which has provided evidence for archaic admixture in non-African genomes [2022]. Such admixture might also be present in at least some of the African populations [23, 24]. These studies taken together have not only resulted in a paradigm shift in our understanding of various aspects of human genomic diversity but also provided necessary data for addressing numerous other questions related to human genome evolution.

SNPs and structural variants are broadly classified into common and rare based on minor allele frequencies (MAF). A widely used cut-off for defining rare SNPs being a MAF of less than 0.05 [17]. However, this cut-off is pragmatic in nature and does not have any special biological relevance. Although differences in SNP allele frequencies might be influenced by various demographic factors like selection and population size, time is the major determinant in the rise or fall of allele frequencies. Mathematical estimates suggest most of the common SNPs to have originated thousands of years ago and therefore to have a wider geographic distribution in contrast to rare variants which are mostly more recent and geographically restricted [25]. The rare and common variants therefore allow us to investigate events at different time scales of demographic histories. The relative phenotypic importance of common and rare SNPs is highly debated [26]. Nevertheless, while most of the Mendelian traits and deleterious mutations have been shown to be rare; several studies suggest some continuous traits like height might well be explained in terms of common SNPs [27, 28].

SNPs and structural variants are often classified into ‘private’ and ‘shared’ based on their distribution in a single population or a range of populations. The term private however might imply different things based on the context, for example, a SNP might be private to an individual or a family, or to a population (monomorphic in all but one population; also referred to as ‘population private’) or to an ancestral group. Therefore, we will use the term ‘population-specific’ for the SNPs that have been found to occur only in a single population. Although private SNPs have not been shown to be involved in major phenotypic traits or common diseases, population-specific SNPs might well be important in ascribing characteristic phenotypes and disease susceptibility/protection to a population [29, 30].

Population specificity of genetic variants, if the population-specific allele is the derived allele, might originate from two different scenarios: in the first scenario, a variant allele originates in a single population and remains restricted to the population of its origin. The second scenario is that the variant originated before differentiation of populations, survives in only a single population, and gets eliminated from other populations. In cases where the population-specific allele is the ancestral allele, both the alleles are estimated to have evolved far back in evolutionary history and the derived allele replaces the ancestral allele in all but one of the populations, probably through selective sweeps. Alternatively, in some cases, the assignment of ancestral state may be incorrect. The other possible scenario by which population-specific SNPs might originate is by admixture with populations which are not included in the study or even populations which are no longer extant. Therefore, in addition to the functional role of these SNPs, the population-specific SNPs might also play an important role in characterizing ancestry and understanding demographic histories [31, 32]. For example, on a genome wide scale the number of population-specific SNPs in a population would be expected to be related to the age of the population and also to reflect demographic events like bottlenecks, geographical isolation and admixtures.

Despite their potential significance, population-specific SNPs have not been studied extensively. Previous HapMap data based studies on population-specific SNPs have been able to identify only a small number of population-specific SNPs due to ascertainment bias of the genotyping platform [6, 3335]. The availability of unbiased whole genome sequence data from sources like the 1000 Genomes project, however, has now made the identification and characterization of population-specific SNPs on a genome wide scale possible. Moreover, sequencing-based studies have shown population-specific SNPs to be one of the major components of genetic diversity within populations [1719, 36]. A deeper understanding of population-specific variations, their genomic distribution and potential functional relevance is important.

We have used 1000 Genomes sequence data (release October 2012), including more than one thousand individuals from 14 populations spanning Europe, Asia, Africa and America, to identify SNPs and structural variants that are private or specific to each population and to study their genomic distribution and potential functional relevance [17, 18]. However, as the population sample sizes are relatively small (<100) and the sequencing is low coverage (4X-6X) for most of the 1000 Genomes data, low frequency alleles are harder to accurately identify and may be incorrectly identified as population-specific [17, 18]. We have therefore focused our study on common population-specific (CPS) SNPs as higher MAF population-specific SNPs are expected to be more informative and less likely to be incorrectly annotated as population-specific in this dataset. We evaluated the frequency distribution of population-specific SNPs identified in our study in the context of the generally accepted model of population migration and differentiation. We analysed the genomic distribution of these SNPs using fixed length and fixed bin window scan based approaches to identify potential biases in genomic distribution of CPS SNPs. The CPS SNP-enriched genomic regions in different populations were then compared to test whether their preferential localization has overlaps across different populations. Analyses of signatures of selection and the distribution of recombination hotspots were performed in the CPS SNP-enriched genomic regions to determine the extent of involvement of these processes in generating CPS SNP-enriched genomic regions in different populations. Functional enrichment analysis of genes containing the CPS SNP was performed and the enriched functional classes for different populations were compared to identify possible functional trajectories in population-specific SNP evolution.

Results and discussion

Identifying SNPs unique to each population

One of the major achievements of the 1000 Genomes project has been the identification of numerous novel SNPs across different populations [17, 18]. The sequence-based approach employed in the 1000 Genomes project in contrast to the previous genotyping-based approaches like HGDP and HapMap, provides an unbiased estimate of human genetic variation across many populations globally [6, 7, 17, 18]. We have used the most recent version (October 2012) of the 1000 Genomes data to identify SNPs which are observed to be unique to each of the individual study populations [18]. These SNPs were categorized into CPS SNPs and rare population-specific (RPS) SNPs based on a MAF cut-off of 0.05. SNPs with MAF >0.05 were considered as CPS SNPs while SNPs with lower MAFs were considered as RPS SNPs. Although more than 99% of population specific SNPs in the 1000 Genomes data are RPS SNPs, we have focused our present study on CPS SNPs because the sample sizes (around 90–100 individuals for each population) and low coverage sequencing (around 4X for most of the genomic regions) used for generating the data make it difficult to reliably ascertain the population specificity of low allele frequency SNPs. Moreover, as these SNPs have a MAF of at least 0.05 they are less likely to be personal SNPs or the result of recent demographic events.

The present 1000 Genomes data contain two African (YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya)), three Asian (JPT (Japanese in Tokyo, Japan), CHB (Han Chinese in Beijing, China) and CHS (Han Chinese South)), three American (MXL (Mexican Ancestry in Los Angeles, CA, USA), PUR (Puerto Ricans in Puerto Rico) and CLM (Colombians in Medellín, Colombia)), 5 European (IBS (Iberian Populations in Spain), GBR (British from England and Scotland), CEU (Utah residents with ancestry from northern and western Europe), FIN (Finnish in Finland) and TSI (Toscani in Italia)) and one admixed African (ASW (African Ancestry in SW USA)) population. The frequencies of common and rare population-specific SNPs in these populations have been summarized in Figure 1A and Figure 1B, respectively. Although the numbers of common and rare SNP differ by many folds, there are some broad similarities in the distribution patterns of the CPS SNPs and RPS SNPs.

Figure 1
figure 1

Population-specific SNPs in the 1000 genomes data. The number of population-specific SNPs for each of the 14 populations for common (A) and rare (B) SNPs are shown in (A) and (B). As the dataset includes admixed and related populations we removed the four known admixed populations (ASW, CLM, PUR, and MXL) and merged the two Chinese populations CHS and CHB into a single CHINESE population. The number of common (C) and rare (D) population-specific SNPs in the remaining 9 populations were retained for further analysis. The European populations are shown in orange, Asian populations in purple and the African populations in light green. The American and the admixed African populations are shown in blue.

For example, the highest number for both CPS SNPs and the RPS SNPs was observed in the LWK population followed by the Japanese (JPT) population. Interestingly, in contrast to the large number of RPS SNPs observed, just a few CPS SNPs were found to occur in the Chinese populations (CHB and CHS). This observation is consistent with the fact that these populations have a similar geographic origin, and the differentiation between them probably started little more than a thousand years ago with the Southward migration of the Northern Han population [3739]. In spite of the pronounced divergence of these populations, reflected in the high frequency of RPS SNPs and has also been observed in many previous studies, the relatively recent divergence has not allowed many of the population-specific alleles to reach frequencies of 0.05 [3739]. As our aim was to identify common SNPs which are unique in different populations, and we know that these populations have a common recent origin, we merged the two Chinese populations CHB and CHS into a single population (named CHINESE for this study). We recognise that this approach would not be suitable for a similar analysis with rare SNPs due to the extent of divergence that these populations have undergone recently.

One of the concerns with using all the current populations of the 1000 Genomes data for identifying population-specific SNPs is the inclusion of populations with known recent admixture, such as ASW and MXL (Supplementary Figures S4 and S9 from reference 18). The inclusion of these admixed populations may mask the true population specificity of SNPs. In order to identify SNPs which are truly unique to populations, ASW and the three American populations (MXL, PML and PUR), which are known to have undergone a significant amount of admixture in the recent past, were removed from the dataset [18]. It is worth noting, however, that the MXL, CLM and PUR populations contain a few hundred common SNPs which were not observed in any other continent/population. As indicated by previous population structure analyses, these populations harbour a significant Native American genetic component; and the order of Native American admixture in these three populations is approximated by the total number of population-specific SNPs in these populations (highest in MXL followed by CLM and then PUR) [18]. It would be an interesting follow-up study to isolate the population-specific SNPs of Native American origin and to functionally assess their significance in these populations.

The trimming and rearrangement of the population datasets resulted in 9 potentially independent and essentially non-admixed populations for further investigation in the current study. The distribution of the CPS SNPs and RPS SNPs for each population was recalculated considering these 9 populations only, and has been summarized in Figure 1C and D. The list of SNPs which were observed to be unique to each population along with their frequencies in the 14 study populations has been provided in Additional file 1. Although the removal of the admixed populations significantly increased the count of CPS SNPs for all the populations, the detected trends, for example the highest number of SNPs in LWK, followed by JPT, IBS and FIN, are similar in both sets (Figure 1A, C, B and D). An interesting exception is the YRI population, where the number of YRI specific CPS SNPs goes up by folds with the removal of the admixed African American population. This result concurs with the known history of recent migration and admixture of the Western African populations in North America [40]. However, in spite of this increase in the number of the CPS SNPs in the YRI, after removal of admixed populations, they still have only about half the number of CPS SNPs observed in LWK. This difference is, however, not surprising in view of the fact that a number of different populations, which most probably include the LWK along with other Bantu-speaking populations, have migrated to East Africa at different time points in history [4144]. The migration of several different populations along with the presence of indigenous East-African Khoesan-speaking populations in this region, which has been suggested to have contributed to the population differentiation in East-Africa, might also explain the high frequency of CPS SNPs and RPS SNPs observed in the LWK [41, 42].

The relatively high frequency of CPS SNPs as well as RPS SNPs in the Japanese population is notable. It is well known that the modern Japanese population contains admixtures of at least two distinct genetic components; the old migrants who migrated to the Japanese Archipelago approximately 30,000 years ago and the new migrants that reached Japan only about a couple of thousand years ago [4547]. It would be interesting to study how far the unique components of both these, and perhaps other migrating populations, add up to generate the high RPS SNPs and CPS SNPs observed in the JPT population.In addition to population histories, the sample size is also a strong determinant of how many variants and unique variants are observed in a population. For example, the huge increase in the frequency of RPS SNPs in the Chinese populations after the merger (Figure 1D) is also an outcome of the increase in sample size due to merging of the populations. As the sample size for the population has doubled the frequency of detection of RPS SNPs has increased proportionately and similar changes can be expected to be observed in other populations in the future as more samples from these populations are sequenced. Similarly, the lack of RPS SNPs in the IBS population in comparison to other populations can be ascribed to the inclusion of only 14 IBS samples in the current 1000 Genomes data set. It can be expected that as more samples are sequenced the fraction of RPS SNPs in this population will be in line with other populations.

We found that three of the European populations (CEU, GBR and TSI) have only a handful of common SNPs unique to them in contrast to a few hundred thousand rare SNPs. While this makes sense in terms of demographics [48, 49] and probable admixtures, it might also be a result of treating these related or partially admixed populations separately. Approaches that group these populations together, based on population histories, might lead to the identification of some CPS SNPs in these groups too. While the high frequency of CPS SNPs in the Finnish population (FIN) can be interpreted in terms of multiple genetic components and demographic factors like isolation, migration and admixture, which is reflected in their distinctive distribution in the European principal component analysis (PCA) plots in other studies [18, 50, 51], the high frequency of CPS SNPs in the Spanish (IBS) population needs to be treated with greater caution as the number of individuals sequenced for this population is only 14. Many of the SNPs which seem to be common (MAF > 0.05) in the IBS in the present data might turn out to be rare once other samples from this population are sequenced.Although our analysis is focused on SNPs, we studied the distribution of population-specific short structural variants (SSVs) to see whether their distribution in different populations concurs with that of the SNPs. Figure 2 shows the distribution of the common population specific structural variants (CPS SSVs) and rare population-specific structural variants (RPS SSVs). Interestingly, the relative prevalence of the SSVs across populations shows high concordance with that of SNPs. However, the numbers observed for rare and common SSVs are similar in contrast to the few fold difference observed in the number of common and rare SNPs.

Figure 2
figure 2

Population-specific common (MAF > 0.05) and rare (MAF ≤ 0.05) short structural variants (SSVs).

To classify the CPS SNP variant alleles into ancestral and derived (based on multi-species alignment) the ancestral/derived information for alleles in the 1000 Genomes vcf file was used [18]. As expected, more than 80% of the population-specific alleles were found to be the derived allele (Figure 3) indicating that most of these alleles likely arose in the individual populations after their divergence from other populations.

Figure 3
figure 3

Classification of population-specific SNP alleles into ancestral and derived. The SNPs for which no ancestral state information could be detected are shown as “Undefined” whereas the SNPs for which the ancestral state could not be detected with confidence are shown as “Not Sure”.

The relative prevalence of the CPS SNPs (as well as RPS SNPs and SSVs) across populations, therefore, shows high concordance with what can be expected on the basis of the generally accepted model of population divergence and the relationships between populations. However, as has been demonstrated, the number of population-specific SNPs observed in any population, in addition to population histories, is also influenced by factors like sample size and number of related and/or admixed populations included in the study. The removal of the admixed African and American populations almost doubled the number of common SNPs which were detected to be population-specific in the other 9 populations, indicating how important the detection and control of admixture is for identifying what is truly population-specific. While the lack of CPS SNPs in most European populations is not very surprising considering their population histories, as well as the number of populations (5 European populations in contrast to only 2 African and 3 Asian population) included in the dataset, it would be interesting to see how strongly the inclusion of other populations from Asia and Africa change the number of population-specific SNPs as new data pour in.

Genomic distribution of CPS SNPs

The distribution of SNPs has for long been known to be non-random across the genome [5255]. Recent studies have further suggested that the rates of mutations in a genomic region in addition to the genomic context might also depend on the presence of repeat sequences and even existing SNPs in the region [56, 57]. Moreover, genomic regions where genomes from different global populations differ very strongly from each other have also been observed [58]. Given this background it was interesting to investigate whether the CPS SNPs, as delineated in our study, also show clustered occurrences across the genome. To identify possible biases in the distribution of CPS SNPs in each population and test whether the enriched regions are similar in different populations we used a sliding window based scan. Although sliding window based approaches have been widely used to identify clusters within genomic regions [59, 60], this approach has been shown to find some false positive clusters in some cases [61]. Therefore, to minimize such false positive results we have used two different sliding windows based approaches and used a conservative p-value cut-off for delineating clusters of CPS SNPs in each population.

50-SNP windows

In the first approach, a window was defined as a set of 50 contiguous SNPs and each chromosome was scanned along the 50-SNP windows (with a slide of 50 SNPs per step) separately for each population. In each step the fraction of CPS SNPs in each window was recorded and compared to an expected value, based on the occurrence of CPS SNPs on the corresponding chromosome for the particular population. The statistical significance of the observations was estimated using cumulative hyper geometric p-values calculated for each window. The results clearly identified specific regions of the genome to be enriched with CPS SNPs in each population. We detected 655 CPS SNP-enriched windows/regions in the 6 populations (Table 1, Additional file 2). The populations CEU, TSI and GBR were not analysed due to a paucity of CPS SNPs. As for the number of CPS SNPs in the population, most CPS SNP-enriched windows were observed in the LWK, followed by YRI and JPT. It is interesting to note that, although both FIN and IBS contain a much greater number of CPS SNPs in comparison to the CHINESE population, which contains 24 enriched windows, only three CPS SNP-enriched windows were detected in the IBS population and a single such window was detected in the FIN population. The two highest-scoring windows detected for each population using this scan are shown in Table 2. In the highest-scoring windows for both LWK and YRI more than 50% of the SNPs were found to be CPS SNPs.

Table 1 Genomic regions enriched in common population-specific (CPS) SNPs identified using 50-SNP and 5-kb window approaches
Table 2 Best common population-specific (CPS) SNP-enriched windows for each population

5-kb windows

The second approach was to use a sliding window of 5 kilobases (kb). This approach, in addition to identifying CPS SNP-enriched regions, provides a more direct way to identify possible overlap within CPS SNP-enriched windows across populations. Using this scan, 565 5-kb regions were found to be significantly enriched for CPS SNPs in the 6 populations (Table 1). For each of the populations there was a very significant amount of overlap between the regions identified by the two sliding window based approaches (Table 1). The comparison of enriched windows identified using both the sliding window approaches shows that there is almost no overlap within the CPS SNP-enriched regions in these six populations (Figure 4). The second interesting aspect revealed by both the 50-SNP windows and 5-kb windows based approaches is that for many genomic regions the run of enrichment extends far beyond a single or couple of windows. The regions containing the longest stretches of enriched 50-SNP windows have been summarized in Table 3.

Figure 4
figure 4

Genomic distribution of common population-specific (CPS) SNP-enriched 5-kb windows. The windows show very little overlap between populations and there are many blocks within populations containing contiguous windows of CPS SNP enrichment.

Table 3 Longest CPS SNP-enriched 50-SNP window stretch for each population

Interestingly, the longest blocks and the highest scoring windows show significant overlap in some populations (Tables 2 and 3). For example, one of the longest blocks as well as one of the most CPS SNP dense windows was detected near the solute carrier organic anion transporter family, member 1B1 (SLCO1B1) gene in the YRI population. Sequence variants identified in the SLCO1B1 gene have been associated with altered transport activity and it has been shown that genetic polymorphisms in the gene have an impact on the inter-individual variability of the pharmacokinetics and pharmacodynamics of specific drugs [62, 63]. Previous studies have also observed unique genetic diversity in the SLCO1B1 gene between populations with the greatest diversity among African populations [62, 63]. Similar overlap was also observed in the RAP1 interacting factor homolog (RIF1) gene in the CHINESE population. Additional files 2 and 3 contain the full list of windows identified using these approaches, and the SNPs included in them. Interestingly, despite fewer CPS SNPs and the presence of only a few enriched windows, two significantly long stretches of enrichment are observed in the CHINESE population. Similarly, although the number of enriched windows in Japanese is less than one third of that of the YRI, the Japanese population seem to harbour much longer enriched window stretches in comparison to the YRI population, and this enrichment cannot be explained solely on the basis of increased LD in the Japanese compared to the YRI These observations taken together indicate that the bias in distribution of CPS SNPs is largely independent of the size of the datasets and the enriched windows or window blocks may represent genomic regions significant in terms of function or population histories.

Possible origin of CPS SNP-enriched genomic regions

Clusters of SNPs with highly differentiated allele frequencies, within and between species, have been observed in numerous previous studies [6466]. The origin of such clusters has been ascribed to various demographic factors like genetic drift and gene flow as well as forces like selection and local adaptations [6769]. The CPS SNP clusters observed in our study are somewhat similar to the clusters which show high allele frequency differentiation within populations as they represent genomic regions which vary widely across populations. However, there is an inherent difference in that in these regions both the SNP composition and SNP density is different in a single population compared to others. Considering this background it was important to investigate if the factors, which are assumed to generate clusters of SNPs with highly differentiated allele frequencies across populations, are also responsible for generating clusters of CPS SNPs. We used different computational approaches to test possible involvement of selection or increased recombination rates in the origin of these clusters.

Role of selective sweeps

To determine whether the genomic regions enriched in CPS SNPs have an association with selective sweeps, we used two different approaches to search for possible signatures of selection in these regions. The first approach was based on the iHS (integrated Haplotype Homozygosity Score) statistic, which in principle involves the detection of unusually long haplotypes of low diversity as signatures of selection [68]. iHS scores for each SNP in the 50-SNP windows which were found to be enriched with CPS SNP were computed using the program iHS_calc [70]. For each 50 SNP window we calculated the proportion of SNPs with |iHS| > 2 which we will call iES (iHS enrichment score). The background iHS and iES score distributions were estimated on the basis of the iHS score calculated from 10,000 random contiguous 50-SNP windows or blocks for each population. Based on the background distribution, we then estimated the number of 50-SNP windows which can be expected to correspond to the top 1%, 5% and 10% of iES scores for each population. The observed number for CPS SNP-enriched windows for each population which correspond to the top 1%, 5%, and 10% iES scores were compared with the number of expected windows and the corresponding p-value for each observation was then estimated using bootstrap resampling. The results show that although some of the CPS SNP-enriched windows show significant iHS score enrichment, the overall distribution does not indicate any significant association of selection with these windows (Figure 5A).

Figure 5
figure 5

Analysis of potential signatures of selection in the common population-specific (CPS) SNP-enriched windows. (A) Expected and observed number of iES (iHS enrichment score) enriched windows (see Methods for details) in YRI, LWK, JPT and CHINESE populations. The number which has been appended to the population code indicates the top nth percentage of iHS score considered (1 = top 1%; 5 = top 5% and 10 = top 10%). The corresponding p-values for enrichment are shown on the right axis. (B) Expected and observed occurrences of top 1%, 5% and 10% population branch statistic (PBS) scores amongst CPS SNP-enriched windows for YRI, LWK, JPT and CHINESE populations. A three letter population combination code (say YLJ) has been used to describe the 3 population set used for calculating the PBS score. The first letter (Y) indicates the population being analysed (YRI in this case). The CPS SNP-enriched windows are analysed for this population. The second letter (L) indicates the population to which it was compared (LWK here) and the third letter (J) indicates the outlier (JPT in this case). The number, appended with an underscore to each three letter dataset name indicates the top nth percentage of PBS score cut-off used for analysis (1 = top 1%, 5 = top 5% and 10 = top 10%).

One of the concerns about using a centi-morgan (cM) based physical map, such as the one used in this study, is that the signals for signatures of selection might get underestimated as the threshold of iHS > 2 used by Voight and colleagues [68] might be too stringent for a cM map based analysis. Therefore, we ran two independent sets of analysis in which the iES scores were defined on the basis of lowered thresholds of iHS > 1.75 and iHS > 1.5, respectively. However, no distinct enrichment of iHS scores was observed even in the lower threshold sets. Results from the analysis of the 5-kb windows were also found to be very similar to that obtained with the 50-SNP windows. It should, however, be kept in mind that iHS in itself might not be a very good metric for testing selective sweeps in a dataset which is known to contain many CPS SNPs of moderate allele frequencies because, unless on a single haplotype, these SNPs will have a tendency to disrupt long haplotype blocks. The results for the iHS scan, nevertheless, confirm that the CPS SNPs in CPS SNP-enriched windows show a complex distribution of SNPs which result in complex haplotype architectures, and not a single long haplotype.

To test for selective sweeps on the basis of allele frequency differentiation rather than haplotype lengths we used the population branch statistic (PBS); which has been found to be very useful in detecting high altitude adaptation-related SNPs in Tibetans relative to Han Chinese and Danish populations, as an alternative approach for detecting signatures of selective sweeps in CPS SNP-enriched windows [71]. PBS can be thought of as an estimate of the allele frequency change at a given locus in the history of a population since its divergence from another population. The idea behind this analysis is that if we consider two related populations and an outlier population, the allele frequency changes at any locus in these two populations should be equidistant (or have similar branch length) from the outlier. Therefore loci which show high allele frequency differentiation in only one of the related populations, reflected by high population branch length (and PBS score), may be potential candidates for selective sweeps.

For each population, the PBS statistic for each CPS SNP-enriched 50-SNP window was calculated using the method used by Yi et al. [69]. For the Asian populations (JPT and CHINESE) and European populations (IBS and FIN), YRI was used as the outlier population. Similarly, for the African populations YRI and LWK, the JPT population was used as the outlier. Although the choice of outlier for the populations might be questionable from a population history perspective, the distances within these populations suggest that this set can still provide reasonable estimates of branch lengths. For each 3-population set (e.g. YRI-LWK-JPT or JPT-CHB-YRI), we estimated the background distribution of the PBS scores, using 10,000 randomly-selected 50-SNP windows. We then identified score cut-offs based on the top 1%, 5% and 10% of the background distribution and estimated the number of 50-SNP windows which can be expected to be in the top 1%, 5% and 10% PBS score range for a population. The number of observed windows in the 1%, 5% and 10% range was compared to the expected number and the corresponding P-values were estimated using a bootstrap analysis. Figure 5B summarizes the PBS score distribution for the Asian and African populations. None of the windows which were found to be enriched with CPS SNPs in FIN and IBS were found to be in the top 1%, 5% or 10% range for the respective populations and hence were not retained for further analysis. It can be seen that, although some of the populations have some enrichment of high PBS scores in the CPS SNP-enriched windows, their lack of statistical significance as well as the overall distribution of PBS scores do not suggest that selection is common in these regions (Figure 5B). Although there are quite a few other tests for detecting selective sweeps [72, 73] which could have been employed for this dataset and might have identified a few more CPS SNP-enriched windows to be under selective sweeps, it is unlikely that they would change the landscape fundamentally and it can be safely concluded that selection is not the major factor causing CPS SNP enrichment in certain genomic regions. However, the efficiency of existing methodologies for detecting signatures of selection in datasets like the current 1000 Genomes dataset (which contain a large proportion low frequency SNPs, sequenced on a low coverage platform) is an important concern as genome wide variation in error rates might easily mask true signals and generate false positive signals of signature of selection. Development of parameters and efficient quality control measures well suited for identifying signatures of selection in such a dataset will significantly contribute to future work in this direction.

Role of recombination rate

Regions of high recombination have been shown to be related to higher SNP densities [74, 75]. As the SNP densities in the CPS SNP-enriched windows are higher in a single population compared to others, we considered whether there was any relationship between CPS SNP-enriched windows and higher recombination rates. To test the association of CPS SNP-enriched genomic regions with meiotic recombination rates, we obtained recombination hotspots based on the recombination maps generated by deCODE [74]. The distribution of recombination hotspots from the deCODE recombination map using a SRR (sex-standardized recombination rate) cut-off of 10 found only a handful of recombination hotspots within the CPS SNP-enriched regions in all populations taken together [76]. However, recombination hotspots have been found to vary significantly among populations [77, 78] and as a population-specific perspective of recombination was key for this study, in addition to the generalized deCODE recombination map, the linkage disequilibrium (LD) based HapMap YRI map (hapMapRelease24YRIRecombMap) was used to identify recombination hotspots and coldspots for the YRI population [6, 33, 34]. Similarly the combined HapMap recombination map (hapMapRelease24CombinedRecombMap) was used to identify recombination hotspots and coldspots for all other populations [6, 33, 34].We studied the genomic distribution of the recombination rates from the YRI-specific map and the genomic regions corresponding to the top 1% recombination rates were defined as recombination hotspots for YRI. A second set of hotspots, likewise, were defined on the basis of the top 5% recombination rates. Similarly, two sets of coldspots were defined by the lowest 1% and 5% recombination rates. Based on the genomic distribution of recombination rates in YRI we estimated the number of hotspot sites expected to occur in CPS SNP-enriched windows for the YRI population. The observed rates were compared with the expected rates and the statistical significance of enrichment of recombination hotspots were estimated at both 1% and 5% levels. The CPS SNP-enriched regions defined on the basis of both length (5-kb) and 50-SNP windows were analysed separately. The frequency of sites with the top 1% and 5% recombination rates in both sets of YRI-specific CPS SNP-enriched regions in comparison to the respective background distributions of genomics regions with the top 1% and 5% recombination rates has been summarized in Figure 6A. It is clear that for both kinds of windows and at both levels (top 1% and 5%) the recombination hotspots were highly enriched in the population specific SNP-enriched genomic regions. The analysis of coldspots at both 1% and 5% levels, on the other hand, show that these sites are highly under-represented in the CPS SNP-enriched regions. A similar analysis for other populations using the combined map (hapMapRelease24CombinedRecombMap) shows that the trend of very significant enrichment of these hotspots and significant depletion of the recombination coldspots is consistently seen in all populations (Figure 6B). A combined analysis of CPS SNP-enriched windows from all the populations taken together also shows the same trend (Figure 6B).

Figure 6
figure 6

Recombination rates in common population-specific (CPS) SNP-enriched regions. A. The expected and observed number of hotspots (HS), defined on the basis of top 1% and 5% recombination rates) and coldspots (CS) (defined on the basis of lowest 1% and 5% recombination rates) in CPS SNP-enriched regions. (A) Recombination rates for the YRI was estimated on the basis of the HapMap24 YRI specific map downloaded using the UCSC table browser. The distribution of hotspots in regions detected by length based (5-kb) and window based (50-SNP) approaches using the top 1% (indicated with _1) and 5% (shown by _5) recombination rate is shown (B) The combined recombination map was used to identify whether the observed pattern of distribution of hotspots and cold spots in YRI also hold for JPT, LWK and CHINESE population specific windows (based on top 5% recombination rates). In addition to individual populations, the CPS SNP-enriched windows for all four populations taken together (ALL_HS and ALL_CS) are also shown.

Although this analysis shows a very clear trend, as the maps used in this study are LD based, further evidence in terms of experimentally derived data for at least some of these regions will be required to reliably establish the relationship between recombination hotspots and CPS SNP-enriched windows. Nevertheless, the observed enrichment of recombination hotspots in CPS SNP-enriched genomic regions hints that high recombination might be one of the factors contributing to the generation of CPS SNP clusters. The presence of recombination hotspot(s) in a short genomic region (5 kb or 50 SNP), especially in case of a genotype based recombination map like the one used here, clearly indicates the LD architecture to be complex and the LD blocks to be short within that particular region. Moreover, as the width of a recombination hotspot (1–2 kb) is significant with respect to length of the sliding windows (5 kb or 50 SNP) used in the analysis, the presence of even a single hotspot can lower the LD of the region covered within the window considerably. The enrichment of recombination hotspots, therefore suggests that LD blocks are probably shorter and that LD is probably lower in the CPS SNP-enriched regions compared to average genomic regions. Moreover, in addition to recombination rate associated SNP density variations, the high recombination rates also suggest that the effects of population admixtures will be more prominent in these regions, which might also be an important source of the observed CPS SNP clusters. Furthermore, as recombination hotspots have been found to vary significantly among populations [77, 78]. Therefore, if recombination hotspots play a role in generating CPS SNP clusters the occurrence of these regions at different genomic positions in different populations becomes explainable.

Functional categories and pathway distribution of CPS SNPs

To study the functional relevance of the CPS SNPs we analysed their localization with respect to known genes. As seen in the case of most novel variants identified by the 1000 Genomes project [17], as well as what can be expected on the basis of the background distribution of SNPs, most of the CPS SNPs were found to be either intergenic or intronic (Figure 7). Despite certain minor variations, for example in FIN and JPT, the overall distribution of the CPS SNPs in different major genomic regions was observed to be similar in all the populations. Interestingly however, the number of coding non-synonymous CPS SNPs in these populations (Table 4) were found be independent of the total number of CPS SNPs in them. These coding non-synonymous CPS SNPs were found to occur in roughly equal numbers in YRI, LWK and JPT, only a single CPS SNP was detected in the IBS, and were missing in the FIN and CHINESE populations. The functional impact of these non-synonymous coding CPS SNPs was assessed using a combination of four different SNP function prediction tools (SIFT, Polyphen 2, LRT, Mutation taster) which predicted most of these SNPs to have a potential functional impact [7982]. The list of coding non-synonymous SNPs along with their predicted functional significance is summarized in Table 4.

Figure 7
figure 7

Localization of common population-specific (CPS) SNPs in genomic regions defined on the basis of gene architecture. The majority of the CPS SNPs were found to be intergenic and intronic. The category ncRNA includes various types of non-coding RNAs and the category “other” includes upstream, downstream and UTR SNPs. The expected distribution based on overall occurrence of SNPs in human genome is shown as “Background”.

Table 4 Coding non-synonymous common population-specific SNPs and potential functional impact

Eleven coding non-synonymous CPS SNPs were observed in the YRI mapping to10 different genes, 8 of them were predicted to be functional by at least one of the tools. Four of the 10 CPS SNPs containing genes were detected to have known association with a disease, includin [79, 80]g HLCS (holocarboxylase synthetase deficiency), TGM1 (congenital ichthyosis), DIAPh1 (deafness), and PAWR (which induces apoptosis in certain cancer cells). Moreover, a functional SNP was detected in TRIM5 which is a capsid-specific restriction factor involved in blocking viral replication early in the life cycle. Additionally, two coding non-synonymous SNPs were detected in the UPK3B gene which plays an important role in AUM-cytoskeleton interaction in terminally differentiated urothelial cells.

In the LWK population 12 coding non-synonymous CPS SNPs in 11 genes were observed, 5 of which are linked to disease phenotypes. These include ABCA4, linked to Stargardt disease 1, hereditary macular degeneration and retinitis pigmentosa; ATP8B1, associated with various forms of cholestasis, GHR, which is linked to Laron syndrome, resulting in growth impairment; MCCC1, involved in methylcrotonoyl-CoA carboxylase 1 deficiency, and two SNPs in NLRP12 gene, which is associated with familial cold autoinflammatory syndrome. In the JPT population 8 non-synonymous CPS SNPs, all of which were predicted to be functional, were observed in 8 genes. Some of these genes were found to be involved in melatonin activity, melanogenesis, olfaction and hair formation. Only a single non-synonymous CPS SNP was detected in the MRP35 gene in the IBS population, whereas none was found to occur in the CHINESE and the FIN populations.

Additionally, a total of 520 CPS SNPs with probable consequences for gene regulation, all from RegulomeDB category 2, which demonstrates direct evidence of a binding through ChIP-seq and DNase data with either a matched position weight matrix to the ChIP-seq factor or a DNase footprint, were identified (Additional file 4) [83]. Of the putative regulatory variants identified, the majority are intergenic (234) and intronic (224). Approximately 3 times as many upstream (24) compared to downstream (7) variants were identified, while 3′-UTR variants were approximately double the number in the 5′-UTR. The occurrence of these potential regulatory SNPs, in addition to the potentially functional coding non-synonymous CPS SNPs indicate that in spite of occurring in a single population, at least some of the CPS SNPs might play a significant functional role in some of these populations.

To identify possible functional preference in the distribution of CPS SNPs in different populations we used the Ingenuity Pathway Analysis tool (IPA) [84] and DAVID [85] to identify functional classes, metabolic pathways and regulatory networks enriched in CPS SNPs in each population. The populations CEU, GBR and TSI, were excluded from this analysis as they contain too few CPS SNPs for generating statistically and biologically meaningful results. The top 5 canonical pathways found to be overrepresented in the CPS SNPs for each population using IPA are shown in Figure 8. We also prepared an extended gene list for each population which, in addition to genes for coding and intronic SNPs included nearby genes for the intergenic SNPs. This set was created to provide a more inclusive view of the functional preference as intergenic SNPs which form large proportion of CPS SNPs, are completely excluded from the pathway analysis. The top 5 CPS SNP-enriched canonical pathways for each population derived using the extended gene set are tabulated in Additional file 5. As expected, the pathways that were found using both the approaches show a significant overlap. Interestingly, there was a very significant overlap in pathways that were detected to be enriched in CPS SNPs between different populations. We also performed an analysis for enrichment of regulatory networks in the CPS SNPs and their corresponding genes. Regulatory networks overrepresented in (a) CPS SNP containing genes and (b) extended gene list (list of all genes containing variants, as well as nearest neighbour genes for intergenic variants), for each population are summarized in Additional file 6 which also exhibited significant overlap between different populations.

Figure 8
figure 8

Ingenuity canonical pathways enriched with common population-specific (CPS) SNPs. The 5 most overrepresented pathways for each population identified using IPA are shown. NCPS denotes the number of CPS SNP containing genes in the pathway and NTOT denotes the total number of genes in the pathway. Each pathway which was found to occur in two or more populations is shown in bold and a distinct colour.

Using DAVID, we identified a number of CPS SNP-enriched disease, pathway, and gene ontology (GO) classes for each population. As observed for the pathways detected using IPA, the CPS SNP-enriched disease, pathway and GO classes identified using DAVID overlapped between the different populations (Additional file 7). Moreover, the pathways identified using DAVID in many cases supported the pathways identified using the IPA tool. One of the major functional classes/pathways, which were observed to show significant CPS SNP enrichment in most of the populations and in multiple analyses, was the axon guidance signalling or axonogenesis pathway. This observation also supports previous work where genetic variations in genes involved in axon guidance signalling have been found to show significantly high levels of population differentiation [86, 87]. Moreover, a recent study aimed at identifying loci under parallel divergence (loci that have undergone moderate allele frequency changes in multiple independent human lineages) found most parallel divergent genes to occur in this pathway [88]. This may explain our observation for CPS SNP enrichment in the corresponding genomic regions in multiple populations. It is also interesting to note that several recent studies have shown this pathway to be one of the major mutational targets in pancreatic and other cancers [8991]. It would be an interesting follow up study to probe whether evolutionary forces, like mutation rate, might contribute to the observed SNP accumulation in regions where genes for these pathways occur and whether this enrichment has any adaptive relevance. Similar overlap was observed in many other CPS SNP-enriched pathways including protein kinase A signalling and CREB Signaling in Neurons (Figure 8), which points to underlying functional similarities in the distribution of CPS SNPs in different populations.

Current functional and pathway analysis is clearly limited by the state of current knowledge about gene interactions and functions. Well studied genes and pathways tend to contain more complete, validated interaction and functional data in contrast to less studied genes and pathways are. As the information around functional gene networks and regulatory pathways increases, we can anticipate that there may be additional gene functions and networks that are identified as being differentially regulated between populations; so these results can only represent our findings with respect to the current state of knowledge

Conclusions

In this study we have highlighted some interesting observations with regard to population-specific genetic variation, using an unbiased data set generated by whole genome sequencing. Firstly, we showed that CPS SNPs are abundant but are not randomly distributed and can cluster into regions that can span up to several kilobases. Secondly we have illustrated that at least some of the CPS SNPs are likely to have a phenotypic or functional impact. Thirdly, in terms of mechanism, we were unable to detect any evidence for selection in the regions of high CPS SNP density but interestingly, these regions more often associate with regions of high recombination. The enrichment of recombination hotspots in a way also indicates that the LD in the CPS SNP-enriched region is lower than that in the average genome and rules out any possible role of LD in generating CPS enriched regions. Finally, functional enrichment analysis of the CPS SNPs and their associated genes has highlighted some interesting pathways and functions over represented in several populations. Particularly, it highlighted possible hyper mutability of genes involved in axonal guidance signalling perhaps suggesting some evolutionary plasticity in this pathway.

Avenues for future exploration have been highlighted. However, there are several caveats. Firstly, the number of individuals per population for whom we have full genome sequences is presently low (N < 100). Secondly, the definition of a population in terms of origin and admixture is at times vague and increased mobility worldwide leads to elevated levels of admixture. Moreover, the numbers of variants analysed is only a small subset (<1%) of all population-specific variants since rare variants (MAF < 0.05) have not been included. Genome sequencing of global populations is providing data which will assist in teasing out ancestral populations and will shed further light on population differentiation and adaptation. The availability of more extensive data along with an increased depth of sequencing, which permits the reliable study of rare genetic variants and structural variants, is therefore required for a better understanding of the relationship between unique genotypic variations and their geographical contexts.

Methods

Data retrieval and processing

The recent version (Phase1, version 3, October 2012) of the 1000 Genomes vcf files containing phased genotypes for 36.7 million autosomal SNPs and 1.38 Million autosomal SSVs were downloaded from 1000 Genomes Project ftp server [92]. The ancestral allele information for SNPs on the basis of multi species alignments, for all variants was also downloaded from the 1000 Genomes ftp site. The conversion of the 1000 Genomes data to PLINK format was performed using the VCF tools [93, 94]. Frequency calculations and many other data manipulation operations were performed using PLINK [94]. The admixed populations (ASW, CLM, MXL and PUR) were excluded and the Chinese populations (CHB and CHS) were merged into a single population using PLINK which we refer to as “CHINESE”. The SNPs were classified as common in a population if the MAF was observed to be greater than 0.05 in that population. SNPs with lower MAF were treated as rare.

Genomic distribution and regional enrichment analysis

Identification of enrichment of CPS SNPs in genomic regions was performed using custom Perl scripts. We used two sliding window based approaches. In the first approach, each chromosome was scanned using sliding and non-overlapping 50-SNP windows and the frequency of CPS SNPs in each window was computed. Based on the overall occurrence of CPS SNPs in the entire chromosome the cumulative hypergeometric p-value for enrichment of CPS SNPs in each window was estimated. To correct for multiple hypothesis testing we used a conservative p-value cut-off of <5 × 10 -8 for the identification of windows enriched with CPS SNPs. In the second approach we employed a similar scan using 5-kb non-overlapping windows.

Selection scan

Signatures of selection were evaluated using two different approaches. The haplotype homozygosity based iHS score was calculated using the WHAMM package [95]. As calculation of iHS requires physical positions to be specified, we downloaded the combined linkage physical map for human genome build GrCh37 from Rutgers Map [96] and incorporated the physical positions into the existing data. For each population, iHS scores for SNPs occurring in the 50-SNP windows which were found to be enriched in CPS SNPs in that population were calculated using the iHS_calc script from the WHAMM package. To estimate the background iHS distribution for each population, we randomly sampled 10,000 50-SNP blocks and calculated the iHS scores for the SNPs occurring in these blocks. Based on allele frequency bins derived from the background, the iHS scores were then standardized. As an extension of the iHS scores we also defined iHS enrichment scores (iES) scores which is the proportion of SNPs in each 50-SNP window which has |iHS| > 2. Windows showing the top 1%, 5% and 10% iES scores were respectively selected as three levels for the analysis. For each level the expected iES distribution in all CPS SNP windows of a population was estimated and compared to the actual distribution. Statistical significance of overrepresentation of iES scores in CPS SNP-enriched windows of a population was estimated using a p-value calculated by a bootstrap resampling analysis. A similar analysis was also performed for CPS SNP-enriched 5-kb windows in each population. In addition, a separate set of analyses were performed for both 50-SNP and 5-kb windows, considering only SNPs with a minimum MAF of 0.05.

The calculation of PBS was carried out following the methods proposed by Yi and colleagues [71]. For calculating PBS scores for the African populations (YRI and LWK), JPT was used as an outlier. For the Asian populations CHINESE and JPT, YRI was used as an outlier. Similarly for the European populations (FIN and IBS) YRI was used as the outlier. For each three population set (like YRI-LWK-JPT or JPT-CHB-YRI) we estimated the background distribution of the PBS scores, using 10,000, randomly selected 50-SNP windows. We then identified score cut-offs based on the top 1%, 5% and 10% of the background distribution and estimated the number of 50-SNP and 5-kb windows which can be expected to be in the top 1% ,5% and 10% PBS score range for a population. The number of observed windows in the 1%, 5% and 10% range was compared to the expected number and the corresponding p-values were estimated using a bootstrap analysis.

Recombination rate

We retrieved the deCODE recombination map and the HapMap related recombination maps (hapMapRelease24YRIRecombMap and hapMapRelease24CombinedRecombMap) using the UCSC table browser [97]. The distribution of recombination hotspots from the deCODE recombination map using a SRR (sex-standardized recombination rate) cut-off of 10 found only a few hotspots in the gene set and were not analysed further.

The HapMap YRI recombination map (hapMapRelease24YRIRecombMap) was used to identify recombination hotspots and coldspots in YRI and the combined dataset. The distribution of recombination rates was studied to select genomic regions showing the top 1% recombination rate scores and these regions were designated as recombination hotspots. We also used the top 5% recombination rate scores to select a second set of hotspots. Similarly, the two sets of coldspots likewise were defined by the lowest 1% and 5% recombination rates. Based on the genomic distribution of recombination rates in YRI (hapMapRelease24YRIRecombMap) we estimated the number of hotspot sites expected to occur in CPS SNP-enriched windows for YRI. The expected value was compared to the observed value and a cumulative hypergeometric p-value was used to estimate the statistical significance of the over and underrepresentation for recombination hotspots and coldspots in the CPS SNP-enriched 50-SNP windows and the CPS SNP enriched 5 kb-windows in YRI. Similar analyses were conducted for all other populations, individually as well as combined together, using the HapMap combined recombination map (hapMapRelease24CombinedRecombMap).

SNP function assessment

The genomic contexts of all CPS SNPs were determined using ANNOVAR [98], which was also used to annotate potentially functional non-synonymous variants based on their predicted functional impact at the protein level. ANNOVAR derives pre-computed functional impact scores for SIFT [80], POLYPHEN2 [79], LRT [82] and Mutation Taster [81]. Non-synonymous variants were considered to have a functional impact if the recommended score criteria for any one of the algorithms were met, SIFT: ≥ 0.95, POLYPHEN2: ≥ 0.85, LRT ≥ 0.5, Mutation Taster ≥ 0.50.

In order to identify non-coding CPS SNPs that may have an effect on the binding of regulatory factors, intronic variants and those flanking genes were searched against the RegulomeDB database [83], which employs a heuristic scoring system based on the confidence that the variant lies in a regulatory element and whether it has known or possible functional consequences such as alteration of Transcript Factor (TF) binding and changes in expression patterns of the associated gene(s). dbSNP [99] variants are classified into 6 categories, with category 1 having highest confidence due to associated eQTL data, and category 6 the lowest. Only CPS SNPs belonging to categories 1 and 2 were considered to be regulation-modifying, since they are the most likely to result in a functional consequence.

IPA analysis

For each population, two gene lists were generated from the CPS SNP set. The first contained only genes that included selected variants, identified by rs IDs [99]. The second contained all genes that contained the identified SNPs, as well as nearest neighbour genes for the SNPs that were intergenic. By definition, the second list contained more genes than the first. Ingenuity Pathway Analysis (IPA) software was used to analyse gene interaction networks in the gene lists, as well as enriched ‘canonical’ pathways describing well characterised and validated regulatory pathways [84].

DAVID analysis

The Database for Annotation, Visualisation and Integrated Discovery (DAVID) [85] is an online tool that accepts a list of genes as input and performs functional analysis on them. It provides a list of functions enriched in the gene list, and clusters these functions according to their similarity. Functions include gene ontology (GO) and Swiss-Prot annotation, InterPro matches, OMIM [100] and other disease links, as well as KEGG [101, 102] and other pathway database links. The gene-enrichment analysis is based on the Fisher’s Exact test, which determines whether or not a given list of genes is enriched for a certain function label or if this function occurs in the list by chance. A p-value shows the significance and adjusted p-values are also provided, after correction for multiple testing. The gene lists for each population that contained CPS SNPs were run through DAVID to identify overrepresented pathways and other functional labels.

Function and disease association of CPS-SNP containing genes

Potential functions of CPS-SNP containing genes and their role in various diseases were inferred from the GeneCards database [103].