The study of human Y chromosome variation through ancient DNA
- First Online:
- Cite this article as:
- Kivisild, T. Hum Genet (2017) 136: 529. doi:10.1007/s00439-017-1773-z
High throughput sequencing methods have completely transformed the study of human Y chromosome variation by offering a genome-scale view on genetic variation retrieved from ancient human remains in context of a growing number of high coverage whole Y chromosome sequence data from living populations from across the world. The ancient Y chromosome sequences are providing us the first exciting glimpses into the past variation of male-specific compartment of the genome and the opportunity to evaluate models based on previously made inferences from patterns of genetic variation in living populations. Analyses of the ancient Y chromosome sequences are challenging not only because of issues generally related to ancient DNA work, such as DNA damage-induced mutations and low content of endogenous DNA in most human remains, but also because of specific properties of the Y chromosome, such as its highly repetitive nature and high homology with the X chromosome. Shotgun sequencing of uniquely mapping regions of the Y chromosomes to sufficiently high coverage is still challenging and costly in poorly preserved samples. To increase the coverage of specific target SNPs capture-based methods have been developed and used in recent years to generate Y chromosome sequence data from hundreds of prehistoric skeletal remains. Besides the prospects of testing directly as how much genetic change in a given time period has accompanied changes in material culture the sequencing of ancient Y chromosomes allows us also to better understand the rate at which mutations accumulate and get fixed over time. This review considers genome-scale evidence on ancient Y chromosome diversity that has recently started to accumulate in geographic areas favourable to DNA preservation. More specifically the review focuses on examples of regional continuity and change of the Y chromosome haplogroups in North Eurasia and in the New World.
Until recently, ancient DNA studies of human remain focused primarily on variation embedded in mitochondrial DNA (mtDNA). For decades, mtDNA had been a target of choice in population genetic studies because of its high mutation rate and high density of polymorphic markers (Wilson et al. 1985). Also, because for any unique sequence in the autosomal, X or Y chromosome locus there are many hundreds or even thousands of copies of mtDNA means that this maternally inherited locus was more likely to work in cases where only a very small number of molecules had survived (Krings et al. 1997). HTS technologies have significantly increased the rate at which sequence data can be generated and make accessible surviving chunks of DNA that are shorter than the size of a PCR (polymerase chain reaction) amplicon (Orlando et al. 2015). They have reduced the costs of sequencing and opened the prospects of assessing the variation of human populations, both modern and ancient, at the scale of entire genomes (Genomes Project et al. 2012; Gilbert et al. 2008; Green et al. 2010; Rasmussen et al. 2010). These technological advances (Orlando et al. 2015) have also made it possible now to study ancient Y chromosome (aY) variation in human populations at the scale of the entire accessible length of the male-specific and non-recombining regions of human Y chromosome.
Ancient DNA presents us the opportunity to directly examine which Y chromosome single nucleotide polymorphisms (SNPs) and haplotypes were present at different time periods in regions that support long-term survival of ancient DNA. It is perhaps not surprising that archaeological sites from high latitude areas (Hofreiter et al. 2015) have yielded the largest number of successfully sequenced samples in recent genome-scale studies that have reported on aY. These studies, as reviewed in further detail below, have provided us the first glimpses of the dynamics of aY haplogroup composition and frequency changes in transects of time and allow us to test hypotheses based on earlier phylogeographic inferences made from the Y chromosome data of presently living populations. In this review, ‘haplogroups’ and ‘clades’ are terms that are interchangeably used to refer to groups of closely-related Y chromosome sequences that share a common ancestor. While there is a general agreement in the definition of the basic haplogroups (A, B, C, etc.) multiple parallel nomenclatures are in use for sub-clades (e.g. see http://www.phylotree.org/Y/, http://isogg.org/tree/, https://www.yfull.com/tree/). For clarity, the names of sub-clades used here are suffixed by the defining SNP-marker name, as suggested previously (Karmin et al. 2015; van Oven et al. 2014). Similarly to the extent to which our earlier views on peopling of continental regions such as Europe based on inferences made from extant mtDNA variation have changed in the light of new ancient mtDNA evidence (Bollongino et al. 2013; Bramanti et al. 2009; Brandt et al. 2013; Haak et al. 2015; Posth et al. 2016; Thomas et al. 2013), it can be seen that models linking the spread of specific Y chromosome haplogroups with the spread of material culture may require substantial revision in the light of new aY evidence.
In this review, methodological aspects of ancient Y chromosome work will be first discussed with a focus on recent HTS research based on shotgun and capture approaches. Challenges common to all ancient DNA studies include those related to calling human polymorphic variants from short and damaged sequence reads that are derived from a mix of different organisms. Highly repetitive nature of the Y chromosome combined with its high sequence homology with the X chromosome (Skaletsky et al. 2003) further complicates variant calling from short ancient DNA sequence reads. These issues, together with the paternal inheritance of Y chromosome, its haploid nature and high linkage between physically distant SNPs, impact on choices of the bioinformatics methods of downstream data analysis. They define the restrictions and specifics how the aY sequence data should be processed and what are the limitations for the interpretations made from such data. Finally, the demographic histories European and North American populations will be briefly reviewed in the light of recently emerged Y chromosome evidence from ancient DNA studies.
Methodological issues in dealing with aY
A number of Y chromosome sequences covered in this review have been generated with shotgun sequencing approach. These include the four oldest genomes of individuals dated to late Pleistocene as well as a number of remains from Europe and Americas dated to the Holocene period. A substantial proportion of the European and Middle Eastern Neolithic and Bronze Age remains have been sequenced, however, with hybridization-based capture technique. Capture-based methods focus on specific targets in the genome with the aim to raise their relative coverage in the resulting data. The human origins (HO) genome-wide SNP array (Patterson et al. 2012) has been used to genotype more than 600 K SNPs in more than two thousand living individuals from 203 populations (Lazaridis et al. 2014). A subset of 390 K SNPs from this array, including 2258 Y chromosome markers, were targeted later (Haak et al. 2015) in a hybridization-based capture design to assess variation in 69 Europeans living 8–3 thousand years ago. The extended panel of the HO hybridization design now targets 1240 K SNPs genome-wide including 32,681 from the Y chromosome (Mathieson et al. 2015). This approach has been used to assess at a high molecular resolution the phylogenetic affiliation of a large number of 110 ancient Y chromosomes from Europe and Near East (Lazaridis et al. 2016).
Because most personal genomes that have been annotated by ISOGG come from individuals of European or North American descent the capture enriched for ISOGG SNPs is best suited for the study of European Y chromosome diversity while being less efficient for the study of other regions. But also in Europe it should be noted that clades that have become infrequent or extinct over time due to extensive admixture or population replacement would have less chance to be recognised with the SNP-targeting capture approach. If the parental clade of an extinct branch has extant sister-clades that have been characterised in the annotated databases the aY sequence from that extinct clade will be characterised at the level of the parental clade (Fig. 2). While the most common Y chromosome haplogroups in Europe are represented in the HO-1240K SNP array by many phylogenetically equivalent SNPs they can be robustly recognised and called even in samples with large number of missing data, the haplogroups of other continental regions are less completely characterised in SNP arrays and therefore shotgun approach may be preferable for the ancient DNA (aDNA) study of Y chromosome variation in those areas. As the accessible regions of the Y chromosome that are commonly used in genetic studies are haploid and do not undergo recombination imputation is more efficient in Y chromosome branches represented by multiple equivalent SNPs than it is in autosomal loci and therefore shotgun sequence data with fairly low coverage can be used for unbiased assessment of phylogenetic affinities of aY lineages. However, when using imputation, particularly in the more terminal branches of the tree, one should be cautioned that some of the ancient samples may in fact derive from clades that share only part of the SNPs defining the extant branch of the Y tree and not all of the SNPs.
Another complicating factor for determining Y chromosome haplogroups from ancient DNA data is that it can be challenging to distinguish true mutations from those induced by damage, particularly in case of C to T and G to A substitutions (Gilbert et al. 2003; Hofreiter et al. 2001). A commonly used strategy for dealing with post-mortem damage in ancient DNA studies is the removal of transitions which are the main targets of deamination-induced miscalls via uracil DNA glycosylase treatment or data filtering (Orlando et al. 2015). As transitions occur naturally more frequently than transversions their total removal can lead, however, to dramatic losses of data and thereby loss of phylogenetic resolution, particularly in cases where the sub-clade-defining branches are short. Therefore, aY sequences with low coverage (<0.1×) may not contain enough informative transversions to allow for robust haplogroup assignment at a resolution that would be useful for testing cases of genetic continuity or admixture. Capture approach may yield better coverage at targeted sites but it should be also cautioned that some SNP-targeting capture designs can be enriched for transitions due to the removal of strand ambiguous G<->C and A<->T transversions from the design of SNP-chips.
High coverage Y chromosome sequences, whether from modern or ancient samples, can be used to draw trees with informative branch lengths. With low coverage data generated from ancient human remains, it is typically not possible to make a clear distinction between damage-affected sites and true mutations (Fig. 1). This ambiguity makes it either highly problematic or impossible for the user to determine the lengths of the private branches of the ancient samples. Capture-based methods that are designed to target sequences surrounding a restricted number of known SNPs have a number of advantages, as reviewed earlier (Pickrell and Reich 2014), while at the same time being limited to detect only variants that are included in the capture design. This means that these methods do not allow for the discovery of new variants nor to determine the length of the private branches of the Y chromosome tree. The capture designs targeting large regions of DNA that uniquely map to the Y chromosome, such as the BigY design of FamilyTree (https://www.familytreedna.com/learn/wp-content/uploads/2014/08/BIG_Y_WhitePager.pdf) which has 67,000 probes that enable the generation of 4.36 Mbp sequence data from “globally covered” regions of the Y chromosome, or the smaller scale capture design of 500 kb non-recombining regions of Lippold et al. (2014), have been successfully applied on modern samples. These capture approaches are able to detect new, previously uncharacterized variants within the captured regions but they are yet to be shown to work efficiently on ancient DNA. Hybridization enrichment approach targeting chromosome 21 has been successfully applied on the ~40 thousand year-old (KYA) human remains from Tianyuan Cave outside Beijing, China (Fu et al. 2013) suggesting that similar approaches can be applied in the future also on accessible (non-ampliconic) regions of the Y chromosome.
Although HTS methods enable us to generate sequence data from short ancient DNA molecules downstream analysis of extremely short Y chromosome mapping reads is limited by the complexity of the Y chromosome which includes large number of repeats and regions that map to other chromosomal regions. This means that female individuals may yield calls for Y chromosome SNPs surrounded by sequence highly homologous to the X chromosome. Inclusion of SNPs that map outside unique single-copy regions of the Y chromosome confounds phylogenetic mapping of ancient DNA samples and may also bias the analyses based on branch lengths and quantitative estimates of sequence divergence between samples. The approach most typically used in HTS studies of Y chromosome diversity is to filter out SNPs from high homology regions, typically converging to X chromosome degenerate regions of a total size <10 Mbp (Francalacci et al. 2015; Karmin et al. 2015; Poznik et al. 2013; Wei et al. 2013). Several attempts to infer Y chromosome mutation rate from these uniquely mapping regions using ancient DNA data have been made, as reviewed in the paper by Oleg Balanovsky in this volume.
Genetic history of Eurasian Y chromosomes
Two of the oldest human Y chromosomes sequenced so far, Ust’Ishim Man (Fu et al. 2014) and the Oase Man (Fu et al. 2015), are both placed near the root of haplogroup K, a sub-clade of F, which is globally the most frequent Y chromosome lineage alive today. K is an ancestral group that unites a number of regional haplogroups that are found widely spread today in Europe, East Asia, Oceania, and Americas. Notably, however, the Y chromosomes of these two ancient Eurasian colonists are not exactly equidistant to all living descendants of haplogroup K that are found today in the world. Both Ust’Ishim and Oase men share the derived allele of a marker M2308 (Poznik et al. 2016) which defines the basal root of two common haplogroups N and O (Fig. 3). These M2308-derived haplogroups have a wide spread today in Eurasia, extending from Finno-Ugric populations of Northeast Europe to Tibeto-Burman, Austroasiatic and Austronesian speaking groups of South and East Asia (Ilumae et al. 2016; Poznik et al. 2016). Although the Ust’Ishim and Oase men were separated from each other about 5 thousand years in time and about 5 thousand kilometres in space these two sampling points we have from the earliest period of peopling of Eurasia suggest some level of continuity in certain Y chromosome lineages that exist in present-day populations of Eurasia. In contrast, the analyses of autosomal genomes have shown that both Ust’Ishim and Oase men were equidistant to East and West Eurasian populations and therefore unlikely to have contributed substantially to later humans in geographic regions where they were living in (Fu et al. 2014, 2015). The phylogenetic affiliation of these early Eurasian men with an NO-related clade is notable because phylogeographic inferences made from present day Y chromosome variation have highlighted other haplogroups, I and R1-M173, as signatures of the genetic legacy of Palaeolithic humans in West Eurasia (Semino et al. 2000) while the origin of haplogroups N and O (Rootsi et al. 2007), as well as other sub-clades of haplogroup K (Karafet et al. 2015), was deemed most likely to be, on the grounds of the highest genetic diversity, in Southeast Asia.
In contrast to haplogroup G, the geographic distribution of haplogroup H is presently almost entirely restricted to South Asia, while one of its sub-clades, H4-L285 (Fig. 5), can be detected as an extremely rare lineage in some European populations. H4-L285 has also been found in the aY sequences of the Anatolian and Levantine farmers as well as in Iberian Chalcolithic samples (Gunther et al. 2015; Lazaridis et al. 2016). Overall, the comparisons of early and middle Holocene versus present-day distributions of haplogroup G and H suggest that as characteristic markers of the early farmer populations of Middle East they were introduced to Europe by the expanding Anatolian farming populations. Their frequency has remained high in some geographically isolated areas such as the Caucasus, Sardinia, Corsica, whereas their frequency in main parts of Europe dropped later due to inflow of other Y chromosome lineages.
Haplogroup J, which, on the basis of its present-day clinal frequency distribution has been associated with the early spread of farming to Europe (Rosser et al. 2000; Semino et al. 2000), has not been detected so far in European Neolithic context, instead, this haplogroup is found in hunter-gatherers from geographically distant areas, from the Caucasus and Karelia (Fig. 6), as well as in two early farmers from Iran and one from Anatolia (Jones et al. 2015; Lazaridis et al. 2016; Mathieson et al. 2015). In Central and Western Europe, J lineages start to emerge in the Bronze Age, likely being part of the demographic processes and population movements initiated from the North Caucasus area during that period. It is possible that these recent processes also introduced to Europe sub-clades of haplogroup E (Cinnioglu et al. 2004; Cruciani et al. 2007; Trombetta et al. 2015) which according to recent ancient DNA evidence was a characteristic haplogroup of the Natufians, pre-pottery farmers of Levant and Ethiopians (Gallego Llorente et al. 2015; Lazaridis et al. 2016).
In contrast to preceding Early and Middle Neolithic sections of time, a large proportion of the Y chromosomes recovered from Bronze Age remains of Central Europe, Northern Caucasus and the Steppe belt of Russia belong to a couple of sub-clades of haplogroups R1a-M420 and R1b-M343 (Fig. 7). Late Neolithic, Early Bronze Age and Iron Age samples from Central and Western Europe have typically the R1b-L11, R1a1-Z283 and R1a-M417 (xZ645) affiliation while the samples from the Yamnaya and Samara neighbourhood are different and belong to sub-clades R1b11-Z2105 and R1a2-Z93 (Allentoft et al. 2015; Cassidy et al. 2016; Haak et al. 2015; Mathieson et al. 2015; Schiffels et al. 2016). The R1b11-Z2015 lineage is today common in the Caucasus and Volga-Uralic region while being virtually absent in Central and Western Europe (Broushaki et al. 2016). Interestingly, the earliest offshoot of extant haplogroup R1b-M343 variation, the V88 sub-clade, which is currently most common in Fulani speaking populations in Africa (Cruciani et al. 2010) has distant relatives in Early Neolithic samples from across wide geographic area from Iberia, Germany to Samara (Fig. 7). In a similar way, early offshoots of the R1b and R1a phylogenies, including R1b lineages derived at P297 and ancestral at M269, and R1a lineages which are derived at M459 while ancestral at M198 and M417 markers have been found in mid-Holocene hunter-gatherer samples in a wide area in Eastern Europe, from Karelia, Latvia and Samara region (Haak et al. 2015; Jones et al. 2017; Mathieson et al. 2015). Extremely rare extant sub-clades of R1a, such as R1a4-YP5061, R1a5-YP1272, and R1a6-YP4141 (Fig. 7), may bear witness to a long-term continuity of such old genetic lineages while the majority of present-day R1a and R1b lineages in West Eurasia derives from just a handful of Late Neolithic/Early Bronze Age male founders.
Ancient Y chromosomes of the Native Americans
The present-day pool of Native American Y chromosomes is a mixture of haplogroups that derive from pre-Columbian dispersals from Siberia and more recent gene flow from Europe and Africa (Grugni et al. 2015; Kimura et al. 2016; Roewer et al. 2013; Zegura et al. 2004). The diversity derived from the first dispersals is restricted to just two founding lineages within haplogroup Q and one or two in haplogroup C3-M217 (Fig. 3). The lineages specific to Native Americans within these two ancient haplogroups Q and C that have also branches that are commonly found in different parts of Eurasia have been suggested to have reached America by multiple independent dispersal events from Siberia (Lell et al. 2002). Lell et al. considered haplogroup Q, which is the most common clade in both North and South American Native populations, to derive from a migration from ‘Middle’ Siberia where it is highly frequent today. C3-M217, which in Americas is restricted, as a minor haplogroup, to a small number of populations, is the most common Y chromosome haplogroup in Northeast Siberians. Lell et al. (Lell et al. 2002) argued that the different spread patterns of these two haplogroups in both America and in Siberia are the outcome of dual origins of Native Americans implying two early dispersal events with two distinct source populations. Similar levels of STR diversities, however, that have been observed in Native American C3-P39 and Q1a-M3 lineages have been interpreted in favour of both of these haplogroups being part of the initial pool of the first expansion of Native Americans some 10–17 thousand years ago and thus can be viewed as being in line with what is called the single wave model (Zegura et al. 2004). The finding of a rare group of lineages of C3-M217 that share the ancestral allele for the P39 marker in Ecuador has recently reignited the debate about dual origins of Native American haplogroup C and Q lineages (Roewer et al. 2013). The C3-M217 Y chromosomes without the P39 marker were found in association with STR diversity coalescing to 6000 years and Roewer et al. (2013) suggested that their presence could be explained by a secondary wave from East Asia, and possibly, more specifically from Japan, considering that certain similarities in material culture exist between the areas where the C3-M217 lineages are found. However, analyses of autosomal SNPs have not supported the model of additional gene flow from East Asia to Ecuador suggesting that instead the presence of the rare C3-M217 lineages without the P39 marker represents a case of a rare founding lineage that has been lost elsewhere by drift (Mezzavilla et al. 2015).
Ancient DNA studies have cast new light on the debate of Native American origins and shown that the loss of rare lineages in post-contact Native Americans is not unusual and possibly part of the extensive lineage extinction process that has been observed in mitochondrial lineages (Llamas et al. 2016). The analyses of the genome of 24 KYA human remains, recovered from the Mal’ta site near Lake Baikal and shotgun-sequenced to an average depth of 1× (Raghavan et al. 2014b), have shown that Native Americans do have ‘dual ancestry’ but not in the sense of the dual ancestry model of Lell et al. (2002). The autosomal genome of the Mal’ta Boy has close affinity to modern European and Native American populations while being more distant to East Asians which suggests, considering that overall Native Americans are more closely related to East Asians than to Europeans, that the Native Americans derive approximately one-third of their genetic ancestry from a population to which the Mal’ta Boy was related to while two-thirds of their ancestry derives from a different source, closely related to modern East Asian populations. The Y chromosome of the Mal’ta Boy (Fig. 3) is more closely affiliated to West Eurasian R lineages than to East Asian D, C or O lineages. It represents an extinct lineage that derives from the base of haplogroup R closely after the split of the ancestors of haplogroups Q and R. Because most Native American Y chromosomes belong to haplogroup Q, they are more closely related to European Y chromosomes while their maternal lineages (with the exception of rare haplogroup X2a) are all nested within East Asian variation. It should be noted, though, that since both mtDNA and Y are effectively single loci no firm conclusions about the sex-specific nature of the admixture process that lies at the foundation of Native American ancestry can be made from these observations.
The third ancient Y chromosome sequence from the Americas, or in fact, technically, from Greenland comes from the Saqqaq site and is dated to 4 KYA. The Saqqaq Man’s mtDNA (Gilbert et al. 2008) and his whole genome, shotgun sequenced to an average depth of 20× (Rasmussen et al. 2010) provided the first direct evidence of a separate Palaeo-Eskimo dispersal event into the Arctic North Americas. The Saqqaq Man’s Y chromosome belongs to Q2b-B143, which is a sub-clade of haplogroup Q that is only distantly related, at time depth >25KYA, to the Q1a-M3 and Q1b-M971 lineages (Fig. 8). The Saqqaq’s Q2b-B143 lineage was not found to be present in a survey of 1863 haplogroup Q lineages from South America making it unlikely to have been among the initial founding pool of Beringian Y chromosomes (Jota et al. 2016). From further ancient DNA analyses, we know that the Palaeo-Eskimos had extremely low genetic diversity, with only a single characteristic mtDNA lineage of haplogroup D2a1 being found in a wide range of sites from Northeast Canada and Greenland dated between 5000 to 700 years before present (Raghavan et al. 2014a). Analyses of autosomal genes and mtDNA of more recent remains associated with the Thule Culture suggest that the Palaeo-Eskimo population was completely replaced by the Neo-Eskimos within less than thousand years ago. Interestingly, though the Saqqaq Man’s Y chromosome lineage may have some continuity in the present day descendants of the Neo-Eskimo dispersal. Greenland Inuits have been shown to carry, at frequencies up to 54% in East Sermersooq, a lineage which is characterised by the NWT01 mutation (Olofsson et al. 2015), a SNP which separates the Y chromosomes of Inuits, who have it, from Athabascans, who do not (Dulik et al. 2012), and which is equivalent to the F1202 SNP that defines the clade that unites the Q2b-B143 and Q2c-M120 sub-clades (Fig. 8). The Y chromosome of the Saqqaq Man has been shown to share a number of SNPs equivalent to B143 with a group of Koryaks from Northeast Siberia (Karmin et al. 2015). It is yet to be revealed, however, whether the Neo-Eskimo Y chromosomes are derived at the SNPs defining the Q2b-B143 branch or whether they represent a yet another sub-clade of Q2-F1202. Further analyses of aY variation across the Americas would help to broaden our understanding of the past dynamics of the male effective population size and show to what extent the lineages present in the past have survived up to the present.
In sum, a number of aDNA studies have already started to reveal the potential of human Y chromosome to inform us about the demographic past complementing the study of autosomal genome and the X chromosome. When compared against the results derived from the analyses of autosomes the unique inheritance patterns of Y, as well as the X chromosome and mtDNA will provide us the opportunity to explore sex-specific dispersal and admixture processes in the future when further sampling will provide larger sample sizes and better coverage of the same geographic areas in time. Further breakthroughs of ancient DNA success in regions like Africa, Southeast Asia and Oceania will be most desirable to tackle broader range of questions about the continuity and nature of sex-specific dispersals and admixture in human evolutionary history. Recent cases of successful retrieval of ancient DNA from Ethiopia (Gallego Llorente et al. 2015), Vanuatu and Tonga (Skoglund et al. 2016) provide some optimism for the retrieval of aY sequence data from warmer climate regions in the future.
I would like to thank Freddi Scheib, Luca Pagani and Eppie Jones for their help in accessing Y chromosome data and two anonymous reviewers for their invaluable comments.
Compliance with ethical standards
Conflict of interest
I declare no conflict of interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.