A general framework for moment-based analysis of genetic data
In population genetics, the Dirichlet (also called the Balding–Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright–Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes–Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta–Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa–Kishino–Yano and Tamura–Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.
KeywordsAllele fraction Beta–Dirichlet Diffusion Dirichlet Distribution of allele fractions Evolutionary history Hierarchical Beta Moments Multi-allelic Wright–Fisher Mutation processes Pyramid
Mathematics Subject Classification60J25 60J60 62E17 62M05 92D25
We are grateful to the associate editor and two anonymous reviewers for helpful comments and suggestions. This work is funded through a Grant from the Danish Research Council (DFF 4002-00382) awarded to Asger Hobolth.
- Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, SunderlandGoogle Scholar
- Jukes TH, Cantor CR (1969) Evolution of protein molecules. Academic Press, New York, pp 21–132Google Scholar
- Ongora A, Migliorati S, Monti GS (2008) A new distribution on the simplex containing the Dirichlet family. In: Proceedings of the 3rd compositional data analysis workshop, 27–30 May. University of GironaGoogle Scholar
- Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. Sinauer Associates, SunderlandGoogle Scholar
- Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526Google Scholar
- Tataru P, Simonsen M, Bataillon T, Hobolth A (2016) Statistical inference in the Wright–Fisher model using allele frequency data. Syst Biol 66:e30–e46Google Scholar