Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

FormalPara Chapter Summary

Neutral evolution is the default process of the genome changes. This is because our world is finite and the randomness is important when we consider history of a finite world. The random nature of DNA propagation is discussed using branching process, coalescent process, Markov process, and diffusion process. Expected evolutionary patterns under neutrality are then discussed on fixation probability, rate of evolution, and amount of DNA variation kept in population. We then discuss various features of neutral evolution starting from evolutionary rates, synonymous and nonsynonymous substitutions, junk DNA, and pseudogenes.

1 Neutral Evolution as Default Process of the Genome Changes

It is now established that the majority of mutations fixed during evolution are selectively neutral, as amply demonstrated by Kimura (1983; [1]) and by Nei (1987; [2]). Reports of many genome sequencing projects routinely mention neutral evolution in the twenty-first century, e.g., mouse genome paper in 2002 ([3]) and chicken genome in 2004 ([4]). We thus discuss neutral evolution as one of the basic processes of genome evolution in this chapter.

Neutral evolution is characterized by the egalitarian nature of the propagation of selectively neutral mutants. For example, let us consider a bacterial plaque that is clonally formed. All cells in one plaque are homogeneous, or have the identical genome sequences, if there are no mutations during the formation of that plaque. Because of identicalness in genome sequences, there will be no change of genetic structure for this plaque. Let us assume that three cells at time 0 in Fig. 3.2 are in this clonal plaque. Their descendant cells at time 4 also have the same genome sequences, though the number of offspring cells at that time varies from 0 to 4. This variation is attributed to nongenetic factors, such as heterogeneous distribution of nutrients. However, the most significant and fundamental factor is randomness, as we will see in Sect. 4.2.2 on branching process.

Mutation is the ultimate source of diversity of organisms. If a mutation occurring in some gene modifies gene function, there is a possibility of heterogeneity in terms of number of offsprings. This is the start of natural selection that will be discussed in Chap. 5. However, some mutations may not change gene function, and although they are somewhat different from parental type DNA sequences, mutants and parental or wild types are equal in terms of offspring propagation. We meet the egalitarian characteristic of the selectively neutral mutants. If all members of evolutionary units, such as DNA molecules, cells, individuals, or populations, are all equal, the frequency change of these types is dominated by random events. It is therefore logical that randomness is the most important factor in neutral evolution.

1.1 Our World Is Finite

Randomness also comes in when abiotic phenomena are involved in organismal evolution. Earthquakes, volcanic eruptions, continental drifts, meteorite hits, and many other geological and astronomical events are not the outcome of biotic evolution, and they can be considered to be stochastic from organismal point of view.

Before proposal of the neutral theory of evolution in 1968 by Kimura ([5]), randomness was not considered as the basic process of evolution. Systematic pressure, particularly natural selection, was believed to play the major role in evolution. This view is applicable if the population size, or the number of individuals in one population, is effectively infinite. However, the earth is finite, and the number of individuals is always finite. Even this whole universe is finite. This finiteness is the basis of the random nature of neutral evolution as we will see in later sections of this chapter.

1.2 Unit of Evolution

Nucleotide sequences reside genetic information, and one gene is often treated as a unit of evolution in many molecular evolutionary studies. A cell is the basic building block for all organisms except for viruses. It is thus natural to consider cell as unit of evolution. One cell is equivalent to one individual in single-cell organisms. In multicellular organisms, by definition, one individual is composed of many cells, and a single cell is no longer a unit of evolution. However, if we consider only germ-line cells and ignore somatic cells, we can still discuss cell lineages as the mainstream of multicellular organisms as in the case for single-cell organisms. Alternatively, clonal cells of one single-cell organism can be considered to be one individual. Cellular slime mold cells form a single body with many cells, or each cell may stay independently, depending on environmental conditions [6]. We therefore should be careful to define cell or individual.

Organisms are usually living together, and multiple individuals form one “population.” We humans are sexually reproducing, and it seems obvious for us to consider one mating group. In classic population genetics theory, this reproduction unit is called Mendelian population, after Gregor Johann Mendel, father of genetics. From individual point of view, the largest Mendelian population is its species. Asexually reproducing organisms are not necessary to form a population, and multiple individuals observed in proximity, which are often recognized as one population, may be just an outcome of past life history of the organism, and each individual may reproduce independently. Gene exchanges also occur in asexually reproducing organisms, including bacteria. Therefore, by extending species concept, bacterial cells with similar phylogenetic relationship are called species. Population or species is also defined for viruses, where each virus particle is assumed as one “individual.”

However, we have to be careful to define individuals and populations. One tree, such as cherry tree, is usually considered to be one individual, for it starts from one seed. Unlike most animal organisms, trees or many plant species can use part of their body to start new “individual.” This asexual reproduction prompted plant population biologist John L. Harper to create terms genet (genetic individual) and ramet (physiological individual) [7]. We should thus be careful about the number of “individuals” especially for asexually reproducing organisms.

2 How to Describe the Random Nature of DNA Propagation

We discuss the four major processes to mathematically describe the random characteristics of DNA transmission. The first two, branching process and coalescent process, are considering the genealogical relationship of gene copies, while the latter two, Markov process and diffusion process, treat temporal changes of allele frequencies.

2.1 Gene Genealogy Versus Allele Frequency Change

For organisms to evolve and diverge, we need changes, or mutation. Supply of mutations to the continuous flow of self-replication of genetic materials (DNA or RNA) is fundamental for organismal evolution. This process is most faithfully described in phylogenetic relationship of genes. Because every organism is product of eons of evolution, we are unable to grasp full characteristics of living beings without understanding the evolutionary history of genes and organisms. It is thus clear that reconstruction of phylogeny of genes is essential not only for study of evolution but also for biology in general. In another words, gene genealogy is the basic descriptor of evolution.

It should be emphasized that the genealogical relationship of genes is independent from the mutation process when mutations are selectively neutral. A gene genealogy is the direct product of DNA replication and always exists, while mutations may or may not happen within a certain time period in some specific DNA region. Therefore, even if many nucleotide sequences happened to be identical, there must be genealogical relationship for those sequences. However, it is impossible to reconstruct the genealogical relationship without mutational events. In this respect, search of mutational events from genes and their products is also important for reconstructing phylogenetic trees. Advancement of molecular biotechnology made it possible to routinely produce gene genealogies from many nucleotide sequences.

Figure 4.1 shows a schematic gene genealogy for 10 genes. There are two types of genes that have small difference in their nucleotide sequences, depicted by open and full circles. Both types are located in the same location in one particular chromosome of this organism. This location is called “locus” (plural form is “loci”), after a Latin word meaning place, and one type of nucleotide sequence is called “allele,” using a Greek word αλλο meaning different. Open circle allele, called allele A, is ancestral type, and full circle allele, called allele M, emerged by a mutation shown as a star mark. The numbers of gene copies are 8 and 2 for alleles A and M. We thus define allele frequencies of these two alleles as 0.8 (=8/10) and 0.2 (=2/10). Allele frequency is sometimes called gene frequency. It should be noted that these frequencies are exact values if there are only 10 genes in the population in question. If these 10 genes were sampled from that population with many more genes, two values are sample allele frequency.

Fig. 4.1
figure 1

Schematic gene genealogy for some locus of a population. Open circles and full circles designate two different alleles, and star is mutation. TIme scale is in terms of generation, where N is the number of individuals. Autosomal locus of a diploid organism is assumed

Because all these 10 genes are homologous at the same locus, they have the common ancestral gene. Alternatively, only descendants of that common ancestral gene are considered in the gene genealogy of Fig. 4.1. There are, however, many genes which did not contribute to the 10 genes at the present time. If we consider these genes once existed, the population history may look like Fig. 4.2. In this figure, gene genealogy starting from full circle gene at generation 1 is embedded with other genes coexisted at each generation but became extinct. If we consider the whole population, it is clear that allele frequency changes temporarily, and many genes shown in open circle did not contribute to the current generation.

Fig. 4.2
figure 2

Relationship between gene genealogy and allele frequency change (From Saitou 2007; [55])

How can this allele frequency change occur? Natural selection does influence this change (see Chap. 5), but the more fundamental process is the random genetic drift. This occurs because a finite number of genes are more or less randomly sampled from the parental generation to produce the offspring generation. This simple stochastic process is the source of random fluctuation of allele frequencies through generations.

The random genetic drift can be described as follows. Let us focus on one particular diploid population with N[t] individuals at generation t. We consider certain autosomal locus A, and the total number of genes on that locus at generation t is 2N[t]. There are many alleles in locus A, but let us consider one particular allele Ai with ni gene copies. By definition, allele frequency pi for allele Ai is ni/2N[t]. When one sperm or egg is formed via miosis, one gene copy is included in that gamete from locus A. If male and female are assumed to be more or less the same allele frequency, the probability to have allele Ai in that gamete is pi. This procedure is a Bernoulli trial, and the offspring generation at time t+1 will be formed with 2N[t+1] Bernoulli trials. Because all these trials are expected to be independent, we have the following binomial distribution to give the probability Prob[k] of having k gene copies among 2N[t+1] genes in the offspring generation:

$$ \mathrm{Prob}\left[\mathrm{k}\right] =_{2\mathrm{N}\left[\mathrm{t}+1\right]}\;{\mathrm{C}}_{\mathrm{k}}{\mathrm{p}}_{\mathrm{i}}^{\mathrm{k}}{\left(1-{\mathrm{p}}_{\mathrm{i}}\right)}^{2\mathrm{N}\left[\mathrm{t}+1\right]-\mathrm{k}} $$
(4.1)

where xCy is the possible combinations to choose y out of x. If we continue this binomial distribution for many generations, the random genetic drift will occur. When the number of individuals in that population, or population size, is quite large, this fluctuation is small because of “law of large numbers” in probability theory, yet the effect of random genetic drift will never disappear under finite population size. The random genetic drift was extensively studied by Sewall Wright and was sometimes called Wright effect.

Figure 4.3 shows examples of computer simulations for the random genetic drift under a set of very simple conditions: discrete generations, haploid, constant population size, no population structure, and no recombination. The perl script for simulating the random genetic drift is available at this book website. Population size (the total number of individuals or genes in one population) varies in Fig. 4.3a (1000) and 4.3b (10,000). The initial allele frequency was set to be 0.2, and the temporal changes of up to 1,000 generations are shown. In each case, 5 replications are shown. Clearly, as population size increases, fluctuation of allele frequencies decreases. This simplified situation is often called the Wright–Fisher model, honoring Sewall Wright and Ronald A. Fisher ([8]).

Fig. 4.3
figure 3

Computer simulation of random genetic drift (From Saitou 2007; [55])

2.2 Branching Process

Francis Galton, a half cousin of Charles Darwin, was interested in extinction probability of surnames. He was thus trying to compute probability of surname extinction. He himself could not reach appropriate answer, so he asked some mathematicians. Eventually he was satisfied with a solution given by H. W. Watson, who used generating function, and they published a joint paper in 1874 [9]. Because of this history, the mathematical model considered by them is sometimes called “Galton–Watson process,” but usually it is called “branching process” (see [10] for detailed description of this process). It may be noted that surnames have been studied in human genetics (e.g., [11]) and in anthropology (e.g., [12]), for their transmissions often coincide with Y chromosome transmissions.

Fisher (1930; [13]) applied this process to obtain the probability of mutants to be ultimately fixed or become extinct. Later in 1940s, when physicists in the USA developed the atomic bomb, the branching process was used to analyze behavior of neutron number changes (see [14]).

The distribution of transmission probability of gene copies from parents to offsprings is the basis of the branching process. The number of individuals in the population is usually not considered, for this process is mainly applied for the shallow genealogy of mutant gene copies within the large population. In a sense, the branching process is a finite small world in an infinite world.

A Poisson process is the default probability distribution for the gene copy transmission under random mating. Let us explain why the Poisson process comes in. We assume a simple reproduction process where one haploid individual can reproduce one offspring n times during its life span, and the probability, p, of reproduction is uniform at each time unit (see Fig. 4.4). The probability Prob[k] of having k offspring during the n times is given by the following binomial distribution:

Fig. 4.4
figure 4

From binomial distribution to Poisson distribution

$$ \mathrm{Prob}\left[\mathrm{k}\right]{=}_{\mathrm{n}}{\mathrm{C}}_{\mathrm{k}}{\mathrm{p}}^{\mathrm{k}}{\left(1-\mathrm{p}\right)}^{\mathrm{n}-\mathrm{k}} $$
(4.2)

Equation 4.2 is equivalent to Eq. 4.1, though the meanings of parameters are somewhat different. The mean, m, of this binomial distribution is

$$ \mathrm{m}=\mathrm{np} $$
(4.3)

Let us increase n and decrease p while keeping m constant. The limit, n = ∞, gives

$$ \mathrm{Prob}\left[\mathrm{k}\right]=\frac{{\mathrm{m}}^{\mathrm{k}}{\mathrm{e}}^{-\mathrm{m}}}{\mathrm{k}!} $$
(4.4)

where e (= 2.718281828459…) is basis of natural logarithm. Equation 4.4 is called Poisson distribution, after French mathematician Siméon Denis Poisson. When m = 1, the mutant gene is expected to keep its copy number, while m>1 or m<1 corresponds to positive or negative selection situations (see Chap. 5). Table 4.1 shows Prob[k] values for various m values. It should be noted that Prob[0], or the probability of transmitting no offspring, is quite high. Even for m = 2, where the expected number of offspring is two times, Prob[0] is ~0.135 even though the gene copy number explosion is expected to occur.

Table 4.1 Prob[k] values for various m values

Fisher [13] showed that the mutant is destined to become extinct for m≤1. When m = 1, one may expect this is a stable situation and the mutant will continue to survive in the population. The population size is assumed to be infinite in the usual branching process, and this causes the mutant gene copy with m = 1 to become extinct. However, we live in finite environment, and the branching process under infinite population size is not appropriate when we consider the long-term evolution. When m>1, the mutant is advantageous, and the probability of survival becomes positive, as we will see in Chap. 5. Readers interested in application of the branching process to fates of mutant genes should refer to Crow and Kimura (1970; [15]).

Although the Poisson process is usually assumed in a random mating population, the real probability distribution of gene copy number may be different. In human study, pedigree data are used to estimate the gene transmission probability. A Kalahari San population (!Kung bushman) was reported to have a bimodal distribution of gene transmission, where the variance is larger than mean (Howell, 1979; [16]). Interestingly, a Philippine Negrito population was shown to have an approximate Poisson distribution with mean 1.05 (Saitou et al. 1988; [17]).

Figure 4.5 shows an example of the branching process with m = 1. A Monte Carlo method was used to generate this genealogy. The perl script for simulating the branching process is available at this book website.

Fig. 4.5
figure 5

Examples of branching process when m = 1

2.3 Coalescent Process

Mutant gene transmission follows with the time arrow in the branching process. In another way, it is a forward process. However, as we saw, most of gene lineages become extinct, and it is not easy to track the lineage which will eventually propagate in the population. Now let us consider a genealogy only for sampled genes. It is natural to look for their ancestral genes, finally going back to the single common ancestral gene. This is viewing a gene genealogy as the backward process. When two gene lineages are joined at their common ancestor, this event is called “coalescence” after Kingman (1982; [18]). It should be noted that Hudson [19] and Tajima [20] independently invented essentially the same theory in 1983.

Let us consider Fig. 4.1 again. Left most two gene copies coalesce first, followed by coalescence of two mutant genes shown in full circles. At this moment, there are eight lineages left, and one of them experienced mutation, shown in a star. After six more coalescent events, at around 2N generations ago, there are only two lineages. Then it took another ~2N generations to reach the final 9th coalescence. If there is no population structure in this organism, called “panmictic” situation, and if there is no change in population size (N), the time to reach the last common ancestral gene, or coalescent time, is expected to be 4N generations ago, according to the coalescent theory of an autosomal locus for diploid organisms.

The simplest coalescent process is pure neutral evolution. Even if mutations accumulate, they do not affect survival of their offspring lineages. Because of this nature, gene genealogy and mutation accumulation can be considered separately. If natural selection, either negative (purifying) or positive, comes in for some mutant lineages, this independence between generation of gene genealogy and mutation accumulation no longer holds.

Another important assumption for the simplest coalescent process is the constant population size, N. In diploid organism, the number of gene copies for an autosomal locus is 2N, while the number of gene copies for haploid organism locus is N. The former situation is assumed explicitly or implicitly in many literatures. However, the original lifestyle of organisms is haploid, and many organisms today are haploids. Therefore, we consider the situation in haploid organisms first. It should be noted that the constant population size is more or less expected if we consider a long-term evolution. Otherwise, the species will become extinct or will have exponential growth. Though we, Homo sapiens, in fact experience population explosion, this is a rather rare situation among many species. In short-term evolution, population size is expected to fluctuate for any organism. Therefore, assumption of the constant population size is not realistic and is only for mathematical simplicity. We have to be careful about this sort of too simplistic assumptions inherent in many evolutionary theories. There are some more simplifications in the original coalescent theory: discrete generation and random mating. Random mating means that any gene copy is equal in terms of gene transmission to the next generation, and there is no subpopulation structure within the population of N individuals in question. These assumptions were also used for the Wright–Fisher model.

Let us first consider the coalescent of only two gene copies. What is the probability, Prob[2→1, 1], for 2 genes to coalesce in one generation? If we pick up one of these two gene copies arbitrarily, this gene, say, G1, should have its parental gene, PG1, in the previous generation. Another gene, G2, also has its parental gene PG2. Because all genes are equal in terms of gene transmission probability under our assumption, all N genes, including G1, can be PG2. We should remember Fig. 4.5, where multiple offsprings may be produced from one individual during one generation. Therefore, to have one offspring G1 does not affect the probability of having another offspring, for these reproductions are independent. It is then obvious that

$$ \mathrm{Prob}\left[2\to 1,1\right]=\frac{1}{\mathrm{N}} $$
(4.5)

The probability of the complementary event, i.e., no coalescence, can be written as Prob[2→2, 1] and

$$ \mathrm{Prob}\left[2\to 2,1\right]=1-\left(\frac{1}{\mathrm{N}}\right) $$
(4.6)

We now move to slightly more complicated situation. What is the probability, Prob[2→1, t], for 2 genes to coalesce exactly after t generations? The coalescent event must occur only after no coalescence of (t−1) generations. Thus,

$$ \mathrm{Prob}\left[2\to 1,\mathrm{t}\right]={\left[1-\left(\frac{1}{\mathrm{N}}\right)\right]}^{\mathrm{t}-1}\cdot \left[\frac{1}{\mathrm{N}}\right] $$
(4.7)

When N is large, [1− (1/N)]t−1 can be approximated as e−t/N. Then

$$ \mathrm{Prob}\left[2\to 1,\mathrm{t}\right]\sim \left[\frac{1}{\mathrm{N}}\right]\;{\mathrm{e}}^{-\frac{\mathrm{t}}{\mathrm{N}}} $$
(4.8)

We can obtain the mean, Mean[2→1, t], and the variance, Var[2→1, t], of the time, t, for coalescence, using this geometric distribution:

$$ \mathrm{Mean}\left[2\to 1,\mathrm{t}\right]={\varSigma}_{\mathrm{t}=1,\infty\;}\mathrm{t}\cdot \left[\frac{1}{\mathrm{N}}\right]\cdot {\left[1-\left(\frac{1}{\mathrm{N}}\right)\right]}^{\mathrm{t}-1} $$
(4.9)

After some transformations,

$$ \mathrm{Mean}\left[2\to 1,\mathrm{t}\right]=\mathrm{N} $$
(4.10)

The variance of this exponential distribution is

$$ \mathrm{Var}\left[2\to 1,\mathrm{t}\right]={\varSigma}_{\mathrm{t}=1,\infty }{\left(\mathrm{t}-\mathrm{N}\right)}^2\cdot \left[\frac{1}{\mathrm{N}}\right]\cdot {\left[1-\left(\frac{1}{\mathrm{N}}\right)\right]}^{\mathrm{t}-1} $$
(4.11)

It can be shown that

$$ \mathrm{Var}\left[2\to 1,\mathrm{t}\right]=\mathrm{N}\left(\mathrm{N}-1\right) $$
(4.12)

When N>>1, v [2→1, t] ~ N2. Therefore, the standard deviation of t is ~N generations, same as its mean. When a diploid autosomal locus is assumed, mean and variance are 2N and (2N)2, respectively.

Let us now consider the coalescent process for n genes sampled from the population of N individuals. We assume n << N. The first step is the probability for two of n gene copies to coalesce during t generations. The probability of three gene copies to coalesce in one generation, is (1/N)2. If N is large, (1/N)2 ~ 0, and we can ignore coalescence of more than 2 genes in one generation, and focus on coalescence of the only pair of genes. Because there are nC2 [= n(n−1)/2] possible combinations to choose two out of n genes,

$$ \mathrm{Prob}\left[\mathrm{n}\to \mathrm{n}-1,1\right] =_{\mathrm{n}\kern0.24em }{\mathrm{C}}_2\cdot \left[\frac{1}{\mathrm{N}}\right]. $$
(4.13)

We can thus generalize Eq. 4.7 to consider the probability that 2 genes among n genes sampled are coalesced in one generation as

$$ \mathrm{Prob}\left[\mathrm{n}\to \mathrm{n}-1,\mathrm{t}\right]={\left[1-\left(\frac{{\mathrm{C}}_2}{\mathrm{N}}\right)\right]}^{\mathrm{t}-1}\cdot \left[\frac{{\mathrm{C}}_2}{\mathrm{N}}\right] $$
(4.14)

The mean of t under this distribution is

$$ \mathrm{Mean}\left[\mathrm{n}\to \mathrm{n}-1,\mathrm{t}\right]=\frac{\mathrm{N}}{{\mathrm{C}}_2}=\frac{2\mathrm{N}}{\mathrm{n}\left(\mathrm{n}-1\right)}. $$
(4.15)

We can then obtain the mean or expected time of coalescence from the current generation of n genes to single common ancestral gene by summing the means above:

$$ \mathrm{Mean}\left[\mathrm{n}\to 1,\mathrm{t}\right]={\varSigma}_{\mathrm{i}=2,\mathrm{n}}\cdot \frac{2\mathrm{N}}{\mathrm{i}\left(\mathrm{i}-1\right)} $$
(4.16)
$$ =2\mathrm{N}\left[1-\left(\frac{1}{\mathrm{n}}\right)\right] $$
(4.17)

If n is large,

$$ \mathrm{Mean}\left[\mathrm{n}\to 1,\mathrm{t}\right]\sim 2\mathrm{N} $$
(4.18)

When diploid autosomal genes are considered, this approximate mean becomes 4N, and the variance of the coalescent time, when n is large, is given by Tajima [20]:

$$ \mathrm{Var}\;\left[\mathrm{n}\to 1,\mathrm{t}\right]\sim 16{\mathrm{N}}^2\left(\frac{\pi^2}{3-3}\right) $$
(4.19)

If n is not much different from N, or almost exhaustive sampling was conducted, the possibility of coalescence of three or more gene copies together at one gene copy within one generation is no longer negligible, and Eq. 4.13 and later do not hold any more. We need to consider “exact” coalescence. The following explanation is after Fu (2006; [21]). If we consider a randomly mating population with constant size N, each gene copy at the present population was sampled from N gene copies of the previous generation with replacement. Therefore, if we choose one particular gene copy, say, copy ID 1, from the present population, the probability of its transmission from a specific gene copy of the previous generation is 1/N. Then the probability of gene copy ID 2 from the present population not sharing the same parental copy with copy ID 1 is 1 − [1/N]. We then go to the next situation in which gene copy ID 3 from the present population shares the parental gene copy with neither ID 1 nor ID 2. Its probability becomes 1 − [2/N]. Applying a similar argument for IDs 4 to n (n≤N), the probability, Prob[n→n, 1], that none of gene copy at the present generation shares the parental gene copy at the previous generation becomes

$$ \mathrm{Prob}\left[\mathrm{n}\to \mathrm{n},1\right]={\varPi}_{\mathrm{k}=1,\mathrm{n}-1}{\left(1-\left[\frac{\mathrm{k}}{\mathrm{N}}\right]\right)}^{\mathrm{N}} $$
(4.20)
$$ =\frac{{\mathrm{N}}_{\left[\mathrm{n}\right]}}{{\mathrm{N}}^{\mathrm{n}}} $$
(4.21)
$$ {\mathrm{N}}_{\left[\mathrm{n}\right]}=\mathrm{N}\left(\mathrm{N}-1\right)\left(\mathrm{N}-2\right)\dots \left(\mathrm{N}-\mathrm{n}+1\right) $$
(4.22)

Therefore, the probability corresponding to Eq. 4.14 under the exact coalescent in which n gene copies at the present generation will coalesce to m (<n) ancestral gene copies at t generations ago becomes

$$ \mathrm{Prob}\left[\mathrm{n}\to \mathrm{m},\mathrm{t}\right]=\left[1-\frac{{\mathrm{N}}_{\left[\mathrm{n}\right]}}{{\mathrm{N}}^{\mathrm{n}}}\right]\cdot {\left[\frac{{\mathrm{N}}_{\left[\mathrm{n}\right]}}{{\mathrm{N}}^{\mathrm{n}}}\right]}^{\mathrm{t}-1} $$
(4.23)

Generally speaking, the coalescent time for exact process is shorter than the approximation, or Kingman coalescence, first given by Kingman (1982). Figure 4.6 shows examples of gene genealogies of the same sample size under the exact coalescence and Kingman coalescence (reproduced from Fu 2006).

Fig. 4.6
figure 6

Comparison of exact and Kingman coalescence (From [21])

Unlike the treatment of allele frequency changes to be discussed in later sections, the coalescent generation time is given in terms of the total number of population in the coalescent theory. Because of this, we can check the implicit assumption of the constant population size. For example, the total number of human population as of 2011 A.D. is over 7 billion. If we apply the coalescent theory under the constant population model, the expected number of generations for coalescence of an autosomal gene, 4N, is 27 billion generations. If one generation is 20 years, the expected coalescent time becomes 540 billion years! This value is far greater than the start of this universe, i.e., Big Bang, approximately 14 billion years ago. This seemingly paradoxical situation simply comes from the population explosion, which violates the assumption of constant population size. To overcome this problem, the “effective population size” is often used. Modern human is estimated to have ca. 10,000 as the effective population size (e.g., Takahata 1993; [59]).

Recent developments of the coalescent theory are discussed in [22] and [23]. We will discuss various applications of the coalescent theory in Chap. 17.

2.4 Markov Process

We now move to the treatment of allele frequency changes. For simplicity, a constant population size (N) is assumed. We also consider haploid organism as before. Let us consider one particular allele Ai, and the number of gene copies at generation t is denoted as i. Allele frequency for this allele at generation t is i/N. Then the probability of having j gene copies among N genes in the next generation (t+1) becomes

$$ \mathrm{Prob}\left[\mathrm{i}\to \mathrm{j}\right]{=}_{\mathrm{N}}{\mathrm{C}}_{\mathrm{j}}{\left[\frac{\mathrm{i}}{\mathrm{N}}\right]}^{\mathrm{j}}{\left(1-\left[\frac{\mathrm{i}}{\mathrm{N}}\right]\right)}^{\mathrm{N}-\mathrm{j}} $$
(4.24)

This is the transition probability of i to j gene copies from generation t to t+1. For simplicity, let us denote Prob[i→j] as Pi,j (0≤i, j≤N). Then we can have the transition probability matrix P as

$$ P=\left|\begin{array}{ccccccccc}{\mathrm{P}}_{0,0}& {\mathrm{P}}_{1,0}& {\mathrm{P}}_{2,0}& {\mathrm{P}}_{3,0}& {\mathrm{P}}_{4,0}& \dots & {\mathrm{P}}_{\mathrm{N}-2,0}& {\mathrm{P}}_{\mathrm{N}-1,0}& {\mathrm{P}}_{\mathrm{N},0}\\ {}{\mathrm{P}}_{0,1}& {\mathrm{P}}_{1,1}& {\mathrm{P}}_{2,0}& {\mathrm{P}}_{3,0}& {\mathrm{P}}_{4,0}& \dots & {\mathrm{P}}_{\mathrm{N}-2,0}& {\mathrm{P}}_{\mathrm{N}-1,0}& {\mathrm{P}}_{\mathrm{N},1}\\ {}{\mathrm{P}}_{0,2}& {\mathrm{P}}_{1,2}& {\mathrm{P}}_{2,2}& {\mathrm{P}}_{3,2}& {\mathrm{P}}_{4,2}& \dots & {\mathrm{P}}_{\mathrm{N}-2,2}& {\mathrm{P}}_{\mathrm{N}-1,2}& {\mathrm{P}}_{\mathrm{N},2}\\ {}\cdot & \cdot & \cdot & \cdot & \cdot & \dots & \cdot & \cdot & \cdot \\ {}{\mathrm{P}}_{0,\mathrm{N}-2}& {\mathrm{P}}_{1,\mathrm{N}-2}& {\mathrm{P}}_{2,\mathrm{N}-2}& {\mathrm{P}}_{3,\mathrm{N}-2}& {\mathrm{P}}_{4,\mathrm{N}-2}& \dots & {\mathrm{P}}_{\mathrm{N}-2,\mathrm{N}-2}& {\mathrm{P}}_{\mathrm{N}-1,0}\;& {\mathrm{P}}_{\mathrm{N},\mathrm{N}-2}\\ {}{\mathrm{P}}_{0,\mathrm{N}-1}& {\mathrm{P}}_{1,\mathrm{N}-1}& {\mathrm{P}}_{2,\mathrm{N}-1}& {\mathrm{P}}_{3,\mathrm{N}-1}& {\mathrm{P}}_{4,\mathrm{N}-1}& \dots & {\mathrm{P}}_{\mathrm{N}-2,\mathrm{N}-1}& {\mathrm{P}}_{\mathrm{N}-1,\mathrm{N}-1}& {\mathrm{P}}_{\mathrm{N},\mathrm{N}-1}\\ {}{\mathrm{P}}_{0,\mathrm{N}\operatorname{}}& {\mathrm{P}}_{1,\mathrm{N}\kern0.24em }& {\mathrm{P}}_{2,\mathrm{N}\kern0.24em }& {\mathrm{P}}_{3,\mathrm{N}}\;& {\mathrm{P}}_{4,\mathrm{N}\operatorname{}}& \dots & {\mathrm{P}}_{\mathrm{N}-2,\mathrm{N}\operatorname{}}& {\mathrm{P}}_{\mathrm{N}-1,\mathrm{N}}& {\mathrm{P}}_{\mathrm{N},\mathrm{N}}\end{array}\;\right| $$
(4.25)

We can derive the probability, Prob[i/N, t+1], of having allele frequency i/N at generation t+1, using this transition probability matrix and the probability at generation t as follows. At the initial generation (t=0), let us assume that there are k (1≤k≤N−1) gene copies in the population. Then Prob[k/N,0] = 1 and Prob[i/N,0]=0 (0≤i≤N, i≠k):

$$ \mathrm{Prob}\left[\frac{1}{\mathrm{N}},\mathrm{t}+1\right]={\varSigma}_{\mathrm{j}=0,\mathrm{N}}\mathrm{Prob}\left[\frac{1}{\mathrm{N}},\mathrm{t}\right]\cdot {\mathrm{P}}_{\mathrm{j},\mathrm{i}} $$
(4.26)

Figure 4.7 shows some examples of Markov process when N = 1,000 and initial frequencies to be 0.5 (case a) and 0.1 (case b). The perl script for computing the Markov process is available at this book website. In the past, the Markov process was not extensively used, for it requires a large number of computations. Thanks to the great advancement of computational powers, we can now obtain allele frequency spectrum for a relatively large number of genes. It may be interesting to apply this exact Markov process for various realistic situations in the future.

Fig. 4.7
figure 7

Example of Markov process. (a) N = 1,000, initial frequency = 0.5. (b) N = 1,000, initial frequency = 0.1. G is generation

2.5 Diffusion Process

There are various mathematical models which can describe the evolutionary changes of allele frequency. The diffusion equation is the most widely used method. It can easily combine the stochastic effect such as random genetic drift and deterministic effect such as selection. The random genetic effect alone is discussed in this section, and natural selection will be discussed in Chap. 5.

The starting point is the binomial distribution, the basic process for the random genetic drift (see Sect. 4.2.1). We assume that the population size is constant, and haploid organism is considered. The binomial distribution in Eq. 4.1 can be written as

$$ \mathrm{Prob}\left[\mathrm{k}\right]{=}_{\mathrm{N}}{\mathrm{C}}_{\mathrm{k}}{{\mathrm{p}}_{\mathrm{i}}}^{\mathrm{k}}{\left(1-{\mathrm{p}}_{\mathrm{i}}\right)}^{\mathrm{N}-\mathrm{k}} $$
(4.27)

Let us note that the mean (m) and variance (v) of this distribution are

$$ \mathrm{m}=\mathrm{Np} $$
(4.28)
$$ \mathrm{v}=\mathrm{Np}\left(1-\mathrm{p}\right) $$
(4.29)

This process can be approximated by a differential equation, a Fokker–Planck equation for the random genetic drift:

$$ \frac{\delta \varphi \left(p\to x;t\right)}{\delta t}=\left[\frac{1}{4N}\right]\left[\frac{\delta^2}{\delta {p}^{{}^2}\left\{p\left(1-p\right)\varphi \left(p\to x;t\right)\right\}}\right] $$
(4.30)

Figure 4.8 explains the basic concept of Eq. 4.41 on the change of allele frequency class, based on Kimura (1955; [24]). Let us consider a very small range of length h, and histograms of many rectangles approximate the probability density function φ(p→x;t). Each rectangle has the width h and the height given by the value of φ(p→x;t) at allele frequency x, at the middle of the rectangle unit. We also consider a very short time ʿt, so the change of allele frequency during that time period is restricted to at most to adjacent range, either left or right. If we take limits (h→zero and ∆t→zero), differential equation (4.30) is obtained.

Fig. 4.8
figure 8

Explanation of diffusion model (From Kimura 1964; [25])

The exact solution for this equation, for probability density distribution of allele frequency x at generation time t, starting from initial frequency p, is

$$ \begin{array}{c}\delta \left(p\to x;t\right)={\varSigma_{i=1}}_{,\infty }p(1-p)i(i+1)(2i+1)F(1-i,i+2,2;p)\\ {}\kern3.1em \cdot F\left(1-i,i+2;x\right) \exp \left[\frac{-1\left(i+1\right)t}{4N}\right]\end{array} $$
(4.31)

F(a,b,c;z) in Eq. 4.31 is a hypergeometric function:

$$ \mathrm{F}\left(\mathrm{a},\mathrm{b},\mathrm{c};\mathrm{z}\right)={\varSigma}_{\mathrm{n}=0,\infty}\kern0.46em \frac{\left\{{\mathrm{a}}_{\left[\mathrm{n}\right]}\cdot {\mathrm{z}}^{\mathrm{n}}\right\}}{\left\{{\mathrm{b}}_{\left[\mathrm{n}\right]}\cdot {\mathrm{c}}_{\left[\mathrm{n}\right]}\cdot \mathrm{n}!\right\}} $$
(4.32)

This solution was given by Kimura in 1955 [24]. Interested readers should refer to [15] and [25] for detailed explanation of the diffusion process.

Figure 4.9 shows the probability density changes for various generations when the initial allele frequency is 0.5 and 0.1. The perl script for computing the diffusion process is available at this book website. Initially, at time zero, all probability density is concentrated at the initial allele frequency. This is Dirac’s delta function. As the random genetic drift starts to operate, allele frequency will start to diffuse. After a long time, probability density becomes flat and low, and majority of probabilities will be residing at either allele frequency being 0 or 1.

Fig. 4.9
figure 9

Diffusion process. (a) Initial frequency = 0.5. (b) Initial frequency = 0.1

2.6 A More Realistic Process of Allele Frequency Change of Selectively Neutral Situation

In reality, the population size is not only finite but also not constant. Therefore, a more realistic process of frequency change of selectively neutral mutant alleles is as follows. Let us denote the total gene copy number of the population at generation t as N[t] and the gene copy number of a selectively neutral allele A at generation t as NA[t]. Then the allele frequency at generation t, Freq_A[t], becomes

$$ \mathrm{Freq}\_\mathrm{A}\left[\mathrm{t}\right]=\frac{{\mathrm{N}}_{\mathrm{A}}\left[\mathrm{t}\right]}{\mathrm{N}\left[\mathrm{t}\right]} $$
(4.33)

We need to consider a finite population in finite maximum population size, or carrying capacity. Then the population size fluctuation can be approximated by a Markov process with constant global population size or carrying capacity. The problem is that the carrying capacity itself will change depending on the change of environment. In the case of human, environment includes technological innovation. We need to redefine the probability transition matrix. Because an extinct population cannot produce new population, P0,j (0<j≤N_max) remains zero or maintains its characteristic as the absorbing barrier. In contrast, PN_max,j (0≤ j≤N_max) is not zero. This is because we are not considering frequency change, but the population size fluctuation is considered. Unfortunately, population genetics theory so far does not consider this more realistic dynamics of populations. It is left for future developments.

3 Expected Evolutionary Patterns Under Neutrality

We discuss three categories when the pure neutral evolution is occurring: fixation probability, the evolutionary rate, and the amount of DNA variation. Because the majority of eukaryotic genome is evolving in this fashion, the understanding of the pure neutral evolutionary process is quite important for evolutionary genomics.

3.1 Fixation Probability

As stated at the beginning of this chapter, neutral evolution is characterized by the egalitarian nature of the propagation of mutants. Therefore, all genes at one generation have the same potential to leave offsprings. If one population is destined to continue for a long time, eventually fixation of one gene will occur. Because any of N genes in the initial generation can become the common ancestor of later generations, the fixation probability of one gene in a population of N genes is 1/N. In an autosomal locus of diploid organisms, the fixation probability becomes 1/2N.

In reality, we do not know if one population in question at this time will continue to survive in later time. Therefore, the absolute fixation probability, Prob_fixation, of one gene should be

$$ \mathrm{Prob}\_\mathrm{fixation}=\mathrm{Prob}\_\mathrm{existence}\cdot \left[\frac{1}{\mathrm{N}}\right] $$
(4.34)

Prob_existence is the probability of existence of that population for a certain long time. Unfortunately, we do not know this probability, and almost always it is implicitly assumed to be unity. Thus,

$$ \mathrm{Prob}\_\mathrm{fixation}=\frac{1}{\mathrm{N}} $$
(4.35)

3.2 Rate of Evolution

If a gene fixation occurs in one population, there will be no change of allele frequency, though the gene genealogy will grow as time goes on. We definitely need mutations for evolution to proceed. If a mutation happens, the population of N genes with only one allele will again become polymorphic with a single copy mutant allele and N-1 copies of the original allele. If all genes in later generations will become descendants of this mutant gene, now gene substitution is attained. Evolution of one gene or one locus can be seen as the accumulation of mutations. Therefore, the rate of gene substitution is equated as the rate of evolution.

Let us define the mutation rate per gene locus per generation as μ. Considering all N genes in this population, Nμ mutants appear in every generation. During T generations, the total amount of mutant genes becomes NμT, under the assumption of the constant population size. Because the fixation probability for each mutant gene is 1/N, the total number of mutant genes fixed during T generation is NμT · [1/N] = μT. The rate, λ, or speed of evolution in terms of continuous mutant fixation is thus

$$ \lambda =\frac{\mu T}{T}=\mu $$
(4.36)

Equation 4.36 was first shown by Kimura and Ohta (1971; [26]). This explanation assumes the constant population size for a long time. We can relax this assumption to obtain Eq. 4.36. Figure 4.10 shows a schematic gene genealogy for a single lineage during time T. The vertical axis represents the whole population, and the population size can vary. Full circles are mutations accumulated in this single lineage, and thin lines represent increase of allele frequency for each mutant. The total number of mutations accumulated during time T is μT. Therefore, the evolutionary rate, λ, of gene substitutions per generation should be μT/T = μ. This argument applies to any time irrespective of population size change. Even if speciation occurs, it does not affect this argument based on the single gene lineage. This is why we can consider the long-term evolution. Of course, any gene at the present population can be the starting point for the single lineage genealogy. This generality comes from the egalitarian nature of the selectively neutral mutant gene copies.

Fig. 4.10
figure 10

Single lineage gene genealogy (From Saitou 2007 [55])

If the mutation rate, μ, does not change for a long time and for diverse group of organisms, we can estimate the mutation rate by estimating the evolutionary rate in the neutrally evolving genomic region. This is the basis of the indirect method for estimating mutation rates discussed in Chap. 2.

3.3 Amount of DNA Variation Kept in Population

If we consider a relatively long DNA fragment, say, composed of n nucleotides, as “locus,” there are 4n possible alleles in this locus. If we consider a 1-kb-long DNA fragment, n = 1,000, and 41,000 is more than 10600. Considering this enormous possibility of alleles for even a short DNA fragment, Kimura and Crow (1964; [27]) proposed the infinite allele model. All new mutations are different with each other in this model. The phylogenetic relationship of alleles is not considered in the infinite allele model. Kimura (1969; [28]) thus proposed the infinite site model where an infinitely long DNA sequence is considered. Now new mutations appear by substituting one nucleotide site, which was not changed before. In this sense, this model is similar to the infinite allele model, but now accumulation of nucleotide substitutions can be considered with the infinite site model. This means that a genealogical relationship of alleles is behind this model. In either case, the expected heterozygosity, H, under these two models is

$$ \mathrm{H}=\frac{4{\mathrm{N}}_{\mathrm{e}}\mu }{\left\{1+4{\mathrm{N}}_{\mathrm{e}}\mu \right\}} $$
(4.37)

Ne is effective population size and μ is mutation rate per locus per generation. The numerator of Eq. 4.37, 4Neμ, which is often denoted as M or θ, should be identical with the nucleotide diversity, π, per nucleotide site ([29]; Kimura 1968).

4 DNA Polymorphism

When we compare gene copies of one locus in one organism, nucleotide sequences may be slightly different because various types of mutation may accumulate. In this case, this locus has genetic or DNA polymorphism. We will classify DNA polymorphisms according to type of mutation (see Table 2.1 of Chap. 2). In classic evolutionary studies, “polymorphism” applies only to one species; however, definition of species is often ambiguous, and there is no clear difference between within-species genetic polymorphism and between-species genetic differences. Therefore, when multiple closely related species are compared, nucleotide sites which have variations are sometimes called polymorphic.

Traditionally, one locus may be called polymorphic if the major allele frequency is equal to or less than 0.99, while it is called monomorphic if the major allele frequency is more than 0.99. However, nowadays we often have sample size of more than 1,000, and if some nucleotide sequences were found to be different from the major allele, this locus may be called polymorphic, even if the frequency of the major allele is more than 0.99.

Although there are no essential differences between haploid and diploid genomes in terms of the random genetic drift, patterns of genetic composition of alleles per locus, called “genotypes,” are different with each other. If there are two alleles, A1 and A2, at a locus, the possible genotypes are the same as alleles for haploids. In diploids, there are three genotypes, or possible combination of alleles: A1A1, A1A2, and A2A2. Genotypes with single type of allele are called “homozygotes” and those with two types of alleles are called “heterozygotes,” after Greek words oμο and έτερος meaning same and different, respectively. In general, if there are N types of alleles in one population, the possible number of homozygotes and heterozygotes are N and N(N−1)/2, respectively.

4.1 Single Nucleotide Polymorphism (SNP)

DNA polymorphism observed at one nucleotide, the smallest unit of DNA molecule, is called “single nucleotide polymorphism,” or SNP. The majority of SNP is created via nucleotide substitution-type mutation, but sometimes 1-nucleotide length insertion or deletion is also included as SNP. An SNP locus is usually biallelic. In nucleotide substitution-type SNPs, there are usually only two nucleotides in the population, for the mutation rate of nucleotide substitution is quite low. However, if we sample many individuals, such as for medical studies of humans, we may encounter SNP loci with three or four nucleotide alleles. There is gap or no-gap allele for single nucleotide indel SNPs.

SNPs observed in protein coding regions may be called cSNPs, and SNPs found in noncoding genomic region may be called gSNPs. There are synonymous and nonsynonymous cSNPs.

If we can estimate the ancestral SNP alleles, we can distinguish typical two alleles into ancestral and derived (mutated) alleles. If one allele has allele frequency lower than 0.5, it is called “minor” allele. Many databases for SNP are available, such as dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/).

4.2 Insertions and Deletions (Indel)

Insertion-type and deletion-type (often abbreviated as indel) mutations create indel DNA polymorphisms. Broadly speaking, repeat number polymorphism and copy number polymorphism to be discussed later are also in this type; however, non-repeat type indels are usually called as indel polymorphism. When the gap length is one, this indel polymorphism may be included in SNP, as mentioned above.

Insertions and deletions are detected as gaps in multiple alignments (see Chap. 16). Therefore, if nucleotide sequences are misaligned, we have incorrect indel information. Nucleotide sequences within the same species are expected to have quite high homology; however, if we are not aware of microinversions, misalignment will occur and often gaps are observed.

4.3 Repeat Number Polymorphism

When insertions or deletions occur within the repeat sequences, they are called repeat polymorphisms. Short repeat sequences of 1–5 nucleotides as unit are called “short tandem repeat” (STR) polymorphism or microsatellite DNA polymorphism. When the repeat length is longer, it is called “variable number of tandem repeat” (VNTR) polymorphism or minisatellite DNA polymorphism.

4.4 Copy Number Variation

If the repeat unit is much bigger, say, at least a few kilobases, it is called “copy number variation” (CNV). A classic example is the Rh blood group D+/D- polymorphism. Recently, many genes in the human genome were found to have CNV-type polymorphism [30, 31]. If CNV haplotype of more copy number is fixed in the population, the original gene is duplicated. Therefore, frequent occurrence of CNVs suggests high frequency of gene duplications.

5 Mutation Is the Major Player of Evolution

Mutations can be classified into deleterious, neutral, and advantageous ones according to their effects to organisms (see Chap. 5). Figure 4.11 shows the semiquantitative proportions of these three categories for a typical mammalian genome. Because the majority of the mammalian genome is noncoding, mutations occurring in this region are selectively neutral. If a mutation occurs in DNA regions where important genetic informations such as protein amino acids and some RNA sequences are coded, this mutation may be deleterious, and the mutant individual may have less possibility of transmitting that gene to the offsprings. In contrast, although in rare occasions, some DNA changes will cause that mutant individual to have more offsprings than those without mutant genes. This type of mutants is called “advantageous.” In any case, when mutations occur, selectively neutral mutants are dominating. If we consider a long-term evolution, only a small fraction of these mutations will survive. Because deleterious mutations will soon disappear from the population (see Chap. 5), only neutral and advantageous mutants will survive in the population for a long time.

Fig. 4.11
figure 11

Comparison between total mutations and surviving mutations under the strict neutral theory

Because the fixation probability for advantageous mutants is higher than that for selectively neutral mutants, the proportion of advantageous mutations among the surviving mutations may be slightly higher than their proportion when they were produced. However, the majority of mutations surviving for long evolutionary time are selectively neutral. This is a clear difference from the prediction made by researchers who advocated the dominant power of natural selection in 1960s and 1970s. As we will see, the fixation of selectively neutral mutations via stochastic effects is the main power of evolution, and the natural selection to choose advantageous mutations has only a limited contribution, although natural selection to eliminate deleterious mutations is quite effective to keep the current genetic entity. In short, natural selection is mostly conservative, and the chance effects, including the fixation of selectively neutral mutations, are really responsible for creative nature of evolution.

6 Evolutionary Rate Under the Neutral Evolution

We considered the fate of selectively neutral mutants in Sect. 4.3. In reality, there are deleterious and advantageous mutations. Because the fraction of advantageous mutations are expected to be much smaller than that of deleterious ones, we consider only neutral and deleterious mutations. Let us denote the fraction of neutral mutations as f. This fraction has the rate of evolution identical with the mutation rate μ. The remaining fraction, 1−f, is deleterious mutations, and all of them are assumed to be not fixed and do not contribute to gene substitution. Thus, the evolutionary rate λ becomes

$$ \lambda =f\cdot \mu +\left(1-f\right)\cdot 0= f\mu $$
(4.38)

The value of f varies from the genomic region to region, as we will see in this section.

6.1 Molecular Clock

Vertebrate hemoglobin consists of globin (protein) and heme (porphyrin), and Fe ion in heme will attach to oxygen. There are two major globin gene families, α and β. Figure 4.12 shows the multiple alignment of 11 vertebrate β globin amino acid sequences (see Chap. 14 for the procedure). Amino acid sequence names are composed of UniProt accession number and genus. Only human amino acid sequence is fully written at the top, and the amino acids of remaining sequences are shown only when they are different from the corresponding human amino acid. If amino acid of nonhuman species globin is identical, dot (.) is given. For example, horse β globin amino acid sequence is different from that of human at 24 sites out of 146 total amino acids. This proportion, p (0.164 = 24/146), can be used to estimate the number, d, of amino acid substitutions per amino acid site:

Fig. 4.12
figure 12

Multiple alignment of 11 vertebrate β-globin sequences

$$ \mathrm{d}=-{ \log}_{\mathrm{e}}\left(1-\mathrm{p}\right) $$
(4.39)

This number is often called evolutionary distance, and d stands for “distance.” Please see Chap. 15 for derivation of this equation. In any case, using this equation, d becomes 0.18 from p. Evolutionary distances between human and the other 10 vertebrate species are plotted in vertical axis of Fig. 4.13. The horizontal axis of this figure represents divergence time between human and corresponding species. Interestingly, evolutionary distances and divergence times are more or less proportional. This rough constancy of the evolutionary rate is often called “molecular clock” after Zuckerkandl and Poring (1965; [32]). It should be noted that evolutionary distances were obtained from molecular data determined in wet experiments, while divergence times were obtained from paleontological studies.

Fig. 4.13
figure 13

Approximate linearity or molecular clock for vertebrate β-globin (Based on data of Fig. 4.12)

Existence of the molecular clock is easily explained under the neutral theory. If the mutation rate (μ) and the fraction (f) of deleterious mutations are constant for a long evolutionary time, the evolutionary rate λ (=fμ) should be constant according to Eq. 4.38. In contrast, if the evolutionary rate is mainly determined by positive selection, not only mutation rate but also population size and selection coefficients of mutants affect the evolutionary rate, and the latter two are known to vary considerably. Therefore, the approximate constancy of the evolutionary rate is one evidence supporting the neutral theory of molecular evolution.

Even if we do not assume the constancy of the evolutionary rate, it is possible to consider the average rate of evolution by comparing two sequences. Figure 4.14 shows a schematic phylogenetic tree of two sequences, 1 and 2. They have the common ancestor T years ago, and the lineage-specific evolutionary rates, λ1 and λ2, are given. Thus, the average rate, λ, of evolution between sequences 1 and 2 becomes

Fig. 4.14
figure 14

Divergence of two lineages

$$ \lambda =\frac{\left(\lambda 1+\lambda 2\right)}{2} $$
(4.40)

Let us denote the evolutionary distance between sequences 1 and 2 as d. Then,

$$ d=\lambda \cdot 2T $$
(4.41)

We can thus estimate the evolutionary rate:

$$ \lambda =\frac{d}{2T} $$
(4.42)

If the constancy of the evolutionary rate approximately holds, we can estimate the divergence time:

$$ \mathrm{T}=\frac{\mathrm{d}}{2\lambda } $$
(4.43)

This equation is often used because the divergence time of two sequences are frequently unknown, while the molecular data such as amino acid sequences or DNA sequences can be easily determined.

6.2 Heterogeneous Evolutionary Rates Among Proteins

The fraction, f, of neutral mutations in Eq. 4.38 may vary in various situations. Let us first consider the heterogeneity among different proteins. Table 4.2 lists the rates of amino acid substitutions per amino acid site per year for 12 proteins. The evolutionary rates considerably vary from 0.01 to 9.0, and the rate for fibrinopeptide is almost 100 times higher than that for histone H4. Histone is the major basic protein of nucleosome that binds DNA, an acid. The very low evolutionary rate for this protein indicates that f is quite small, and majority of amino acid changing mutations are deleterious.

Table 4.2 Rates of amino acid substitutions (From Nei 1987; [2])

Fibrinopeptide is the leftover of fibrinogen which was cut to fibrin and fibrinopeptide. The main function of blood coagulation is residing in fibrin, and the function of fibrinopeptide is just to keep fibrin not to become fibrous until it is detached from fibrin part. It is thus understandable that many amino acid substitutions on fibrinopeptide gene may be permissible; hence, its f became high.

Because of this relationship between f values and protein functions, it is routine to discuss the importance of one function in terms of its rate of amino acid substitutions. If the rate is slow, the protein may be called “quite important,” and it may be “less important” if the rate is relatively high.

6.3 Heterogeneous Evolutionary Rates Among Protein Parts

One protein has its specific 3D structure (see Chap. 1), and the functional part is often localized. Figure 4.15 is a 3D structure of hemoglobin, or globin and heme. Globin protein is mostly composed of α helix, and there is heme pocket that grabs heme. Kimura and Ohta [33] estimated the rate of amino acid substitutions for four parts of α and β globins. Table 4.3 shows the results. As expected, the rate at heme pocket, where the oxygen-transporting heme is anchored, is lowest compared to the other three parts.

Fig. 4.15
figure 15

The 3D structure of globin and heme (pointed by arrow)

Table 4.3 The rate of amino acid substitutions for various protein components of α and β globins (Data from Kimura and Ohta 1973; [33])

Domains are often defined for many proteins because of their wide conservations (see Chap. 1). Therefore, it is natural for a domain part with lower evolutionary rate than the remaining part of the protein. For example, hox genes have highly conserved homeobox domain. If we compare amino acid sequences of human and mouse orthologous HoxA1–HoxA5 amino acid sequences, amino acid identities are certainly higher for homeodomain region. Table 4.4 shows the estimated rate of amino acid substitution for this gene using Eq. 4.39. As expected, the evolutionary rate of homeobox domains is quite low compared to the remaining parts.

Table 4.4 Comparison of amino acid identity between homeodomain and the other regions of HoxA. From Saitou 2007 [55]

6.4 Heterogeneous Evolutionary Rates Among Organisms

The evolutionary rate is proportional to f and μ. Therefore, if μ, the mutation rate, differs among various lineages, molecular clock no longer holds. This is the case for the rodent lineage and other mammalian lineage, as first clearly shown by Wu and Li [34, 35]. Hominoid and Old World monkeys diverged at ~30 million years ago. Because human and rhesus macaque genomic distance is ~0.06 [36], the average evolutionary rate in terms of nucleotide substitutions is, from Eq. 4.42, λ[primates] = 0.06/[2 × 30 million] = 1 × 10−9/site/year. The genomic distance between mouse and rat in terms of fourfold degenerate synonymous sites (see Sect. 4.7.1) is ~0.15 [37]. The divergence time between mouse and rat is not well known, so we use a range of 10–20 million years. Then λ[rodents] = 0.15/[2×{10−20} million] = 4−8 × 10−9/site/year. Because mammalian genomes are mostly consisting of junk DNAs (see Sect. 4.7.2), genome-wide evolutionary rates are approximately mutation rate. It is clear that the mutation rate of rodents is 4–8 times higher than that for primates.

Compared to ordinary DNA genome organisms, genomes of RNA viruses such as influenza virus, SARS, and HIVs (see Chap. 8) are RNA molecules, and their evolutionary rates are million times higher than those of DNA genome organisms.

If the value of f, fraction of neutral mutations, varies among lineages for a particular protein gene, the evolutionary rate obviously changes. In this case, molecular clock no longer holds, yet this variation naturally follows the pattern of neutral evolution. Although the molecular clock is often considered as the important characteristics of the neutral evolution, this comes from the simple relationship shown in Eq. 4.38. Therefore, if f and/or μ changes, the evolutionary rate should change, according to the principle of neutral evolution. Figure 4.16 is the evolutionary history of rodent α crystallin [38]. The amino acid sequence of this protein is identical among mouse, rat, and hamster, and their sequence is identical with that of common ancestor or all rodents. In marked contrast to that situation, nine amino acid substitutions accumulated in the mole rat lineage during 40 million years. Mole rat eye is diminished, and apparently, importance of α crystallin, the major lens protein, is reduced. It is natural to expect higher fraction (f) of selectively neutral mutations for mole rat than other rodents whose eyes are necessary for their existence.

Fig. 4.16
figure 16

Evolutionary history of rodent α-crystalin. Based on [38]

6.5 Unit of Evolutionary Rate

We discussed the unit of mutation rate in Chap. 2. Because mutation is the main player of evolution, unit of the evolutionary rate is closely related to that discussion. While the generation time for many organisms is not known, divergence times of some organism groups such as vertebrates have been well documented thanks to paleontological studies. Thus, the rate of evolution is often obtained by Eq. 4.42 and the time unit is years, not generations.

7 Various Features of Neutral Evolution

We discuss features of neutral evolution in terms of preponderance of synonymous substitutions to nonsynonymous ones, pure neutral evolution of junk DNA and pseudogenes, and neutral evolution at the macroscopic levels and at genomic levels.

7.1 Synonymous and Nonsynonymous Substitutions

If synonymous or nonsynonymous mutations (see Chap. 2) are fixed in populations, these are called synonymous and nonsynonymous substitutions, respectively. Historically, synonymous substitutions are called silent substitutions, and nonsynonymous ones are amino acid replacing substitutions.

If we consider the consequences of synonymous mutations, it is easy to expect that they are selectively neutral with original alleles because produced proteins are identical with each other. Nonsynonymous mutations may become deleterious because they may disrupt or reduce the protein function. As we saw in evolution of fibrinopeptides, it is also possible that the effect of a nonsynonymous substitution may be very minor and essentially selectively neutral. It is therefore a good approximation that f for synonymous mutations is 1, and the evolutionary rate is identical with mutation rate. As for nonsynonymous mutations, f is smaller than 1, and the evolutionary rate of nonsynonymous substitutions is expected to be smaller than that for synonymous substitutions. As we will see in Chap. 5, the evolutionary rate of nonsynonymous substitutions may become larger than the mutation rate when a special type of natural selection is operating, in which any amino acid change is advantageous. In this case, the rate of nonsynonymous substitutions will be higher than that of synonymous substitutions. Figure 4.17 shows a schematic comparison of the rates of synonymous and nonsynonymous substitutions.

Fig. 4.17
figure 17

A schematic comparison of synonymous and nonsynonymous substitutions

Because the number of synonymous substitutions (Ds) and nonsynonymous substitutions (Dn) is simultaneously estimated for the same proteins of different species (or different paralogous genes), comparison of Ds and Dn values is routinely conducted for many studies of genome comparison. Figure 4.18 is such example. In both (A) for mouse and rat and (B) for human and rhesus macaque, Ds > Dn for the majority of protein coding genes.

Fig. 4.18
figure 18

Comparison of synonymous substitutions (horizontal axis) and nonsynonymous substitutions (vertical axis). (a) Comparison between mouse and rat. (b) Comparison between human and rhesus macaque

It should be noted that the rate of synonymous substitutions may not be identical with the mutation rate, for biases of codon usages exist ([39]; Ikemura 1985). We will discuss the consequences of these sorts of purifying selection on synonymous substitutions in Chap. 5.

7.2 Junk DNA

Susumu Ohno proclaimed the characteristics of mammalian genomes as “So much “junk” DNA in our genome” as early as 1972 [40]. Junk DNA means functionless DNA. In fact, only 1.5 % of the human genome is used for protein coding [41], and the rest are mostly junk. They are interspersed repeats (LINEs and SINEs), microsatellites, other intergenic regions, and introns (see Chap. 10). It is true that a small fraction of noncoding genomic regions are highly conserved [42, 43], and they are expected to have some functions such as enhancers. Even some SINE is known to obtain an important function during the mammalian evolution [44, 45].

It is still true that the majority of noncoding genomic regions are functionless and just junk DNAs. Recently there are some reports of transcriptions on many noncoding regions [46, 47]. However, these results were obtained by problematic ChIP-chip techniques and found to be artifact [48] by checking ChIP-seq techniques.

Because the f value of Eq. 4.38 is 1 for junk DNA and for synonymous sites, their evolutionary rates are expected to be similar, if we ignore heterogeneity of mutation rates in one genome. In fact, the number (~0.15) of nucleotide substitutions per site in intergenic regions for mouse and rat genomes was shown to be quite similar to that of synonymous substitutions ([37]).

If we ignore a small portion of functional DNAs that are highly conserved among diverse organisms, the majority (more than 90%; see Babarinde and Saitou 2013 [56]) of mammalian or all vertebrate genomes are junk DNAs. Therefore, a genome-wide divergence of two species is a good approximation of the consequence of pure neutral evolution.

7.3 Pseudogenes

Pseudogenes are DNA sequences which are homologous to functional genes, but themselves are no longer functional. For example, if there are frameshift mutations and/or stop codons in a DNA sequence highly homologous to a known functional gene, it is called “pseudogene,” for functional protein is expected to be not formed. Therefore, they are often products of gene duplications. Because of nonfunctional nature of pseudogenes, the pseudogenes should be genuine members of junk DNAs. Figure 4.19 shows one of initial analysis of pseudogene evolution by Li, Gojobori, and Nei (1980; [49]).

Fig. 4.19
figure 19

Formation of pseudogene and nonfunctionalization. Gene A is pseudogene which diverged from its functional counterpart (gene A) for Td years. Gene A became nonfunctional Tn years ago. Gene B in mouse diverged from rabbit counterpart (gene C) T years ago. The evolutionary rate of functional genes is assumed to be ai (i = 1, 2, or 3) for i-th position of codons, while that of pseudogene is b for all three codon positions (From Li et al. (1980); [49])

There are four types of gene duplication (see Chap. 2). Among them, RNA-mediated duplication produces intronless sequences via reverse transcription of mRNAs. These cDNAs will be integrated to a DNA region unrelated to its place of origin, where a series of gene regulatory sequences exist. Therefore, such cDNA inserts are almost always ‘dead on arrival’. We can see a clear enhancement of evolutionary rate for intronless, or processed, pseudogenes for the mouse p53 gene. The estimated numbers of nucleotide substitutions between M. musculus and M. leggada are 0.0157 and 0.0651 for functional genes and pseudogenes, respectively (data from Table 16.1C; originally from Ohtsuka et al. (1996; [57])).

Nonfunctionalization can happen without gene duplication. Vitamins are molecules that exist in small quantity but essential for organisms, especially human, to survive. By definition, vitamins are not produced by the organism itself, and they should be taken in as a part of food. Their very existences are enigmatic, for these molecules are coming from other organisms which produce them. If vitamins are so important, why are they not produced by a certain species such as human? The neutral theory of evolution easily resolves this paradox. If vitamins are abundant in everyday foods, even the mutants with no ability of producing a certain vitamin are selectively neutral compared to wild types with ability to produce that vitamin through the existing enzymatic pathway.

Vitamin C, or ascorbic acid, is a good example. If appropriate intake of vitamin C is stopped for a long time, human will develop scurvy. King and Jukes (1969; [50]) already predicted that the lack of ascorbic acid production could be explained by assuming the neutral evolution. Not only human but all primates except for prosimians, elephants, guinea pigs, and fruit bats lack the ability of producing ascorbic acid [51]. Medaka, a teleost fish, also does not produce ascorbic acid [52]. In fact, nonfunctionalization of L-gulono-γ-lactone oxidase (enzyme number E.C.1.1.3.8) gene was confirmed by Nishikimi and his collaborators [53].

A more drastic situation of pseudogene formation without gene duplication is found in parasitic bacterial genomes. Mycobacterium leprae, the causative bacteria of leprosy, was found to have many pseudogenes in its genome ([54]). This is because this bacterium is hiding deep in host body and receives many nutrients from host.

A gene function is often quite complex, and it is not easy to determine if a “pseudogene” is really nonfunctional. Even if protein is not produced, mRNA or even DNA sequences themselves may still have some function. Therefore, when we discuss about the evolution of pseudogenes, it may be too simplistic to assume that f, fraction of neutral mutations, is 1 for a pseudogene. A “pseudogene” with some function is not surprising, for they were named so only because of sequence comparison.

7.4 Neutral Evolution at the Macroscopic Level

So far, we discussed evolution of nucleotide or amino acid sequences and saw that the fixations of selectively neutral mutations are the major process of evolution. It is thus natural to expect that the evolution at the macroscopic or so-called phenotypic level is also following mostly neutral fashion. Unfortunately, this logically derived conjecture seems to be not kept by many evolutionary biologists. Ever since Charles Darwin, many biologists have been enchanted by seemingly powerful positive selection. They are biologists who study macroscopic morphology of organisms, animal behaviors, developmental process, and so on. As we will see in Chap. 5, we should be careful to discuss adaptation without clear demonstration at the molecular level.

It may be still optimistic to expect a rapid expansion of our knowledge on the genetic basis of developmental and behavioral traits in the near future. However, modern biology is proceeding to this direction, and I personally hope that the superficial dichotomy between molecules (genotypes) and phenotypes will disappear sooner or later. Evolutionary genomics is at the foundation of this edifice of modern biology. It should be added that Nei’s (2013) recent book “Mutation-driven evolution” ([58]) covers many interesting topics related to this chapter.