TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column
Abstract
Background
As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed.
Results
We present algorithms to estimate the evolutionary time (\(t_{\text {MRS}}\)) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated \(t_{\text {MRS}}\) values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation \(\sigma\) of \(t_{\text {MRS}}\) by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations.
Conclusions
We have presented algorithms to estimate the evolutionary time (\(t_{\text {MRS}}\)) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species.
Keywords
Phylogenetic trees Comparative genomics Probabilistic modelsBackground
As sequenced genomes continue to accumulate, a very rich source for discovering detailed evolutionary information grows. The UCSC genome browser provides multiple genome alignments for 100 vertebrate species, including humans (the multiz100way track) [1, 2, 3].
In previous decades, multiple DNA alignments are often used to reconstruct species trees and ancestral nucleotide states [4] and many algorithms and softwares are developed for such purposes. Some of the most used algorithms include Neighbor-Joining algorithm [5] and maximal likelihood method [4] and Bayesian Markov chain Monte Carlo method [6]. These algorithms usually assume evolutionary models that each nucleotide stochastically mutates over evolutionary time, and output the most consistent phylogenetic tree from possible \((2n-3)!!\) rooted or \((2n-5)!!\) unrooted trees for n-species. On the other hand, since the species tree of 100 vertebrates of multiz100way are basically resolved from the previous studies [7], finding functional genomic sites rather than determining the phylogenetic tree is becoming more important application as the use of multiple genomic alignments in recent years.
As it is difficult to visually inspect functional regions from 100-species alignments, computing genome-wide summary statistics is very important. Measuring the strength of negative or positive selection is among the most popular analyses for screening functional regions of genomes [8, 9, 10, 11, 12, 13, 14]. These statistics are computed using probabilistic models that model the stochastic processes of DNA mutations along phylogenetic species trees, which are used in tree reconstruction [4, 5, 6], and detect genomic regions that show smaller or larger mutation rates using likelihood ratio tests or similar probabilistic computations.
Such statistics have advantages over simpler statistics that do not assume a particular evolutionary model, such as nucleotide frequency of alignment columns and pairwise mismatch rates. By using a phylogenetic tree, we can appropriately count the number of ancestral mutations that are widespread within extant species. Further, stochastic processes can account for multiple nucleotide mutations whose effects are not negligible when we study evolutionarily distant species. However, only conservation/divergence measures are not sufficient to extract all evolutionarily important events from potential \(4^{100}\) nucleotide patterns of a 100-species alignment column.
In this study, we develop algorithms to compute three statistics, \(t_{\text {MRS}}\), \(\sigma\), and q, for each column of a multiple genome alignment based on an evolutionary model that is similar to those described above. \(t_{\text {MRS}}\) is the evolutionary time to the most recent substitution event that occurred along the lineage of a given target species in the phylogenetic tree. Since the confidence in estimated \(t_{\text {MRS}}\) values varies markedly among alignment columns depending on gap fractions and complexity of nucleotide patterns (see Fig. 1 for explanation), we also compute the standard deviation \(\sigma\) of \(t_{\text {MRS}}\). Further, we compute the probability q that there is no mutation in the target lineage because the estimated \(t_{\text {MRS}}\) value has no meaning in such cases. By filtering out sites with non negligible probability of nucleotide conservation over the entire target lineage based on q, we can remove highly conserved sites. By comparing \(t_{\text {MRS}}\) with speciation time points, we can categorize sites by the groups of species that share mutation effects with the target species. Such detailed information is difficult to obtain from conservation measures. Our algorithms can be a very useful tool for screening the genomic sites that may have been involved in the acquisition of unique biological features by target species.
In the next section, we describe our algorithms to compute \(t_{\text {MRS}}\) and data processing procedures. We first explain the \(t_{\text {MRS}}\) algorithm on a single edge of phylogenetic tree, and then generalize it to account for the entire tree. The algorithms for computing \(\sigma\) and q are described in Additional file 1 as they are very similar to that of \(t_{\text {MRS}}\). In the result section, we empirically show the correctness of our algorithms by posterior sampling of mutation history. We also show that our algorithm is fast enough to be applied to the entire human genome, and that \(t_{\text {MRS}}\) statistic is very different from other statistics to detect evolutionary conservation/divergence of genomic sites. We then apply our algorithms to the multiz100way dataset and investigate distributions of \(t_{\text {MRS}}\) in different genomic contexts. In particular, we investigate the correlation between \(t_{\text {MRS}}\) distribution on the bidirectionally transcribed enhancers and tissue specificity of enhancer activities and found that brain-specific transcribed enhancers are threefold enriched in the density of \(t_{\text {MRS}}\) that located in the human lineage.
Method
We first derive formulas for \(t_{\text {MRS}}\) and other variables for an edge of a phylogenetic tree, and then describe how to generalize them into statistics for the entire phylogenetic tree.
Single edge case
Strand symmetric rate matrix
Rate parameters
\(R_{\text {Symmetric}} = \begin{pmatrix} * &{} \alpha &{} \beta &{} \gamma \\ \eta &{} * &{} \delta &{} \epsilon \\ \epsilon &{} \delta &{} * &{} \eta \\ \gamma &{} \beta &{} \alpha &{} * \end{pmatrix}, R_{\text {GTR}}= \begin{pmatrix} * &{} \pi _A \alpha &{} \pi _A \beta &{} \pi _A \gamma \\ \pi _C \alpha &{} * &{} \pi _C \delta &{} \pi _C \epsilon \\ \pi _G \beta &{} \pi _G \delta &{} * &{} \pi _G \eta \\ \pi _T \gamma &{} \pi _T \epsilon &{} \pi _T \eta &{} * \end{pmatrix}\) |
Time to most recent substitution \(t_\text{MRS }\). These schematically show the situations that may impact the confidence levels of inferred \(t_{\text {MRS}}\) values. The leaf nodes correspond to the target species are indicated by rectangles. In the left figure, we expect the last substitution occurred between node x and y, and \(t_{\text {MRS}}\) will be around \(t_1\) to \(t_1+t_2\). In the middle figure, the pattern of alignment column is not simple, and the state of node y can be either A or G. Therefore, the inferred \(t_{\text {MRS}}\) will have a large variance between \(t_1\) to \(t_1+t_2+t_3\). In the right figure, there is an ambiguous nucleotide in the column. In such cases, the inferred \(t_{\text {MRS}}\) value is the same as that inferred from only three species, and the confidence will accordingly be lower than when all four nucleotides are known
Phylogenetic tree case
Inside and outside variables. \(c_k\) denotes the concestor nodes on the target lineage. \(b_k\) denotes the sibling node of \(c_{k-1}\). \(\alpha (b_k,*)\) represents the inside variable, while \(\beta (c_k,*)\) represents the outside variable. \(\gamma (b_k,*)\) represents a dynamic programming variable in Eq. 2 in the main text
Similar algorithms can be derived for the standard deviation \(\sigma\) and the probability of no mutation q as described in Additional file 1.
Alignment gaps and ambiguous characters
We treat gap and ambiguous nucleotide characters of non-target leaves as missing characters; we sum the probabilities of all possible nucleotide patterns in computation. Then, the probability condition indicates that the estimated values are the same as those computed from the reduced phylogenetic tree and alignment columns after removal of gaps and ambiguous characters and the corresponding edges in the tree. This increases the standard deviation \(\sigma\) of estimates \(t_{\text {MRS}}\). On the other hand, we do not consider the sites if the character of the target is a gap or an ambiguous character.
Software availability
We implemented our algorithms in the C++ language. The resulting software (‘TMRS’) is available at our website [21].
Dataset and data processing
We downloaded the MAF-formatted Multiz100way multiple alignment files from the UCSC genome browser site, which consists of multiple genome alignments of 100 vertebrate species, including the human genome version hg38. We also downloaded the phylogenetic tree data from the PhyloP track, whose edge lengths are trained using fourfold degenerate (4d) sites of RefSeq genes under the general time reversible model.
We used the topology of the PhyloP phylogenetic tree as it is, and trained only the edge lengths of the tree as well as the rate parameters of the strand symmetric model. For this, we collected alignment columns at human 4d sites based on gene annotations of the RefGene track from the UCSC site, following Siepel et al. [8] and Pollard et al. [9]. The reason for using 4d sites is the higher quality of alignments and higher coverage of distant species in the alignments [8, 9], though they may be subject to various evolutionary constraints. In order to investigate the uncertainty of trained parameters, we randomly sampled 100 sets of 4d sites from about three million 4d sites in the human genome such that each has a given number of sites, ranging from 1 to \(10^5\). We generated an alignment of concatenated genomic alignment columns, and trained parameters based on the maximum likelihood method [22], using the LBFGS-B gradient descent package [17].
For studying differences in \(t_{\text {MRS}}\) distributions among genes, we sampled 100,000 alignment columns from intergenic, CDS, 3′UTR, and 5′UTR sequences based on ‘Gencode v24 Basic’ track gene models from the UCSC site [3].
Anderson et al. [23] identified genomic elements called transcribed enhancers in human and other genomes, where short RNAs are produced by bidirectional transcription as a result of chromatin openings. From the FANTOM5 enhancer atlas site [24], we downloaded the coordinates of transcribed enhancers and the list of tissue and cell specific enhancers where bidirectional transcription occurs in a tissue and/or cell-specific manner.
Results and discussions
Parameter optimization and performance tests
Convergence of optimized parameters. The upper left panel shows the distributions of the pairwise relative differences of inferred parameters. The x-axis represents the number of alignment columns used to train the parameters. The upper right panel represents the distribution of the correlation coefficients of tree edge lengths between the PhyloP model of the UCSC genome browser site and the inferred parameters. The x-axis is the same as that shown in the upper left panel. The bottom panel represents the distributions of inferred time to each concestor from the present. The unit is the number of substitutions per site. Each parameter set is trained using 100,000 alignment columns sampled from 4d sites
Trained rate matrix and equilibrium distribution
Substitution type | Parameter | Rate |
---|---|---|
\(\text {A}\leftarrow \text {C}\), \(\text {T}\leftarrow \text {G}\) | \(\alpha\) | 0.16 |
\(\text {A}\leftarrow \text {G}\), \(\text {T}\leftarrow \text {C}\) | \(\beta\) | 0.57 |
\(\text {A}\leftarrow \text {T}\), \(\text {T}\leftarrow \text {A}\) | \(\gamma\) | 0.20 |
\(\text {C}\leftarrow \text {G}\), \(\text {G}\leftarrow \text {C}\) | \(\delta\) | 0.24 |
\(\text {C}\leftarrow \text {T}\), \(\text {G}\leftarrow \text {A}\) | \(\epsilon\) | 0.59 |
\(\text {G}\leftarrow \text {T}\), \(\text {C}\leftarrow \text {A}\) | \(\eta\) | 0.25 |
Nucleotide | Equilibrium frequency \(\pi\) | |
---|---|---|
A, T | 0.23 | |
C, G | 0.27 |
Numerical tests of our algorithms. The statistics \(t_{\text {MRS}}\), \(\sigma\), and q computed by exact algorithms were compared with those estimated using sampled histories of nucleotide substitutions. The y-axes represent the relative difference between the values from the exact algorithms and those obtained by approximate sampling algorithms. The x-axes show the dependency on the number of sampled histories and the number of discrete points in the phylogenetic tree from which the states were sampled
Runtime of our implementation
Computation | Datasize | Runtime |
---|---|---|
Train, gradient (1 iteration) | 100 K columns | 4.6 min |
Train, total (300 iterations) | 100 K columns | 23 h |
\(t_{\text {MRS}}\), \(\sigma\), q | 1 column | \(7.3\times 10^{-4}\) s |
\(t_{\text {MRS}}\), \(\sigma\), q | 1 G columns | 204 h |
Comparison with other statistical measures
To compare the accuracy of our algorithm with these approximate algorithms, we simulated evolutionary history and alignment column of base mutation using forward simulation using the phylogenetic model of the previous section. We masked nucleotide positions where there are gap or ambiguous characters in sampled multiz100way alignments in order to imitate the gap patterns of real alignments. Details of the simulation algorithm is described in Section 6 of Additional file 1. As a result, we obtained 100,000 alignment columns of 100 species with ‘true’ annotation of \(t_{\text {MRS}}\) and \(q\in \{0,1\}\).
Effect of filtering. We investigated the effect of filtering by q threshold on the accuracy of \(t_{\text {MRS}}\) estimates using simulation dataset. The x-axis represents the fraction of alignment columns remained by filtering with varying threshold. a Fraction of alignment columns that have no mutation throughout the target lineage in the positive set. b Mean % error of \(t_{\text {MRS}}\) values in the dataset after filtering. The blue and green points represent the approximate \(t_{\text {MRS}}\) and q computed from the reconstruction of ancestral states, and the closest extant species whose base is different from that of the target species, respectively
Effects of filtering by probability q of no mutation
q threshold | Positive fraction | FDR for no mutation | % error of \(t_{\text {MRS}}\) |
---|---|---|---|
0.01 | 0.24 | 0.0013 | 4.4 |
0.1 | 0.36 | 0.015 | 7.9 |
0.5 | 0.68 | 0.16 | 15 |
1.0 | 1 | 0.33 | 15 |
Correlation with other conservation measures
Significance measure | Spearman’s correlation with true \(t_{\text {MRS}}\) |
---|---|
\(t_{\text {MRS}}(q<0.01)\) | 0.965 |
\(t_{\text {MRS}}(q<0.1)\) | 0.938 |
\(t_{\text {MRS}}(q<1)\) | 0.858 |
Reconstruction (\(q<1\)) | 0.905 |
Alignment (\(q<1\)) | 0.338 |
Entropy | 0.301 |
Pairwise | 0.344 |
Phastcons | 0.112 |
Phylop | 0.108 |
Gerp | 0.129 |
Genomic distribution of \(t_{\text {MRS}}\)
Evolutionary time of reduced concestors
Concestor | Time | Sibling | Descendants |
---|---|---|---|
Homo | 0 | – | Human |
Hominoidea | 0.026 | Hylobatidae | Gibbon |
Euarchontoglires | 0.17 | Glires | Mouse, rabbit |
Eutheria | 0.22 | Atlantogenata | Elephant, armadillo |
Mammalia | 0.55 | Prototheria | Platypus |
Amniota | 0.69 | Sauropsida | Bird, reptile |
Tetrapoda | 0.80 | Amphibia | Frog |
Vertebrata | 1.1 | Cyclostomata | Lamprey |
Distributions of \(t_{\text {MRS}}, \sigma\) and q in the human genome. The top panels show the sampling distribution of statistics \(t_{\text {MRS}}\)-q (left) and \(\sigma\)-q (right) in the human genome. In these panels, a total of 2,063,207 alignment columns were sampled from the human genome excluding repeat regions. The bottom panels show the densities of q (left) and \(t_{\text {MRS}}\) (with \(q<0.01\)) (right) for several types of genomic region: CDS, 5\('\)UTR, 3\('\)UTR, Intron, and Intergenic
Figure 6 (bottom left) shows the density of q for each annotated genomic region. Compared to Intergenic, Intron, 3′UTR, and 5′UTR, CDS regions have a large fraction of sites with a high probability of no mutation, indicating many ancestral nucleotides that were fixed before the appearance of the vertebrate concestor. Since computed \(t_{\text {MRS}}\) values have less meaning if q is large, we filtered out sites with \(q>0.01\) and plotted the distributions of \(t_{\text {MRS}}\) values for the remaining sites (Fig. 6 (bottom right)). There are several peaks because some sites are guaranteed to experience the last substitution between specific interval of concestors. All regions have the highest peak around \(t_{\text {MRS}}\sim 0.1\), which is between the Simiiformes and Primate concestors. CDS regions have a large peak around \(t_{\text {MRS}}\sim 0.36\), which corresponds to between the Eutheria and Theria concestors.
Concestor interval of the last substitution event
Topological relationship of reduced concestors. We show the topology of simplified phylogenetic tree of 100 vertebrate species used in the analyses of concestor intervals. See Table 6 for the numerical values of evolutionary time
Frequency of concestor intervals. The two panels show the frequencies of genomic sites categorized by the concestor intervals where their most recent substitutions occurred. The left panel shows the genomic distribution. The axes represent late (x-axis) and early (y-axis) ends of intervals. The right panel shows distributions for several types of genomic region: CDS, 5′UTR, 3′UTR, Intron, Intergenic, and Transcribed Enhancer. Only intervals with non-zero counts are shown in this panel
Tissue-concestor interval correlations for transcribed enhancers
Tissue-concestor interval correlation for transcribed enhancers
Tissue | Interval | Z-score | \(-\log _{10}(\text {p-value})\) | Enrichment | Observed |
---|---|---|---|---|---|
Brain | Homo–Vertebrata | 14.2 | 32.0 | 2.97 | 140 |
Hominoidea–Tetrapoda | 13.3 | 29.2 | 2.69 | 148 | |
Hominoidea–Vertebrata | 11.5 | 22.1 | 2.17 | 115 | |
Meninx | Hominoidea–Tetrapoda | 8.58 | 11.7 | 3.84 | 32 |
Hominoidea–Vertebrata | 6.68 | 7.72 | 3.52 | 23 | |
Hominoidea–Amniota | 6.06 | 6.96 | 3.01 | 25 | |
Eye | Hominoidea–Vertebrata | 8.43 | 11.5 | 3.50 | 37 |
Eutheria–Tetrapoda | 8.42 | 7.44 | 9.69 | 9 | |
Eutheria–Vertebrata | 8.06 | 8.98 | 5.13 | 19 |
Examples of alignment columns. The figure shows the examples of alignment columns that include the concestor interval Homo–Vertebrata in the transcribed enhancer regions and show brain-specific RNA transcription. The y-axis represents nine example alignment columns and x-axis represents nucleotides of each column, in which gaps, ambiguous nucleotides, and unaligned regions are shown as blank. The species are aligned such that it conforms phylogenetic trees and sorted such that species more evolutionarily distant from humans are placed on the right
Tissue-concestor interval correlations for genes
Tissue-concestor interval correlation for genes
Tissue | Interval | Z-score | \(-\log _{10}(\text {p-value})\) | Enrichment | Observed |
---|---|---|---|---|---|
Muscle | Eutheria–Tetrapoda | 6.44 | 11.3 | 1.26 | 250 |
Eutheria–Amniota | 6.23 | 11.9 | 1.18 | 287 | |
Eutheria–Mammalia | 5.22 | 10.8 | 1.09 | 307 | |
Artery aorta | Eutheria–Tetrapoda | 5.73 | 9.76 | 1.36 | 122 |
Eutheria–Amniota | 4.22 | 6.20 | 1.18 | 131 | |
Euarchontoglires–Eutheria | 4.16 | 5.11 | 1.33 | 99 | |
Pineal gland | Euarchontoglires–Eutheria | 5.50 | 8.05 | 1.28 | 217 |
Eutheria–Amniota | 5.45 | 9.04 | 1.15 | 290 | |
Eutheria–Tetrapoda | 5.21 | 7.60 | 1.21 | 247 |
Conclusions
We have developed algorithms to infer the time \(t_{\text {MRS}}\) to most recent substitution in the lineage from a given target species to the root of a phylogenetic tree. In order to filter out highly conserved sites and ambiguous sites where the confidence of estimated \(t_{\text {MRS}}\) is low, we also compute the probability q of no mutation and the standard deviation \(\sigma\) of \(t_{\text {MRS}}\). We computed these variables efficiently using dynamic programming algorithms on the phylogenetic tree such that the algorithms can be applied to multiple genomic alignments with 100 species. We have empirically checked the correctness of our algorithms by posterior sampling of mutation histories on the tree. Our algorithms are exact under the assumptions of the model: genome evolution follows a site-independent continuous-time Markov process along the phylogenetic tree. Our results also depend on the quality of Multiz alignment, which was debated previously [27]. Although alignment errors can be less influential if the corresponding leaf nodes are far from the target lineage, the incomplete coverage of sequenced genomes directly affects the number of sites whose \(t_{\text {MRS}}\) can be determined with confidence. We expect that the number of sites with confident \(t_{\text {MRS}}\) value will increase as the coverage of genome sequences improve in the future.
We have applied our tool to 100-species multiple genome alignments with human target and obtained a frequency spectrum of concestor intervals that categorized the time points at which the last substitutions occurred. Furthermore, we studied the correlation between the frequency of concestor intervals and the tissue-specificity of transcribed enhancers and found that brain-specific transcribed enhancers are highly enriched among the sites with mutations in the human lineage. It may be very interesting to combine our method with genome editing experiments to see if nucleotide changes at the screened sites affect tissue functions.
Notes
Acknowledgements
This work was supported by JSPS KAKENHI [Grant Numbers 16H01532, 17K00398] (H.K.).
Authors' contributions
HK designed the project, developed the algorithms, and wrote the manuscript. KI and YK contributed to the development of algorithms and their implementation and computational experiments in the early stages of the study. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Supplementary material
References
- 1.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14(4):708–15.PubMedPubMedCentralCrossRefGoogle Scholar
- 2.Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006.PubMedPubMedCentralCrossRefGoogle Scholar
- 3.UCSC Genome Browser. http://genome.ucsc.edu/. Accessed 15 Jun 2018.
- 4.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17(6):368–76.PubMedCrossRefGoogle Scholar
- 5.Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25.Google Scholar
- 6.Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17(8):754–5.PubMedCrossRefGoogle Scholar
- 7.Murphy WJ, Eizirik E, O’Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348–51.PubMedCrossRefGoogle Scholar
- 8.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50.PubMedPubMedCentralCrossRefGoogle Scholar
- 9.Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20(1):110–21.PubMedPubMedCentralCrossRefGoogle Scholar
- 10.Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15(7):901–13.PubMedPubMedCentralCrossRefGoogle Scholar
- 11.Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25(12):54–62.CrossRefGoogle Scholar
- 12.Gu X, Fu YX, Li WH. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol. 1995;12(4):546–57.PubMedGoogle Scholar
- 13.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39(3):306–14.PubMedCrossRefGoogle Scholar
- 14.Siepel A, Pollard KS, Haussler D. New methods for detecting lineage-specific selection. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in computational molecular biology., RECOMB. Lecture notes in computer scienceBerlin: Springer; 2006.Google Scholar
- 15.Yang Z. Computational molecular evolution. Oxford: Oxford University; 2006.CrossRefGoogle Scholar
- 16.Karro JE, Peifer M, Hardison RC, Kollmann M, von Grunberg HH. Exponential decay of GC content detected by strand-symmetric substitution rates influences the evolution of isochore structure. Mol Biol Evol. 2008;25(2):362–74.PubMedCrossRefGoogle Scholar
- 17.Zhu C, Byrd RH, Norcedal J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans Math Softw. 1997;23(4):550–60.CrossRefGoogle Scholar
- 18.Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21(3):468–88.PubMedCrossRefGoogle Scholar
- 19.Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. Bioinformatics. 2011;27(17):2346–53.PubMedCrossRefGoogle Scholar
- 20.Dawkins R. The Ancestor’s tale. London: Weidenfeld and Nicolson; 1970.Google Scholar
- 21.TMRS Software. https://github.com/hmatsu1226/SCODE. Accessed 15 Jun 2018.
- 22.Fisher R. On the mathematical foundation of theoretical statistics. Philos Trans R Soc Lond Ser A. 1922;222:309–68.CrossRefGoogle Scholar
- 23.Andersson Rea. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61.PubMedPubMedCentralCrossRefGoogle Scholar
- 24.FANTOM5 human enhancer tracks. http://slidebase.binf.ku.dk/human_enhancers/. Accessed 15 Jun 2018.
- 25.Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 2012;13(1):5.CrossRefGoogle Scholar
- 26.Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005;6(2):21.CrossRefGoogle Scholar
- 27.Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res. 2008;36(18):5863–71.PubMedPubMedCentralCrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.