MC-Net: a method for the construction of phylogenetic networks based on the Monte-Carlo method
- 3.8k Downloads
- 2 Citations
Abstract
Background
A phylogenetic network is a generalization of phylogenetic trees that allows the representation of conflicting signals or alternative evolutionary histories in a single diagram. There are several methods for constructing these networks. Some of these methods are based on distances among taxa. In practice, the methods which are based on distance perform faster in comparison with other methods. The Neighbor-Net (N-Net) is a distance-based method. The N-Net produces a circular ordering from a distance matrix, then constructs a collection of weighted splits using circular ordering. The SplitsTree which is a program using these weighted splits makes a phylogenetic network. In general, finding an optimal circular ordering is an NP-hard problem. The N-Net is a heuristic algorithm to find the optimal circular ordering which is based on neighbor-joining algorithm.
Results
In this paper, we present a heuristic algorithm to find an optimal circular ordering based on the Monte-Carlo method, called MC-Net algorithm. In order to show that MC-Net performs better than N-Net, we apply both algorithms on different data sets. Then we draw phylogenetic networks corresponding to outputs of these algorithms using SplitsTree and compare the results.
Conclusions
We find that the circular ordering produced by the MC-Net is closer to optimal circular ordering than the N-Net. Furthermore, the networks corresponding to outputs of MC-Net made by SplitsTree are simpler than N-Net.
Keywords
Markov Chain Ordinary Little Square Energy Function Distance Matrix Phylogenetic NetworkBackground
Phylogenetics is concerned with the construction and analysis of phylogenetic trees or networks to understand the evolution of species, populations, and individuals. Evolutionary processes such as hybridization between species, lateral transfer of genes, recombination within a population, and convergent evolution can all lead to evolutionary histories that are distinctly non-treelike. Moreover, even when the underlying evolution is treelike, the presence of conflicting or ambiguous signals can make a single tree representation inappropriate. In these situations, phylogenetic network methods can be particularly useful.
Phylogenetic network is a generalization of phylogenetic trees that can represent several trees simultaneously. For any network construction method, the conflicting signals should be represented in the network but it is vital that the network does not depict more conflict than is found in the data. At the same time, when the data fits well to a tree, the method should return a network that is close to a tree. Recently, in addition to biology, the phylogenetic networks methods are widely used for classifying different types of data such as those finding in linguistics, music, etc. There are many different methods to construct phylogenetic trees or networks which are based on distance matrix such as ME (minimum evolution) [1], LS (least squares) [2, 3], NJ (neighbor-joining) [4], AddTree [5], N-Net (neighbor-net) [6] and Q-Net [7]. All these methods are called distance-based methods.
where δ_{ ij } is an estimation of input d_{ ij } and X is the set of taxa. In fact, the main goal is to find a tree whose induced metric is closer to d_{ ij } . The LS was first introduced in [2] and [3].
Nearly 20 years have passed by since the landmark paper in Molecular Biology and Evolution introducing NJ [4]. The method has become the most widely used method for building phylogenetic trees from distances. Steel and Gascuel showed that NJ is a greedy algorithm for ME principle [9]. The N-Net is a hybrid of NJ and split decomposition [10]. It is applicable to data sets containing hundreds of taxa. The N-Net is an algorithm for constructing phylogenetic networks.
Split decomposition, implemented in SplitsTree [11], decomposes the distance matrix into simple components based on weighted splits. These splits are then represented using a special type of phylogenetic network called split network. The N-Net works in a similar way: it first produces a circular ordering from distance matrix and then constructs a collection of weighted splits. Dan Levy and Lior Patcher showed that the N-Net is a greedy algorithm for the traveling salesman problem that minimizes the balanced length of the split system at every step and it is optimal for circular distance matrices [12]. Balanced minimum evolution (BME) is designed under the ME principle [13]. The BME is a special version of the ME principle where tree length is estimated by the weighted least squares [13].
In this work, we introduce MC-Net algorithm (Monte-Carlo Network algorithm) which works in a similar way: First, it finds a circular ordering for taxa, based on Monte-Carlo with simulated annealing, it then extracts splits from the circular ordering and uses non-negative least squares for weighting splits. We compare the results of the N-Net and the MC-Net for several data sets.
Preliminaries
In an edge-weighted tree, the weight of each edge is assigned to its corresponding split. The Phyletic distance between any two taxa x and y in an edge-weighted tree is the sum of the weights of the edges along the path from x to y. Hence, the phyletic distance between x and y equals the sum of split weights for all those splits in which x and y belong to separate components.
A collection of splits is called compatible, if all possible pairing of splits are compatible. A compatible collection of splits is represented by a phylogenetic tree [14, 15]. Dress and Huson introduced SplitsTree to display more complex evolutionary patterns [16]. For a set of incompatible splits, SplitsTree outputs the split network using bands of parallel edges.
Circular collection of splits is a mathematical generalization of compatible collections of splits. Formally, a collection of splits of X is circular if there exists an ordering x_{1},⋯,x_{ n } of X such that every split is of the form {x_{ i }, x_{i+1},⋯,x_{ j } }|X - {x_{ i },⋯,x_{ j } } for some i and j, 1 ≤ i ≤ j ≤ n. A Compatible collection of splits are always circular [10]. On the other hand, the class of circular collection of splits contains the class of the collection of splits corresponding to a tree. Andreas Dress and Daniel Huson proved that circular collections of splits always have a planar splits graph representation [16]. A distance matrix is circular (also called Kalmanson) if it is the phyletic distances for a circular collection of splits with positive weights. Because compatible splits are circular, treelike distances are circular too [6].
Where Σ is the set of all circular orderings of taxa x_{1},...,x_{ n } . We call function η the energy function, and any circular ordering that minimizes η is called the optimal circular ordering.
Methods
There are a number of different methods for constructing various kinds of phylogenetic networks. A phylogenetic network can be constructed from a collection of weighted splits. N-Net uses circular ordering to construct a collection of weighted splits. Since finding an optimal circular ordering is an NP-hard problem, so we introduce a heuristic algorithm based on the Monte-Carlo method to find optimal circular ordering. The MC-Net seeks to find an optimal circular ordering from the distance matrix and then extracts a collection of weighted splits based on that ordering.
Algorithms
In this section, a new algorithm called the MC-Net, is presented to construct a set of weighted splits for taxa set X = {x_{1},...,x_{ n } }with a given distance matrix. The MC-Net consists of two steps. In the first step, we find a circular ordering. In the second step, the splits which are obtained from the circular ordering are weighted. The core of the first step contains two procedures, namely, INITIAL and the Monte-Carlo. The INITIAL is a greedy algorithm to obtain a circular ordering, namely, the initial circular ordering. The INITIAL works in the following way:
If r = x_{σ(1)}, we consider the new ordering $\overline{x},{x}_{\sigma (1),\dots ,}{x}_{\sigma (k)}$. Otherwise the ordering ${x}_{\sigma (1),\dots ,}{x}_{\sigma (k)},\overline{x}$ is considered. This process stops when all taxa are ordered.
where Σ is the set of all circular orderings.
Pseudo code of the Monte-Carlo algorithm with simulated annealing
Input: T initial temperature |
---|
σ_{ 0 } initial ordering |
T_{ low } low temperature |
t constant number |
σ =σ _{ 0 } |
While T > T _{ low } |
Repeat t time |
choose random $\stackrel{~}{\sigma}\in N\left(\sigma \right)$ |
If $\eta \left(\stackrel{~}{\sigma}\right)\le \eta \left(\sigma \right)$ |
$\sigma =\stackrel{~}{\sigma}$ |
Else |
x = random(0, 1) |
If $x<e\frac{-\eta (\stackrel{~}{\sigma})+(\sigma )}{T}$ |
$\sigma =\stackrel{~}{\sigma}$ |
T = T * 0.9 |
Return σ and η(σ) |
The matrix A = [A_{ ij,k } ] is full rank [17].
Let d = (d_{12}, d_{13},...,d_{(n-1)n}) be an n(n - 1)/2 dimensional vector corresponding to rows of A where d_{ ij } is obtained by distance matrix. Let b be the weight vector of splits, then the phyletic distance vector is p = Ab.
If we discard splits with negative weights and leave the remaining splits unchanged, the weight of the remaining splits are often grossly overestimated. Similar to the N-Net algorithm, we compute the optimal least square estimates with a non-negative constraint. In this paper, we use the FNNLS algorithm [18].
Results and Discussion
In this section, we compare the results of the MC-Net and the N-Net on some data sets. We use SplitsTree4 program [19] for drawing phylogenetic networks. Due to the limitation of space, we insert only six figures in this article.
Data sets
One of the data sets, a collection of 110 Salmonella MLST Data, was obtained from authors of the N-Net. The other data sets presented as the examples in SplitsTree4 program (version 4.10): Its(46 taxa), Jsa (46 taxa), Mammals (30 taxa), Primates (12 taxa), Rubber (23 taxa), Dolphins (36 taxa) and Myosin (143 taxa).
Optimal threshold for cooling coefficient and T_{ low }
Results
Values of energy function: the values of energy function for circular orderings obtained by the N-Net, the MC-Net and the MC-Net with initial ordering of the N-Net.
Data set | Its | Jsa | Mammals | Primates |
---|---|---|---|---|
N-Net | 0.4096 | 0.2808 | 4.4275 | 2.1465 |
MC-Net | 0.4079 | 0.2728 | 4.4172 | 2.1410 |
start N-Net | 0.3979 | 0.2767 | 4.4202 | 2.1410 |
Data set | Rubber | Dolphins | Salmonella | Myosin |
N-Net | 0.7723 | 2.2 | 0.2546 | 43.8199 |
MC-Net | 0.7596 | 2.1667 | 0.2575 | 43.8019 |
start N-Net | 0.7547 | 2.2 | 0.2515 | 43.6935 |
The number of splits obtained by the MC-Net and the N-Net for all data sets.
Data set | Its | Jsa | Mammals | Primates |
---|---|---|---|---|
N-Net | 110 | 83 | 103 | 34 |
MC-Net | 105 | 78 | 99 | 34 |
Data set | Rubber | Dolphins | Salmonella | Myosin |
N-Net | 55 | 67 | 107 | 520 |
MC-Net | 53 | 62 | 90 | 507 |
The value of norm for all data sets.
Data set | Its | Jsa | Mammals | Primates |
---|---|---|---|---|
N-Net | 0.0444 | 0.0329 | 0.0717 | 0.0385 |
MC-Net | 0.0358 | 0.0292 | 0.0648 | 0.0358 |
Data set | Rubber | Dolphins | Salmonella | Myosin |
N-Net | 0.0362 | 0.1068 | 0.0487 | 0.0291 |
MC-Net | 0.0316 | 0.1019 | 0.0405 | 0.0207 |
Conclusions
In this work, we propose an algorithm, MC-Net, which is a distance based method for constructing phylogenetic networks. The MC-Net scales well and can quickly produce detailed and informative networks for large number of taxa. We compare the performance of the MC-Net with the N-Net on eight different data sets. We have shown (Tables 2, 3 and 4) that the MC-Net performs better than the N-Net for almost test cases and the networks obtained by the MC-Net are simpler than the N-Net with the same major splits. The N-Net is a part of SplitsTree program. So, the results of the MC-Net could be used in SplitsTree program too.
Appendix
Let S = {E_{1},...,E_{ s } } be a finite set of states, and consider a physical process having these discrete states at time t. A Markov chain is a stochastic model of this system, such that the state of system at time t + 1 depends only on the state of system at time t.
Theorem 1(Convergence to stationary Markov chain, [20])
such that π = (π_{1},...,π_{ s } ) is a unique probability distribution and ${\pi}_{j}={\displaystyle \sum _{i=1}^{s}{\pi}_{i}{p}_{ij}}$.
The probability distribution is π is called stationary probability of the Markov chain.
- 1.
i, ∉ N(i).
- 2.
i ∈ N(j) ⇔ j ∈ N(i)
- 3.if i ≠ j, then there exit i _{1},i _{2},...,i _{1} ∈ Σ such that$i\in N\left({i}_{1}\right),{i}_{1}\in N\left({i}_{2}\right),\dots ,{i}_{l}\in N\left(j\right).$
Where ${\pi}_{i}^{T}=\frac{{e}^{-\frac{\eta (i)}{T}}}{{\displaystyle \sum _{j\in \sum}{e}^{-\frac{\eta (j)}{T}}}}$ (see page 45 in [20]).
Proof: The proof is presented in [20] (claim 2.8 and claim 2.9).
The corollary 1 illustrates that by cooling temperature (T → 0^{+}), system enters into one of the states of η_{0} with the probability 1 after t (t → ∞) time. In this article, we define the set of all circular orderings of taxa as the finite set of states. Our definition of neighborhood in the MC-Net satisfies in three properties of neighborhood and every elements of η_{0} is an optimal circular ordering. Therefore, the MC-Net yields a circular ordering with approximately minimal energy function.
Notes
Acknowledgements
We are grateful to the faculty of mathematics of Shahid Beheshti University. This work is supported in part by IPM(cs-1385-02). The authors would like to thank Prof. Hamid Pezeshk for many useful comments.
Supplementary material
References
- 1.Kidd KK, Sgamarella-Zonta LA: Phylogenetic analysis: concepts and methods. Am J Human Genetics. 1971, 23: 235-252.Google Scholar
- 2.Cavalli-Sforza LL, Edwards AWF: Phylogenetic analysis: models and estimating procedures. Am J Hum Genet. 1967, 19: 233-257. (1967)PubMedCentralPubMedGoogle Scholar
- 3.Fitch WM, Margoliash E: Construction of phylogenetic trees. Science. 1967, 155: 279-284. 10.1126/science.155.3760.279.CrossRefPubMedGoogle Scholar
- 4.Saitou N, Nei N: The neighbor joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987, 4: 406-425.PubMedGoogle Scholar
- 5.Sattath S, Tversky A: Phylogenetic similarity trees. Psychometrika. 1977, 42: 319-345. 10.1007/BF02293654.CrossRefGoogle Scholar
- 6.Bryant D, Moulton V: NeighborNet: An agglomerative method for the construction of planar phylogenetic networks. Molecular Biology And Evolution. 2004, 21: 255-265. 10.1093/molbev/msh018.CrossRefPubMedGoogle Scholar
- 7.Grunewald S, Forslund K, Dress A, Moulton V: QNet: An agglomerative method for the construction of phylogenetic networks from weighted quartets. Molecular Biology and Evolution. 2007, 24: 532-538. 10.1093/molbev/msl180.CrossRefPubMedGoogle Scholar
- 8.Rzhetsky A, Nei M: Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol. 1993, 10: 1073-1095.PubMedGoogle Scholar
- 9.Gascuel O, Steel M: Neighbor-joining revealed. Molecular Biology and Evolution. 2006, 23: 1997-2000. 10.1093/molbev/msl072.CrossRefPubMedGoogle Scholar
- 10.Bandelt HJ, Dress AWM: Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol Phyl Evol. 1992, 1: 242-252. 10.1016/1055-7903(92)90021-8.CrossRefGoogle Scholar
- 11.Huson DH: SplitsTree: A program for analyzing and visualizing evolutionary data. Bioinformatics. 1998, 14 (10): 68-73. 10.1093/bioinformatics/14.1.68.CrossRefPubMedGoogle Scholar
- 12.Levy D, Patcher L: The Neighbor-Net Algorithm. Advances in Applied Mathematics.Google Scholar
- 13.Desper R, Gascuel O: The Minimum-Evolution Distance Based Approach to Phylogenetic Inference. Math Evolution and Phylogeny. Edited by: Gascuel O. 2005, Oxford Univ. PressGoogle Scholar
- 14.Semple C, Steel M: Cyclic permutations and evolutionary trees. Adv Appl Math. 2004, 32 (4): 669-680. 10.1016/S0196-8858(03)00098-8.CrossRefGoogle Scholar
- 15.Semple C, Steel M: Phylogenetics. 2003, Oxford, UK: Oxford University PressGoogle Scholar
- 16.Dress A, Huson DH: Constructing splits graphs. IEEE/ACM Transactions in Computational Biology and Bioinformatics. 2004, 1: 109-115. 10.1109/TCBB.2004.27.CrossRefGoogle Scholar
- 17.Bandelt H-J, Dress A: A canonical decomposition theory for metrics on a finite set. Adv Math. 1992, 92: 47-105. 10.1016/0001-8708(92)90061-O.CrossRefGoogle Scholar
- 18.Bro R, Jong SD: A Fast Non-negativity-constrained Least Squares Algorithm. Journal of Chemometrics. 1997, 11 (5): 393-401. 10.1002/(SICI)1099-128X(199709/10)11:5<393::AID-CEM483>3.0.CO;2-L.CrossRefGoogle Scholar
- 19.Huson D, Bryant D: Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution. 2005, 23: 254-267. 10.1093/molbev/msj030.CrossRefPubMedGoogle Scholar
- 20.Clote P, Backofen R: Computational molecular biology. 2000, New York, WILEYGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.