Annals of Operations Research

, Volume 276, Issue 1–2, pp 293–313 | Cite as

Multilocus phylogenetic analysis with gene tree clustering

  • Ruriko Yoshida
  • Kenji Fukumizu
  • Chrysafis VogiatzisEmail author
Computational Biomedicine


Both theoretical and empirical evidence point to the fact that phylogenetic trees of different genes (loci) do not display precisely matched topologies. Nonetheless, most genes do display related phylogenies; this implies they form cohesive subsets (clusters). In this work, we discuss gene tree clustering, focusing on the normalized cut (Ncut) framework as a suitable method for phylogenetics. We proceed to show that this framework is both efficient and statistically accurate when clustering gene trees using the geodesic distance between them over the Billera–Holmes–Vogtmann tree space. We also conduct a computational study on the performance of different clustering methods, with and without preprocessing, under different distance metrics, and using a series of dimensionality reduction techniques. Our results with simulated data reveal that Ncut accurately clusters the set of gene trees, given a species tree under the coalescent process. Other observations from our computational study include the similar performance displayed by Ncut and k-means under most dimensionality reduction schemes, the worse performance of hierarchical clustering, and the significantly better performance of the neighbor-joining method with the p-distance compared to the maximum-likelihood estimation method. Supplementary material, all codes, and the data used in this work are freely available at online.


Phylogenetics Normalized cut Clustering 



The authors would like to thank the editor and the anonymous referees for their useful comments for improving the manuscript.

Funding    K. F. and R. Y. were supported by JSPS KAKENHI 26540016. C. V. would also like to acknowledge support from ND EPSCoR NSF #1355466.

Supplementary material

10479_2017_2456_MOESM1_ESM.pdf (3.7 mb)
Supplementary material 1 (pdf 3824 KB)


  1. Abascal, F., & Valencia, A. (2002). Clustering of proximal sequence space for the identification of protein families. Bioinformatics, 18(7), 908–921.CrossRefGoogle Scholar
  2. Amemiya, C. T., Alföldi, J., et al. (2013). The african coelacanth genome provides insights into tetrapod evolution. Nature, 496, 311–316.CrossRefGoogle Scholar
  3. Betancur, R., Li, C., Munroe, T., Ballesteros, J., & Ortí, G. (2013). Addressing gene tree discordance and non-stationarity to resolve a multi-locus phylogeny of the flatfishes (teleostei: Pleuronectiformes). Systematic Biology. doi: 10.1093/sysbio/syt039.
  4. Billera, L., Holmes, S., & Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4), 733–767.CrossRefGoogle Scholar
  5. Bininda-Emonds, O., Gittleman, J., & Steel, M. (2002). The (super)tree of life: Procedures, problems, and prospects. Annual Review of Ecology and Systematics, 33, 265–289.CrossRefGoogle Scholar
  6. Bollback, J., & Huelsenbeck, J. (2009). Parallel genetic evolution within and between bacteriophage species of varying degrees of divergence. Genetics, 181(1), 225–234.CrossRefGoogle Scholar
  7. Brito, P., & Edwards, S. (2009). Multilocus phylogeography and phylogenetics using sequence-based markers. Genetica, 135, 439–455.CrossRefGoogle Scholar
  8. Carballido-Gamio, J., Belongie, S., & Majumdar, S. (2004). Normalized cuts in 3-D for spinal MRI segmentation. IEEE Transactions on Medical Imaging, 23(1), 36–44.CrossRefGoogle Scholar
  9. Carling, M., & Brumfield, R. (2008). Integrating phylogenetic and population genetic analyses of multiple loci to test species divergence hypotheses in passerina buntings. Genetics, 178, 363–377.CrossRefGoogle Scholar
  10. Chatterji, S., Yamazaki, I., Bai, Z., & Eisen, J. A. (2008). Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In M. Vingron & L. Wong (Eds.), Research in computational molecular biology (pp. 17–28). Berlin: Springer.Google Scholar
  11. Chen, D., Burleigh, G. J., & Fernández-Baca, D. (2007). Spectral partitioning of phylogenetic data sets based on compatibility. Systematic Biology, 56(4), 623–632.CrossRefGoogle Scholar
  12. Cox, I. J., Rao, S. B., & Zhong, Y. (1996). “Ratio regions”: A technique for image segmentation. In 1996, proceedings of the 13th international conference on pattern recognition, vol. 2 (pp. 557–564). IEEE.Google Scholar
  13. Dasarathy, G., Nowak, R., & Roch, S. (2015). Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 122, 422–432.CrossRefGoogle Scholar
  14. Edwards, S. (2009). Is a new and general theory of molecular systematics emerging? Evolution, 63, 1–19.CrossRefGoogle Scholar
  15. Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). London: Wiley.CrossRefGoogle Scholar
  16. Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17, 368–376.CrossRefGoogle Scholar
  17. Fritzsch, B. (1987). The inner ear of the coelacanth fish latimeria has tetrapod affinities. Nature, 327, 153–154.CrossRefGoogle Scholar
  18. Gori, K., Suchan, T., Alvarez, N., Goldman, N., & Dessimoz, C. (2015). Clustering genes of common evolutionary history. Preprint. arXiv:1510.02356.
  19. Gorr, T., Kleinschmidt, T., & Fricke, H. (1991). Close tetrapod relationships of the coelacanth latimeria indicated by haemoglobin sequences. Nature, 351, 394–397.CrossRefGoogle Scholar
  20. Gretton, A., Smola, A. J., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., et al. (2005). Kernel constrained covariance for dependence measurement. In Proceedings of the 10th international workshop on artificial intelligence and statistics.Google Scholar
  21. Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52(5), 696–704.CrossRefGoogle Scholar
  22. Hartigan, J. (1975). Clustering algorithms. London: Wiley.Google Scholar
  23. Hasegawa, M., Kishino, H., & Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial dna. Journal of Molecular Evolution, 22, 160–174.CrossRefGoogle Scholar
  24. Haws, D., Huggins, P., O’Neill, E. M., Weisrock, D. W., & Yoshida, R. (2012). A support vector machine based test for incongruence between sets of trees in tree space. BMC Bioinformatics, 13, 210. doi: 10.1186/1471-2105-13-210.CrossRefGoogle Scholar
  25. Hedges, S. (2009). Vertebrates (vertebrata). In S. B. Hedges & S. Kumar (Eds.), The timetree of life (pp. 309–314). Berlin: Springer-Verlag.Google Scholar
  26. Heled, J., & Drummond, A. (2011). Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution, 27(3), 570–580.CrossRefGoogle Scholar
  27. Hess, J., & Goldman, N. (2011). Addressing inter-gene heterogeneity in maximum likelihood phylogenomic analysis: Yeasts revisited. PLoS ONE, 6, e22783.CrossRefGoogle Scholar
  28. Higham, D., Kalna, G., & Kibble, M. (2007). Spectral clustering and its use in bioinformatics. Journal of Computational and Applied Mathematics, 204(1), 25–37. (Special issue dedicated to Professor Shinnosuke Oharu on the occasion of his 65th birthday).CrossRefGoogle Scholar
  29. Hochbaum, D. S. (2010). Polynomial time algorithms for ratio regions and a variant of normalized cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5), 889–898.CrossRefGoogle Scholar
  30. Hochbaum, D. S. (2013). A polynomial time algorithm for rayleigh ratio on discrete variables: Replacing spectral techniques for expander ratio, normalized cut, and cheeger constant. Operations Research, 61(1), 184–198.CrossRefGoogle Scholar
  31. Holmes, S. (2005). Statistical approach to tests involving phylogenies. In O. Gascuel (Ed.), Mathematics of phylogeny and evolution, chapter 4 (pp. 91–117). New York: Oxford University Press.Google Scholar
  32. Huson, D. H., Klopper, T., Lockhart, P. J., & Steel, M. A. (2005). Reconstruction of reticulate networks from gene trees. In S. Miyano, J. Mesirov, S. Kasif, S. Istrail, P. A. Pevzner & M. Waterman (Eds.), Research in computational molecular biology, proceedings (pp. 233–249). Berlin: Springer.Google Scholar
  33. Jeffroy, O., Brinkmann, H., Delsuc, F., & Philippe, H. (2006). Phylogenomics: The beginning of incongruence? Trends Genetics, 22, 225–231.CrossRefGoogle Scholar
  34. Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. Munro (Ed.), Mammalian protein metabolism (pp. 21–32). New York: Academic.CrossRefGoogle Scholar
  35. Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.CrossRefGoogle Scholar
  36. Leigh, J. W., Lapointe, F.-J., Lopez, P., & Bapteste, E. (2011). Evaluating phylogenetic congruence in the post-genomic era. Genome Biology and Evolution, 3, 571–587.CrossRefGoogle Scholar
  37. Liang, D., Shen, X., & Zhang, P. (2013). One thousand two hundred ninety nuclear genes from a genome-wide survey support lungfishes as the sister group of tetrapods. Molecular Biology and Evolution, 30(8), 1803–1807.CrossRefGoogle Scholar
  38. Liu, K., Raghavan, S., Nelesen, S., Linder, C., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324, 1561–1564.CrossRefGoogle Scholar
  39. Maddison, W. P. (1997). Gene trees in species trees. Systematic Biology, 46(3), 523–536.CrossRefGoogle Scholar
  40. Maddison, W. P., & Maddison, D. (2009). Mesquite: A modular system for evolutionary analysis. Version 2.72. Available at
  41. Maimon, O., & Rokach, L. (2005). Data mining and knowledge discovery handbook (Vol. 2). Berlin: Springer.CrossRefGoogle Scholar
  42. Martin, A. P., & Burg, T. M. (2002). Perils of paralogy: Using HSP70 genes for inferring organismal phylogenies. Systematic Biology, 51, 570–587.CrossRefGoogle Scholar
  43. Miller, E., Owen, M., & Provan, J. S. (2015). Averaging metric phylogenetic trees. Advances in Applied Mathematics, 68, 51–91.CrossRefGoogle Scholar
  44. Mirarab, S., Bayzid, M. S., Boussau, B., & Warnow, T. (2014). Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215), 1250463.CrossRefGoogle Scholar
  45. Newman, M. E. J. (2013). Spectral methods for community detection and graph partitioning. Physical Review E, 88, 042822.CrossRefGoogle Scholar
  46. Neyman, J. (1971). Molecular studies of evolution: A source of novel statistical problems. In S. S. Gupta & J. Yackel (Eds.), Statistical decision theory and related topics (pp. 1–27). New York: Academic Press.Google Scholar
  47. Owen, M., & Provan, J. S. (2011). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(1), 2–13.CrossRefGoogle Scholar
  48. Pamilo, P., & Nei, M. (1988). Relationships between gene trees and species trees. Molecular Biology and Evolution, 5, 568–583.Google Scholar
  49. Posada, D., & Crandall, K. (2002). The effect of recombination on the accuracy of phylogeny reconstruction. Journal of Molecular Evolution, 54, 396–402.CrossRefGoogle Scholar
  50. Rivera, M. C., Jain, R., Moore, J. E., & Lake, J. A. (1998). Genomic evidence for two functionally distinct gene classes. Proceedings of the National Academy of Sciences of the United States of America, 95(11), 6239–6244.CrossRefGoogle Scholar
  51. Robinson, D., & Foulds, L. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147.CrossRefGoogle Scholar
  52. Roch, S., & Steel, M. (2015). Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading. Theoretical Population Biology, 100, 56–62.CrossRefGoogle Scholar
  53. Saitou, N., & Nei, M. (1987). The neighbor joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425.Google Scholar
  54. Salichos, L., & Rokas, A. (2013). Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497, 327–331.CrossRefGoogle Scholar
  55. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.CrossRefGoogle Scholar
  56. Sharon, E., Galun, M., Sharon, D., Basri, R., & Brandt, A. (2006). Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104), 810–813.CrossRefGoogle Scholar
  57. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.CrossRefGoogle Scholar
  58. Takahata, N. (1989). Gene genealogy in 3 related populations: Consistency probability between gene and population trees. Genetics, 122, 957–966.Google Scholar
  59. Takahata, N., & Nei, M. (1990). Allelic genealogy under overdominant and frequency-dependent selection and polymorphism of major histocompatibility complex loci. Genetics, 124, 967–978.Google Scholar
  60. Takezaki, N., Figueroa, F., Zaleska-Rutczynska, Z., Takahata, N., & Klein, J. (2004). The phylogenetic relationship of tetrapod, coelacanth, and lungfish revealed by the sequences of forty-four nuclear genes. Molecular Biology and Evolution, 21, 1512–1524.CrossRefGoogle Scholar
  61. Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17, 57–86.Google Scholar
  62. Taylor, J. W., Jacobson, D. J., Kroken, S., Kasuga, T., Geiser, D. M., Hibbett, D. S., et al. (2000). Phylogenetic species recognition and species concepts in fungi. Fungal Genetics and Biology, 31, 21–32.CrossRefGoogle Scholar
  63. Thompson, K., & Kubatko, L. (2013). Using ancestral information to detect and localize quantitative trait loci in genome-wide association studies. BMC Bioinformatics, 14, 200.CrossRefGoogle Scholar
  64. van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.Google Scholar
  65. Weisrock, D. W., Shaffer, H. B., Storz, B. L., Storz, S. R., Storz, S. R., & Voss, S. R. (2006). Multiple nuclear gene sequences identify phylogenetic species boundaries in the rapidly radiating clade of mexican ambystomatid salamanders. Molecular Ecology, 15, 2489–2503.CrossRefGoogle Scholar
  66. Weyenberg, G., Huggins, P., Schardl, C., Howe, D., & Yoshida, R. (2014). KDETREES: Non-parametric estimation of phylogenetic tree distributions. Bioinformatics, 30(16), 2280–2287.CrossRefGoogle Scholar
  67. Xing, E., & Karp, R. (2001). CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, 17(suppl 1), S306–S315.CrossRefGoogle Scholar
  68. Yang, Z. (1997). PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS, 15, 555–556.Google Scholar
  69. Yao, W., Krzystek, P., & Heurich, M. (2012). Tree species classification and estimation of stem volume and DBH based on single tree extraction by exploiting airborne full-waveform lidar data. Remote Sensing of Environment, 123, 368–380.CrossRefGoogle Scholar
  70. Yu, Y., Warnow, T., & Nakhleh, L. (2011). Algorithms for MDC-based multi-locus phylogeny inference: Beyond rooted binary gene trees on single alleles. Journal of Computational Biology, 18(11), 1543–1559.CrossRefGoogle Scholar
  71. Zhang, S.-B., Zhou, S.-Y., He, J.-G., & Lai, J.-H. (2011). Phylogeny inference based on spectral graph clustering. Journal of Computational Biology, 18(4), 627–637.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Operations ResearchNaval Postgraduate SchoolMontereyUSA
  2. 2.The Institute of Statistical MathematicsTachikawaJapan
  3. 3.Department of Multidisciplinary SciencesGraduate University of Advanced StudiesHayamaJapan
  4. 4.Department of Industrial and Manufacturing EngineeringNorth Dakota State UniversityFargoUSA

Personalised recommendations