Skip to main content

Syntactic Phylogenetic Trees

  • Chapter
  • First Online:

Abstract

In light of recent controversies surrounding the use of computational methods for the reconstruction of phylogenetic trees of language families (especially the Indo-European family), a possible approach based on syntactic information, complementing other linguistic methods, appeared as a promising possibility, largely developed in recent years in Longobardi’s Parametric Comparison Method. In this paper we identify several serious problems that arise in the use of syntactic data from the SSWL database for the purpose of computational phylogenetic reconstruction. We show that the most naive approach fails to produce reliable linguistic phylogenetic trees. We identify some of the sources of the observed problems and we discuss how they may be, at least partly, corrected by using additional information, such as prior subdivision into language families and subfamilies, and a better use of the information about ancient languages. We also describe how the use of phylogenetic algebraic geometry can help in estimating to what extent the probability distribution at the leaves of the phylogenetic tree obtained from the SSWL data can be considered reliable, by testing it on phylogenetic trees established by other forms of linguistic analysis. In simple examples, we find that, after restricting to smaller language subfamilies and considering only those SSWL parameters that are fully mapped for the whole subfamily, the SSWL data match extremely well reliable phylogenetic trees, according to the evaluation of phylogenetic invariants. This is a promising sign for the use of SSWL data for linguistic phylogenetics. We also argue how dependencies and nontrivial geometry/topology in the space of syntactic parameters would have to be taken into consideration in phylogenetic reconstructions based on syntactic data. A more detailed analysis of syntactic phylogenetic trees and their algebro-geometric invariants will appear elsewhere [33].

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://evolution.genetics.washington.edu/phylip/software.html.

  2. 2.

    http://ab.inf.uni-tuebingen.de/data/software/splitstree4/download/manual.pdf.

  3. 3.

    http://www.fluxus-engineering.com/Network5000_user_guide.pdf.

References

  1. E. Allman, J. Rhodes, Phylogenetic ideals and varieties for general Markov models. Adv. Appl. Math. 40, 127–148 (2008)

    Article  MathSciNet  Google Scholar 

  2. M. Baker, The Atoms of Language (Basic Books, USA, 2001)

    Google Scholar 

  3. F. Barbançon, S.N. Evans, L. Nakhleh, D. Ringe, T. Warnow, An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica 30(2), 143–170 (2013)

    Article  Google Scholar 

  4. C. Bocci, Topics in phylogenetic algebraic geometry. Expo. Math. 25, 235–259 (2007)

    Article  MathSciNet  Google Scholar 

  5. A. Bouchard-Côté, D. Hall, T.L. Griffiths, D. Klein, Automated reconstruction of ancient languages using probabilistic models of sound change. Proc. Natl. Acad. Sci. (PNAS) 110(11), 4224–4229 (2013)

    Article  Google Scholar 

  6. R. Bouckaert, P. Lemey, M. Dunn, S.J. Greenhill, A.V. Alekseyenko, A.J. Drummond, R.D. Gray, M.A. Suchard, Q.D. Atkinson, Mapping the origins and expansion of the Indo-European language family. Science 337, 957–960 (2012)

    Article  Google Scholar 

  7. N. Chomsky, Lectures on Government and Binding (Foris Publications, Dordrecht, 1982)

    Google Scholar 

  8. N. Chomsky, H. Lasnik, The theory of Principles and Parameters. in “Syntax: An International Handbook of Contemporary Research”, (de Gruyter, 1993), pp. 506–569

    Google Scholar 

  9. M. DeGiorgio, J.H. Degnan, Robustness to divergence time underestimation when inferring species trees from estimated gene trees. Syst. Biol. 63(1), 66–82 (2014)

    Article  Google Scholar 

  10. A. Delmestri, N. Cristianini, Linguistic phylogenetic inference by PAM-like matrices. J. Quant. Linguist. 19, 95–120 (2012)

    Article  Google Scholar 

  11. N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic geometry, in “Projective Varieties with Unexpected Properties”, pp.237–255, Walter de Gruyter, 2005

    Google Scholar 

  12. P. Forster, C. Renfrew, Phylogenetic Methods and the Prehistory of Language (McDonald Institute Monographs, 2006)

    Google Scholar 

  13. C. Galves (ed.), Parameter Theory and Linguistic Change (Oxford University Press, Oxford, 2012)

    Google Scholar 

  14. D. Gusfield, ReCombinatorics (MIT Press, Cambridge, 2014)

    Google Scholar 

  15. D.H. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts (Cambridge University Press, Algorithms and Applications, 2010)

    Book  Google Scholar 

  16. P. Kanerva, Sparse Distributed Memory (MIT Press, Cambridge, 1988)

    Google Scholar 

  17. G. Longobardi, Methods in parametric linguistics and cognitive history. Linguist. Var. Yearb. 3, 101–138 (2003)

    Article  Google Scholar 

  18. G. Longobardi, L. Bortolussi, M.A. Irimia, N. Radkevich, A. Ceolin, C. Guadagno, D. Michelioudakis, A. Sgarro, Mathematical modeling of grammatical diversity supports the historical reality of formal syntax. in “Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics” (2016)

    Google Scholar 

  19. G. Longobardi, S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, G. Barbujani, Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)

    Article  Google Scholar 

  20. G. Longobardi, C. Guardiano, Evidence for syntax as a signal of historical relatedness. Lingua 119, 1679–1706 (2009)

    Article  Google Scholar 

  21. G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin, Towards a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist. 3(1), 122–152 (2013)

    Article  Google Scholar 

  22. M. Marcolli, Syntactic parameters and a coding theory perspective on entropy and complexity of language families. Entropy 18, 110 [17 pages] (2016)

    Article  MathSciNet  Google Scholar 

  23. L. Nakhleh, D. Ringe, T. Warnow, Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382–420 (2005)

    Article  Google Scholar 

  24. L. Pacher, B. Sturmfels, The mathematics of phylogenomics. SIAM Rev. 49(1), 3–31 (2007)

    Article  MathSciNet  Google Scholar 

  25. L. Pacher, B. Sturmfels, Tropical geometry of statistical models. Proc. Natl. Acad. Sci. (PNAS) 10146, 16132–16137 (2004)

    Google Scholar 

  26. J.J. Park, R. Boettcher, A. Zhao, A. Mun, K. Yuh, V. Kumar, M. Marcolli, Prevalence and recoverability of syntactic parameters in sparse distributed memories. in Geometric Science of Information. Third International Conference GSI 2017. Lecture Notes in Computer Science, vol. 10589 (Springer, 2017), pp. 265–272

    MATH  Google Scholar 

  27. A. Perelysvaig, M.W. Lewis, The Indo-European Controversy: Facts and Fallacies in Historical Linguistics (Cambridge University Press, Cambridge, 2015)

    Google Scholar 

  28. F. Petroni, M. Serva, Language distance and tree reconstruction. J. Stat. Mech. 2008, P08012 [16 pages] (2008)

    Google Scholar 

  29. A. Port, I. Gheorghita, D. Guth, J.M. Clark, C. Liang, S. Dasu, M. Marcolli, Persistent Topology of Syntax. Math. Comput. Sci. 12(1), 33–50 (2018)

    Google Scholar 

  30. L. Rizzi, On the Format and Locus of Parameters: The Role of Morphosyntactic Features, preprint, 2016

    Google Scholar 

  31. N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  32. K.Shu, M.Marcolli, Syntactic Structures and Code Parameters, Math. Comput. Sci. 11(1), 79–90 (2017)

    Google Scholar 

  33. K. Shu, A. Ortegaray, R. C. Berwick, M. Marcolli, Phylogenetics of Indo-European Language families via an Algebro-Geometric Analysis of their Syntactic Structures. arXiv:1712.01719

  34. K. Siva, J. Tao, M. Marcolli, Syntactic Parameters and Spin Glass Models of Language Change. Linguist. Anal. 41(3–4), 559–608 (2017)

    Google Scholar 

  35. B. Sturmfels, S. Sullivant, Toric ideals of phylogenetic invariants. J. Comput. Bio. 12(2), 204–228 (2005)

    Article  Google Scholar 

  36. T. Warnow, S.N. Evans, D. Ringe, L. Nakhleh, Stochastic Models of Language Evolution and an Application to the Indo-European Family of Languages. Available at http://www.stat.berkeley.edu/users/evans/659.pdf

  37. SSWL Database of Syntactic Parameters: http://sswl.railsplayground.net/

Download references

Acknowledgements

The first author is supported by a Summer Undergraduate Research Fellowship at Caltech. Part of this work was performed as part of the activities of the last author’s Mathematical and Computational Linguistics lab and CS101/Ma191 class at Caltech. The last author is partially supported by NSF grants DMS-1201512 and PHY-1205440 and DMS-1707882.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matilde Marcolli .

Editor information

Editors and Affiliations

Appendix: The SSWL Parameters of the Latin languages

Appendix: The SSWL Parameters of the Latin languages

The phylogenetic invariants for the tree of Latin languages of Fig. 11 are evaluated at the probability distribution \(p_{i_1,i_2.i_3,i_4,i_5}\) at the leaves, based on the SSWL parameters for this group of languages. There are 106 parameters in the SSWL database that are completely mapped for all of these five languages. We have excluded from the list all those SSWL parameters that are only mapped for some but not all of the languages in this group. With the notation \(\ell _1=\) French, \(\ell _2=\) Italian, \(\ell _3 =\) Latin, \(\ell _4=\) Spanish, and \(\ell _5=\) Portuguese, the syntactic parameters are given by the following list. The column on the left lists the SSWL parameters P as labeled in the database, [37].

figure a

One can see by inspecting the different groups of parameters in this list that several parameters within the “same group” tend to behave in the same way (e.g. all the Neg parameters) or in more highly correlated way than across groups of parameters. This observation is consistent with the more general observation of dependencies observed through the Kanerva networks method in [26]. Thus, in order to better fit this set of binary variables with the hypothesis of independent equally distributed variables in Markov processes, it may be better to select a subset of the SSWL parameters that cuts across the various groups of more closely correlated variables. We will discuss this aspect more in details elsewhere.

The probability \(p_{i_1,i_2.i_3,i_4,i_5}\) is then computed by counting the frequencies of occurrence of binary vectors \([i_1,i_2,i_3,i_4,i_5] \epsilon \{0,1\}^5\) among the 106 vectors of SSWL parameters above. The only nonzero frequencies are

$$ p_{0,0,0,0,0}=\frac{31}{106}, \ \ \ p_{0,0,0,0,1}=\frac{1}{106}, \ \ \ p_{0,0,0,1,0}=\frac{1}{106}, \ \ \ p_{0,0,1,0,0}=\frac{23}{106}, $$
$$ p_{0,0,1,0,1}= \frac{3}{106}, \ \ \ p_{0,0,1,1,1}=\frac{2}{106}, \ \ \ p_{0,1,0,0,0}=\frac{1}{106}, \ \ \ p_{0,1,0,1,1}=\frac{1}{106}, $$
$$ p_{0,1,1,0,1}= \frac{1}{106}, \ \ \ p_{0,1,1,1,1}=\frac{3}{106}, \ \ \ p_{1,0,0,0,0} = \frac{5}{106}, \ \ \ p_{1,0,0,1,0}=\frac{2}{106}, $$
$$ p_{1,1,0,1,0}=\frac{1}{106}, \ \ \ p_{1,1,0,0,0}=\frac{2}{106}, \ \ \ p_{1,1,0,1,1}=\frac{8}{106}, \ \ \ p_{1,1,1,1,1}=\frac{21}{106}. $$

Note how these frequencies confirm some well known facts about the Latin languages. Syntactic parameters (as recorded in SSWL) are very likely to have remained the same across all five languages in the family, with a higher probability of a feature not allowed in Latin remaining not allowed in the other languages (31/106) than of a feature allowed in Latin remaining allowed in the other languages (21 / 106). It is also very likely that a feature is the same in all the modern ones but different from Latin, with a much higher incidence of cases of a feature allowed in Latin becoming disallowed in all the other languages (23/106) than the other way around (8/106). Among the remaining possibilities, we see incidences where French has an allowed feature that is missing in the other languages (5/106) of disallowed (3/106) and cases where Latin and Portuguese have the same feature allowed, which is disallowed in the other languages (3/106): all other nonzero entries have only two or less occurrences. The resulting matrices for the edge flattenings of the tree of Fig. 11 are then as computed in Sect. 5.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Shu, K., Aziz, S., Huynh, VL., Warrick, D., Marcolli, M. (2018). Syntactic Phylogenetic Trees. In: Kouneiher, J. (eds) Foundations of Mathematics and Physics One Century After Hilbert. Springer, Cham. https://doi.org/10.1007/978-3-319-64813-2_14

Download citation

Publish with us

Policies and ethics