SCA: Phonetic Alignment Based on Sound Classes

  • Johann-Mattis List
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7415)

Abstract

In this paper I present the most recent version of the SCA method for pairwise and multiple alignment analyses. In contrast to previously proposed alignment methods, SCA is based on a novel framework of sequence alignment which combines new approaches to sequence modeling in historical linguistics with recent developments in computational biology. In contrast to earlier versions of SCA [1,2] the new version comes along with a couple of modifications that significantly improve the performance and the application range of the algorithm: A new sound class model was defined which works well on highly divergent sequences, the algorithm for pairwise alignment was modified to be sensitive to secondary sequence structures such as syllable boundaries, and an algorithm for the pre-processing of the data in multiple alignment analyses [3] was included to cope for the bias resulting from progressive alignment analyses. In order to test the method, a new gold standard for pairwise and multiple alignment analyses was created which consists of 45 947 sequences covering a total of 435 different taxa belonging to six different language families.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

978-3-642-31467-4_3_MOESM1_ESM.zip (1.4 mb)
Electronic Supplementary Material(1,398 KB)
978-3-642-31467-4_3_MOESM2_ESM.zip (3.5 mb)
Electronic Supplementary Material(3,554 KB)

References

  1. 1.
    List, J.M.: Phonetic alignment based on sound classes. In: Slavkovik, M. (ed.) Proceedings of the 15th Student Session of the European Summer School for Logic, Language and Information, Kopenhagen, pp. 192–202 (2010)Google Scholar
  2. 2.
    List, J.M.: Multiple sequence alignment in historical linguistics. A sound class based approach. In: Proceedings of ConSOLE XIX (2011) (forthcoming)Google Scholar
  3. 3.
    Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee. A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)CrossRefGoogle Scholar
  4. 4.
    Gray, R.D., Atkinson, Q.D.: Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426(6965), 435–439 (2003)CrossRefGoogle Scholar
  5. 5.
    Holman, E.W., Brown, C.H., Wichmann, S., Müller, A., Velupillai, V., Hammarström, H., Sauppe, S., Jung, H., Bakker, D., Brown, P., Belyaev, O., Urban, M., Mailhammer, R., List, J.M., Egorov, D.: Automated dating of the world’s language families based on lexical similarity. Current Anthropology 52(6), 841–875 (2011)CrossRefGoogle Scholar
  6. 6.
    Baxter, W.H., Manaster Ramer, A.: Beyond lumping and splitting. Probabilistic issues in historical linguistics. In: Renfrew, C., McMahon, A., Trask, L. (eds.) Time Depth in Historical Linguistics, pp. 167–188. McDonald Institute for Archaeological Research, Cambridge (2000)Google Scholar
  7. 7.
    Kessler, B.: The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications, Stanford (2001)Google Scholar
  8. 8.
    Kondrak, G.: Algorithms for language reconstruction. Dissertation. University of Toronto, Toronto (2002)Google Scholar
  9. 9.
    Prokić, J., Wieling, M., Nerbonne, J.: Multiple sequence alignments in linguistics. In: Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education, pp. 18–25. Association for Computational Linguistics, Stroudsburg (2009)CrossRefGoogle Scholar
  10. 10.
    Turchin, P., Peiros, I., Gell-Mann, M.: Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship 3, 117–126 (2010)Google Scholar
  11. 11.
    Covington, M.A.: An algorithm to align words for historical comparison. Computational Linguistics 22(4), 481–496 (1996)Google Scholar
  12. 12.
    Ross, M., Durie, M.: Introduction. In: Durie, M. (ed.) The Comparative Method Reviewed. Regularity and Irregularity in Language Change, pp. 3–38. Oxford University Press, New York (1996)Google Scholar
  13. 13.
    Trask, R.L. (ed.): The dictionary of historical and comparative linguistics. Edinburgh University Press, Edinburgh (2000)Google Scholar
  14. 14.
    Lass, R.: Historical linguistics and language change. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  15. 15.
    Gusfield, D.: Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  16. 16.
    Needleman, S.B., Wunsch, C.D.: A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)CrossRefGoogle Scholar
  17. 17.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    Eddy, S.R.: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22(8), 1035–1036 (2004)CrossRefGoogle Scholar
  19. 19.
    Rosenberg, M.S.: Sequence alignment. Concepts and history. In: Rosenberg, M.S. (ed.) Sequence Alignment. Methods, Models, Concepts, and Strategies, pp. 1–22. University of California Press, Berkeley and Los Angeles and London (2009)Google Scholar
  20. 20.
    Durbin, R., Eddy, S.R., Krogh, A., Mitchinson, G.: Biological sequence analysis. Probabilistic models of proteins and nucleic acids, 7th edn. Cambridge University Press, Cambridge (2002)Google Scholar
  21. 21.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 1, 195–197 (1981)CrossRefGoogle Scholar
  22. 22.
    Morgenstern, B., Dress, A., Werner, T.D.: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proceedings of the National Acadamy of Science, USA 93, 12098–12103 (1996)MATHCrossRefGoogle Scholar
  23. 23.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. PNAS 89(22), 10915–10919 (1992)CrossRefGoogle Scholar
  24. 24.
    Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin 28, 1409–1438 (1958)Google Scholar
  25. 25.
    Saitou, N., Nei, M.: The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4), 406–425 (1987)Google Scholar
  26. 26.
    Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25(4), 351–360 (1987)CrossRefGoogle Scholar
  27. 27.
    Dolgopolsky, A.B.: Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija (A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia). Voprosy Jazykoznanija 2, 53–63 (1964)Google Scholar
  28. 28.
    Dolgopolsky, A.B.: A probabilistic hypothesis concerning the oldest relationships among the language families of northern Eurasia. In: Shevoroshkin, V.V. (ed.) Typology, Relationship and Time, pp. 27–50. Karoma Publisher, Ann Arbor (1986)Google Scholar
  29. 29.
    Brown, C.H., Holman, E.W., Wichmann, S.: Sound correspondences in the world’s languages (2011), Online manuscript, PDF, http://wwwstaff.eva.mpg.de/~wichmann/wwcPaper23.pdf
  30. 30.
    Brown, C.H., Holman, E.W., Wichmann, S., Velupillai, V., Cysouw, M.: Automated classification of the world’s languages. Sprachtypologie und Universalienforschung 61(4), 285–308 (2008)Google Scholar
  31. 31.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W. Nucleic Acids Research 22(22), 4673–4680 (1994)CrossRefGoogle Scholar
  32. 32.
    Geisler, H.: Akzent und Lautwandel in der Romania. Narr, Tübingen (1992)Google Scholar
  33. 33.
    Hóu, J. (ed.): Xiàndài Hànyǔ fāngyán yīnkù (Phonological database of Chinese dialects). Shànghǎi Jiàoyǔ, Shanghai (2004)Google Scholar
  34. 34.
    Downey, S.S., Hallmark, B., Cox, M.P., Norquest, P., Lansing, S.: Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics 15(4), 340–369 (2008)CrossRefGoogle Scholar
  35. 35.
    Wang, F.: Comparison of languages in contact. Institute of Linguistics Academia Sinica, Taipei (2006)Google Scholar
  36. 36.
    Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27(13), 2682–2690 (1999)CrossRefGoogle Scholar
  37. 37.
    Raghava, G.P.S., Barton, G.J.: Quantification of the variation in percentage identity for protein sequence alignments. BMC Bioinformatics 7(415) (2006)Google Scholar
  38. 38.
    Heggarty, P.: Sounds of the Andean languages. Online resource, http://www.quechua.org.uk/
  39. 39.
    Allen, B.: Bai Dialect Survey. SIL International (2007)Google Scholar
  40. 40.
    Almberg, J., Skarbø, K.: Nordavinden og sola. En norsk dialektprøvedatabase på nettet (The North Wind and the Sun. A Norwegian dialect database on the web) (2011), Online resource, http://www.ling.hf.ntnu.no/nos/
  41. 41.
    Gauchat, L., Jeanjaquet, J., Tappolet, E.: Tableaux phonétiques des patois suisses romands. Attinger, Neuchâtel (1925)Google Scholar
  42. 42.
    Renfrew, C., Heggarty, P.: Languages and origins in europe. Online resource, http://www.languagesandpeoples.com/

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Johann-Mattis List
    • 1
  1. 1.Heinrich Heine University DüsseldorfGermany

Personalised recommendations