Measuring Spelling Similarity for Cognate Identification

Gomes, Luís; Pereira Lopes, José Gabriel

doi:10.1007/978-3-642-24769-9_45

Luís Gomes²¹ &
José Gabriel Pereira Lopes²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7026))

Included in the following conference series:

Portuguese Conference on Artificial Intelligence

1482 Accesses
13 Citations

Abstract

The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as “ph” and “f” in English-Portuguese cognates “phase” and “fase”. Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity.

This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance-based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bergsma, S., Kondrak, G.: Alignment-based discriminative string similarity. In: Annual Meeting – Association for Computational Linguistics, vol. 45, page 656 (2007)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences – Computer Science and Computational Biology. Cambridge University Press (1997)
Google Scholar
Ildefonso, T., Pereira Lopes, J.G.: Longest Sorted Sequence Algorithm for Parallel Text Alignment. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds.) EUROCAST 2005. LNCS, vol. 3643, pp. 81–90. Springer, Heidelberg (2005)
Chapter Google Scholar
Kondrak, G.: Identification of Cognates and Recurrent Sound Correspondences in Word Lists. Traitement Automatique des Langues 50(2), 201–235 (2009)
Google Scholar
Dan Melamed, I.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999)
Google Scholar
Ribeiro, A., Dias, G., Lopes, G.P., Mexia, J.T.: Cognates alignment. In: Maegaard, B. (ed.) Proceedings of the Machine Translation Summit VIII (MT Summit VIII), Santiago de Compostela, Spain, September 18-22, pp. 287–292. European Association of Machine Translation (2001)
Google Scholar
Simard, M., Foster, G., Isabelle, P.: Using cognates to align sentences in parallel corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 67–81 (1992)
Google Scholar
Tiedemann, J.: Automatic construction of weighted string similarity measures. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 213–219 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Informática e Tecnologias da Informação (CITI), Universidade Nova de Lisboa, 2829-516, Caparica, Portugal
Luís Gomes & José Gabriel Pereira Lopes

Authors

Luís Gomes
View author publications
You can also search for this author in PubMed Google Scholar
José Gabriel Pereira Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculdade de Ciências, Departamento de Informática, GUESS/LabMAg/Universidade de Lisboa, Campo Grande, 749-016, Lisboa, Portugal
Luis Antunes
Department of Computer Science and Engineering, INESC-ID, Instituto Superior Técnico, IST, Avenida Rovisco Pais, 1049-001, Lisboa, Portugal
H. Sofia Pinto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomes, L., Pereira Lopes, J.G. (2011). Measuring Spelling Similarity for Cognate Identification. In: Antunes, L., Pinto, H.S. (eds) Progress in Artificial Intelligence. EPIA 2011. Lecture Notes in Computer Science(), vol 7026. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24769-9_45

Download citation

DOI: https://doi.org/10.1007/978-3-642-24769-9_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24768-2
Online ISBN: 978-3-642-24769-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics