Methods for extracting and classifying pairs of cognates and false friends

Mitkov, Ruslan; Pekar, Viktor; Blagoev, Dimitar; Mulloni, Andrea

doi:10.1007/s10590-008-9034-5

Methods for extracting and classifying pairs of cognates and false friends

Published: 17 May 2008

Volume 21, pages 29–53, (2007)
Cite this article

Machine Translation

Ruslan Mitkov¹,
Viktor Pekar¹,
Dimitar Blagoev² &
…
Andrea Mulloni¹

706 Accesses
7 Citations
Explore all metrics

Abstract

The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make use of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English–French, English–German, English–Spanish and French–Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming ‘ideal’ extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Computational Approach to Measuring the Semantic Divergence of Cognates

A large and evolving cognate database

Article Open access 30 May 2021

Creation and Significance of Database of Dictionary of Cognate Words

References

Barker G, Sutcliffe R (2000) An experiment in the semi-automatic identification of false-cognates between English and Polish. In: Proceedings of the 11th Irish conference on artificial intelligence and cognitive science. Galway, Ireland, pp 597–606
Bergsma S, Kondrak G (2007a) Multilingual cognate identification using integer linear programming. In: Proceedings of the international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 11–18
Bergsma S, Kondrak G (2007b) Alignment-based discriminative string similarity. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 656–663
Brew C, McKelvie D (1996) Word-pair extraction for lexicography. In: Proceedings of the second international conference on new methods in language processing. Ankara, Turkey, pp 45–55
Budanitsky A, Hirst G (2001) Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Proceedings of the workshop on WordNet and other lexical resources second meeting of the North American chapter of the association for computational linguistics (NAACL-2001). Pittsburgh, PA, pp 29–34
Danielsson P, Muehlenbock K (2000) Small but efficient: the misconception of high-frequency words in Scandinavian translation. In: Envisioning machine translation in the information future, 4th conference of the association for machine translation in the Americas (AMTA 2000), LNCS vol 1934. Springer Verlag, Berlin, pp 158–168
Frunza O, Inkpen D (2006) Semi-supervised learning of partial cognates using bilingual bootstrapping. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 441–448
Fung P (1998) Statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup, third conference of the association for machine translation in the Americas, LNCS vol 1529. Springer Verlag, Berlin, pp 1–17
Fung P, McKeown K (1997) Finding terminology translation from non-parallel corpora. In: Proceedings of the 5th annual workshop on very large corpora. Hong Kong, August 1997, pp 192-202
Gale WA, Church KW, Yarowsky D (1992) A method for disambiguating word senses in a large corpus. Comput Human 26: 415–439
Article Google Scholar
Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL-04: 42nd annual meeting of the association for computational linguistics, proceedings. Barcelona, Spain, pp 526–533
Grefenstette G (1996) Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches. In: Boguarev B, Pustejovsky J(eds) Corpus processing for lexical acquisition. MIT Press, Cambridge, MA, pp 205–216
Google Scholar
Guy J (1994) An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. J Quant Linguist 1(1): 35–42
Article Google Scholar
Inkpen D, Frunza O, Kondrak G (2005) Automatic identification of cognates and false friends in French and English. In: Proceedings of the international conference on recent advances in natural language processing (RANLP’ 05). Borovets, Bulgaria, pp 251–257
Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4): 599–612
Google Scholar
Koehn P, Knight K (2002) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the 17th national conference on artificial intelligence (AAAI). Austin, TX, pp 711–715
Kondrak G (2000) A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL/ANLP 2000: 1st conference of the North American chapter of the association for computational linguistics and 6th conference on applied natural language processing. Seattle, WA, pp 288–295
Kondrak G (2001) Identifying cognates by phonetic and semantic similarity. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 103–110
Kondrak G, Dorr B (2004) Identification of confusable drug names: a new approach and evaluation methodology. In: Coling: 20th international conference on computational linguistics, proceedings. Geneva, Switzerland, pp 952–958
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellbaum C(eds) WordNet: an electronic lexical database. MIT Press, Cambridge, MA, pp 265–283
Google Scholar
Lee L (1999) Measures of distributional similarity. 37th Annual meeting of the association for computational linguistics, College Park, MD, 25–32
Levenshtein N (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4): 845–848
Google Scholar
Mann G, Yarowsky D (2001) Multipath translation lexicon induction via bridge languages. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 151–158
Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Google Scholar
Melamed D (1999) Bitext maps and alignment via pattern recognition. Comput Linguist 25(1): 107–130
Google Scholar
Mulloni A, Pekar V (2006) Automatic detection of orthographic cues for cognate recognition. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-06). Genoa, Italy, pp 2387–2390
Mulloni A, Pekar V, Mitkov R, Blagoev D (2007) Semantic evidence for automatic identification of cognates. In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 49–54
Nakov S, Nakov P, Paskaleva E (2007) Cognate or false friend? Ask the web! In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 55–62
Nerbonne J, Heeringa W (1997) Measuring dialect distance phonetically. In: Proceedings of the third workshop on computational phonology, special interest group of the association for computational linguistics (SIGPHON-97). Madrid, Spain, pp 11–18
Oakes MP (1998) Statistics for corpus linguistics. Edinburgh University Press, Edinburgh, UK
Google Scholar
Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. 31st Annual meeting of the association for computational linguistics, Columbus, OH, pp 183–190
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. 37th Annual meeting of the association for computational linguistics. College Park, MD, pp 519–526
Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intel Res (11):95–130
Schulz S, Marko K, Sbrissia E, Nohama P, Hahn U (2004) Cognate mapping—a heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. Coling: 20th international conference on computational linguistics, proceedings Geneva, Switzerland, pp 813–819
Simard M, Foster G, Isabelle P (1992) Using cognates to align sentences in bilingual corpora. In: Fourth international conference on theoretical and methodological issues in machine translation, empiricist vs. rationalist methods in MT, TMI-92, Proceedings, Montreal, Canada, pp 67–81
Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING 96: the 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585
Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington D.C., Association of Computational Linguistics, pp 64–71
Versley Y (2005) Parser evaluation across text types. In: Proceedings of the 4th workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220
Vossen P, Bloksma L, Boersma P, Verdejo F, Gonzalo J, Rodriquez H, Rigau G, Calzolari N, Peters C, Picchi E, Montemagni S, Peters W (1998) EuroWordNet Tools and resources report. Technical report LE-4003, University of Amsterdam, The Netherlands (http://dare.uva.nl/record/157403)
Wu Z, Palmer M (1994) Verb semantics and lexical selection. 32nd Annual meeting of the association for computational linguistics. Las Cruces, NM, pp 133–138

Download references

Author information

Authors and Affiliations

Research Institute for Information and Language Processing, University of Wolverhampton, Stafford Street, Wolverhampton, WV1 1SB, UK
Ruslan Mitkov, Viktor Pekar & Andrea Mulloni
Mathematics and Informatics Department, University of Plovdiv, 4003, Plovdiv, Bulgaria
Dimitar Blagoev

Authors

Ruslan Mitkov
View author publications
You can also search for this author in PubMed Google Scholar
Viktor Pekar
View author publications
You can also search for this author in PubMed Google Scholar
Dimitar Blagoev
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Mulloni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruslan Mitkov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitkov, R., Pekar, V., Blagoev, D. et al. Methods for extracting and classifying pairs of cognates and false friends. Machine Translation 21, 29–53 (2007). https://doi.org/10.1007/s10590-008-9034-5

Download citation

Received: 18 January 2007
Accepted: 27 February 2008
Published: 17 May 2008
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10590-008-9034-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for extracting and classifying pairs of cognates and false friends

Abstract

Access this article

Similar content being viewed by others

A Computational Approach to Measuring the Semantic Divergence of Cognates

A large and evolving cognate database

Creation and Significance of Database of Dictionary of Cognate Words

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Methods for extracting and classifying pairs of cognates and false friends

Abstract

Access this article

Similar content being viewed by others

A Computational Approach to Measuring the Semantic Divergence of Cognates

A large and evolving cognate database

Creation and Significance of Database of Dictionary of Cognate Words

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation