Skip to main content
Log in

Methods for extracting and classifying pairs of cognates and false friends

  • Published:
Machine Translation

Abstract

The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make use of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English–French, English–German, English–Spanish and French–Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming ‘ideal’ extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barker G, Sutcliffe R (2000) An experiment in the semi-automatic identification of false-cognates between English and Polish. In: Proceedings of the 11th Irish conference on artificial intelligence and cognitive science. Galway, Ireland, pp 597–606

  • Bergsma S, Kondrak G (2007a) Multilingual cognate identification using integer linear programming. In: Proceedings of the international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 11–18

  • Bergsma S, Kondrak G (2007b) Alignment-based discriminative string similarity. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 656–663

  • Brew C, McKelvie D (1996) Word-pair extraction for lexicography. In: Proceedings of the second international conference on new methods in language processing. Ankara, Turkey, pp 45–55

  • Budanitsky A, Hirst G (2001) Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Proceedings of the workshop on WordNet and other lexical resources second meeting of the North American chapter of the association for computational linguistics (NAACL-2001). Pittsburgh, PA, pp 29–34

  • Danielsson P, Muehlenbock K (2000) Small but efficient: the misconception of high-frequency words in Scandinavian translation. In: Envisioning machine translation in the information future, 4th conference of the association for machine translation in the Americas (AMTA 2000), LNCS vol 1934. Springer Verlag, Berlin, pp 158–168

  • Frunza O, Inkpen D (2006) Semi-supervised learning of partial cognates using bilingual bootstrapping. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 441–448

  • Fung P (1998) Statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup, third conference of the association for machine translation in the Americas, LNCS vol 1529. Springer Verlag, Berlin, pp 1–17

  • Fung P, McKeown K (1997) Finding terminology translation from non-parallel corpora. In: Proceedings of the 5th annual workshop on very large corpora. Hong Kong, August 1997, pp 192-202

  • Gale WA, Church KW, Yarowsky D (1992) A method for disambiguating word senses in a large corpus. Comput Human 26: 415–439

    Article  Google Scholar 

  • Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL-04: 42nd annual meeting of the association for computational linguistics, proceedings. Barcelona, Spain, pp 526–533

  • Grefenstette G (1996) Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches. In: Boguarev B, Pustejovsky J(eds) Corpus processing for lexical acquisition. MIT Press, Cambridge, MA, pp 205–216

    Google Scholar 

  • Guy J (1994) An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. J Quant Linguist 1(1): 35–42

    Article  Google Scholar 

  • Inkpen D, Frunza O, Kondrak G (2005) Automatic identification of cognates and false friends in French and English. In: Proceedings of the international conference on recent advances in natural language processing (RANLP’ 05). Borovets, Bulgaria, pp 251–257

  • Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4): 599–612

    Google Scholar 

  • Koehn P, Knight K (2002) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the 17th national conference on artificial intelligence (AAAI). Austin, TX, pp 711–715

  • Kondrak G (2000) A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL/ANLP 2000: 1st conference of the North American chapter of the association for computational linguistics and 6th conference on applied natural language processing. Seattle, WA, pp 288–295

  • Kondrak G (2001) Identifying cognates by phonetic and semantic similarity. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 103–110

  • Kondrak G, Dorr B (2004) Identification of confusable drug names: a new approach and evaluation methodology. In: Coling: 20th international conference on computational linguistics, proceedings. Geneva, Switzerland, pp 952–958

  • Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellbaum C(eds) WordNet: an electronic lexical database. MIT Press, Cambridge, MA, pp 265–283

    Google Scholar 

  • Lee L (1999) Measures of distributional similarity. 37th Annual meeting of the association for computational linguistics, College Park, MD, 25–32

  • Levenshtein N (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4): 845–848

    Google Scholar 

  • Mann G, Yarowsky D (2001) Multipath translation lexicon induction via bridge languages. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 151–158

  • Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA

    Google Scholar 

  • Melamed D (1999) Bitext maps and alignment via pattern recognition. Comput Linguist 25(1): 107–130

    Google Scholar 

  • Mulloni A, Pekar V (2006) Automatic detection of orthographic cues for cognate recognition. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-06). Genoa, Italy, pp 2387–2390

  • Mulloni A, Pekar V, Mitkov R, Blagoev D (2007) Semantic evidence for automatic identification of cognates. In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 49–54

  • Nakov S, Nakov P, Paskaleva E (2007) Cognate or false friend? Ask the web! In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 55–62

  • Nerbonne J, Heeringa W (1997) Measuring dialect distance phonetically. In: Proceedings of the third workshop on computational phonology, special interest group of the association for computational linguistics (SIGPHON-97). Madrid, Spain, pp 11–18

  • Oakes MP (1998) Statistics for corpus linguistics. Edinburgh University Press, Edinburgh, UK

    Google Scholar 

  • Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. 31st Annual meeting of the association for computational linguistics, Columbus, OH, pp 183–190

  • Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. 37th Annual meeting of the association for computational linguistics. College Park, MD, pp 519–526

  • Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intel Res (11):95–130

  • Schulz S, Marko K, Sbrissia E, Nohama P, Hahn U (2004) Cognate mapping—a heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. Coling: 20th international conference on computational linguistics, proceedings Geneva, Switzerland, pp 813–819

  • Simard M, Foster G, Isabelle P (1992) Using cognates to align sentences in bilingual corpora. In: Fourth international conference on theoretical and methodological issues in machine translation, empiricist vs. rationalist methods in MT, TMI-92, Proceedings, Montreal, Canada, pp 67–81

  • Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING 96: the 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585

  • Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington D.C., Association of Computational Linguistics, pp 64–71

  • Versley Y (2005) Parser evaluation across text types. In: Proceedings of the 4th workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220

  • Vossen P, Bloksma L, Boersma P, Verdejo F, Gonzalo J, Rodriquez H, Rigau G, Calzolari N, Peters C, Picchi E, Montemagni S, Peters W (1998) EuroWordNet Tools and resources report. Technical report LE-4003, University of Amsterdam, The Netherlands (http://dare.uva.nl/record/157403)

  • Wu Z, Palmer M (1994) Verb semantics and lexical selection. 32nd Annual meeting of the association for computational linguistics. Las Cruces, NM, pp 133–138

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruslan Mitkov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitkov, R., Pekar, V., Blagoev, D. et al. Methods for extracting and classifying pairs of cognates and false friends. Machine Translation 21, 29–53 (2007). https://doi.org/10.1007/s10590-008-9034-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-008-9034-5

Keywords

Navigation