Advertisement

Machine Translation

, 21:29 | Cite as

Methods for extracting and classifying pairs of cognates and false friends

  • Ruslan Mitkov
  • Viktor Pekar
  • Dimitar Blagoev
  • Andrea Mulloni
Article

Abstract

The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make use of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English–French, English–German, English–Spanish and French–Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming ‘ideal’ extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.

Keywords

Cognates Faux amis Orthographic similarity Distributonal similarity Semantic similarity Translational equivalence 

References

  1. Barker G, Sutcliffe R (2000) An experiment in the semi-automatic identification of false-cognates between English and Polish. In: Proceedings of the 11th Irish conference on artificial intelligence and cognitive science. Galway, Ireland, pp 597–606Google Scholar
  2. Bergsma S, Kondrak G (2007a) Multilingual cognate identification using integer linear programming. In: Proceedings of the international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 11–18Google Scholar
  3. Bergsma S, Kondrak G (2007b) Alignment-based discriminative string similarity. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 656–663Google Scholar
  4. Brew C, McKelvie D (1996) Word-pair extraction for lexicography. In: Proceedings of the second international conference on new methods in language processing. Ankara, Turkey, pp 45–55Google Scholar
  5. Budanitsky A, Hirst G (2001) Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Proceedings of the workshop on WordNet and other lexical resources second meeting of the North American chapter of the association for computational linguistics (NAACL-2001). Pittsburgh, PA, pp 29–34Google Scholar
  6. Danielsson P, Muehlenbock K (2000) Small but efficient: the misconception of high-frequency words in Scandinavian translation. In: Envisioning machine translation in the information future, 4th conference of the association for machine translation in the Americas (AMTA 2000), LNCS vol 1934. Springer Verlag, Berlin, pp 158–168Google Scholar
  7. Frunza O, Inkpen D (2006) Semi-supervised learning of partial cognates using bilingual bootstrapping. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 441–448Google Scholar
  8. Fung P (1998) Statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup, third conference of the association for machine translation in the Americas, LNCS vol 1529. Springer Verlag, Berlin, pp 1–17Google Scholar
  9. Fung P, McKeown K (1997) Finding terminology translation from non-parallel corpora. In: Proceedings of the 5th annual workshop on very large corpora. Hong Kong, August 1997, pp 192-202Google Scholar
  10. Gale WA, Church KW, Yarowsky D (1992) A method for disambiguating word senses in a large corpus. Comput Human 26: 415–439CrossRefGoogle Scholar
  11. Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL-04: 42nd annual meeting of the association for computational linguistics, proceedings. Barcelona, Spain, pp 526–533Google Scholar
  12. Grefenstette G (1996) Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches. In: Boguarev B, Pustejovsky J(eds) Corpus processing for lexical acquisition. MIT Press, Cambridge, MA, pp 205–216Google Scholar
  13. Guy J (1994) An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. J Quant Linguist 1(1): 35–42CrossRefGoogle Scholar
  14. Inkpen D, Frunza O, Kondrak G (2005) Automatic identification of cognates and false friends in French and English. In: Proceedings of the international conference on recent advances in natural language processing (RANLP’ 05). Borovets, Bulgaria, pp 251–257Google Scholar
  15. Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4): 599–612Google Scholar
  16. Koehn P, Knight K (2002) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the 17th national conference on artificial intelligence (AAAI). Austin, TX, pp 711–715Google Scholar
  17. Kondrak G (2000) A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL/ANLP 2000: 1st conference of the North American chapter of the association for computational linguistics and 6th conference on applied natural language processing. Seattle, WA, pp 288–295Google Scholar
  18. Kondrak G (2001) Identifying cognates by phonetic and semantic similarity. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 103–110Google Scholar
  19. Kondrak G, Dorr B (2004) Identification of confusable drug names: a new approach and evaluation methodology. In: Coling: 20th international conference on computational linguistics, proceedings. Geneva, Switzerland, pp 952–958Google Scholar
  20. Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellbaum C(eds) WordNet: an electronic lexical database. MIT Press, Cambridge, MA, pp 265–283Google Scholar
  21. Lee L (1999) Measures of distributional similarity. 37th Annual meeting of the association for computational linguistics, College Park, MD, 25–32Google Scholar
  22. Levenshtein N (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4): 845–848Google Scholar
  23. Mann G, Yarowsky D (2001) Multipath translation lexicon induction via bridge languages. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001). Pittsburgh, PA, pp 151–158Google Scholar
  24. Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MAGoogle Scholar
  25. Melamed D (1999) Bitext maps and alignment via pattern recognition. Comput Linguist 25(1): 107–130Google Scholar
  26. Mulloni A, Pekar V (2006) Automatic detection of orthographic cues for cognate recognition. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-06). Genoa, Italy, pp 2387–2390Google Scholar
  27. Mulloni A, Pekar V, Mitkov R, Blagoev D (2007) Semantic evidence for automatic identification of cognates. In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 49–54Google Scholar
  28. Nakov S, Nakov P, Paskaleva E (2007) Cognate or false friend? Ask the web! In: Proceedings of the 1st international workshop on acquisition and management of multilingual lexicons. Borovets, Bulgaria, pp 55–62Google Scholar
  29. Nerbonne J, Heeringa W (1997) Measuring dialect distance phonetically. In: Proceedings of the third workshop on computational phonology, special interest group of the association for computational linguistics (SIGPHON-97). Madrid, Spain, pp 11–18Google Scholar
  30. Oakes MP (1998) Statistics for corpus linguistics. Edinburgh University Press, Edinburgh, UKGoogle Scholar
  31. Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. 31st Annual meeting of the association for computational linguistics, Columbus, OH, pp 183–190Google Scholar
  32. Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. 37th Annual meeting of the association for computational linguistics. College Park, MD, pp 519–526Google Scholar
  33. Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intel Res (11):95–130Google Scholar
  34. Schulz S, Marko K, Sbrissia E, Nohama P, Hahn U (2004) Cognate mapping—a heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. Coling: 20th international conference on computational linguistics, proceedings Geneva, Switzerland, pp 813–819Google Scholar
  35. Simard M, Foster G, Isabelle P (1992) Using cognates to align sentences in bilingual corpora. In: Fourth international conference on theoretical and methodological issues in machine translation, empiricist vs. rationalist methods in MT, TMI-92, Proceedings, Montreal, Canada, pp 67–81Google Scholar
  36. Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING 96: the 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585Google Scholar
  37. Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington D.C., Association of Computational Linguistics, pp 64–71Google Scholar
  38. Versley Y (2005) Parser evaluation across text types. In: Proceedings of the 4th workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220Google Scholar
  39. Vossen P, Bloksma L, Boersma P, Verdejo F, Gonzalo J, Rodriquez H, Rigau G, Calzolari N, Peters C, Picchi E, Montemagni S, Peters W (1998) EuroWordNet Tools and resources report. Technical report LE-4003, University of Amsterdam, The Netherlands (http://dare.uva.nl/record/157403)
  40. Wu Z, Palmer M (1994) Verb semantics and lexical selection. 32nd Annual meeting of the association for computational linguistics. Las Cruces, NM, pp 133–138Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Ruslan Mitkov
    • 1
  • Viktor Pekar
    • 1
  • Dimitar Blagoev
    • 2
  • Andrea Mulloni
    • 1
  1. 1.Research Institute for Information and Language ProcessingUniversity of WolverhamptonWolverhamptonUK
  2. 2.Mathematics and Informatics DepartmentUniversity of PlovdivPlovdivBulgaria

Personalised recommendations