Advertisement

Machine Translation

, Volume 27, Issue 1, pp 1–23 | Cite as

Generalizing sampling-based multilingual alignment

  • Adrien Lardilleux
  • François Yvon
  • Yves Lepage
Article

Abstract

Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

Keywords

Association measures Sub-sentential alignment Phrase-based machine translation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biçici E, Dymetman M (2008) Dynamic translation memory: using statistical MT to improve translation memory fuzzy matches. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing 2008), Haifa, Israel, pp 454–465. http://www.xrce.xerox.com/content/download/7009/52469/file/2007-046.pdf. Accessed 19 Mar 2012
  2. Brown P, Cocke J, Della Pietra S, Della Pietra V, Jelinek F, Mercer R, Roossin P (1988) A statistical approach to language translation. In: Coling Budapest, proceedings of the 12th international conference on computational linguistics, Budapest, Hungary, pp 71–76. http://aclweb.org/anthology-new/C/C88/C88-1016.pdf. Accessed 19 Mar 2012
  3. Brown P, Della Pietra S, Della Pietra V, Goldsmith M, Hajic J, Mercer R, Mohanty S (1993a) But dictionaries are data too. In: Proceedings of the workshop on human language technologies, Plainsboro, NJ, USA, pp 202–205. http://www.aclweb.org/anthology/H93-1039. Accessed 19 Mar 2012
  4. Brown P, Della Pietra S, Della Pietra V, Mercer R (1993b) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311. http://aclweb.org/anthology-new/J/J93/J93-2003.pdf. Accessed 19 Mar 2012Google Scholar
  5. Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2):201–228. http://aclweb.org/anthology-new/J/J07/J07-2003.pdf. Accessed 19 Mar 2012Google Scholar
  6. Clark J, Dyer C, Lavie A, Smith N (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL 2011), Portland, OR, USA, pp 176–181. http://www.aclweb.org/anthology/P11-2031. Accessed 19 Mar 2012
  7. Cover T, Thomas J (1991) Elements of information theory. Wiley, New YorkMATHCrossRefGoogle Scholar
  8. Dagan I, Church K (1994) Termight: identifying and translating technical terminology. In: Proceedings of the fourth conference on applied natural language processing, Association for Computational Linguistics, Stuttgart, pp 34–40. http://www.mt-archive.info/ANLP-1994-Dagan.pdf. Accessed 19 Mar 2012
  9. Deng Y, Byrne W (2005) HMM word and phrase alignment for statistical machine translation. In: Proceedings of the conference on human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP 2005), Vancouver, BC, Canada, pp 169–176. http://www.aclweb.org/anthology/H/H05/H05-1022.pdf. Accessed 19 Mar 2012
  10. Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74. http://portal.acm.org/citation.cfm?id=972450.972454. Accessed 19 Mar 2012
  11. Fordyce CS (2007) Overview of the IWSLT 2007 evaluation campaign. In: Proceedings of the 4th international workshop on spoken language translation (IWSLT 2007), Trento, Italy, pp 1–12. http://www.mt-archive.info/IWSLT-2007-Fordyce.pdf. Accessed 19 Mar 2012
  12. Fraser A, Marcu D (2007) Getting the structure right for word alignment: LEAF. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 51–60. http://www.aclweb.org/anthology/D/D07/D07-1006.pdf. Accessed 19 Mar 2012
  13. Fung P, Church K (1994) K-vec: a new approach for aligning parallel texts. In: Proceedings of the 15th international conference on computational linguistics (COLING 94), Kyoto, Japan, vol 2, pp 1096–1102. http://aclweb.org/anthology-new/C/C94/C94-2178.pdf. Accessed 19 Mar 2012
  14. Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the conference on 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics (COLING–ACL ’98), Montreal, vol 1, pp 414–420. http://www.aclweb.org/anthology/P98-1069. Accessed 19 Mar 2012
  15. Gale W, Church K (1991) Identifying word correspondences in parallel texts. In: Proceedings of the fourth DARPA workshop on speech and natural language, Pacific Grove, pp 152–157. http://www.aclweb.org/anthology/H/H91/H91-1026.pdf. Accessed 19 Mar 2012
  16. Ganchev K, Graça J, Taskar B (2008) Better alignments = better translations? In: Proceedings of the conference on 46th annual meeting of the Association for Computational Linguistics: human language technologies (ACL-08: HLT), Columbus, OH, pp 986–993. http://www.aclweb.org/anthology/P/P08/P08-1112.pdf. Accessed 19 Mar 2012
  17. Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing, Columbus, OH, USA, pp 49–57. http://www.aclweb.org/anthology/W/W08/W08-0509.pdf. Accessed 19 Mar 2012
  18. Gaussier E, Langé JM (1995) Modèles statistiques pour l’extraction de lexiques bilingues. Traitement Automatique des Langues 36(1–2):133–155Google Scholar
  19. Graça J, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Linguist 36(3):481–504. http://www.aclweb.org/anthology/J/J10/J10-3007.pdf. Accessed 19 Mar 2012Google Scholar
  20. Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 967–975. http://www.aclweb.org/anthology/D/D07/D07-1103.pdf. Accessed 19 Mar 2012
  21. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit X: the tenth machine translation summit, Phuket, pp 79–86. http://www.mt-archive.info/MTS-2005-Koehn.pdf. Accessed 19 Mar 2012
  22. Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics conference series, Edmonton, pp 48–54. http://aclweb.org/anthology-new/N/N03/N03-1017.pdf. Accessed 19 Mar 2012
  23. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL 2007), Prague, pp 177–180. http://aclweb.org/anthology-new/P/P07/P07-2045.pdf. Accessed 19 Mar 2012
  24. Lardilleux A (2010) Contribution des basses fréquences à à l’alignement sous-phrastique multilingue: une approche différentielle. PhD thesis, Université de Caen Basse-Normandie. http://tel.archives-ouvertes.fr/tel-00520787. Accessed 19 Mar 2012
  25. Lardilleux A, Lepage Y (2008) A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method. In: Proceedings of the 8th conference of the Association for Machine Translation in the Americas (AMTA 2008), Waikiki, pp 125–132. http://hal.archives-ouvertes.fr/hal-00368737/fr/. Accessed 19 Mar 2012
  26. Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: Proceedings of recent advances in natural language processing (RANLP 2009), Borovets, pp 214–218. http://hal.archives-ouvertes.fr/hal-00439789/fr/. Accessed 19 Mar 2012
  27. Lardilleux A, Chevelu J, Lepage Y, Putois G, Gosme J (2009) Lexicons or phrase tables? An investigation in sampling-based multilingual alignment. In: Proceedings of the 3rd workshop on example-based machine translation (EBMT3), Dublin, pp 45–52. http://hal.archives-ouvertes.fr/hal-00439806/fr/. Accessed 19 Mar 2012
  28. Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the main conference on human language technology conference of the North American chapter of the Association of Computational Linguistics, New York, NY, USA, pp 104–111. http://www.aclweb.org/anthology/N/N06/N06-1014.pdf. Accessed 19 Mar 2012
  29. Mandelbrot B (1954) Structure formelle des textes et communication. Word 10: 1–27Google Scholar
  30. Marcu D, Wong D (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 conference on empirical methods in natural language processing, Philadelphia, PA, pp 133–139. http://www.aclweb.org/anthology/W02-1018. Accessed 19 Mar 2012
  31. Melamed D (2000) Models of translational equivalence among words. Comput Linguist 26(2): 221–249CrossRefGoogle Scholar
  32. Montemurro M (2004) A generalization of the zipf-mandelbrot law in linguistics. In: Gell–Mann M, Tsallis C (eds) Nonextensive entropy: interdisciplinary applications. Oxford University Press, New York, p 12Google Scholar
  33. Moore R (2004) On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 333–340Google Scholar
  34. Moore R (2005) Association-based bilingual word alignment. In: proceedings of the workshop on building and using parallel texts: data-driven machine translation and beyond (ACL-05), Ann Arbor, MI, USA, pp 1–8. http://www.aclweb.org/anthology/W/W05/W05-0801.pdf. Accessed 19 Mar 2012
  35. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the conference on 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 311–318. http://www.aclweb.org/anthology/P02-1040. Accessed 19 Mar 2012
  36. Smadja F, Hatzivassiloglou V, McKeown K (1996) Translating collocations for bilingual lexicons: a statistical approach. Comput Linguist 22(1):1–38. http://aclweb.org/anthology-new/J/J96/J96-1001.pdf. Accessed 19 Mar 2012Google Scholar
  37. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation (AMTA 2006), Cambridge, MA, USA, pp 223–231. http://www.mt-archive.info/AMTA-2006-Snover.pdf. Accessed 19 Mar 2012
  38. Takezawa T, Sumita E, Sugaya F, Yamamoto H, Yamamoto S (2002) Toward a broad-coverage bilingual corpus for speech translation of travel conversation in the real world. In: Proceedings of the third international conference on language resources and evaluation (LREC 2002), Las Palmas, Gran Canaria, Spain, pp 147–152. http://gandalf.aksis.uib.no/lrec2002/pdf/305.pdf. Accessed 19 Mar 2012
  39. Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X: the tenth machine translation summit, Phuket, Thailand, pp 251–258. http://www.mt-archive.info/MTS-2005-Vogel.pdf. Accessed 19 Mar 2012
  40. Vogel S, Ney H, Tillman C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th international conference on computational linguistics (COLING-96), vol 2, Copenhagen, Denmark, pp 836–841. http://aclweb.org/anthology-new/C/C96/C96-2141.pdf. Accessed 19 Mar 2012
  41. Wu D (1997) Stochastic inversion transduction grammar and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–404. http://www.aclweb.org/anthology/J/J97/J97-3002.pdf. Accessed 19 Mar 2012Google Scholar
  42. Zipf G (1965) The psycho-biology of language: an introduction to dynamic philology. Classic Series. The MIT Press, Cambridge; first edition 1935Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Adrien Lardilleux
    • 1
  • François Yvon
    • 1
    • 2
  • Yves Lepage
    • 3
  1. 1.LIMSI-CNRSOrsay CedexFrance
  2. 2.University Paris SudOrsay CedexFrance
  3. 3.Graduate School of Information, Production and SystemsWaseda UniversityKitakyuusyuu-siJapan

Personalised recommendations