Skip to main content
Log in

Generalizing sampling-based multilingual alignment

  • Published:
Machine Translation

Abstract

Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Biçici E, Dymetman M (2008) Dynamic translation memory: using statistical MT to improve translation memory fuzzy matches. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing 2008), Haifa, Israel, pp 454–465. http://www.xrce.xerox.com/content/download/7009/52469/file/2007-046.pdf. Accessed 19 Mar 2012

  • Brown P, Cocke J, Della Pietra S, Della Pietra V, Jelinek F, Mercer R, Roossin P (1988) A statistical approach to language translation. In: Coling Budapest, proceedings of the 12th international conference on computational linguistics, Budapest, Hungary, pp 71–76. http://aclweb.org/anthology-new/C/C88/C88-1016.pdf. Accessed 19 Mar 2012

  • Brown P, Della Pietra S, Della Pietra V, Goldsmith M, Hajic J, Mercer R, Mohanty S (1993a) But dictionaries are data too. In: Proceedings of the workshop on human language technologies, Plainsboro, NJ, USA, pp 202–205. http://www.aclweb.org/anthology/H93-1039. Accessed 19 Mar 2012

  • Brown P, Della Pietra S, Della Pietra V, Mercer R (1993b) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311. http://aclweb.org/anthology-new/J/J93/J93-2003.pdf. Accessed 19 Mar 2012

    Google Scholar 

  • Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2):201–228. http://aclweb.org/anthology-new/J/J07/J07-2003.pdf. Accessed 19 Mar 2012

    Google Scholar 

  • Clark J, Dyer C, Lavie A, Smith N (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL 2011), Portland, OR, USA, pp 176–181. http://www.aclweb.org/anthology/P11-2031. Accessed 19 Mar 2012

  • Cover T, Thomas J (1991) Elements of information theory. Wiley, New York

    Book  MATH  Google Scholar 

  • Dagan I, Church K (1994) Termight: identifying and translating technical terminology. In: Proceedings of the fourth conference on applied natural language processing, Association for Computational Linguistics, Stuttgart, pp 34–40. http://www.mt-archive.info/ANLP-1994-Dagan.pdf. Accessed 19 Mar 2012

  • Deng Y, Byrne W (2005) HMM word and phrase alignment for statistical machine translation. In: Proceedings of the conference on human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP 2005), Vancouver, BC, Canada, pp 169–176. http://www.aclweb.org/anthology/H/H05/H05-1022.pdf. Accessed 19 Mar 2012

  • Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74. http://portal.acm.org/citation.cfm?id=972450.972454. Accessed 19 Mar 2012

  • Fordyce CS (2007) Overview of the IWSLT 2007 evaluation campaign. In: Proceedings of the 4th international workshop on spoken language translation (IWSLT 2007), Trento, Italy, pp 1–12. http://www.mt-archive.info/IWSLT-2007-Fordyce.pdf. Accessed 19 Mar 2012

  • Fraser A, Marcu D (2007) Getting the structure right for word alignment: LEAF. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 51–60. http://www.aclweb.org/anthology/D/D07/D07-1006.pdf. Accessed 19 Mar 2012

  • Fung P, Church K (1994) K-vec: a new approach for aligning parallel texts. In: Proceedings of the 15th international conference on computational linguistics (COLING 94), Kyoto, Japan, vol 2, pp 1096–1102. http://aclweb.org/anthology-new/C/C94/C94-2178.pdf. Accessed 19 Mar 2012

  • Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the conference on 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics (COLING–ACL ’98), Montreal, vol 1, pp 414–420. http://www.aclweb.org/anthology/P98-1069. Accessed 19 Mar 2012

  • Gale W, Church K (1991) Identifying word correspondences in parallel texts. In: Proceedings of the fourth DARPA workshop on speech and natural language, Pacific Grove, pp 152–157. http://www.aclweb.org/anthology/H/H91/H91-1026.pdf. Accessed 19 Mar 2012

  • Ganchev K, Graça J, Taskar B (2008) Better alignments = better translations? In: Proceedings of the conference on 46th annual meeting of the Association for Computational Linguistics: human language technologies (ACL-08: HLT), Columbus, OH, pp 986–993. http://www.aclweb.org/anthology/P/P08/P08-1112.pdf. Accessed 19 Mar 2012

  • Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing, Columbus, OH, USA, pp 49–57. http://www.aclweb.org/anthology/W/W08/W08-0509.pdf. Accessed 19 Mar 2012

  • Gaussier E, Langé JM (1995) Modèles statistiques pour l’extraction de lexiques bilingues. Traitement Automatique des Langues 36(1–2):133–155

    Google Scholar 

  • Graça J, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Linguist 36(3):481–504. http://www.aclweb.org/anthology/J/J10/J10-3007.pdf. Accessed 19 Mar 2012

    Google Scholar 

  • Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 967–975. http://www.aclweb.org/anthology/D/D07/D07-1103.pdf. Accessed 19 Mar 2012

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit X: the tenth machine translation summit, Phuket, pp 79–86. http://www.mt-archive.info/MTS-2005-Koehn.pdf. Accessed 19 Mar 2012

  • Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics conference series, Edmonton, pp 48–54. http://aclweb.org/anthology-new/N/N03/N03-1017.pdf. Accessed 19 Mar 2012

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL 2007), Prague, pp 177–180. http://aclweb.org/anthology-new/P/P07/P07-2045.pdf. Accessed 19 Mar 2012

  • Lardilleux A (2010) Contribution des basses fréquences à à l’alignement sous-phrastique multilingue: une approche différentielle. PhD thesis, Université de Caen Basse-Normandie. http://tel.archives-ouvertes.fr/tel-00520787. Accessed 19 Mar 2012

  • Lardilleux A, Lepage Y (2008) A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method. In: Proceedings of the 8th conference of the Association for Machine Translation in the Americas (AMTA 2008), Waikiki, pp 125–132. http://hal.archives-ouvertes.fr/hal-00368737/fr/. Accessed 19 Mar 2012

  • Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: Proceedings of recent advances in natural language processing (RANLP 2009), Borovets, pp 214–218. http://hal.archives-ouvertes.fr/hal-00439789/fr/. Accessed 19 Mar 2012

  • Lardilleux A, Chevelu J, Lepage Y, Putois G, Gosme J (2009) Lexicons or phrase tables? An investigation in sampling-based multilingual alignment. In: Proceedings of the 3rd workshop on example-based machine translation (EBMT3), Dublin, pp 45–52. http://hal.archives-ouvertes.fr/hal-00439806/fr/. Accessed 19 Mar 2012

  • Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the main conference on human language technology conference of the North American chapter of the Association of Computational Linguistics, New York, NY, USA, pp 104–111. http://www.aclweb.org/anthology/N/N06/N06-1014.pdf. Accessed 19 Mar 2012

  • Mandelbrot B (1954) Structure formelle des textes et communication. Word 10: 1–27

    Google Scholar 

  • Marcu D, Wong D (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 conference on empirical methods in natural language processing, Philadelphia, PA, pp 133–139. http://www.aclweb.org/anthology/W02-1018. Accessed 19 Mar 2012

  • Melamed D (2000) Models of translational equivalence among words. Comput Linguist 26(2): 221–249

    Article  Google Scholar 

  • Montemurro M (2004) A generalization of the zipf-mandelbrot law in linguistics. In: Gell–Mann M, Tsallis C (eds) Nonextensive entropy: interdisciplinary applications. Oxford University Press, New York, p 12

  • Moore R (2004) On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 333–340

  • Moore R (2005) Association-based bilingual word alignment. In: proceedings of the workshop on building and using parallel texts: data-driven machine translation and beyond (ACL-05), Ann Arbor, MI, USA, pp 1–8. http://www.aclweb.org/anthology/W/W05/W05-0801.pdf. Accessed 19 Mar 2012

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the conference on 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 311–318. http://www.aclweb.org/anthology/P02-1040. Accessed 19 Mar 2012

  • Smadja F, Hatzivassiloglou V, McKeown K (1996) Translating collocations for bilingual lexicons: a statistical approach. Comput Linguist 22(1):1–38. http://aclweb.org/anthology-new/J/J96/J96-1001.pdf. Accessed 19 Mar 2012

    Google Scholar 

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation (AMTA 2006), Cambridge, MA, USA, pp 223–231. http://www.mt-archive.info/AMTA-2006-Snover.pdf. Accessed 19 Mar 2012

  • Takezawa T, Sumita E, Sugaya F, Yamamoto H, Yamamoto S (2002) Toward a broad-coverage bilingual corpus for speech translation of travel conversation in the real world. In: Proceedings of the third international conference on language resources and evaluation (LREC 2002), Las Palmas, Gran Canaria, Spain, pp 147–152. http://gandalf.aksis.uib.no/lrec2002/pdf/305.pdf. Accessed 19 Mar 2012

  • Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X: the tenth machine translation summit, Phuket, Thailand, pp 251–258. http://www.mt-archive.info/MTS-2005-Vogel.pdf. Accessed 19 Mar 2012

  • Vogel S, Ney H, Tillman C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th international conference on computational linguistics (COLING-96), vol 2, Copenhagen, Denmark, pp 836–841. http://aclweb.org/anthology-new/C/C96/C96-2141.pdf. Accessed 19 Mar 2012

  • Wu D (1997) Stochastic inversion transduction grammar and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–404. http://www.aclweb.org/anthology/J/J97/J97-3002.pdf. Accessed 19 Mar 2012

    Google Scholar 

  • Zipf G (1965) The psycho-biology of language: an introduction to dynamic philology. Classic Series. The MIT Press, Cambridge; first edition 1935

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrien Lardilleux.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lardilleux, A., Yvon, F. & Lepage, Y. Generalizing sampling-based multilingual alignment. Machine Translation 27, 1–23 (2013). https://doi.org/10.1007/s10590-012-9126-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-012-9126-0

Keywords

Navigation