Abstract
Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.
Similar content being viewed by others
References
Biçici E, Dymetman M (2008) Dynamic translation memory: using statistical MT to improve translation memory fuzzy matches. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing 2008), Haifa, Israel, pp 454–465. http://www.xrce.xerox.com/content/download/7009/52469/file/2007-046.pdf. Accessed 19 Mar 2012
Brown P, Cocke J, Della Pietra S, Della Pietra V, Jelinek F, Mercer R, Roossin P (1988) A statistical approach to language translation. In: Coling Budapest, proceedings of the 12th international conference on computational linguistics, Budapest, Hungary, pp 71–76. http://aclweb.org/anthology-new/C/C88/C88-1016.pdf. Accessed 19 Mar 2012
Brown P, Della Pietra S, Della Pietra V, Goldsmith M, Hajic J, Mercer R, Mohanty S (1993a) But dictionaries are data too. In: Proceedings of the workshop on human language technologies, Plainsboro, NJ, USA, pp 202–205. http://www.aclweb.org/anthology/H93-1039. Accessed 19 Mar 2012
Brown P, Della Pietra S, Della Pietra V, Mercer R (1993b) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311. http://aclweb.org/anthology-new/J/J93/J93-2003.pdf. Accessed 19 Mar 2012
Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2):201–228. http://aclweb.org/anthology-new/J/J07/J07-2003.pdf. Accessed 19 Mar 2012
Clark J, Dyer C, Lavie A, Smith N (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL 2011), Portland, OR, USA, pp 176–181. http://www.aclweb.org/anthology/P11-2031. Accessed 19 Mar 2012
Cover T, Thomas J (1991) Elements of information theory. Wiley, New York
Dagan I, Church K (1994) Termight: identifying and translating technical terminology. In: Proceedings of the fourth conference on applied natural language processing, Association for Computational Linguistics, Stuttgart, pp 34–40. http://www.mt-archive.info/ANLP-1994-Dagan.pdf. Accessed 19 Mar 2012
Deng Y, Byrne W (2005) HMM word and phrase alignment for statistical machine translation. In: Proceedings of the conference on human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP 2005), Vancouver, BC, Canada, pp 169–176. http://www.aclweb.org/anthology/H/H05/H05-1022.pdf. Accessed 19 Mar 2012
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74. http://portal.acm.org/citation.cfm?id=972450.972454. Accessed 19 Mar 2012
Fordyce CS (2007) Overview of the IWSLT 2007 evaluation campaign. In: Proceedings of the 4th international workshop on spoken language translation (IWSLT 2007), Trento, Italy, pp 1–12. http://www.mt-archive.info/IWSLT-2007-Fordyce.pdf. Accessed 19 Mar 2012
Fraser A, Marcu D (2007) Getting the structure right for word alignment: LEAF. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 51–60. http://www.aclweb.org/anthology/D/D07/D07-1006.pdf. Accessed 19 Mar 2012
Fung P, Church K (1994) K-vec: a new approach for aligning parallel texts. In: Proceedings of the 15th international conference on computational linguistics (COLING 94), Kyoto, Japan, vol 2, pp 1096–1102. http://aclweb.org/anthology-new/C/C94/C94-2178.pdf. Accessed 19 Mar 2012
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the conference on 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics (COLING–ACL ’98), Montreal, vol 1, pp 414–420. http://www.aclweb.org/anthology/P98-1069. Accessed 19 Mar 2012
Gale W, Church K (1991) Identifying word correspondences in parallel texts. In: Proceedings of the fourth DARPA workshop on speech and natural language, Pacific Grove, pp 152–157. http://www.aclweb.org/anthology/H/H91/H91-1026.pdf. Accessed 19 Mar 2012
Ganchev K, Graça J, Taskar B (2008) Better alignments = better translations? In: Proceedings of the conference on 46th annual meeting of the Association for Computational Linguistics: human language technologies (ACL-08: HLT), Columbus, OH, pp 986–993. http://www.aclweb.org/anthology/P/P08/P08-1112.pdf. Accessed 19 Mar 2012
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing, Columbus, OH, USA, pp 49–57. http://www.aclweb.org/anthology/W/W08/W08-0509.pdf. Accessed 19 Mar 2012
Gaussier E, Langé JM (1995) Modèles statistiques pour l’extraction de lexiques bilingues. Traitement Automatique des Langues 36(1–2):133–155
Graça J, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Linguist 36(3):481–504. http://www.aclweb.org/anthology/J/J10/J10-3007.pdf. Accessed 19 Mar 2012
Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp 967–975. http://www.aclweb.org/anthology/D/D07/D07-1103.pdf. Accessed 19 Mar 2012
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit X: the tenth machine translation summit, Phuket, pp 79–86. http://www.mt-archive.info/MTS-2005-Koehn.pdf. Accessed 19 Mar 2012
Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics conference series, Edmonton, pp 48–54. http://aclweb.org/anthology-new/N/N03/N03-1017.pdf. Accessed 19 Mar 2012
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL 2007), Prague, pp 177–180. http://aclweb.org/anthology-new/P/P07/P07-2045.pdf. Accessed 19 Mar 2012
Lardilleux A (2010) Contribution des basses fréquences à à l’alignement sous-phrastique multilingue: une approche différentielle. PhD thesis, Université de Caen Basse-Normandie. http://tel.archives-ouvertes.fr/tel-00520787. Accessed 19 Mar 2012
Lardilleux A, Lepage Y (2008) A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method. In: Proceedings of the 8th conference of the Association for Machine Translation in the Americas (AMTA 2008), Waikiki, pp 125–132. http://hal.archives-ouvertes.fr/hal-00368737/fr/. Accessed 19 Mar 2012
Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: Proceedings of recent advances in natural language processing (RANLP 2009), Borovets, pp 214–218. http://hal.archives-ouvertes.fr/hal-00439789/fr/. Accessed 19 Mar 2012
Lardilleux A, Chevelu J, Lepage Y, Putois G, Gosme J (2009) Lexicons or phrase tables? An investigation in sampling-based multilingual alignment. In: Proceedings of the 3rd workshop on example-based machine translation (EBMT3), Dublin, pp 45–52. http://hal.archives-ouvertes.fr/hal-00439806/fr/. Accessed 19 Mar 2012
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the main conference on human language technology conference of the North American chapter of the Association of Computational Linguistics, New York, NY, USA, pp 104–111. http://www.aclweb.org/anthology/N/N06/N06-1014.pdf. Accessed 19 Mar 2012
Mandelbrot B (1954) Structure formelle des textes et communication. Word 10: 1–27
Marcu D, Wong D (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 conference on empirical methods in natural language processing, Philadelphia, PA, pp 133–139. http://www.aclweb.org/anthology/W02-1018. Accessed 19 Mar 2012
Melamed D (2000) Models of translational equivalence among words. Comput Linguist 26(2): 221–249
Montemurro M (2004) A generalization of the zipf-mandelbrot law in linguistics. In: Gell–Mann M, Tsallis C (eds) Nonextensive entropy: interdisciplinary applications. Oxford University Press, New York, p 12
Moore R (2004) On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 333–340
Moore R (2005) Association-based bilingual word alignment. In: proceedings of the workshop on building and using parallel texts: data-driven machine translation and beyond (ACL-05), Ann Arbor, MI, USA, pp 1–8. http://www.aclweb.org/anthology/W/W05/W05-0801.pdf. Accessed 19 Mar 2012
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the conference on 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 311–318. http://www.aclweb.org/anthology/P02-1040. Accessed 19 Mar 2012
Smadja F, Hatzivassiloglou V, McKeown K (1996) Translating collocations for bilingual lexicons: a statistical approach. Comput Linguist 22(1):1–38. http://aclweb.org/anthology-new/J/J96/J96-1001.pdf. Accessed 19 Mar 2012
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation (AMTA 2006), Cambridge, MA, USA, pp 223–231. http://www.mt-archive.info/AMTA-2006-Snover.pdf. Accessed 19 Mar 2012
Takezawa T, Sumita E, Sugaya F, Yamamoto H, Yamamoto S (2002) Toward a broad-coverage bilingual corpus for speech translation of travel conversation in the real world. In: Proceedings of the third international conference on language resources and evaluation (LREC 2002), Las Palmas, Gran Canaria, Spain, pp 147–152. http://gandalf.aksis.uib.no/lrec2002/pdf/305.pdf. Accessed 19 Mar 2012
Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X: the tenth machine translation summit, Phuket, Thailand, pp 251–258. http://www.mt-archive.info/MTS-2005-Vogel.pdf. Accessed 19 Mar 2012
Vogel S, Ney H, Tillman C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th international conference on computational linguistics (COLING-96), vol 2, Copenhagen, Denmark, pp 836–841. http://aclweb.org/anthology-new/C/C96/C96-2141.pdf. Accessed 19 Mar 2012
Wu D (1997) Stochastic inversion transduction grammar and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–404. http://www.aclweb.org/anthology/J/J97/J97-3002.pdf. Accessed 19 Mar 2012
Zipf G (1965) The psycho-biology of language: an introduction to dynamic philology. Classic Series. The MIT Press, Cambridge; first edition 1935
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lardilleux, A., Yvon, F. & Lepage, Y. Generalizing sampling-based multilingual alignment. Machine Translation 27, 1–23 (2013). https://doi.org/10.1007/s10590-012-9126-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-012-9126-0