Skip to main content

Improved Alignment Based Algorithm for Multilingual Text Compression

  • Conference paper
Language and Automata Theory and Applications (LATA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6638))

  • 653 Accesses

Abstract

Multilingual text compression exploits the existence of the same text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual text alignment, a mapping of words and phrases in one text to their semantic equivalents in the translation. A new multilingual text compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language text; the incurred compression loss due to this overhead is smaller than the savings in the compressed target language texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number languages.

This work has been done while the first author was a PhD student at Bar Ilan University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adiego, J., Brisaboa, N.R., Martínez-Prieto, M.A., Sánchez-Martínez, F.: A two-level structure for compressing aligned bitexts. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 114–121. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  2. Ahrenberg, L., Andersson, M., Merkel, M.: A knowledge-lite approach to word alignment. In: Véronis, J. (ed.) Parallel Text Processing, pp. 97–116. Kluwer Academic Publishers, Dordrecht (2000)

    Chapter  Google Scholar 

  3. Brown, P.F., Della Pietra, S., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2), 263–311 (1993)

    Google Scholar 

  4. Conley, E.S., Klein, S.T.: Using alignment for multilingual text compression. Int. J. Found. Comput. Sci. 19(1), 89–101 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  5. Conley, E.S., Klein, S.T.: Compression of multilingual aligned texts. In: DCC, p. 442. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  6. Dagan, I., Church, K.W., Gale, W.A.: Robust bilingual word alignment for machine-aided translation. In: Proc. of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8 (1993)

    Google Scholar 

  7. EUR-Lex, http://eur-lex.europa.eu/

  8. Fung, P., McKeown, K.: Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In: Proceedings of the First Conference of the Association for Machine Translation in the Americas, pp. 81–88 (1994)

    Google Scholar 

  9. Gaussier, É., Hull, D., Aït-Mokhtar, S.: Term alignment in use: Machine-aided human translation. In: Véronis, J. (ed.) Parallel Text Processing, pp. 253–274. Kluwer Academic Publishers, Dordrecht (2000)

    Chapter  Google Scholar 

  10. Heaps, J.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., New York (1978)

    MATH  Google Scholar 

  11. Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Storer, J.A., Marcellin, M.W. (eds.) DCC, p. 459. IEEE Computer Society, Los Alamitos (2009)

    Google Scholar 

  12. Nevill, C., Bell, T.: Compression of parallel texts. Information Processing & Management 28, 781–793 (1992)

    Article  Google Scholar 

  13. Schmid, H.: TreeTagger – a language-independent part-of-speech tagger. Web address, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Conley, E.S., Klein, S.T. (2011). Improved Alignment Based Algorithm for Multilingual Text Compression. In: Dediu, AH., Inenaga, S., Martín-Vide, C. (eds) Language and Automata Theory and Applications. LATA 2011. Lecture Notes in Computer Science, vol 6638. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21254-3_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21254-3_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21253-6

  • Online ISBN: 978-3-642-21254-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics