Skip to main content
Log in

Improved Alignment-Based Algorithm for Multilingual Text Compression

  • Published:
Mathematics in Computer Science Aims and scope Submit manuscript

Abstract

Multilingual text compression exploits the existence of the same text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual text alignment, a mapping of words and phrases in one text to their semantic equivalents in the translation. A new multilingual text compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language text; the incurred compression loss due to this overhead is smaller than the savings in the compressed target language texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number of languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adiego, J., Brisaboa, N.R., Martínez-Prieto, M.A., Sánchez-Martínez, F.: A two-level structure for compressing aligned bitexts. In: SPIRE, pp. 114–121 (2009)

  2. Ahrenberg L., Andersson M., Merkel M.: A knowledge-lite approach to word alignment. In: Véronis, J. (eds) Parallel Text Processing, pp. 97–116. Kluwer Academic Publishers, Dordrecht (2000)

    Chapter  Google Scholar 

  3. Ajtai M., Burns R.C., Fagin R., Long D.D.E.: Compactly encoding unstructured inputs with differential compression. J. ACM 49(3), 318–367 (2002)

    Article  MathSciNet  Google Scholar 

  4. Brown P.F., Della Pietra S., Della Pietra V.J., Mercer R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  5. Burns, R.C., Long, D.D.E.: Efficient distributed backup and restore with delta compression. In: Workshop on I/O in Parallel and Distributed Systems (IOPADS), ACM (1997)

  6. bzip: http://www.bzip.org

  7. Conley, E.S., Klein, S.T.: Compression of multilingual aligned texts. In: Data Compression Conference—DCC, p. 442 (2006)

  8. Conley E.S., Klein S.T.: Using alignment for multilingual text compression. Int. J. Found. Comput. Sci. 19(1), 89–101 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  9. Dagan, I., Church, K.W., Gale, W.A.: Robust bilingual word alignment for machine-aided translation. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8 (1993)

  10. EUR-Lex: http://eur-lex.europa.eu/

  11. Fung, P., Mckeown, K.: Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In: Proceedings of the First Conference of the Association for Machine Translation in the Americas, pp. 81–88 (1994)

  12. Gaussier É., Hull D., Aït-Mokhtar S.: Term alignment in use : Machine-aided human translation. In: Véronis, J. (eds) Parallel Text Processing, pp. 253–274. Kluwer Academic Publishers, Dordrecht (2000)

    Chapter  Google Scholar 

  13. gzip: http://www.gzip.org

  14. Heaps J.: Information Retrieval : Computational and Theoretical Aspects. Academic Press, Inc., New York, NY (1978)

    MATH  Google Scholar 

  15. Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Data Compression Conference—DCC, p. 459 (2009)

  16. Moffat A.: Word-based text compression. Softw. Pract. Exp. 19, 185–198 (1985)

    Article  Google Scholar 

  17. Moffat A., Zobel J.: Adding compression to a full-text retrieval system. Softw. Pract. Exp. 25(8), 891–903 (1995)

    Article  Google Scholar 

  18. Nevill C., Bell T.: Compression of parallel texts. Inf. Process. Manage. 28, 781–793 (1992)

    Article  Google Scholar 

  19. TreeTagger—a language-independent part-of-speech tagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

  20. Witten I.H., Moffat A., Bell T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York (1994)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shmuel T. Klein.

Additional information

This is an extended version of a paper that has appeared in the Proceedings of the LATA’11 conference. The work has been done while the first author was a PhD student at Bar Ilan University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conley, E.S., Klein, S.T. Improved Alignment-Based Algorithm for Multilingual Text Compression. Math.Comput.Sci. 7, 137–153 (2013). https://doi.org/10.1007/s11786-012-0138-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11786-012-0138-1

Keywords

Mathematics Subject Classification

Navigation