Skip to main content

A Modified Burrows-Wheeler Transform for Highly Scalable Example-Based Translation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3265))

Abstract

The Burrows-Wheeler Transform (BWT) was originally developed for data compression, but can also be applied to indexing text. In this paper, an adaptation of the BWT to word-based indexing of the training corpus for an example-based machine translation (EBMT) system is presented. The adapted BWT embeds the necessary information to retrieve matched training instances without requiring any additional space and can be instantiated in a compressed form which reduces disk space and memory requirements by about 40% while still remaining searchable without decompression.

Both the speed advantage from O(log N) lookups compared to the O(N) lookups in the inverted-file index which had previously been used and the structure of the index itself act as enablers for additional capabilities and run-time speed. Because the BWT groups all instances of any n-gram together, it can be used to quickly enumerate the most-frequent n-grams, for which translations can be precomputed and stored, resulting in an order-of-magnitude speedup at run time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Veale, T., Way, A.: Gaijin: A Template-Driven Bootstrapping Approach to Example-Based Machine Translation. In: Proceedings of t he NeMNLP’97, New Methods in Natural Language Processessing, Sofia, Bulgaria (1997), http://www.compapp.dcu.ie/~tonyv/papers/gaijin.html

  2. Brown, R.D.: Adding Linguistic Knowledge to a Lexical Example-Based Translation System. In: Proceedings of the Eighth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 1999), Chester, England, pp. 22–32 (1999), http://www.cs.cmu.edu/~ralf/papers.html

  3. Carl, M.: Inducing Translation Templates for Example-Based Machine Translation. In: Proceedings of the Seventh Machine Translation Summit (MT-Summit VII)

    Google Scholar 

  4. Cicekli, I., Guvenir, H.A.: Learning Translation Templates from Bilingual Translation Examples. Applied Intelligence 15, 57–76 (2001), http://www.cs.bilkent.edu.tr/~ilyas/pubs.html

    Article  MATH  Google Scholar 

  5. Burrows, M., Wheeler, D.: A Block-Sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  6. Seward, J.: The bzip2 and libbzip2 Home Page (1997), http://www.bzip2.com

  7. Ferragina, P., Manzini, G.: An Experimental Study of an Opportunistic Index. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 269–278 (2001), http://citeseer.ist.psu.edu/ferraginaOlexperimental.html

  8. Brown, R.D., Hutchinson, R., Bennett, P.N., Carbonell, J.G., Jansen, P.: Reducing Boundary Friction Using Translation-Fragment Overlap. In: Proceedings of the Ninth Machine Translation Summit, pp. 24–31 (2003), http://www.cs.cmu.edu/~ralf/papers.html

  9. Linguistic Data Consortium: Hansard Corpus of Parallel English and French. Linguistic Data Consortium 250-258 (1997), http://www.ldc.upenn.edu/

  10. Frederking, R., Rudnicky, A., Hogan, C.: Interactive Speech Translation in the DIPLOMAT Project. In: Krauwer, S., et al. (eds.) Spoken Language Translation: Proceedings of a Workshop, Association of Computational Linguistics and Eurpoean Network in Language and Speech, Madrid, Spain, pp. 61–66 (1997)

    Google Scholar 

  11. Black, A.W., Brown, R.D., Frederking, R., Singh, R., Moody, J., Steinbrecher, E.: TONGUES: Rapid Development of a Speech-to-Speech Translation System. In: Proceedings of HLT-2002: Second International Conference on Human Language Technology Research, pp. 183–189 (2002), http://www.cs.cmu.edu/~ralf/Papers.html

  12. Graff, D., Cieri, C., Strassel, S., Martey, N.: The TDT-3 Text and Speech Corpus (1999), http://www.ldc.upenn.edu/Papers/TDTi999/tdt3corpus.ps

  13. Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms) (1997), http://www.cs.princeton.edu/~rs/strings/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brown, R.D. (2004). A Modified Burrows-Wheeler Transform for Highly Scalable Example-Based Translation. In: Frederking, R.E., Taylor, K.B. (eds) Machine Translation: From Real Users to Research. AMTA 2004. Lecture Notes in Computer Science(), vol 3265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30194-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30194-3_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23300-8

  • Online ISBN: 978-3-540-30194-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics