Skip to main content

The NAIST Simultaneous Translation Corpus

  • Chapter
  • First Online:

Part of the book series: New Frontiers in Translation Studies ((NFTS))

Abstract

This chapter describes an English-Japanese/Japanese-English simultaneous interpretation corpus collected at the Nara Institute of Science and Technology (NAIST). There are two main features of the corpus that set it apart from others. The first is that it contains recorded interpretation results from professional simultaneous interpreters with different amounts of experience. This makes it possible to compare the differences between interpreters of different levels, elucidating the effect of interpreter experience on the objective and subjective qualities of results. The second feature is that part of the corpus also has been translated. This data makes it possible to compare and contrast the results when a particular talk is translated from text without time constraints (using the translation data) or from speech with time constraints (using the simultaneous interpretation data). The corpus contains a total of 387k words worth of data, with the material covering lectures and news. All transcriptions are time aligned. The corpus will be helpful to analyze differences in interpretation styles, and may also be used as a reference in the construction of simultaneous interpretation systems.

This work was performed while all authors were affiliated with the Nara Institute of Science and Technology. This chapter is based on a manuscript in the proceedings of the International Conference on Language Resources and Evaluation (LREC) (Shimizu et al. 2014).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The corpus is available at http://ahclab.naist.jp/resource/stc/.

  2. 2.

    http://www.ted.com.

  3. 3.

    http://cnnradio.cnn.com/.

  4. 4.

    http://www3.nhk.or.jp/.

References

  • Bangalore, Srinivas, Vivek Kumar Rangarajn Sridhar, Prakash Kodan Ladan Golipour, and Aura Jimenez. 2012. Real-time incremental speech-to-speech translation of dialogs. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. T. Chandra. Montreal: Association for Computational Linguistics.

    Google Scholar 

  • Bendazzoli, Claudio, and Annalisa Sandrelli. 2005. An approach to corpus-based interpreting studies: Developin g EPIC (European Parliament Interpreting Corpus). In Proceedings of MuTra 2005—Challenges of multidimensional translation, eds. H. Gerzymisch-Arbogast, and S. Nauert. 149160. Saarbrücken: Saarland University.

    Google Scholar 

  • Fügen, Christian, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine Translation 21 (4): 209–252.

    Article  Google Scholar 

  • Fujita, Tomoki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2013. Simple, lexicalized choice of translation timing for simultaneous speech translation. In Proceedings of the 14th annual conference of the International Speech Communication Association (InterSpeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3487–3491. Lyon: International Speech Communication Association.

    Google Scholar 

  • Grissom, Alvin, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé. 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), 1342–1352. Qatar: Association for Computational Linguistics.

    Google Scholar 

  • Hamon, Olivier, Djamel Mostefa, and Khalid Choukri. 2007. End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In Proceedings of the Machine Translation summit XI, ed. B. Meagaard, s.l. Copenhagen: European Association of Machine Translation.

    Google Scholar 

  • He, He, Jordan Boyd-Graber, and Hal Daumé. 2016. Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. K. Knight, 944–952. San Diego: Association for Computational Linguistics.

    Google Scholar 

  • Isozaki, Hideki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), eds. H. Li, and L. Márquez, 971–976. Cambridge: Association of Computational Linguistics.

    Google Scholar 

  • Maekawa, Kikuo. 2003. Corpus of spontaneous Japanese: Its design and evaluation. In Proceedings of the ISCA/IEEE workshop on spontaneous speech, paper MM02. Tokyo: Tokyo Institute of Technology.

    Google Scholar 

  • Matsubara, Shikegi, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the 3rd international conference on language resources and evaluation (LREC), 153–159, Las Palmas: LREC.

    Google Scholar 

  • Oda, Yusuke, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2014. Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL), ed. D. Marcu, 551–556. Baltimore: Association for Computational Linguistics.

    Google Scholar 

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL), ed. I. Pierre, 311–318, Philadelphia: Association for Computational Linguistics.

    Google Scholar 

  • Paulik, Matthias, and Alex Waibel. 2008. Extracting clues from human interpreter speech for spoken language translation. In Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP), ed. A. Sayed, 5097–5100. Las Vegas: IEEE.

    Google Scholar 

  • Ryu, Koishiro, Shikegi Matsubara, and Yasuyoshi Inagaki. 2006. Simultaneous English-Japanese spoken language translation based on incremental dependency parsing and transfer. In Proceedings of the 44th annual meeting of the association for computational linguistics (ACL), ed. N. Calzolari, 683–690. Sydney: Association for Computational Linguistics.

    Google Scholar 

  • Shimizu, Hiroaki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2014. Collection of a simultaneous translation corpus for comparative analysis. In Proceedings of the 9th international conference on language resources and evaluation (LREC), eds. N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, 670–673, Reykjavik: LREC.

    Google Scholar 

  • Snover, Matthew, Bonny Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the conference of the association for machine translation in the Americas (AMTA), 223–231. Boston: Association for Machine Translation in the Americas.

    Google Scholar 

  • Sridhar, Vivek Kumar Rangarajn, John Chen, and Srinivas Bangalore. 2013a. Corpus analysis of simultaneous interpretation data for improving real time speech translation. In Proceedings of the 14th annual conference of the international speech communication association (interspeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3468–3472. Lyon: International Speech Communication Association.

    Google Scholar 

  • Sridhar, Vivek Kumar Rangarajn, John Chen, Srinivas Bangalore, Andrej Ljolje, and Rathinavelu Chengalvarayan. 2013b. Segmentation strategies for streaming speech translation. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT), ed. L. Vanderwende, 230–238. Atlanta: Association for Computational Linguistics.

    Google Scholar 

Download references

Acknowledgements

Part of this work was supported by JSPS KAKENHI Grant Number 24240032.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Neubig .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter

Neubig, G., Shimizu, H., Sakti, S., Nakamura, S., Toda, T. (2018). The NAIST Simultaneous Translation Corpus. In: Russo, M., Bendazzoli, C., Defrancq, B. (eds) Making Way in Corpus-based Interpreting Studies . New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-10-6199-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6199-8_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6198-1

  • Online ISBN: 978-981-10-6199-8

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics