The NAIST Simultaneous Translation Corpus

Neubig, Graham; Shimizu, Hiroaki; Sakti, Sakriani; Nakamura, Satoshi; Toda, Tomoki

doi:10.1007/978-981-10-6199-8_11

The NAIST Simultaneous Translation Corpus

Graham Neubig⁵,
Hiroaki Shimizu⁶,
Sakriani Sakti⁷,
Satoshi Nakamura⁷ &
…
Tomoki Toda⁸

Chapter
First Online: 25 October 2017

974 Accesses
8 Altmetric

Part of the book series: New Frontiers in Translation Studies ((NFTS))

Abstract

This chapter describes an English-Japanese/Japanese-English simultaneous interpretation corpus collected at the Nara Institute of Science and Technology (NAIST). There are two main features of the corpus that set it apart from others. The first is that it contains recorded interpretation results from professional simultaneous interpreters with different amounts of experience. This makes it possible to compare the differences between interpreters of different levels, elucidating the effect of interpreter experience on the objective and subjective qualities of results. The second feature is that part of the corpus also has been translated. This data makes it possible to compare and contrast the results when a particular talk is translated from text without time constraints (using the translation data) or from speech with time constraints (using the simultaneous interpretation data). The corpus contains a total of 387k words worth of data, with the material covering lectures and news. All transcriptions are time aligned. The corpus will be helpful to analyze differences in interpretation styles, and may also be used as a reference in the construction of simultaneous interpretation systems.

This work was performed while all authors were affiliated with the Nara Institute of Science and Technology. This chapter is based on a manuscript in the proceedings of the International Conference on Language Resources and Evaluation (LREC) (Shimizu et al. 2014).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The corpus is available at http://ahclab.naist.jp/resource/stc/.
2.
http://www.ted.com.
3.
http://cnnradio.cnn.com/.
4.
http://www3.nhk.or.jp/.

References

Bangalore, Srinivas, Vivek Kumar Rangarajn Sridhar, Prakash Kodan Ladan Golipour, and Aura Jimenez. 2012. Real-time incremental speech-to-speech translation of dialogs. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. T. Chandra. Montreal: Association for Computational Linguistics.
Google Scholar
Bendazzoli, Claudio, and Annalisa Sandrelli. 2005. An approach to corpus-based interpreting studies: Developin g EPIC (European Parliament Interpreting Corpus). In Proceedings of MuTra 2005—Challenges of multidimensional translation, eds. H. Gerzymisch-Arbogast, and S. Nauert. 149–160. Saarbrücken: Saarland University.
Google Scholar
Fügen, Christian, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine Translation 21 (4): 209–252.
Article Google Scholar
Fujita, Tomoki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2013. Simple, lexicalized choice of translation timing for simultaneous speech translation. In Proceedings of the 14th annual conference of the International Speech Communication Association (InterSpeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3487–3491. Lyon: International Speech Communication Association.
Google Scholar
Grissom, Alvin, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé. 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), 1342–1352. Qatar: Association for Computational Linguistics.
Google Scholar
Hamon, Olivier, Djamel Mostefa, and Khalid Choukri. 2007. End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In Proceedings of the Machine Translation summit XI, ed. B. Meagaard, s.l. Copenhagen: European Association of Machine Translation.
Google Scholar
He, He, Jordan Boyd-Graber, and Hal Daumé. 2016. Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. K. Knight, 944–952. San Diego: Association for Computational Linguistics.
Google Scholar
Isozaki, Hideki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), eds. H. Li, and L. Márquez, 971–976. Cambridge: Association of Computational Linguistics.
Google Scholar
Maekawa, Kikuo. 2003. Corpus of spontaneous Japanese: Its design and evaluation. In Proceedings of the ISCA/IEEE workshop on spontaneous speech, paper MM02. Tokyo: Tokyo Institute of Technology.
Google Scholar
Matsubara, Shikegi, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the 3rd international conference on language resources and evaluation (LREC), 153–159, Las Palmas: LREC.
Google Scholar
Oda, Yusuke, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2014. Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL), ed. D. Marcu, 551–556. Baltimore: Association for Computational Linguistics.
Google Scholar
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL), ed. I. Pierre, 311–318, Philadelphia: Association for Computational Linguistics.
Google Scholar
Paulik, Matthias, and Alex Waibel. 2008. Extracting clues from human interpreter speech for spoken language translation. In Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP), ed. A. Sayed, 5097–5100. Las Vegas: IEEE.
Google Scholar
Ryu, Koishiro, Shikegi Matsubara, and Yasuyoshi Inagaki. 2006. Simultaneous English-Japanese spoken language translation based on incremental dependency parsing and transfer. In Proceedings of the 44th annual meeting of the association for computational linguistics (ACL), ed. N. Calzolari, 683–690. Sydney: Association for Computational Linguistics.
Google Scholar
Shimizu, Hiroaki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2014. Collection of a simultaneous translation corpus for comparative analysis. In Proceedings of the 9th international conference on language resources and evaluation (LREC), eds. N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, 670–673, Reykjavik: LREC.
Google Scholar
Snover, Matthew, Bonny Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the conference of the association for machine translation in the Americas (AMTA), 223–231. Boston: Association for Machine Translation in the Americas.
Google Scholar
Sridhar, Vivek Kumar Rangarajn, John Chen, and Srinivas Bangalore. 2013a. Corpus analysis of simultaneous interpretation data for improving real time speech translation. In Proceedings of the 14th annual conference of the international speech communication association (interspeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3468–3472. Lyon: International Speech Communication Association.
Google Scholar
Sridhar, Vivek Kumar Rangarajn, John Chen, Srinivas Bangalore, Andrej Ljolje, and Rathinavelu Chengalvarayan. 2013b. Segmentation strategies for streaming speech translation. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT), ed. L. Vanderwende, 230–238. Atlanta: Association for Computational Linguistics.
Google Scholar

Download references

Acknowledgements

Part of this work was supported by JSPS KAKENHI Grant Number 24240032.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, United States of America
Graham Neubig
Fuji Xerox, Yokohama, Japan
Hiroaki Shimizu
Nara Institute of Science and Technology, Ikoma, Japan
Sakriani Sakti & Satoshi Nakamura
Nagoya University, Nagoya, Japan
Tomoki Toda

Authors

Graham Neubig
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Shimizu
View author publications
You can also search for this author in PubMed Google Scholar
Sakriani Sakti
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar
Tomoki Toda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Graham Neubig .

Editor information

Editors and Affiliations

Department of Interpreting and Translation, University of Bologna, Forlì, Italy
Mariachiara Russo
Department of Economic and Social Studies, Mathematics and Statistics, University of Turin, Torino, Italy
Claudio Bendazzoli
Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Bart Defrancq

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Neubig, G., Shimizu, H., Sakti, S., Nakamura, S., Toda, T. (2018). The NAIST Simultaneous Translation Corpus. In: Russo, M., Bendazzoli, C., Defrancq, B. (eds) Making Way in Corpus-based Interpreting Studies . New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-10-6199-8_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-6199-8_11
Published: 25 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6198-1
Online ISBN: 978-981-10-6199-8
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics