Abstract
This chapter describes an English-Japanese/Japanese-English simultaneous interpretation corpus collected at the Nara Institute of Science and Technology (NAIST). There are two main features of the corpus that set it apart from others. The first is that it contains recorded interpretation results from professional simultaneous interpreters with different amounts of experience. This makes it possible to compare the differences between interpreters of different levels, elucidating the effect of interpreter experience on the objective and subjective qualities of results. The second feature is that part of the corpus also has been translated. This data makes it possible to compare and contrast the results when a particular talk is translated from text without time constraints (using the translation data) or from speech with time constraints (using the simultaneous interpretation data). The corpus contains a total of 387k words worth of data, with the material covering lectures and news. All transcriptions are time aligned. The corpus will be helpful to analyze differences in interpretation styles, and may also be used as a reference in the construction of simultaneous interpretation systems.
This work was performed while all authors were affiliated with the Nara Institute of Science and Technology. This chapter is based on a manuscript in the proceedings of the International Conference on Language Resources and Evaluation (LREC) (Shimizu et al. 2014).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The corpus is available at http://ahclab.naist.jp/resource/stc/.
- 2.
- 3.
- 4.
References
Bangalore, Srinivas, Vivek Kumar Rangarajn Sridhar, Prakash Kodan Ladan Golipour, and Aura Jimenez. 2012. Real-time incremental speech-to-speech translation of dialogs. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. T. Chandra. Montreal: Association for Computational Linguistics.
Bendazzoli, Claudio, and Annalisa Sandrelli. 2005. An approach to corpus-based interpreting studies: Developin g EPIC (European Parliament Interpreting Corpus). In Proceedings of MuTra 2005—Challenges of multidimensional translation, eds. H. Gerzymisch-Arbogast, and S. Nauert. 149–160. Saarbrücken: Saarland University.
Fügen, Christian, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine Translation 21 (4): 209–252.
Fujita, Tomoki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2013. Simple, lexicalized choice of translation timing for simultaneous speech translation. In Proceedings of the 14th annual conference of the International Speech Communication Association (InterSpeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3487–3491. Lyon: International Speech Communication Association.
Grissom, Alvin, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé. 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), 1342–1352. Qatar: Association for Computational Linguistics.
Hamon, Olivier, Djamel Mostefa, and Khalid Choukri. 2007. End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In Proceedings of the Machine Translation summit XI, ed. B. Meagaard, s.l. Copenhagen: European Association of Machine Translation.
He, He, Jordan Boyd-Graber, and Hal Daumé. 2016. Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT), ed. K. Knight, 944–952. San Diego: Association for Computational Linguistics.
Isozaki, Hideki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), eds. H. Li, and L. Márquez, 971–976. Cambridge: Association of Computational Linguistics.
Maekawa, Kikuo. 2003. Corpus of spontaneous Japanese: Its design and evaluation. In Proceedings of the ISCA/IEEE workshop on spontaneous speech, paper MM02. Tokyo: Tokyo Institute of Technology.
Matsubara, Shikegi, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the 3rd international conference on language resources and evaluation (LREC), 153–159, Las Palmas: LREC.
Oda, Yusuke, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2014. Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL), ed. D. Marcu, 551–556. Baltimore: Association for Computational Linguistics.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL), ed. I. Pierre, 311–318, Philadelphia: Association for Computational Linguistics.
Paulik, Matthias, and Alex Waibel. 2008. Extracting clues from human interpreter speech for spoken language translation. In Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP), ed. A. Sayed, 5097–5100. Las Vegas: IEEE.
Ryu, Koishiro, Shikegi Matsubara, and Yasuyoshi Inagaki. 2006. Simultaneous English-Japanese spoken language translation based on incremental dependency parsing and transfer. In Proceedings of the 44th annual meeting of the association for computational linguistics (ACL), ed. N. Calzolari, 683–690. Sydney: Association for Computational Linguistics.
Shimizu, Hiroaki, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. 2014. Collection of a simultaneous translation corpus for comparative analysis. In Proceedings of the 9th international conference on language resources and evaluation (LREC), eds. N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, 670–673, Reykjavik: LREC.
Snover, Matthew, Bonny Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the conference of the association for machine translation in the Americas (AMTA), 223–231. Boston: Association for Machine Translation in the Americas.
Sridhar, Vivek Kumar Rangarajn, John Chen, and Srinivas Bangalore. 2013a. Corpus analysis of simultaneous interpretation data for improving real time speech translation. In Proceedings of the 14th annual conference of the international speech communication association (interspeech), eds. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, 3468–3472. Lyon: International Speech Communication Association.
Sridhar, Vivek Kumar Rangarajn, John Chen, Srinivas Bangalore, Andrej Ljolje, and Rathinavelu Chengalvarayan. 2013b. Segmentation strategies for streaming speech translation. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT), ed. L. Vanderwende, 230–238. Atlanta: Association for Computational Linguistics.
Acknowledgements
Part of this work was supported by JSPS KAKENHI Grant Number 24240032.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Neubig, G., Shimizu, H., Sakti, S., Nakamura, S., Toda, T. (2018). The NAIST Simultaneous Translation Corpus. In: Russo, M., Bendazzoli, C., Defrancq, B. (eds) Making Way in Corpus-based Interpreting Studies . New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-10-6199-8_11
Download citation
DOI: https://doi.org/10.1007/978-981-10-6199-8_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6198-1
Online ISBN: 978-981-10-6199-8
eBook Packages: Social SciencesSocial Sciences (R0)