HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Verkhodanova, Vasilisa; Ronzhin, Alexander; Kipyatkova, Irina; Ivanko, Denis; Karpov, Alexey; Železný, Miloš

doi:10.1007/978-3-319-43958-7_40

Vasilisa Verkhodanova¹⁶,
Alexander Ronzhin¹⁶,
Irina Kipyatkova¹⁶,
Denis Ivanko¹⁶,
Alexey Karpov¹⁶ &
…
Miloš Železný¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Included in the following conference series:

International Conference on Speech and Computer

2306 Accesses
16 Citations
2 Altmetric

Abstract

In this paper we present a software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone. We describe the architecture of the developed software as well as some details of the collected database of Russian audio-visual speech HAVRUS. The developed software provides synchronization and fusion of both audio and video channels and makes allowance for and processes the natural factor of human speech - the asynchrony of audio and visual speech modalities. The collected corpus comprises recordings of 20 native speakers of Russian and is meant for further research and experiments on audio-visual Russian speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

An audio-visual corpus for multimodal automatic speech recognition

Article Open access 07 January 2017

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

References

Biwi 3D Audiovisual Corpus of Affective Communication. http://www.vision.ee.ethz.ch/datasets/b3dac2.en.html
CHIL - Computers in the Human Interaction Loop. https://imatge.upc.edu/web/projects/chil-computers-human-interaction-loop
Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions. http://catalog.elra.info/product_info.php?cPath=25&products_id=1082
Císař, P., Železnỳ, M., Krňoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of International Conference on the Auditory-Visual Speech Processing, pp. 1–4 (2005)
Google Scholar
Císař, P., Zelinka, J., Železnỳ, M., Karpov, A., Ronzhin, A.: Audio-visual speech recognition for slavonic languages (Czech and Russian). In: Proceedings of 11th International Conference SPECOM 2006, St. Petersburg, Russia, pp. 493–498 (2006)
Google Scholar
Estival, D., Cassidy, S., Cox, F., Burnham, D., et al.: Austalk: an audio-visual corpus of australian english. In: Proceedings of 9th Language Resources and Evaluation Conference LREC 2014, pp. 3105–3109 (2014)
Google Scholar
Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., Quintard, L.: The REPERE corpus: a multimodal corpus for person recognition. In: Proceedings of 8th Language Resources and Evaluation Conference (LREC 2012), pp. 1102–1107 (2012)
Google Scholar
Grishina, E.: Multimodal russian corpus (MURCO): first steps. In: Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), pp. 2953–2960 (2010)
Google Scholar
Karpov, A., Ronzhin, A., Kipyatkova, I.: Designing a multimodal corpus of audio-visual speech using a high-speed camera. In: Proceedings of 11th International Conference on Signal Processing (ICSP 2012), vol. 1, pp. 519–522. IEEE (2012)
Google Scholar
Karpov, A., Kipyatkova, I., Železný, M.: A framework for recording audio-visual speech corpora with a microphone and a high-speed camera. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 50–57. Springer, Heidelberg (2014)
Google Scholar
Karpov, A., Ronzhin, A., Kipyatkova, I., Železnỳ, M.: Influene of phone-viseme temporal correlations on audiovisual STT and TTS performance. In: Proceedings of 17th International Congress of Phonetic Sciences, pp. 1030–1033 (2011)
Google Scholar
Karpov, A., Ronzhin, A., Markov, K., Zeleznỳ, M.: Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Proceedings of INTERSPEECH 2010, Makuhari, Japan, pp. 2678–2681 (2010)
Google Scholar
Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Analy. 19(3), 546–558 (2009)
Article Google Scholar
Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.S.: AVICAR: audio-visual sspeech corpus in a car eenvironment. In: Proceedings of INTERSPEECH 2004, Jeju Island, Korea, pp. 2489–2492 (2004)
Google Scholar
Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Cristoforetti, L., Tobia, F., et al.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Evalu. 41(3–4), 389–407 (2007)
Article Google Scholar
Nikan, S.: Human face recognition under degraded conditions. University of Windsor (2014)
Google Scholar
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 2017–2020. IEEE (2002)
Google Scholar
Ronzhin, A.L., Vatamanyuk, I., Ronzhin, A.L., Železnỳ, M.: Mathematical methods to estimate image blur and recognize faces in the system of automatic conference participant registration. Autom. Remote Control 76(11), 2011–2020 (2015)
Article MATH Google Scholar
Togneri, R., B.M., Sui, C.: Multimodal speech recognition with the AusTalk 3D audio-visual corpus. In: Tutorial at ITERSPEECH 2014 (2014)
Google Scholar
Waibel, A., Stiefelhagen, R., Carlson, R., Casas, J., Kleindienst, J., Lamel, L., Lanz, O., Mostefa, D., Omologo, M., Pianesi, F., et al.: Computers in the human interaction loop. In: Nakashima, H., Aghajan, H., Augusto, J.C. (eds.) Handbook of Ambient Intelligence and Smart Environments, pp. 1071–1116. Springer, Heidelberg (2010)
Chapter Google Scholar
Xie, X.: Illumination preprocessing for face images based on empirical mode decomposition. Signal Process. 103, 250–257 (2014)
Article Google Scholar
Železnỳ, M., Císař, P., Krňoul, Z., Ronzhin, A., Li, I., Karpov, A.: Design of russian audio-visual speech corpus for bimodal speech recognition. In: Proceedings of SPECOM, pp. 397–400 (2005)
Google Scholar
Zeleznỳ, M., Císar, P.: Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition. In: Proceedings of International Conference on Audio-Visual Speech Processing (AVSP 2003), pp. 169–173 (2003)
Google Scholar

Download references

Acknowledgments

This research is financially supported by the Ministry of Education and Science of the Russian Federation, agreement No 14.616.21.0056 (reference RFMEFI61615X0056), project “Research and development of audio-visual speech recognition system based on a microphone and a high-speed camera”, as well as by the Czech Ministry of Education, Youth and Sports, project No LO1506.

Author information

Authors and Affiliations

SPIIRAS, St. Petersburg, Russia
Vasilisa Verkhodanova, Alexander Ronzhin, Irina Kipyatkova, Denis Ivanko & Alexey Karpov
University of West Bohemia, Pilsen, Czech Republic
Miloš Železný

Authors

Vasilisa Verkhodanova
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ronzhin
View author publications
You can also search for this author in PubMed Google Scholar
Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar
Denis Ivanko
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Železný
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasilisa Verkhodanova .

Editor information

Editors and Affiliations

SPIIRAS , Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University , Moscow, Russia
Rodmonga Potapova
Budapest University of Technology and Economics, Budapest, Hungary
Géza Németh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M. (2016). HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-43958-7_40
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43957-0
Online ISBN: 978-3-319-43958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Abstract

Access this chapter

Similar content being viewed by others

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

An audio-visual corpus for multimodal automatic speech recognition

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Abstract

Access this chapter

Similar content being viewed by others

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

An audio-visual corpus for multimodal automatic speech recognition

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation