CineAD: a system for automated audio description script generation for the visually impaired

Campos, Virginia P.; de Araújo, Tiago M. U.; de Souza Filho, Guido L.; Gonçalves, Luiz M. G.

doi:10.1007/s10209-018-0634-4

CineAD: a system for automated audio description script generation for the visually impaired

Long Paper
Published: 31 August 2018

Volume 19, pages 99–111, (2020)
Cite this article

Universal Access in the Information Society Aims and scope Submit manuscript

Virginia P. Campos¹,
Tiago M. U. de Araújo ORCID: orcid.org/0000-0002-5953-5435²,
Guido L. de Souza Filho² &
…
Luiz M. G. Gonçalves¹

1189 Accesses
12 Citations
6 Altmetric
Explore all metrics

Abstract

Audio description (AD) is an assistive technology that allows visually impaired people to access cinema and understand the story of a movie. Basically, the visual content of the story is told by way of using a voice, narrated during the film gaps of silence. Nonetheless, this assistive technology is not widely used, due to several factors, among them the high cost and time involved in creating such audio descriptions. Towards solving this problem, this work proposes a solution that automatically generates AD scripts for recorded audiovisual content, named CineAD. This solution detects the breaks in the spoken lines in the video receiving the AD and generates these descriptions from the original script and subtitles. Alternatively, the solution can be incorporated into a speech synthesizer or used by an audio description narrator to generate the audio that contains the descriptions. To evaluate the proposed solution, qualitative tests with visually impaired users and audio description narrators are conducted. The results show that the proposed solution can generate descriptions of the most important events in the videos, and therefore, can help to reduce the barriers in accessing video faced by visually impaired, when the script and subtitles are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Artificial intelligence moving serious gaming: Presenting reusable game AI components

Article Open access 30 July 2019

Wim Westera, Rui Prada, … Stefan Ruseti

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

Stephanie Stoll, Necati Cihan Camgoz, … Richard Bowden

A review of intelligent music generation systems

Article 19 February 2024

Lei Wang, Ziyi Zhao, … Qidi Wu

Notes

http://www.acb.org/adp/dvdsoverview.html.
https://www.audacityteam.org/.
https://www.steinberg.net/en/products/cubase/start.html.
https://www.celtx.com.
http://lucene.apache.org/.
http://oca.ancine.gov.br/sites/default/files/publicacoes/pdf/anuario_2015.pdf.
The questionnaire can be accessed at this link: https://www.dropbox.com/s/sn2iejtpzapqass/Questionnaire%3AComprehensionTests.pdf?dl=0.
A 1–6 scale was chosen because according to Morrissey [23], even scales encourage users to make positive or negative evaluations, avoiding neutral evaluations. In addition, this scale was also used in other works which also involve evaluation of solutions for people with disabilities (e.g., [10, 32]).

References

ANCINE: Brazilian nations cinema agency (ancine)– Regulatory News: accessibility (2015). http://www.ancine.gov.br/sites/default/files/consultas-publicas/Not%C3%ADcia%20Regulat%C3%B3ria%20-%20acessibilidade%20exibicao.pdf. Accessed Dec 2015
Araujo, V.L.S.: O processo de legendagem no Brasil (the subtitling process in Brazil). Revista do GELNE (GELNE Magazine), Fortaleza 1/2, 156–159 (2006)
Google Scholar
Benecke, B.: Audio-description. Meta Transl. J. 49(1), 78–80 (2004)
Google Scholar
Bojanowski, P., Lajugie, R., Bach, F.R., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly supervised action labeling in videos under ordering constraints. European Conference on Computer Vision - ECCV (2014), Zurich, Switzerland. Springer, 8693 (Part V), pp. 628–643 (2014)
Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publisher, Norwell, MA, USA (2000)
Book Google Scholar
Chapdelaine, C., Gagnon, L.: Accessible videodescription on-demand. In: Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility, Assets ’09, pp. 221–222. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1639642.1639685
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431. IEEE, Boston, MA (2015)
Google Scholar
Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 919–926 (2009). https://doi.org/10.1109/CVPRW.2009.5206667
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. NIPS'16 Proceedings of the 30th international conference on neural information processing systems - Barcelona, Spain, pp. 379–387 (2016)
De Araújo, T.M.U., Ferreira, F.L.S., Silva, D.A.N.S., Oliveira, L.D., Falcão, E.L., Domingues, L.A., Martins, V.F., Portela, I.A.C., Nóbrega, Y.S., Lima, H.R.G., Souza Filho, G.L., Tavares, T.A., Duarte, A.N.: An approach to generate and embed sign language video tracks into multimedia contents. Inf. Sci. 281, 762–780 (2014). https://doi.org/10.1016/j.ins.2014.04.008
Article Google Scholar
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine intelligence (CVPR 15), vol. 39, no. 4, pp. 677–691. IEEE, Washington, DC, USA (2017). https://doi.org/10.1109/TPAMI.2016.2599174
Google Scholar
Duchenne, O., Laptev, I., Sivic, J., Bach, F.R., Ponce, J.: Automatic annotation of human actions in video. In: 2009 IEEE 12th International Conference on Computer Vision (2009)
Edmundson, H.P.: New methods in automatic extracting. J. ACM 16(2), 264–285 (1969). https://doi.org/10.1145/321510.321519
Article MATH Google Scholar
Encelle, B., Beldame, M.O., Prié, Y.: Towards the usage of pauses in audio-described videos. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, W4A ’13, pp. 31:1–31:4. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2461121.2461130
Fang, H., Gupta, S., Iandola, F.N., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. (2014) CoRR http://arxiv.org/abs/abs/1411.4952 arXiv:abs/1411.4952
Fernández-Torné, A.: Audio description and technologies: study on the semi-automatisation of the translation and voicing of audio descriptions. Ph.D. thesis, Universitat Autnoma de Barcelona, Barcelona, Spain (2016)
Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. PloS One 10(12):e0144610 (2015). https://doi.org/10.1371/journal.pone.0144610
Article Google Scholar
Kobayashi, M., Nagano, T., Fukuda, K., Takagi, H.: Describing online videos with text-to-speech narration. In: Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A), W4A ’10, pp. 29:1–29:2. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1805986.1806025
Kobayashi, M., O’Connell, T., Gould, B., Takagi, H., Asakawa, C.: Are synthesized video descriptions acceptable? In: Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’10, pp. 163–170. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1878803.1878833
Lakritz, J., Salway, A.: The semi-automatic generation of audio description from screenplays. Technical report CS-06-05, Dept. Of Computing, University of Surrey (2002)
Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Anchorage, AK (2008). https://doi.org/10.1109/CVPR.2008.4587756
Chapter Google Scholar
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936. IEEE, Miami, FL (2009). https://doi.org/10.1109/CVPR.2009.5206557
Chapter Google Scholar
Morrissey, S.: Data-driven machine translation for sign languages. Ph.D. thesis, Dublin City University, Dublin, Ireland (2008)
Nenkova, A., Maskey, S., Liu, Y.: Automatic summarization. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts of ACL 2011, HLT-11, pp. 3:1–3:86. Association for Computational Linguistics, Stroudsburg, PA, USA, Article 3, 86 pp (2011)
Nunes, E.V., Machado, F.O., Vanzin, T.: Audiodescricao como Tecnologia Assistiva para o Acesso ao Conhecimento por Pessoas Cegas. (Audio description as assistive technology for access to knowledge for the blind). In: Ulbricht, V.R., Vanzin, T., Villarouco, V. (eds.) Ambiente Virtual de Aprendizagem Inclusivo (Inclusive Virtual Learning Environment), p. 352. Pandion, Florianopolis (2011)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4594–4602. IEEE, Las Vegas, NV (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. IEEE, Honolulu, HI (2017)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149 (2017)
Article Google Scholar
Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall J., Gehler P., Leibe B. (eds.) Pattern recognition. DAGM 2015. Lecture Notes in Computer Science, vol. 9358. Springer, Cham (2015)
Google Scholar
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., Schiele, B.: Movie description. Int. J. Comput. Vis. 123, 94–120 (2017). https://doi.org/10.1007/s11263-016-0987-1
Article Google Scholar
Salway, A., Vassiliou, A., Ahmad, K.: Whats happens in films? In: Proceedings of the IEEE International Conference on Multimedia an Expo, ICME (2005)
San-Segundo, R., Montero, J., Córdoba, R., Sama, V., Fernndez, F., Dharo, L., López-Ludeña, V., Sánchez, D., García, A.: Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal. Appl. 15, 203–224 (2012)
Article MathSciNet Google Scholar
Szarkowska, A.: Text-to-speech audio description: towards wider availability of AD. J. Spec. Transl. 15, 142–162 (2011)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence–video to text. (2015) ICCV '15 Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542 (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Denver, Colorado, USA, pp. 1494–1504, May 31–June 5 (2015)
Wang, K.C., Yang, Y.M., Yang, Y.R.: Speech/music discrimination using hybrid-based feature extraction for audio data indexing. In: 2017 International Conference on System Science and Engineering (ICSSE), pp. 515–519 (2017). https://doi.org/10.1109/ICSSE.2017.8030927

Download references

Author information

Authors and Affiliations

Federal University of Rio Grande do Norte, 3000, Campus Universitario, Natal, RN, Brazil
Virginia P. Campos & Luiz M. G. Gonçalves
Federal University of Paraiba, R. dos Escoteiros, s/n-Mangabeira, João Pessoa, PB, Brazil
Tiago M. U. de Araújo & Guido L. de Souza Filho

Authors

Virginia P. Campos
View author publications
You can also search for this author in PubMed Google Scholar
Tiago M. U. de Araújo
View author publications
You can also search for this author in PubMed Google Scholar
Guido L. de Souza Filho
View author publications
You can also search for this author in PubMed Google Scholar
Luiz M. G. Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiago M. U. de Araújo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Campos, V.P., de Araújo, T.M.U., de Souza Filho, G.L. et al. CineAD: a system for automated audio description script generation for the visually impaired. Univ Access Inf Soc 19, 99–111 (2020). https://doi.org/10.1007/s10209-018-0634-4

Download citation

Published: 31 August 2018
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10209-018-0634-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CineAD: a system for automated audio description script generation for the visually impaired

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence moving serious gaming: Presenting reusable game AI components

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

A review of intelligent music generation systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CineAD: a system for automated audio description script generation for the visually impaired

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence moving serious gaming: Presenting reusable game AI components

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

A review of intelligent music generation systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation