Skip to main content
Log in

CineAD: a system for automated audio description script generation for the visually impaired

  • Long Paper
  • Published:
Universal Access in the Information Society Aims and scope Submit manuscript

Abstract

Audio description (AD) is an assistive technology that allows visually impaired people to access cinema and understand the story of a movie. Basically, the visual content of the story is told by way of using a voice, narrated during the film gaps of silence. Nonetheless, this assistive technology is not widely used, due to several factors, among them the high cost and time involved in creating such audio descriptions. Towards solving this problem, this work proposes a solution that automatically generates AD scripts for recorded audiovisual content, named CineAD. This solution detects the breaks in the spoken lines in the video receiving the AD and generates these descriptions from the original script and subtitles. Alternatively, the solution can be incorporated into a speech synthesizer or used by an audio description narrator to generate the audio that contains the descriptions. To evaluate the proposed solution, qualitative tests with visually impaired users and audio description narrators are conducted. The results show that the proposed solution can generate descriptions of the most important events in the videos, and therefore, can help to reduce the barriers in accessing video faced by visually impaired, when the script and subtitles are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.acb.org/adp/dvdsoverview.html.

  2. https://www.audacityteam.org/.

  3. https://www.steinberg.net/en/products/cubase/start.html.

  4. https://www.celtx.com.

  5. http://lucene.apache.org/.

  6. http://oca.ancine.gov.br/sites/default/files/publicacoes/pdf/anuario_2015.pdf.

  7. The questionnaire can be accessed at this link: https://www.dropbox.com/s/sn2iejtpzapqass/Questionnaire%3AComprehensionTests.pdf?dl=0.

  8. A 1–6 scale was chosen because according to Morrissey [23], even scales encourage users to make positive or negative evaluations, avoiding neutral evaluations. In addition, this scale was also used in other works which also involve evaluation of solutions for people with disabilities (e.g., [10, 32]).

References

  1. ANCINE: Brazilian nations cinema agency (ancine)– Regulatory News: accessibility (2015). http://www.ancine.gov.br/sites/default/files/consultas-publicas/Not%C3%ADcia%20Regulat%C3%B3ria%20-%20acessibilidade%20exibicao.pdf. Accessed Dec 2015

  2. Araujo, V.L.S.: O processo de legendagem no Brasil (the subtitling process in Brazil). Revista do GELNE (GELNE Magazine), Fortaleza 1/2, 156–159 (2006)

    Google Scholar 

  3. Benecke, B.: Audio-description. Meta Transl. J. 49(1), 78–80 (2004)

    Google Scholar 

  4. Bojanowski, P., Lajugie, R., Bach, F.R., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly supervised action labeling in videos under ordering constraints. European Conference on Computer Vision - ECCV (2014),  Zurich, Switzerland. Springer, 8693 (Part V), pp. 628–643 (2014)

  5. Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publisher, Norwell, MA, USA (2000)

    Book  Google Scholar 

  6. Chapdelaine, C., Gagnon, L.: Accessible videodescription on-demand. In: Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility, Assets ’09, pp. 221–222. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1639642.1639685

  7. Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431. IEEE, Boston, MA (2015)

    Google Scholar 

  8. Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 919–926 (2009). https://doi.org/10.1109/CVPRW.2009.5206667

  9. Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. NIPS'16 Proceedings of the 30th international conference on neural information processing systems - Barcelona, Spain, pp. 379–387 (2016)

  10. De Araújo, T.M.U., Ferreira, F.L.S., Silva, D.A.N.S., Oliveira, L.D., Falcão, E.L., Domingues, L.A., Martins, V.F., Portela, I.A.C., Nóbrega, Y.S., Lima, H.R.G., Souza Filho, G.L., Tavares, T.A., Duarte, A.N.: An approach to generate and embed sign language video tracks into multimedia contents. Inf. Sci. 281, 762–780 (2014). https://doi.org/10.1016/j.ins.2014.04.008

    Article  Google Scholar 

  11. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine intelligence (CVPR 15), vol. 39, no. 4, pp. 677–691. IEEE, Washington, DC, USA (2017). https://doi.org/10.1109/TPAMI.2016.2599174

    Google Scholar 

  12. Duchenne, O., Laptev, I., Sivic, J., Bach, F.R., Ponce, J.: Automatic annotation of human actions in video. In: 2009 IEEE 12th International Conference on Computer Vision (2009)

  13. Edmundson, H.P.: New methods in automatic extracting. J. ACM 16(2), 264–285 (1969). https://doi.org/10.1145/321510.321519

    Article  MATH  Google Scholar 

  14. Encelle, B., Beldame, M.O., Prié, Y.: Towards the usage of pauses in audio-described videos. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, W4A ’13, pp. 31:1–31:4. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2461121.2461130

  15. Fang, H., Gupta, S., Iandola, F.N., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. (2014) CoRR http://arxiv.org/abs/abs/1411.4952 arXiv:abs/1411.4952

  16. Fernández-Torné, A.: Audio description and technologies: study on the semi-automatisation of the translation and voicing of audio descriptions. Ph.D. thesis, Universitat Autnoma de Barcelona, Barcelona, Spain (2016)

  17. Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. PloS One 10(12):e0144610 (2015). https://doi.org/10.1371/journal.pone.0144610

    Article  Google Scholar 

  18. Kobayashi, M., Nagano, T., Fukuda, K., Takagi, H.: Describing online videos with text-to-speech narration. In: Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A), W4A ’10, pp. 29:1–29:2. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1805986.1806025

  19. Kobayashi, M., O’Connell, T., Gould, B., Takagi, H., Asakawa, C.: Are synthesized video descriptions acceptable? In: Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’10, pp. 163–170. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1878803.1878833

  20. Lakritz, J., Salway, A.: The semi-automatic generation of audio description from screenplays. Technical report CS-06-05, Dept. Of Computing, University of Surrey (2002)

  21. Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Anchorage, AK (2008). https://doi.org/10.1109/CVPR.2008.4587756

    Chapter  Google Scholar 

  22. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936. IEEE, Miami, FL (2009). https://doi.org/10.1109/CVPR.2009.5206557

    Chapter  Google Scholar 

  23. Morrissey, S.: Data-driven machine translation for sign languages. Ph.D. thesis, Dublin City University, Dublin, Ireland (2008)

  24. Nenkova, A., Maskey, S., Liu, Y.: Automatic summarization. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts of ACL 2011, HLT-11, pp. 3:1–3:86. Association for Computational Linguistics, Stroudsburg, PA, USA, Article 3, 86 pp  (2011)

  25. Nunes, E.V., Machado, F.O., Vanzin, T.: Audiodescricao como Tecnologia Assistiva para o Acesso ao Conhecimento por Pessoas Cegas. (Audio description as assistive technology for access to knowledge for the blind). In: Ulbricht, V.R., Vanzin, T., Villarouco, V. (eds.) Ambiente Virtual de Aprendizagem Inclusivo (Inclusive Virtual Learning Environment), p. 352. Pandion, Florianopolis (2011)

  26. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4594–4602. IEEE, Las Vegas, NV (2016)

    Google Scholar 

  27. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. IEEE, Honolulu, HI (2017)

    Google Scholar 

  28. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149  (2017)

    Article  Google Scholar 

  29. Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall J., Gehler P., Leibe B. (eds.) Pattern recognition. DAGM 2015. Lecture Notes in Computer Science, vol. 9358. Springer, Cham (2015)

    Google Scholar 

  30. Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., Schiele, B.: Movie description. Int. J. Comput. Vis. 123, 94–120 (2017). https://doi.org/10.1007/s11263-016-0987-1

    Article  Google Scholar 

  31. Salway, A., Vassiliou, A., Ahmad, K.: Whats happens in films? In: Proceedings of the IEEE International Conference on Multimedia an Expo, ICME (2005)

  32. San-Segundo, R., Montero, J., Córdoba, R., Sama, V., Fernndez, F., Dharo, L., López-Ludeña, V., Sánchez, D., García, A.: Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal. Appl. 15, 203–224 (2012)

    Article  MathSciNet  Google Scholar 

  33. Szarkowska, A.: Text-to-speech audio description: towards wider availability of AD. J. Spec. Transl. 15, 142–162 (2011)

    Google Scholar 

  34. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence–video to text. (2015) ICCV '15 Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542 (2015)

  35. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Denver, Colorado, USA, pp. 1494–1504, May 31–June 5 (2015)

  36. Wang, K.C., Yang, Y.M., Yang, Y.R.: Speech/music discrimination using hybrid-based feature extraction for audio data indexing. In: 2017 International Conference on System Science and Engineering (ICSSE), pp. 515–519 (2017). https://doi.org/10.1109/ICSSE.2017.8030927

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiago M. U. de Araújo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Campos, V.P., de Araújo, T.M.U., de Souza Filho, G.L. et al. CineAD: a system for automated audio description script generation for the visually impaired. Univ Access Inf Soc 19, 99–111 (2020). https://doi.org/10.1007/s10209-018-0634-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10209-018-0634-4

Keywords

Navigation