Abstract
Audio description (AD) is an assistive technology that allows visually impaired people to access cinema and understand the story of a movie. Basically, the visual content of the story is told by way of using a voice, narrated during the film gaps of silence. Nonetheless, this assistive technology is not widely used, due to several factors, among them the high cost and time involved in creating such audio descriptions. Towards solving this problem, this work proposes a solution that automatically generates AD scripts for recorded audiovisual content, named CineAD. This solution detects the breaks in the spoken lines in the video receiving the AD and generates these descriptions from the original script and subtitles. Alternatively, the solution can be incorporated into a speech synthesizer or used by an audio description narrator to generate the audio that contains the descriptions. To evaluate the proposed solution, qualitative tests with visually impaired users and audio description narrators are conducted. The results show that the proposed solution can generate descriptions of the most important events in the videos, and therefore, can help to reduce the barriers in accessing video faced by visually impaired, when the script and subtitles are available.
Similar content being viewed by others
Notes
The questionnaire can be accessed at this link: https://www.dropbox.com/s/sn2iejtpzapqass/Questionnaire%3AComprehensionTests.pdf?dl=0.
A 1–6 scale was chosen because according to Morrissey [23], even scales encourage users to make positive or negative evaluations, avoiding neutral evaluations. In addition, this scale was also used in other works which also involve evaluation of solutions for people with disabilities (e.g., [10, 32]).
References
ANCINE: Brazilian nations cinema agency (ancine)– Regulatory News: accessibility (2015). http://www.ancine.gov.br/sites/default/files/consultas-publicas/Not%C3%ADcia%20Regulat%C3%B3ria%20-%20acessibilidade%20exibicao.pdf. Accessed Dec 2015
Araujo, V.L.S.: O processo de legendagem no Brasil (the subtitling process in Brazil). Revista do GELNE (GELNE Magazine), Fortaleza 1/2, 156–159 (2006)
Benecke, B.: Audio-description. Meta Transl. J. 49(1), 78–80 (2004)
Bojanowski, P., Lajugie, R., Bach, F.R., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly supervised action labeling in videos under ordering constraints. European Conference on Computer Vision - ECCV (2014), Zurich, Switzerland. Springer, 8693 (Part V), pp. 628–643 (2014)
Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publisher, Norwell, MA, USA (2000)
Chapdelaine, C., Gagnon, L.: Accessible videodescription on-demand. In: Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility, Assets ’09, pp. 221–222. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1639642.1639685
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431. IEEE, Boston, MA (2015)
Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 919–926 (2009). https://doi.org/10.1109/CVPRW.2009.5206667
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. NIPS'16 Proceedings of the 30th international conference on neural information processing systems - Barcelona, Spain, pp. 379–387 (2016)
De Araújo, T.M.U., Ferreira, F.L.S., Silva, D.A.N.S., Oliveira, L.D., Falcão, E.L., Domingues, L.A., Martins, V.F., Portela, I.A.C., Nóbrega, Y.S., Lima, H.R.G., Souza Filho, G.L., Tavares, T.A., Duarte, A.N.: An approach to generate and embed sign language video tracks into multimedia contents. Inf. Sci. 281, 762–780 (2014). https://doi.org/10.1016/j.ins.2014.04.008
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine intelligence (CVPR 15), vol. 39, no. 4, pp. 677–691. IEEE, Washington, DC, USA (2017). https://doi.org/10.1109/TPAMI.2016.2599174
Duchenne, O., Laptev, I., Sivic, J., Bach, F.R., Ponce, J.: Automatic annotation of human actions in video. In: 2009 IEEE 12th International Conference on Computer Vision (2009)
Edmundson, H.P.: New methods in automatic extracting. J. ACM 16(2), 264–285 (1969). https://doi.org/10.1145/321510.321519
Encelle, B., Beldame, M.O., Prié, Y.: Towards the usage of pauses in audio-described videos. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, W4A ’13, pp. 31:1–31:4. ACM, New York, NY, USA (2013). https://doi.org/10.1145/2461121.2461130
Fang, H., Gupta, S., Iandola, F.N., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. (2014) CoRR http://arxiv.org/abs/abs/1411.4952 arXiv:abs/1411.4952
Fernández-Torné, A.: Audio description and technologies: study on the semi-automatisation of the translation and voicing of audio descriptions. Ph.D. thesis, Universitat Autnoma de Barcelona, Barcelona, Spain (2016)
Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. PloS One 10(12):e0144610 (2015). https://doi.org/10.1371/journal.pone.0144610
Kobayashi, M., Nagano, T., Fukuda, K., Takagi, H.: Describing online videos with text-to-speech narration. In: Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A), W4A ’10, pp. 29:1–29:2. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1805986.1806025
Kobayashi, M., O’Connell, T., Gould, B., Takagi, H., Asakawa, C.: Are synthesized video descriptions acceptable? In: Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’10, pp. 163–170. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1878803.1878833
Lakritz, J., Salway, A.: The semi-automatic generation of audio description from screenplays. Technical report CS-06-05, Dept. Of Computing, University of Surrey (2002)
Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Anchorage, AK (2008). https://doi.org/10.1109/CVPR.2008.4587756
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936. IEEE, Miami, FL (2009). https://doi.org/10.1109/CVPR.2009.5206557
Morrissey, S.: Data-driven machine translation for sign languages. Ph.D. thesis, Dublin City University, Dublin, Ireland (2008)
Nenkova, A., Maskey, S., Liu, Y.: Automatic summarization. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts of ACL 2011, HLT-11, pp. 3:1–3:86. Association for Computational Linguistics, Stroudsburg, PA, USA, Article 3, 86 pp (2011)
Nunes, E.V., Machado, F.O., Vanzin, T.: Audiodescricao como Tecnologia Assistiva para o Acesso ao Conhecimento por Pessoas Cegas. (Audio description as assistive technology for access to knowledge for the blind). In: Ulbricht, V.R., Vanzin, T., Villarouco, V. (eds.) Ambiente Virtual de Aprendizagem Inclusivo (Inclusive Virtual Learning Environment), p. 352. Pandion, Florianopolis (2011)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4594–4602. IEEE, Las Vegas, NV (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. IEEE, Honolulu, HI (2017)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149 (2017)
Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall J., Gehler P., Leibe B. (eds.) Pattern recognition. DAGM 2015. Lecture Notes in Computer Science, vol. 9358. Springer, Cham (2015)
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., Schiele, B.: Movie description. Int. J. Comput. Vis. 123, 94–120 (2017). https://doi.org/10.1007/s11263-016-0987-1
Salway, A., Vassiliou, A., Ahmad, K.: Whats happens in films? In: Proceedings of the IEEE International Conference on Multimedia an Expo, ICME (2005)
San-Segundo, R., Montero, J., Córdoba, R., Sama, V., Fernndez, F., Dharo, L., López-Ludeña, V., Sánchez, D., García, A.: Design, development and field evaluation of a Spanish into sign language translation system. Pattern Anal. Appl. 15, 203–224 (2012)
Szarkowska, A.: Text-to-speech audio description: towards wider availability of AD. J. Spec. Transl. 15, 142–162 (2011)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence–video to text. (2015) ICCV '15 Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542 (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Denver, Colorado, USA, pp. 1494–1504, May 31–June 5 (2015)
Wang, K.C., Yang, Y.M., Yang, Y.R.: Speech/music discrimination using hybrid-based feature extraction for audio data indexing. In: 2017 International Conference on System Science and Engineering (ICSSE), pp. 515–519 (2017). https://doi.org/10.1109/ICSSE.2017.8030927
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Campos, V.P., de Araújo, T.M.U., de Souza Filho, G.L. et al. CineAD: a system for automated audio description script generation for the visually impaired. Univ Access Inf Soc 19, 99–111 (2020). https://doi.org/10.1007/s10209-018-0634-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10209-018-0634-4