Abstract
In real world, visually challenged people face the great challenge of understanding nearby objects and movements going on in their vicinity. They mainly depend upon their auditory or physical abilities of touch to recognize things that are happening around them. Being able to describe the surrounding environment and objects that are present around them using properly formed sentences could only be done if a normal person is present and can describe it to them. We plan on creating an application that can solve this very challenging task by generating description of a real-time video captured from a mobile phone camera, which will aid the visually challenged in their day to day activities. In this paper, we are using concepts of Object detection and Caption generation and present our approach for the same, this which will enable us to run the model on smart phone devices in real time. The description pertaining to the objects, as seen in real time video generated will be converted to audio as the output. We train our proposed model on various datasets so that the generated descriptions are correct and up to the mark. Using the combinations of Convolutional Neural-Network and Recurrent Neural-Network and our own modifications, we tend to create a new model. We also are implementing an Android application for the visually challenged people to show the real-life applicability and usefulness of the Neural Network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4) (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. Department of Computer Science, University of Cornell. arXiv:1409.3215
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image Captioning with Semantic Attention. Department of Computer Science, University of Rochester, Rochester NY 14627, USA
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MS-COCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4) (2017)
Hori, C., Hori, T., Marks, T.K., Hershey, J.R.: Early and Late Integration of Audio Features for Automatic Video Description. Mitsubishi Electric Research Laboratories. TR2017-183 Dec 2017
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu1, W.: CNN-RNN: a unified framework for multi-label image classification. The University of California at Los Angles. TR2017-183 Dec 2016
Hori, C., Hori, T., Lee, T.-Y., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description (2017). arXiv:1701.03126
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from the unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900 (2016)
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 577–585. Curran Associates, Inc. (2015)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: CoRR (2014). arXiv:1409.0473
Lin, M., Chen, Q., Yan, S.: Network in the network. In: CoRR (2013). arXiv:1312.4400
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: CoRR (2014). arXiv:1409.1556
Li, Z., Rao, Z.: Object Detection and its Implementation on Android Devices (2015)
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, pp. 4489–4497 (7–13 Dec 2015)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In: IEEE Workshop on Automatic Speech Recognition and Understanding, 1997. Proceedings. IEEE, pp. 347–354 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Arvind Bhile, A., Hole, V. (2020). Real-Time Environment Description Application for Visually Challenged People. In: Smys, S., Senjyu, T., Lafata, P. (eds) Second International Conference on Computer Networks and Communication Technologies. ICCNCT 2019. Lecture Notes on Data Engineering and Communications Technologies, vol 44. Springer, Cham. https://doi.org/10.1007/978-3-030-37051-0_38
Download citation
DOI: https://doi.org/10.1007/978-3-030-37051-0_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37050-3
Online ISBN: 978-3-030-37051-0
eBook Packages: EngineeringEngineering (R0)