A sketch is worth a thousand navigational instructions


Is it possible for a robot to navigate an unknown area without the ability to understand verbal instructions? This work proposes the use of pictorial cues (hand drawn sketches) to assist navigation in scenarios where verbal instructions seem less practical. These scenarios include verbal instructions referring to novel objects or complex instructions describing fine details. Furthermore, there are patterns (textures, languages) which are difficult to describe verbally. Given a single sketch, our novel “draw in 2D and match in 3D” algorithm spots the desired content under large view variations. We show that off-the-shelf deep features, for sketch matching, have limited view point invariance. Additionally, this work exposes the challenges of using the scene text as a pictorial cue. We propose a novel strategy to overcome these challenges across multiple languages. Our “just draw it” method overcomes the language understanding barrier. We show that sketch based text spotting works, without alteration, for arbitrary font shapes, which standard text detectors find hard to spot. Even in case of custom made text detector (for arbitrary shaped fonts), sketch based text spotting demonstrates complimentary performance. We provide extensive evaluation on public datasets. We also provide a fine grained dataset “Crossroads” which includes tough scenarios for generating navigational instructions. Finally we demonstrate the performance of our view invariant sketch detectors in robotic navigation scenarios using MINOS simulator which contains reconstructed indoor environments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21


  1. 1.


  2. 2.

    \({\dagger }\) will be made public.


  1. Ammirato, P., Poirson, P., Park, E., Košecká, J., & Berg, A. C. (2017). A dataset for developing and benchmarking active vision. In ICRA. IEEE

  2. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR.

  3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.

  4. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  5. Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5965–5974).

  6. Boniardi, F., Valada, A., Burgard, W., & Tipaldi, G. D. (2016). Autonomous indoor robot navigation using a sketch interface for drawing maps and routes. In ICRA.

  7. Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In ICCV.

  8. Chen, D. L., & Mooney, R. J. (2011). Learning to interpret natural language navigation instructions from observations. In AAAI.

  9. Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

  10. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV. IEEE

  11. Cherubini, A., Spindler, F., & Chaumette, F. (2014). Autonomous visual navigation and laser-based moving obstacle avoidance. IEEE Transactions on Intelligent Transportation Systems, 15(5), 2101–2110.

    Article  Google Scholar 

  12. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR.

  13. Coronado, E., Villalobos, J., Bruno, B., & Mastrogiovanni, F. (2017). Gesture-based robot control: Design challenges and evaluation with humans. In ICRA. IEEE

  14. Costante, G., Forster, C., Delmerico, J., Valigi, P., & Scaramuzza, D. (2016). Perception-aware path planning. arXiv preprint arXiv:1605.04151.

  15. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017). Visual dialog. In CVPR.

  16. Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR.

  17. Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3d features. In ICCV. IEEE

  18. Furlan, A., Miller, S. D., Sorrenti, D. G., Li, F. F., Savarese, S. (2013). Free your camera: 3d indoor scene understanding from arbitrary camera motion. In BMVC.

  19. Gupta, A., & Davis, L. S. (2008). Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV. Springer.

  20. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In CVPR.

  21. Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.

    Google Scholar 

  22. Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV. IEEE

  23. Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV.

  24. Hemachandra, S., Duvallet, F., Howard, T. M., Roy, N., Stentz, A., & Walter, M. R. (2015). Learning models for following natural language directions in unknown environments. In ICRA. IEEE

  25. Hussain, W., Civera, J., Montano, L., & Hebert, M. (2016). Dealing with small data and training blind spots in the manhattan world. In WACV. IEEE

  26. Khosla, A., An An, B., Lim, J. J., & Torralba, A. (2014). Looking beyond the visible scene. In CVPR.

  27. Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In CVPR.

  28. Lam, O., Dayoub, F., Schulz, R., & Corke, P. (2015). Automated topometric graph generation from floor plan analysis. In ACRA.

  29. Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2136–2143). IEEE.

  30. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV. Springer.

  31. Liu, C., Schwing, A. G., Kundu, K., Urtasun, R., & Fidler, S. (2015). Rent3d: Floor-plan priors for monocular layout estimation. In CVPR.

  32. Liu, C., Wu, J., & Furukawa, Y. (2018). Floornet: A unified framework for floorplan reconstruction from 3d scans. In ECCV. Springer.

  33. Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.

    Article  Google Scholar 

  34. MacMahon, M., Stankiewicz, B., & Kuipers, B. (2006). Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI.

  35. Matuszek, C., Fox, D., Koscher, K. (2010). Following directions using statistical machine translation. In 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE.

  36. Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.

  37. Nabbe, B., Hoiem, D., Efros, A. A., & Hebert, M. (2006). Opportunistic use of vision to push back the path-planning horizon. In IROS. IEEE

  38. Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.

  39. Quy Phan, T., Shivakumara, P., Tian, S., & Lim Tan, C. (2013). Recognizing text with perspective distortion in natural scenes. In ICCV.

  40. Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In CVPR.

  41. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  42. Salas, M., Hussain, W., Concha, A., Montano, L., Civera, J., & Montiel, J. (2015). Layout aware visual tracking and mapping. In IROS. IEEE

  43. Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: Learning to retrieve badly drawn bunnies. SIGGRAPH

  44. Savva, M., Chang, A. X., Dosovitskiy, A., Funkhouser, T., & Koltun, V. (2017). Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.

  45. Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops.

  46. Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. In ACM transactions on graphics (Vol. 30, p. 154). ACM.

  47. Skubic, M., Blisard, S., Carle, A., & Matsakis, P. (2002). Hand-drawn maps for robot navigation. In AAAI.

  48. Tellex, S., Kollar, T., Dickerson, S., Walter, M. R., Banerjee, A. G., Teller, S. J., & Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI.

  49. Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.

  50. Wang, S., Fidler, S., & Urtasun, R. (2015). Lost shopping! monocular localization in large indoor spaces. In ICCV.

  51. Winograd, T. (1971). Procedures as a representation for data in a computer program for understanding natural language. Massachusetts Institute of Tech Cambridge Project MAC, Technical report.

  52. Xu, K., Chen, K., Fu, H., Sun, W. L., & Hu, S. M. (2013). Sketch2scene: Sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG), 32(4), 1–15.

    Article  Google Scholar 

  53. Yamauchi, B. (1997). A frontier-based approach for autonomous exploration. In 1997 IEEE international symposium on computational intelligence in robotics and automation, 1997. CIRA’97, Proceedings. IEEE.

  54. Yuliang, L., Lianwen, J., Shuaitao, Z., & Sheng, Z. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.

  55. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA. IEEE

  56. Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR.

Download references


This work was supported by Higher Education Commission (HEC), Govt of Pakistan through its research Grant Number 6025/Federal/NRPU/RD/HEC/2016. We really appreciate the continuous guidance we received from anonymous reviewers. We are thankful to Khadija Azhar, Shanza Nasir, Tuba Tanveer and other ROMI lab members for helping out with sketches. We are also thankful to Saran Khaliq for providing assistance with deep object detectors.

Author information



Corresponding author

Correspondence to Wajahat Hussain.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ahmad, H., Usama, S.M., Hussain, W. et al. A sketch is worth a thousand navigational instructions. Auton Robot 45, 313–333 (2021). https://doi.org/10.1007/s10514-020-09965-2

Download citation


  • Navigation
  • Collaborative mapping
  • Human–robot interaction