Efficient Object Annotation via Speaking and Pointing

  • Michael GygliEmail author
  • Vittorio Ferrari


Deep neural networks deliver state-of-the-art visual recognition, but they rely on large datasets, which are time-consuming to annotate. These datasets are typically annotated in two stages: (1) determining the presence of object classes at the image level and (2) marking the spatial extent for all objects of these classes. In this work we use speech, together with mouse inputs, to speed up this process. We first improve stage one, by letting annotators indicate object class presence via speech. We then combine the two stages: annotators draw an object bounding box via the mouse and simultaneously provide its class label via speech. Using speech has distinct advantages over relying on mouse inputs alone. First, it is fast and allows for direct access to the class name, by simply saying it. Second, annotators can simultaneously speak and mark an object location. Finally, speech-based interfaces can be kept extremely simple, hence using them requires less mouse movement compared to existing approaches. Through extensive experiments on the COCO and ILSVRC datasets we show that our approach yields high-quality annotations at significant speed gains. Stage one takes \(2.3{\times }-14.9{\times }\) less annotation time than existing methods based on a hierarchical organization of the classes to be annotated. Moreover, when combining the two stages, we find that object class labels come for free: annotating them at the same time as bounding boxes has zero additional cost. On COCO, this makes the overall process \(1.9\times \) faster than the two-stage approach.


Speech-based annotation Object annotation Multimodal interfaces Large-scale computer vision 



  1. Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In: Proceedings of the European conference on computer vision.Google Scholar
  2. Bolt, RA. (1980). “Put-that-there”: Voice and gesture at the graphics interface. In: SIGGRAPH.MathSciNetCrossRefGoogle Scholar
  3. Clarkson, E., Clawson, J., Lyons, K., Starner, T. (2005). An empirical study of typing rates on mini-qwerty keyboards. In: CHI.Google Scholar
  4. Dai, D. (2016). Towards cost-effective and performance-aware vision algorithms. Ph.D. thesis, ETH Zurich.Google Scholar
  5. Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos E, et al. (2018). Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision.Google Scholar
  6. Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., Fei-fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition.Google Scholar
  7. Deng, J., Russakovsky, O., Krause, J., Bernstein, MS., Berg, A., Fei-Fei L (2014). Scalable multi-label annotation. In: CHI.Google Scholar
  8. Ehinger, K. A., Hidalgo-Sotelo, B., Torralba, A., & Oliva, A. (2009). Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition, 17, 945–978. CrossRefGoogle Scholar
  9. Gygli, M., Ferrari, V. (2019). Fast object class labelling via speech. In: Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  10. Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass J (2018). Jointly discovering visual objects and spoken words from raw sensory input. In: Proceedings of the European conference on computer vision (ECCV).Google Scholar
  11. Hauptmann, AG. (1989). Speech and gestures for graphic image manipulation. In: ACM SIGCHI.CrossRefGoogle Scholar
  12. Kahneman, D. (1973). Attention and effort. Englewood Cliffs: Prentice-Hall. Google Scholar
  13. Karat, CM., Halverson, C., Horn, D., Karat, J. (1999). Patterns of entry and correction in large vocabulary continuous speech recognition systems. In: ACM SIGCHI, ACM.Google Scholar
  14. Krishna, RA., Hata, K., Chen, S., Kravitz, J., Shamma, DA., Fei-Fei, L., Bernstein, MS. (2016). Embracing error to enable rapid crowdsourcing. In: CHI.Google Scholar
  15. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Duerig, T., Ferrari, V. (2018). The open images dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982.
  16. Laradji, IH., Rostamzadeh, N., Pinheiro, PO., Vazquez, D., Schmidt, M. (2018). Where are the blobs: Counting by localization with point supervision. arXiv preprint arXiv:1807.09856.
  17. Lin, TY., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár P, Zitnick, C. (2014). Microsoft COCO: Common objects in context. In: European conference on computer vision.CrossRefGoogle Scholar
  18. Lleras, A., Rensink, R. A., & Enns, J. T. (2005). Rapid resumption of interrupted visual search: New insights on the interaction between vision and memory. Psychological Science, 16, 684–688. CrossRefGoogle Scholar
  19. Manen, S., Gygli, M., Dai, D., Van Gool, L. (2017). Pathtrack: Fast trajectory annotation with path supervision. In: IEEE international conference on computer vision.Google Scholar
  20. Mettes, P., van Gemert, J.C., Snoek, C.G. (2016). Spot on: Action localization from pointly-supervised proposals. In: European conference on computer vision.CrossRefGoogle Scholar
  21. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  22. Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443–453. CrossRefGoogle Scholar
  23. Oviatt, S. (1996). Multimodal interfaces for dynamic interactive maps. In: ACM SIGCHI.Google Scholar
  24. Oviatt, S. (2003). Multimodal interfaces. In J. A. Jacko & A. Sears (Eds.), The human-computer interaction handbook: Fundamentals, evolving technologies and emerging applications. Boca Raton: CRC Press.Google Scholar
  25. Oviatt, S., DeAngeli, A., Kuhn, K. (1997). Integration and synchronization of input modes during multimodal human–computer interaction. In: CHI.Google Scholar
  26. Papadopoulos, DP., Uijlings, JR., Keller, F., Ferrari, V. (2017a). Extreme clicking for efficient object annotation. In: Proceedings of the IEEE international conference on computer vision.Google Scholar
  27. Papadopoulos, DP., Uijlings, JR., Keller, F., Ferrari, V. (2017b). Training object class detectors with click supervision. In: CVPR.Google Scholar
  28. Pausch, R., Leatherby, JH. (1991). An empirical study: Adding voice input to a graphical editor. In: Journal of the American voice input/output society.Google Scholar
  29. Pont-Tuset, J., Gygli, M., Ferrari, V. (2019). Natural vocabulary emerges from free-form annotations. arXiv preprint arXiv:1906.01542.
  30. Rayner, K. (2009). Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62, 1457–1506.CrossRefGoogle Scholar
  31. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015a). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.MathSciNetCrossRefGoogle Scholar
  32. Russakovsky, O., Li, LJ., Fei-Fei, L. (2015b). Best of both worlds: Human–machine collaboration for object annotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  33. Su, H., Deng, J., Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In: AAAI human computation workshop.Google Scholar
  34. Sun, C., Shrivastava, A., Singh, S., Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision.Google Scholar
  35. Vaidyanathan, P., Prud, E., Pelz, JB., Alm, CO. (2018). SNAG : Spoken narratives and gaze dataset. In: Proceedings of Association for computational linguistics.Google Scholar
  36. Vasudevan, AB., Dai, D., Van Gool, L. (2017). Object referring in visual scene with spoken language. In: Conference on computer vision and pattern recognition.Google Scholar
  37. Watson, D. G., & Inglis, M. (2007). Eye movements and time-based selection: Where do the eyes go in preview search? Psychonomic Bulletin & Review, 14, 852–857.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Google ResearchZurichSwitzerland

Personalised recommendations