Self-directed Lifelong Learning for Robot Vision
Efforts towards robust visual scene understanding tend to rely heavily on manual annotations. When human labels are required, collecting a dataset large enough to train a successful robot vision system is almost certain to be prohibitively expensive. However, we argue that a robot with a vision sensor can learn powerful visual representations in a self-directed manner by relying on fundamental physical priors and bootstrapping techniques. For example, it has been shown that basic visual tracking systems can be used to automatically label short-range correspondences in video that allow one to train a system with capabilities analogous to object permanence in humans. An object permanence system can in turn be used to automatically label long-range correspondences, allowing one to train a system able to compare and contrast objects and scenes. In the end, the agent will develop a representation that encodes persistent material properties, state, lighting, etc. of various parts of a visual scene. Starting with a strong visual representation, the agent can then learn to solve traditional vision tasks such as class and/or instance recognition using only a sparse set of labels that can be found on the Internet or solicited at little cost from humans. More importantly, such a representation would also enable truly robust solutions to challenges in robotics such as global localization, loop closure detection, and object pose estimation.
- 1.Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 37–45 (2015)Google Scholar
- 2.Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency (2016). arXiv preprint arXiv:1609.03677
- 4.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning (2013). arXiv preprint arXiv:1312.5602
- 6.Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2015)Google Scholar
- 7.Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J.: 3D match: learning the matching of local 3D geometry in range scans (2016). arXiv preprint arXiv:1603.08182