3D Object Detection with Multiple Kinects

  • Wandi Susanto
  • Marcus Rohrbach
  • Bernt Schiele
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7584)


Categorizing and localizing multiple objects in 3D space is a challenging but essential task for many robotics and assisted living applications. While RGB cameras as well as depth information have been widely explored in computer vision there is surprisingly little recent work combining multiple cameras and depth information. Given the recent emergence of consumer depth cameras such as Kinect we explore how multiple cameras and active depth sensors can be used to tackle the challenge of 3D object detection. More specifically we generate point clouds from the depth information of multiple registered cameras and use the VFH descriptor [20] to describe them. For color images we employ the DPM [3] and combine both approaches with a simple voting approach across multiple cameras.

On the large RGB-D dataset [12] we show improved performance for object classification on multi-camera point clouds and object detection on color images, respectively. To evaluate the benefit of joining color and depth information of multiple cameras, we recorded a novel dataset with four Kinects showing significant improvements over a DPM baseline for 9 object classes aggregated in challenging scenes. In contrast to related datasets our dataset provides color and depth information recorded with multiple Kinects and requires localizing and categorizing multiple objects in 3D space. In order to foster research in this field, the dataset, including annotations, is available on our web page.


Point Cloud Object Detection Depth Information Depth Image Camera View 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Coates, A., Ng, A.Y.: Multi-camera object detection for robotics. In: ICRA (2010)Google Scholar
  2. 2.
    Dalal, N., Triggs, B.: Histogram of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  3. 3.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI (2010)Google Scholar
  4. 4.
    Franzel, T., Schmidt, U., Roth, S.: Object Detection in Multi-view X-Ray Images. In: Pinz, A., Pock, T., Bischof, H., Leberl, F. (eds.) DAGM/OAGM 2012. LNCS, vol. 7476, pp. 144–154. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Gould, S., Baumstarck, P., Quigley, M., Ng, A.Y., Koller, D.: Integrating visual and range data for robotic object detection. In: M2SFA2 (2008)Google Scholar
  7. 7.
    Helmer, S., Meger, D., Muja, M., Little, J.J., Lowe, D.G.: Multiple Viewpoint Recognition and Localization. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part I. LNCS, vol. 6492, pp. 464–477. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Hinterstoisser, S.H.S., Cagniart, C., Ilic, S., Konolige, K., Navab, N., Lepetit, V.: Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: ICCV (2011)Google Scholar
  9. 9.
    Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., Fitzgibbon, A.: Kinectfusion: real-time 3D reconstruction and interaction using a moving depth camera. In: UIST (2011)Google Scholar
  10. 10.
    Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Darrell, T.: A category-level 3-D object dataset: Putting the kinect to work. In: ICCV (2011)Google Scholar
  11. 11.
    Johnson, A., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. PAMI (1999)Google Scholar
  12. 12.
    Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view rgb-d object dataset. In: ICRA (2011)Google Scholar
  13. 13.
    Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: CVPR (2011)Google Scholar
  14. 14.
    Meger, D., Wojek, C., Little, J.J., Schiele, B.: Explicit occlusion reasoning for 3D object detection. In: BMVC (2011)Google Scholar
  15. 15.
    Muja, M., Lowe, D.: Fast approximate nearest-neighbors with automatic algorithm configuration. In: VISAPP (2009)Google Scholar
  16. 16.
    Redondo-Cabrera, C., López-Sastre, R.J., Acevedo-Rodríguez, J., Maldonado-Bascón, S.: Surfing the point clouds: Selective 3D spatial pyramids for category-level object recognition. In: CVPR (2012)Google Scholar
  17. 17.
    Rohrbach, M., Enzweiler, M., Gavrila, D.M.: High-Level Fusion of Depth and Intensity for Pedestrian Classification. In: Denzler, J., Notni, G., Süße, H. (eds.) DAGM 2009. LNCS, vol. 5748, pp. 101–110. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  18. 18.
    Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script Data for Attribute-Based Recognition of Composite Activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012)Google Scholar
  19. 19.
    Roig, G., Boix, X., Shitrit, H.B., Fua, P.: Conditional random fields for multi-camera object detection. In: ICCV (2011)Google Scholar
  20. 20.
    Rusu, R.B., Bradski, G., Thibaux, R., Hsu, J.: Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram. In: IROS (2010)Google Scholar
  21. 21.
    Rusu, R.B., Cousins, S.: 3D is here: Point Cloud Library (PCL). In: ICRA (2011)Google Scholar
  22. 22.
    Saenko, K., Karayev, S., Yia, Y., Shyr, A., Janoch, A., Long, J., Fritz, M., Darrell, T.: Practical 3-D object detection using category and instance-level appearance models. In: IROS (2011)Google Scholar
  23. 23.
    Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. In: CVPR (2010)Google Scholar
  24. 24.
    Wilson, A.D., Benko, H.: Combining multiple depth cameras and projectors for interactions on, above and between surfaces. In: UIST (2010)Google Scholar
  25. 25.
    Wojek, C., Roth, S., Schindler, K., Schiele, B.: Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 467–481. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Wandi Susanto
    • 1
  • Marcus Rohrbach
    • 1
  • Bernt Schiele
    • 1
  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations