International Conference on Human-Computer Interaction

HCI 2015: HCI International 2015 - Posters’ Extended Abstracts pp 383-388 | Cite as

Smart Playground: A Tangible Interactive Platform with Regular Toys for Young Kids

  • Duc-Minh PhamEmail author
  • Thinh Nguyen-Vo
  • Minh-Triet Tran
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 528)


In modern world, children need to get familiar with interactive toys to quickly improve their learning and imagination. Our approach is to add augmented information and interaction to common toys on the surface containing them, which is called Smart Playground. Popular methods use three color channels and local features to recognize objects. However, toys of children usually have various pictures with different colors drawn on many small components. Therefore depth data is useful in this case. Each toy usually have unique shape that is distinguishable from others. In this paper, we use an RGB-D sensor to collect information about both color and shape of objects. To learn the training set of toys, an approach of convolutional neural network is used to represent data (both color and depth) by high-level feature vectors. Using the combined results, the accuracy of 3D recognition is more than 90 %.


RGB-D 3d recognition Deep learning Convolutional neural network 

1 Introduction

Young kids love playing with toys. Although young children in modern life have opportunities to interact and play games with mobile devices, such as tablets or smartphones, toys are still an essential part of their childhood. The problem is to integrate smart features and interactions that a kid can do with computers into regular toys. It would be necessary to make regular toys become more lively with smart interactive features so that young kids feel more excited when playing with their toys.

In this paper, we propose to develop a Smart Playground, a platform in which a regular toy becomes a tangible user interface object. When a kid puts his or her bear or doll on the surface of Smart Playground, our system recognizes which the toy is with RGB-D information captured from color and depth cameras, then displays visual effects on both that toy and other existing toys on the playground as well as plays sound effects or background music to augment the playground environment.

There are two main components in our proposed Smart Playground. The first component is a 3D toy recognition module that can recognize a regular 3D toy from its 3D shape and color/texture. The second component is a module to manage augmented multimedia objects linked to a given 3D toy.

The main contributions in our proposed system are as follows:
  • First, we propose to transform regular toys into tangible UI objects that can be used to activate certain interactions and events on the smart playground. There is no special modification on regular toys, such as using magnetic tags or RFID chips. We employ the vision-based approach to recognize 3D object, a.k.a. regular toys, from their 3D shapes, colors, and textures.

  • Second, we propose a method following the new trend in computer vision to apply deep convolutional neural network to learn higher-level features [9] from depth and color data to boost the accuracy of 3D object recognition in realtime manner.

  • Third, we propose an architecture for Smart Playground to become a flexible platform that can accept more toys with augmented multimedia data. Each toy, after being trained to be recognized, is associated with a collection of multimedia data, such as video clips, audio clips, or images.

The rest of this paper has the following structure: In Sect. 2, we briefly preview the related works; the proposed method for 3D recognition with RGB-D features is presented in Sect. 3; after that, the architecture for our smart playground is introduced in Sect. 4; Sect. 5 show the experiment results and evaluations; finally, the conclusion is presented in Sect. 6.

2 Related Works

Nowadays, there are many depth sensors that open a new era for 3D research in computer vision. With depth data, not only the appearance but also the shape of objects are known. The new trend is to combine both color and depth features together for more efficient object recognition. Depth data is also a support feature for scene segmentation and classification [2]. Some approach use sparse coding to learn hierarchical feature representations from raw RGB-D data in an unsupervised way [3]. However, capturing the newest idea of deep learning, several research groups use neural networks for training RGB-D images [4, 5, 6]. Follow the new trend, we apply convolutional neural network (CNN) for our system.

3 Proposed Method for 3D Toy Recognition with RGB-D Data

To recognize a regular toy, we use both color data and depth data captured from cameras in realtime. Figure 1 illustrates the overview of our proposed method to recognize a toy with RGB-D data. The recognition process is performed on color and depth data independently, then the results are processed in the fusion module to determine the output toy ID corresponding to a regular 3D toy.
Fig. 1.

Overview of our proposed method for 3D toy recognition with RGB-D data

Fig. 2.

Toy recognition using convolutional neural network

Instead of depending on the common approach to recognize an object from color data by using Bag-of-Words (BoW) model [1], we follow the new trend in visual object recognition to apply deep convolutional neural network (CNN) [9] to process both color data and raw 3D data into features in a high-level representation (c.f. Fig. 2). One of the most popular framework for building and testing CNN is Caffe [7]. This tool not only support an engine to calculate the parameters of CNN but also provide some available trained results from popular image data sets such as ImageNet ILSVRC2012 and MNIST. These sets have a very large number of classes and images, especially ILSVRC2012 [11]. This 1000-class data set consists of millions of images captured from various things in our real world such as animals (cat, monkey, fox), daily items (umbrella, soccer ball, balloon), means of transportation (car, canoe, boat), and natural scenes (volcano, sea, forest). Hence, the CNN trained on ILSVRC2012 contains most of features that we meet everyday in our life.

Based on the above characteristics, we propose to use the color of a specific object as input for the CNN of ILSVRC2012. The corresponding output vector of feed forward process on this CNN can be considered as a high-level global feature for the given object. For depth information, the raw 3D data need to be normalized to 2D grayscale image that describes the shape of the object before pushing to the CNN. That means, each object is now represented by two vectors of color data and depth data.

The built CNN of ILSVRC2012 consists of multiple layers that generates differently dimensional vectors. We decide to use the last feature vector of 4096 elements to represents our input data. For each type of data (color or depth), all of the corresponding vectors from our train set of 3D toys is used to construct a prediction model of multiple SVM classifiers [8]. The purpose of this additional step is to adapt and transform the feature vectors to the perspective of our using data set. Finally, for each of testing toy, we collect two output probability vectors from multiple SVM classifiers’ prediction on color and depth data. These two vectors is used to generate the final prediction by matching the given toy with the class that have the maximum predicted probability from both color and depth data.

4 Architecture of Smart Playground

Figure 3 shows the overview of our proposed Smart Playground. Any regular flat surface can be transformed into a smart playground. A depth and regular cameras are attached above the surface to continuously capture images from the below playground. A projector is also hung above the surface to create visual effects and presentations on the surface and objects on it. A calibration step is performed to align captured data from the two cameras, as well as between the camera pair with the projector.

When a young kid puts a toy on the ground surface, the depth camera detects the difference in the playground surface to activate the toy recognition process. After identifying which toy is put on the surface, the server retrieves augmented multimedia objects, such as video clips, audio clips, or images linked to that toy. These multimedia objects are then performed with the projector and/or the speaker. For example, when a child puts a house model on the surface, a garden with trees and a swimming pool are projected onto the surface next to the real physical house toy. Another example is that the audio clip of Snow White story is played when a child puts a doll of Show White on the surface playground, etc.
Fig. 3.

Overview of smart playground

Figure 4 illustrates the architecture to manage and process augmented multimedia objects. For each toy registered in the system, there is a specification of all augmented multimedia objects linked to that toy. The Specification Processor processes an XML- based specification of an augmented multimedia object to generate an appropriate instance of multimedia object, such as an audio clip, a video clip, a single image, an image sequence, or a 3D model, etc. The object is then performed by a corresponding presenter.
Fig. 4.

Augmented multimedia object manager

5 Experiments and Implementation

In our implementation, we use PrimeSense Carmine 1.09 and Microsoft Lifecam Cinema 720p to capture depth and color data respectively.

To evaluate the accuracy of recognizing toys with RGB-D data, we first apply our proposed method on a subset of RGB-D Object Dataset[1]. My chosen data set consists of 32 classes, each class has approximately 250 images. All of the images are about daily items such as hat, ball, and camera. We use 3/4 number of images in each class for training and the remain for testing. To determine the efficiency of combining color and depth data for recognition, the recognition process is applied for only color data, then for only depth data, and finally for both of them.

On 1662 images of test set, we receive the accuracy of 88.45 % for recognition on only color data and 82.07 % for depth data. However, when we combine the predictions from both color and depth data, we reach the accuracy of 90.49 %.
Fig. 5.

Our collection with color and depth data

Another experiments is done with our collection of color and depth data of 10 toys, each of which has the size of about 10 cm for each dimension. For each toy, we collect 50 samples from different views and different distances. All of the toys have white color and is placed on tables (Fig. 5). The accuracy is 90.63 % for color data only, 88.13 % for depth data only, and 97.12 % for combination of RGB-D.

6 Conclusion

In this paper, we propose to develop a Smart Playground to transform regular toys into multimedia-augmented tangible UI objects. To recognize a toy, we use both color and depth data. Bag-of-words model is used to recognize toys with color data while we propose to use deep learning to represent depth feature. By using the plug-in mechanism, more types of multimedia objects and actions can also be added into our platform.

From experiments, we can verify that although the accuracy of toy recognition with only depth data is lower than that of toy recognition with color data, the fusion of results using both depth and color data provides the best accuracy. Furthermore, by using the high-level representation of depth features using convolutional neural network, we can boost the overall accuracy of the toy recognition process.


  1. 1.
    Fei-Fei, L., Fergus, R., Torralba, A.: Recognizing and Learning Object Categories. Awarded the Best Short Course Prize at ICCV (2005)Google Scholar
  2. 2.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR (2013)Google Scholar
  3. 3.
    Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Desai, J.P., Dudek, G., Khatib, O., Kumar, V. (eds.) Experimental Robotics. STAR, vol. 88, pp. 387–402. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  4. 4.
    Socher, R., Huval, B., Bhat, B., Manning, C.D., Ng, A.Y.: Convolutional-Recursive Deep Learning for 3D Object Classification. In: NIPS (2012)Google Scholar
  5. 5.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 345–360. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  6. 6.
    Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: IJCAI (2013)Google Scholar
  7. 7.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014)Google Scholar
  8. 8.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  9. 9.
    Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. In: ICML (2012)Google Scholar
  10. 10.
    Sun, Y., Bo, L., Fox, D.: Learning to identify new objects. In: ICRA, pp. 3165–3172 (2014)Google Scholar
  11. 11.
    ImageNet: Large Scale Visual Recognition Challenge, ILSVRC2012 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Duc-Minh Pham
    • 1
    Email author
  • Thinh Nguyen-Vo
    • 1
  • Minh-Triet Tran
    • 1
  1. 1.Faculty of Information TechnologyUniversity of Science, VNU-HCMHo-chi-minhVietnam

Personalised recommendations