Keywords

1 Introduction

Human-object interaction recognition has been studied in the field of computer vision for years and it has a broad prospective application. Although human can distinguish interactions easily, it is still a difficult task for computers. The reasons are: (1) The appearance of objects and human vary a lot due to occlusion between objects and human during interactions, which leads to the failure of recognition. As shown in Fig. 1(a), objects are occluded by hand. In this situation, it is challenging for object recognition only by appearance. (2) There are ambiguities in human actions if we only consider the pose from a single frame. Even for pose sequences, it is not easy. Sometimes different actions have similar pose sequences. In Fig. 1(b) and (c), “calling” and “drinking” can not be separated well from skeleton sequences.

Inspired by the work of Yao and Fei-Fei [22], we propose a method to recognize human-object interactions by modeling the context between human actions and manipulated objects with RGBD videos which are captured by Kinect sensor. Firstly, we train a classifier by SVM algorithm for actions with pose sequence features, which can score each action, and then we search all possible manipulated objects by a sliding window near human hand region. We keep all possible interpretations of action and object labels. By modeling the context between them, we get the most reasonable results of human actions and manipulated objects with optimization. Our framework is shown in Fig. 2.

The rest of this paper is organized as follows. In Sect. 2, we introduce the related works. In Sect. 3, more details of our method are presented. In Sect. 4, our dataset and experimental results are described. Finally, the paper is concluded in Sect. 5 with discussion.

Fig. 1.
figure 1

Human-object interaction examples. (a) The manipulated objects are small and occluded in human-object interactions, (b) and (c) are ambiguous with similar pose sequences.

2 Related Work

Action Recognition. Action recognition is very important in the field of computer vision. Researchers have proposed many different approaches to solve this problem. Most of traditional methods focused on 2D data. They used features like silhouettes and shapes to extract spatio-temporal feature descriptors to recognize actions [4, 6, 17, 19]. Raptis and Soatto [17] proposed a hierarchical structure which included SIFT average descriptor, trajectory transition descriptor, and trajectory proximity descriptor. Some works [13, 14] modeled actions with the coordinates of skeletons or the relative position of body parts. The approach in [13] extracted local joints structure as local skeleton features and histograms of 3D joints as global skeleton features for action recognition. In most of the above work, actions are performed without occlusion. When occlusions happen, the features usually tend to fail. It is one of the most difficult issues in many kinds of vision work.

Object Features. The “good” features are very important for object classification. Researchers have presented various kinds of features. For low-level image features, SIFT [8, 15] and HOG [7, 25] are the most popular features in vision tasks. Many researchers adopted multiple kinds of features to represent various aspects of objects for classification in [1, 2, 11]. Ito and Kubota [11] introduced three different co-occurrence features named color-CoHOG (color-Co-occurrence Histograms of Oriented Gradients), CoHED (Co-occurrence Histograms of pairs of Edge orientations and color Difference), and CoHD (Co-occurrence Histograms of color Difference) to classify objects. Benefitted from the performance of 3D sensors, such as the Kinect sensor, some researchers [1, 2] extracted features from RGBD data. Bo et al. [2] presented five depth kernel descriptors that captured different cues including size, shape and edges. Some other approaches used a set of semantic attributes to classify objects [12, 18]. Su et al. [18] used five groups of semantic attributes including scene, color, part, shape, and material, they demonstrated that the semantic attributes can be helpful to improve the performance of object classification.

Context. Psychology experiments have shown that context plays an important role in recognition in the system of human vision. It has been used for some tasks in computer vision, such as object detection, object classification, action recognition, scene recognition, and semantic segmentation [10, 16, 19, 20, 23]. Marszalek et al. [16] claimed human actions have relations to the purpose and scene. Hence, they modeled the context between human actions and scenes to recognize actions. Sun et al. [19] adopted a hierarchical structure to represent the spatio-temporal context information for action recognition. In [20], they proposed a 4D human-object interaction model to recognize the events and objects in the video sequences. Gupta et al. [10] combined the spatial with function constraint between human and objects to recognize actions and objects. Yao et al. [23] modeled the mutual context of objects and human poses in human-object interaction activities.

Fig. 2.
figure 2

An overview of our framework

3 Modeling Context of Actions and Objects

We define human-object interactions as \(I = <A, O>\). Given a RGBD video V in time interval \(T = [1, T]\), our goal is to recognize manipulated objects O and actions A respectively. We cast recognition into an optimization problem:

$$\begin{aligned} E(I, V) = \sum _{t = 1}^{T} (\lambda _1 E(A, V_{t}) + \lambda _2 E(O, V_{t}) + E(A, O)). \end{aligned}$$
(1)

Where \(E(A, V_{t})\) is the energy of human actions in temporal space. \(E(O, V_{t})\) is the energy of manipulated objects in spatial space. E(AO) is the energy of context between human actions and manipulated objects. \(\lambda _1\) and \(\lambda _2\) are weights to balance the contribution of each energy term.

3.1 Action Energy

\(E(A, V_{t})\) models the energy of human actions. We train a classifier by multi-class SVM [5] to score each skeleton sequence as class a. The energy is:

$$\begin{aligned} E(A, V_{t}) = \omega _{a}\theta _{a}. \end{aligned}$$
(2)

Where \(\omega _{a}\) is the template parameter of action class a. \(\theta _{a}\) is the skeleton features of human action.

As shown in Fig. 3, we improve the method of global features in [21], histograms of 3D joints (HOJ3D), by combining local features. This feature is computed with aligned spherical coordinate system and it is independent of views. We denote features as \(\theta _{a} = \{ F_{L},F_{G} \}\). The features include two parts: local features \(F_{L} = (l_{1}, l_{2}, ... , l_{N_{l}})\) and global features \(F_{G} = (g_{1}, g_{2}, ... , g_{N_{g}})\). For global feature, 3D space is firstly divided into many bins and we count how many skeletons are in these bins. It describes the overall statistical information about human skeletons distribution without local features. Besides the global features, we also extract another feature to describe the local structure of skeletons, which uses the triangle areas of every three skeletons. In order to obtain more robust and dense features, we apply linear discriminant analysis (LDA) to reduce the dimension of features space. LDA can extract dominant features and produce the best discrimination between classes by searching in the subspace.

Fig. 3.
figure 3

The features for action recognition. (1) Extracting local features by computing triangle areas. (2) Extracting global features by mapping the skeleton coordinates into spherical bins.

3.2 Object Energy

\(E(O, V_{t})\) models the energy of object label. Extracting discriminative features is critical for object classification. Various features are explored by researchers [3, 9, 24]. In this paper, we apply kernel descriptors [1] to model objects. We use gradient, color, and local binary pattern match kernel descriptors to turn images into patch-level features. We sample the image patches with different sizes, such as \(4 \times 4\) rectangle or \(8 \times 8\) rectangle. Then, the gradient match kernel is used to measure gradient orientations similarity between patches from two different images. In the same way, the color match kernel and local binary pattern match kernel can represent image appearance and local shape respectively. We visualize these three types of kernel features in Fig. 4. As the number of image patches is large and evaluating kernels is time consuming, we utilize kernel principal component analysis to extract compact basis vectors. In human-object interactions, object is often small and occluded. Fortunately, benefitted from the Kinect, we can get more stable and reliable human pose. So we use sliding window to detect object near human hand region with several sizes. For each window, we extract above features and then compute the cost that assigning the label o to the object by a linear SVM classifier. The energy is defined as:

$$\begin{aligned} E(O, V_{t}) = \omega _{o}\theta _{o}. \end{aligned}$$
(3)

Where \(\omega _{o}\) is the template parameter of object class o. \(\theta _{o}\) is the kernel features of manipulated object.

Fig. 4.
figure 4

The gradient, color, and shape kernel features of the cup (Color figure online)

3.3 Context Between Actions and Objects

Human can recognize human-object interactions easily even some information is missing. There is the context between action and related object in human mind. The knowledge of context is helpful to get the most reasonable interpretation for action and the manipulated object together. For example, when someone is drinking, severe occlusion usually happened because of human’s hand holding the cup. But, for human, it is easy to fill in the information of the object. It is supposed that there exists a cup in human hand when the human is doing the action of “drink”. Some kinds of action has a higher possibility to manipulate certain objects. That is to say, we can infer the human action with the related object and vice versa. The context between human actions and manipulated objects is defined as:

$$\begin{aligned} E(A, O) = \sum _{i=1}^{N_{A}}\sum _{j=1}^{N_{O}}N_{(A=i)}N_{(O=j)} . \end{aligned}$$
(4)

Where \(N_{A}\) and \(N_{O}\) represent the number of action classes and object classes respectively. \(N_{(A=i)}\) and \(N_{(O=j)}\) represent the number of the ith category action and the jth category object respectively.

3.4 Learning and Inference

We adopt SVM algorithm to learn the action model parameter \(\omega _{a}\), defined by:

$$\begin{aligned} \mathop {min}_{\omega , \xi _{i}}\quad \frac{1}{2} \omega ^{T}\omega + C\sum _{i=1}^{N}\xi _{i} . \end{aligned}$$
(5)
$$\begin{aligned} {\text {s.t.}}\quad y_i (\omega \theta +b_i) \ge 1-\xi _{i}. \end{aligned}$$

Where \(y_{i}\) is the label for data \(x_i\). \(\theta \) is the feature for data \(x_i\). Similarly, we can learn the object model parameter \(\omega _{o}\) by SVM.

Our final optimization function is to solve the minimum energy in Eq. (1), defined by:

$$\begin{aligned} I^{\star } = \mathop {\arg \min }_{I_{1} I_{2} \cdots I_{t}} \sum _{t = 1}^{T}{E(I_{t}, V_{t})}. \end{aligned}$$
(6)

We optimize the Eq. (6) via greedy algorithm framework. We always choose the locally optimal choice at each stage with the hope of finding a global optimum.

4 Experiments

We collect a human-object interaction dataset by Kinect sensor with RGB, depth and skeletons. There are ten subjects, six males and four females. Each person performs each interaction four or five times. In our dataset, there are eight different daily interactions including calling with a phone, drinking with a cup, picking up a mouse, putting down a cup, opening a laptop, turning off a laptop, opening a soda can, and pouring into a cup. Some examples from our dataset are shown in Fig. 5.

Fig. 5.
figure 5

Some examples of the human-object interactions in our dataset

Action Recognition. The human actions are ambiguous without manipulated objects, for example, making a call has similar skeleton sequences with drinking. We use global features and local features from skeleton data described in Sect. 3.1. The confusion matrix of the action recognition without and with the manipulated objects context are shown in Fig. 7(a) and (b). The results in Fig. 7(a) demonstrate that making a call, drinking, putting down and pouring water have a large confusion probability. In Fig. 6, the “drink” action energy is minimum, but considering the manipulated objects, it has a large chance to be the phone, we finally infer the action label is “make a call”. The results of our method show that the action recognition accuracy can be improved with the related objects as context.

Fig. 6.
figure 6

Optimize action recognition and object recognition by the context between them.

Object Classification. In human-object interactions, related objects are always small or occluded by human hand. Occlusion is one of the most difficult issues in the field of computer vision. Bo et al. [1] designed a family of kernel features that described gradient, color and shape. Our dataset is randomly split into ten parts, and we adopt leave-one-sample-out cross validation strategy, namely, one original video sequence is used as the test data while the rest original video sequences as the training data. The confusion matrix of the object recognition without and with the human actions information are shown in Fig. 8(a) and (b). We can see that the context of the human actions and manipulated objects is effective in improving the accuracy of object classification.

Human-Object Interaction Recognition. In addition to the results of human actions and objects from the RGBD video, we can also recognize human-object interaction. Figure 9 indicates the performance of the human-object interaction recognition of our method. We can see that our results demonstrate the effectiveness of human action recognition and object classification for better human-object interaction recognition.

Fig. 7.
figure 7

(a) The confusion matrix of action recognition in [21]. (b) The confusion matrix of our method

Fig. 8.
figure 8

(a) The confusion matrix of object recognition in [1]. (b) The confusion matrix of our method

Fig. 9.
figure 9

The confusion matrix of human-object interaction recognition

5 Conclusions

Human-object interaction recognition is one of the most important topics in the field of computer vision. In this paper, we model human actions and manipulated objects in a unified framework for recognition. For human actions, we use local features together with global features to improve the accuracy of recognition. For object recognition, we apply kernel features. Experiments show that the features are more discriminative. Then we model the context between actions and manipulated objects. It is helpful to improve human-object interaction recognition. In the future, we will extend our dataset and consider more kinds of actions and objects.