Human Gesture Recognition in Still Images Using GMM Approach

  • Soumya Ranjan Mishra
  • Tusar Kanti Mishra
  • Goutam Sanyal
  • Anirban Sarkar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 695)

Abstract

Human gesture and activity recognition is an important topic, and it gains popularity in the field research in several sectors associated with computer vision. The requirements are still challenging, and researchers are proposing handful of methods to come up with those requirements. In this work, the objective is to compute and analyze native space-time features in a general experimentation for recognition of several human gestures. Particularly, we have considered four distinct feature extraction methods and six native feature representation methods. Thus, we have used a bag-of-features. As a classifier, the support vector machine (SVM) is used for classification purpose. The performance of the scheme has been analyzed using ten distinct gesture images that have been derived from the Willow 7-action dataset (Delaitre et al, Proceedings British Machine Vision Conference, 2010). Interesting experimental results are obtained that validates the efficiency of the proposed technique.

1 Introduction

There exist several video-based human action and gesture recognition schemes in the literature [2, 3]. However, few schemes are available for recognition of human gestures and actions in static images. The recognition of human gestures from static images plays a very important role, especially in forensic science and investigations. Thus, this problem has dragged the attention of many a researchers over the last decade. The PASCAL-VOC action recognition competition [4] has been organized for the purpose to facilitate research work in this scenario. There exists certain core difference between human action and gesture recognition in video and that in a still image. That is, in video, there are the temporal sequences of image frames available for sufficient analysis [5, 6]. However, in still image the researcher has to establish correlation among all the available objects, human body components, and the background [7].

Through the existing technique for addressing the challenge of gesture recognition in static images, the main limitation is that, they involve manual intervention for the purpose of drawing bounding boxes and labeling/annotation [8, 9, 10, 11, 12, 13, 14]. The same manual annotation is also required during all the training and testing phases. This manual intervention is cost consuming in terms of time and also there may be chances of introduction of unwanted human errors. In [15], the authors proposed a scheme that does not require human intervention for the purpose. However, here also during the testing phase, proper manual intervention is used. Similarly, in [16], automated recognition of human gestures have been performed. Still, it remains as a research challenge to classify and recognize human gestures in still images.

In this work, we have proposed a systematic and automated approach for addressing the current challenge. These are stated below:
  • Candidate objects are first generated out of the input still image. The image is decomposed into different parts containing individual objects. These are suitable for detailed and individual shape analysis. Among these, the human only objects are extracted and rest are eliminated. One of the successful methods for this is reported in [17].

  • An efficient product quantization scheme is applied to annotate the predictions to the morphed objects (human only). This requires no bounding box around during the input phase.

Most often, the proposed scheme accurately delineates the foreground regions of underlying human (action masking).

2 Proposed Scheme

The entire recognition problem has been divided into two sub-problems at par with the divide-and-conquer strategy. An overview for this has been depicted in Fig. 1. The first step involves in delineating action masks. As a solution, we have successfully used an optimization technique as given below. Corresponding sample from the experiment is shown in Fig. 2.
Fig. 1

Overview of the proposed scheme

Fig. 2

Sample object delineation

$$\begin{aligned} y^N_{p, q, r} = \mathrm {max} [Z_{p', q', r}, p \le p'< p+N, q' < q+N] \end{aligned}$$
(1)
such that,

s\(1 \le q \le Q; 1 \le r \le R \)

where, (pq) are coordinates of top left corner of the objects bounding box, N is standard index of part object from where remaining parts are seeded, y is the image under consideration, Z is the resultant.

2.1 Computation of the Gesture-Mask

This computation is performed with the base assumption that the objects/parts involved in a particular gesture/action are neighbor to each other. An individual part is learned with respect to all other classes (remaining six). The visual coherence of all the gesture related parts (the pixel intensities) are considered meanwhile. They are isolated from the main input image thereby. So now, the task is now resembles a typical energy minimization problem. It can be formulated as a hidden Markov field. As derived from [18] it can be expressed Mathematically as,
$$\begin{aligned} \mathrm {minimum}/(\alpha , \beta ^h, \beta _i^l, \gamma ^c) \sum _{i}^{} (\sum _{m}^{} V(\alpha _{i m}, y_{i, m}, \beta ^h, \beta _i^l, \gamma ^c)) + \sum _{i}^{}(\sum _{m, n}^{}(U(\alpha _{i, n}, \alpha _{i, m}))) \end{aligned}$$
(2)
Each of the parts is obtained by applying a clustering based on their pixels. This constitutes the part feature-space. Each of the gesture class is having few part-groups. As the numbers in this group is very less; hence, the action masks are obtained by the particular group sparsity which is a constant value. For the purpose of gesture representation, a feature representation is performed on the resultant. As each gesture is involving a human object, an action feature vector is to be generated for each of the seven action those have been considered in this work. Instead of generating feature vectors for all, only these seven gestures templates have been used.

The overall steps involved in generating the gesture masks are represented below in a stepwise manner.

  1. 1.

    A fisher vector (FV) [19] is generated from each input image that uses every part-based features. The Gaussian mixture model (GMM) is learned from these part features. The feature values are then normalized. The final feature vectors are computed by taking the mean and variance values from the Gaussians.

     
  2. 2.

    Individual gesture models are learned from their action models as computed in the above step. The feature vector corresponding to a specific model is just a row. Corresponding to this row is the label of the image. So, the images of corresponding classes are labeled, respectively.

     
  3. 3.

    The Gaussian components with nonzero values along with their centers are computed.

     
  4. 4.

    For all the images, the mask actions are computed. This is done by taking the parts which are closer to the Gaussian components.

     
  5. 5.

    The grab-cut is used to update the gesture marks into low-level profiles. The foreground and background GMM’s are learned, and further refinement is done to the masks.

     
  6. 6.

    The global model is updated out of these gesture parts.

     
For the purpose of gesture recognition, a feature vector representation is needed here. For this, a pilot experiment is carried out to determine the best model of computing the feature vectors for the gestures. The very deep convolutional neural network (VD-CNN) has been applied directly to get the gesture vectors. This is shown in Fig. 3. The outputs of the last layer of the network are used to compute the parts. Let’s take the output of a bounding box as X as a subset of \(R^{s\times s\times d}\), s being the spatial dimensions and d being the filter. The pooling strategy used to extract the parts are given below (Fig. 4):
$$\begin{aligned} z^M_{r, s, d} = \mathrm {MAX}_{X(r', s', d), r \le r'< r + m; s \le s' < s + m} \end{aligned}$$
(3)
where (rs) are relative coordinate values of the first corner of the part bounding box that is associated to its surrounding object bounding box, m is scale index of the parts. The Fig. 5 shows the difference between the only CNN approach and the VD-CNN approach where it is clear that the background scene is eliminated.
Fig. 3

Sample CNN model representation

Fig. 4

Sample fusion model representation

Fig. 5

Output comparison of VD-CNN with only CNN

As of next approach, the objects are fused into gesture representation models. This leads to the formation of the gesture vector. This is shown in Fig. 4. This pilot experimentation favors the fusion model for generation of the feature vectors for gestures.

3 Gesture Recognition by Classification

For the purpose of gesture recognition by classification, the linear support vector machine (SVM) is used. This classifier is trained out of the feature vectors obtained so far from the training images. Testing is then performed on suitable still images. The database considered for this purpose is explained in the subsequent section.

3.1 Input Database of Still Images

The Willow 7-action dataset [1] is considered for validating the effectiveness of the proposed scheme, this dataset consists of 108 images of each class. However, we have considered seven classes of gestures into account. These are walking, standing, sitting, bending, biking, stretched-hands, and Holding-prop. Fifty samples from each of the classes are taken as input for our experiment. Among these, the train/test split is 30/20, respectively. Thus, the total size of the dataset in our case is 3,500. A snapshot of the sample dataset with preprocessed outputs is shown in Fig. 6.
Fig. 6

Sample dataset and preprocessed outputs

Fig. 7

Plot of comparison of rates of accuracy

3.2 Experimental Evaluation

The proposed work was implemented for the dataset as discussed in the previous section. The results so obtained are quite satisfactory. A k-fold (k = 5) cross-validation strategy is adopted for determining the rate of accuracy. The overall accuracy obtained is 84.5%. The gesture-mask learning method thus performs better as compared to the existing bounding-box methods for gesture recognition. Proper labeling mechanism is also incorporated in the corresponding code of the proposed scheme. The rate of classifications has been presented in Table 1 in a form of confusion matrix. The out-performance of our method is presented in Table 1. Here, it has been compared with the state-of-the-art two other schemes. Plot of the k-fold cross-validation in a comparative manner has also been shown in Fig. 7. When the number of samples is increased, the accuracy for the scheme increases gradually and becomes stable afterward. This shows a good sign of the persistence of the proposed scheme (Table 2).
Table 1

Gesture-wise misclassification rate

Gestures

Walk

Stand

Sit

Bend

Bike

Stretch

Hold

Accuracy (%)

Walk

44

5

0

0

0

1

0

88

Stand

4

46

0

0

0

0

0

92

Sit

0

0

45

5

0

0

0

90

Bend

2

1

4

40

0

1

2

80

Bike

2

4

0

0

41

0

3

82

Stretch

1

2

1

1

2

40

3

80

Hold

3

3

1

0

1

2

40

80

Table 2

Gesture-wise comparison of rate of overall accuracy with other schemes

Mthods/Gestures

Walk (%)

Stand (%)

Sit (%)

Bend (%)

Bike (%)

Stretched (%)

Holding (%)

Overall (%)

EPM [13]

88

87

86

80

78

76

79

82

LLC [20]

89

82

80

80

81

78

79

81.3

Proposed

88

92

90

80

82

80

80

84.5

4 Conclusion

An efficient scheme has been proposed for recognition of six important gestures from still images. The scheme utilizes the very deep CNN for extraction of relativity between human parts with objects. Fisher vector approach has been used to generate feature vectors thereby. Gesture masks are computed with the help of GMM and grab-cut techniques. Finally, support vector machine has been used for proper classification of the feature vectors into six different gestures efficiently. For the whole purpose, Willow 7-action dataset has been used for validation. Along with this, local data captured at our end also have been used only for testing purpose. Future work includes recognition of more number of complex human gestures. Further, the study and analysis on forensic data will be part of the work.

References

  1. 1.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in stilll images: a study of bag-of-features and part-based representations. In: Proceedings British Machine Vision Conference (2010)Google Scholar
  2. 2.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 97690 (2010)CrossRefGoogle Scholar
  3. 3.
    Cheng, G., Wan, Y., Saudagar, A., Namuduri, K., Buckles, B.: Advances in human action recognition: a survey, 130 (2015)Google Scholar
  4. 4.
    Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html
  5. 5.
    Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2577–2584 (2014)Google Scholar
  6. 6.
    Zhang, T., Zhang, Y., Cai, J., Kot, A.: Efficient object feature selection for action recognition. In: International Conference on Acoustics, Speech and Signal Processing (2016)Google Scholar
  7. 7.
    Guo, G.-D., Lai, A.: A survey on stilll image based human action recognition. Pattern Recogn. 47(10), 334–361 (2014)CrossRefGoogle Scholar
  8. 8.
    Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3177–3184 (2011)Google Scholar
  9. 9.
    Hoai, M.: Regularized max pooling for image categorization. In: Proceedings British Machine Vision Conference (2014)Google Scholar
  10. 10.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)Google Scholar
  11. 11.
    Gupta, S., Malik, J.: Visual semantic role labeling (2015). arXiv:1505.0447
  12. 12.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: Proceedings IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)Google Scholar
  13. 13.
    Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for semantic description of humans in stilll images (2015). arXiv:1509.04186
  14. 14.
    Yang, H., Zhou, J.T., Zhang, Y., Gao, B.-B., Wu, J., Cai, J.: Exploit bounding box annotations for multi-label object recognition. In: Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, pp. 280–288 (2016)Google Scholar
  15. 15.
    Gkioxari, G., Girshick, R., Malik, J.: Actions and attributes from wholes and parts. In: Proceedings IEEE International Conference on Computer Vision, pp. 2470–2478 (2015)Google Scholar
  16. 16.
    Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)CrossRefGoogle Scholar
  17. 17.
    Mahapatra, A., Mishra, T.K., Sa, P.K., Majhi, B.: Human recognition system for outdoor videos using hidden markov model. AEU-Int. J. Electron. Commun. 68(3), 227–236 (2014)CrossRefGoogle Scholar
  18. 18.
    Rother, C., Kolmogorov, V., Blake, A.: GrabCutinteractive foreground extraction using iterated graph cuts. In: SIGGRAPH, pp. 309–314 (2004)Google Scholar
  19. 19.
    Sanchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classifica- tion with the Fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Localityconstrained linear coding for image classification. In: Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3360–3367 (2010)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Soumya Ranjan Mishra
    • 1
    • 2
  • Tusar Kanti Mishra
    • 1
    • 2
  • Goutam Sanyal
    • 1
    • 2
  • Anirban Sarkar
    • 1
    • 2
  1. 1.Department of Computer Science and EngineeringNIT DurgapurDurgapurIndia
  2. 2.Department of Computer Science and EngineeringANITSVisakhapatnamIndia

Personalised recommendations