1 Introduction

This paper presents a solution which allows tracking hand gestures in three dimensional space that can be inserted into a CAVE3D (Automatic Virtual Environment, see Fig. 1).

Fig. 1.
figure 1

Cave3D experimental setup consists of 3 screens with 3D projectors that are surrounding the user.

System consists of two main parts: a gesture recognition tool and a graphical environment. The gesture recognition tool allows user to create gestures database, learn it using one of selected classifiers and then recognize gestures performed by the user in real time. Implemented solution allows recognition of gestures recorded with varied velocity and with different user placement relating to the controller, so it is velocity and position invariant. It also contains build–in features that test recognition accuracy which are helpful during research and tests of the quality of this solution.

The whole system allows real–time position and velocity invariant gesture recognition: preparation of user’s own set of gestures, classifiers learning, and then recognition of gestures in real–time using selected classifiers, it is a novel and innovative solution.

The main purpose of the work described in this paper is to reduce size of the data set and computation time of gesture recognition while maintaining the highest possible classification accuracy. To achieve this, number of attributes was reduced by extracting features from a prepared data set, performing dimensionality reduction and checking what is the minimal number of features needed to achieve satisfying accuracy. This goal was achieved, and the important issues concerning this work are described in the following paragraphs.

This work is the continuation of researches presented in [4, 6].

2 Related Work

The problem of gesture recognition is a challenging and popular issue. Many researchers tried to resolve it in their own way. Diversity of approaches is, inter alia, a consequence of different possible ways of gesture representation. Very often gesture is represented as a movement of single body’s part, usually a hand. This approach is the same that was described in the following gesture definition: gestures are “movements of the arms and hands which are closely synchronized with the flow of speech” [3].

In [7] authors studied possibility of gesture recognition using accelerometer MEMS (Microelectromechanical System). This device was controlled by user’s hand and its movement in three dimensions was observed. Authors performed their research on seven simple gestures. The same device was used in [1] to resolve similar problem. In that work, the gesture classification was realized using Dynamic Time Warping (DTW) algorithm. A database consisting of 3780 instances of gestures, grouped into 18 different decision classes was used. Another solution which is based on the accelerometer device and the same classification algorithm is described in [8]. In this publication authors tested classification accuracy in two cases: user-relevant and user-irrelevant. Their research database contained 3200 instances, each of them represented one of eight different gestures. In this case every single instance of gesture took the same time (3,40 s). In all these solutions gestures were treated the same way as presented in following paragraphs – as a movement of one single point, which represents user’s palm in three dimensional space.

Another example of tracking hand’s movement was described in [9]. In this case to track gestures authors used a finger–worn device which was similar to a ring – it was called Magic Ring (MR). They presented personalized gesture recognition using a method of adaptive template adjustment. Another exemplary gesture recognition solution is called 1$ [5]. This name was given to symbolize algorithm’s low cost and simplicity – it’s implementation took only about a hundred lines of code. According to authors’ description it works satisfyingly even there is only one training instance of each gesture. This work was an inspiration for the authors of [2]. They designed a solution based on the algorithm described in [5] which used Sparse Representation (SR) and Compressed Sensing (CS) methods.

3 Gesture Data Collecting and Processing

In this section authors describe issues connected with gesture data collection and processing. That was described in details in [4, 6], however let’s outline briefly here: first, user has to prepare his own data set. He stands before the controller and performs gestures, signaling beginning and end of the gestureFootnote 1 Then he can save created data set and use it for two purposes:

  1. 1.

    Learn selected classifier and use it to perform gesture recognition in real time,

  2. 2.

    Use selected data set for classifier’s parameters optimization and measure the recognition accuracy to evaluate solution quality.

The first issue was described widely in [6]. In this publication authors would like to concentrate on the progress which concerns the second one.

The issues connected with position and speed invariance was widely described in [4, 6]. For this reason, we do not concentrate about them in this paper.

3.1 Gesture Dataset

Gesture database included 12 different gestures shown in Table 1, recorded as values in relative data format. Each gesture type (a decision class) was recorded 80 times which in total sums up to a gesture database consisting of 960 gestures. All of gestures were performed by four different users – each user recorded 20 gestures. What is more, they were asked to perform gestures in a different way and change their positions a bit between every gesture. As a result, every recorded instance was a bit different than others. Gestures were recorded with different velocities, users were standing in different distances to the controller and they were placed in a different parts of detecting range. Such way of performing gestures provided research data that allows test of position and velocity invariance in a real scenario. In addition, recordings that were not perfect were also included into the dataset (but recordings were not repeated), making the data set even more difficult to analyze. The only limitation in recording instances was assumption that every gesture should be performed in the same direction. It means that, for instance, every horizontal line is performed from left side to right side.

As it is shown in Table 1, all gestures are two–dimensional by their defintion (for example, brackets are designed to be two–dimensional, etc.). However they are captured in three–dimensional space. Device captured depth, that was recorded and written to the dataset the same way as width and height. It means that it was also important if the data collecting participants were performing movements in depth dimension. We have chosen such gestures, but there would be no problem to choose, for examples, “push” and “pull” ones – whichever selected gestures would be, the algorithm should work the same way.

All of recorded gestures were shuffled and written into a single dataset. The information about gesture performer was not saved. It means that gesture recognition is fully user–independent.

Table 1. Gestures dataset

3.2 Feature Data Representation

Number of attributes in a data set depends on the length of gesture. Authors assumed that all gestures have 40 samples, three dimensions each, which makes 120 attributes. Recording 40 samples using Kinect controller takes a bit more than one second. It does not mean that user has to perform the gesture in exactly one second – it can take any amount of time. This assumed number is just a final length of the gesture after scaling.

To resolve this problem authors decided to extract features from the gesture instances. The advantage of this solution is the fact that it is possible to extract a given number of features independently from the gesture length. It means, no matter how long the gesture is, the number of features is always the same. This allows reduction or extension of dimension to the same number of dimensions, allowing to have unlimited length of gestures.

To extract data features authors decided to transform the prepared data set following way: from a relative hand position absolute values were computed, but always starting from point (0, 0, 0). That means gesture’s samples values were translated to the beginning of the coordinate system. Authors performed this operation to express real movements of the hand – representation used in [6] was a proper one for direct recognition, but in authors’ opinion it needs above transformation to achieve features that express the given problem best way.

Table 2 shows features extracted from the prepared dataset. Popular statistical and signal features were selected. As it is presented, most of these features were computed independently for each axis and for all of the axes together. Axis to axis features were computed between the cartesian of axes. In total 49 features were extracted. In the Table 2 n is the number of samples k and l are the sample pair of axes and a is the sample.

Table 2. Extracted features

4 Dimensionality Reduction

One of the main objective in this publication is to check whether the minimal number of features exists that allows to achieve rewarding gesture classification accuracy. As it was mentioned in Sect. 3.2, 49 features were extracted. This is the maximal number of dimensions proposed in our computations. The next step is the dimensionality reduction, which objective is to reduce number of features. To achieve this a Singular Value Decomposition (SVD) algorithm was used, but considering only real numbers (which is a right assumption for the purposes of our gesture recognition problems, where numbers cannot be complex).

The singular value decomposition of \(m \times n\) matrix M is a \(M=U\varSigma {V}^* \) factorization, where:

  • U is a \(m \times m\) unitary matrix,

  • \(\varSigma \) is a \(m \times n\) rectangular diagonal matrix with non-negative real numbers on the diagonal,

  • \(V^*\) is a \(n \times n\) unitary matrix, which is a transposition of V.

Values that are placed on the diagonal of \(\varSigma \) are called singular values of matrix M. The m columns of U are known as left–singular vectors of M and the n columns of V are called right–singular vectors of M.

First, the SVD algorithm is performed on all set of n features. Then, to reduce this data set to k dimensions, all elements of \(k+1\) to n columns of \(\varSigma \) and V matrices and rows of U matrices are set to zeroes. Then a new \(M'\) matrix is computed according to the procedure presented before. As a result, its first k columns are different from 0 — these columns form a new, dimensionally reduced set of features.

5 Research Description

For accuracy testing purposes and to achieve best results in gestures recognition problem, authors performed several experiments with collected gesture data. Aim of this study was to check the classifiers, primarily to identify accuracy in different gestures recognition, as well as speed of calculations.

5.1 Parameters Optimization

The basic issue connected with classification using many classifiers, especially SVM, is a parameter optimization. This process is essential due to the fact that classification accuracy highly depends on parameters of the classifier. Parameters have to fit the character of data. It is a serious problem because there is no simple way of selecting proper parameters. A popular method to obtain kernel parameters is a grid search. Note, that it is also possible that selected parameters do not fit to the testing set. All these facts mean that parameter optimization does not have a perfect solution – choosing good parameters is rather a compromise than a sure answer.

To minimize risk of data fitting authors decided to perform parameter optimization, use a single random data set division (but the same each time) and the 5–fold cross-validation. This division assumes that in every one of the five parts there are the same number of each gesture class instances. For each parameter combination the dataset was randomly divided into five parts, but taking into account that described assumption. Classifier is then learned using four of these parts and tested using the fifth, out–of–bag (OOB) part. This process is repeated five times (each time the other part is OOB part) and then the result is averaged. The same actions are performed for each of parameters combination and the best one is selected.

Because of long time of computations, the parameter optimization procedure was performed in a parallel way. The parallelisation ratio was computed on a single personal computer with Intel i7 processor (8 cores, 16 threads).

5.2 Classification

After selection of best classifiers’ parameters (and kernel function parameters for the SVM classifier) authors performed data classification using the obtained parameters set. Similarly to parameter optimization, 5–fold cross validation was used, but the classification with 100 different random divisions was performed, not only the single one. The whole classification process was the same that it was in the case of parameters optimization – the difference is that research results were averaged over all these 100 divisions, and that value was recorded as a final classification accuracy.

5.3 Research Parts – Two Experiments

Authors performed two main parts of the research and recognition accuracy evaluation.

The first experiment concerns analysis how tested classifiers deal with gesture classification problem. Evaluation of all four classifiers in two cases was performed: with and without the normalization. Additionally, for SVM classifier, five kernel functions were tested independently. Such research gave a large number of results, and they are summarized and presented in Sect. 6.

For the latter, a Singular Value Decomposition was used to find the minimal number of features that give satisfying gesture classification results in order to obtain a compromise between the accuracy and computation time. To achieve this, gesture classification accuracy with the increase of number of dimensions was compared. In addition, best results using full data representation to feature data representation were compared, in order to check if new way of expressing data does not cause the severe drop of the classification accuracy.

One of the main purpose of second experiment was to check how many features are enough to achieve satisfactory classification accuracy. To accomplish this, dimensionality of data set was reduced, so that each example with 49 features was reduced iteratively into 48 new data sets that consisted from 1 to 48 features. Each of these data sets was tested using method described above. This allowed us to judge how an addition of a single dimension to a data set affects the classification accuracy.

All the results obtained are presented and discussed in Sect. 6.

6 Results and Discussion

6.1 First Experiment

First experiment was performed using relative data representation. For each classifier one configuration was selected that achieved best results based on recognition accuracy comparison, and then these were compared with other classifiers’ best configurations. Summarized results are presented in Table 3.

Table 3. Results of measurements obtained using selected classifiers

On the basis of these observations, authors conclude that for the given problem of classification of gestures the best results are obtained using the SVM classifier. SVM performed best in the shortest possible time and was characterized by a low diversity of the results achieved in subsequent repetitions.

The results of SVM kernels comparison are shown in Table 4. As we can see, in all the cases accuracy obtained using normalization is better than without using it. The best result was produced using wavelet kernel, but the difference between this kernel and the others was not large. It is important to note that without normalization Wavelet kernel generally gives much worse results. Authors have chosen best possible parameters and final results were good, but without cross validation it would be really hard to choose because most of parameters without normalization yielded bad results with the Wavelet kernel.

Table 4. Results of measurements obtained using selected SVM kernels

Considering results of presented research it is also worth to note what are the reasons of recognition mistakes. Figure 2 shows classification errors for opening bracket gesture. The most problematic were gestures similar to less–than sign and often they were incorrectly recognized as a vertical line. It is vital to note visual similarity between these gestures. When the user marks the curve too sharply while performing opening bracket gesture, it makes similar to less–than sign. When user marks this curve not sharply enough, gesture starts to look like a vertical line gesture. This explains reasons of classification errors. Analysis of incorrectly classified instances of other gestures confirmed that observation.

Fig. 2.
figure 2

Incorrect recognition of opening bracket gesture as different decision classes

6.2 Second Experiment

The first tested approach was a check difference between classification accuracy using relative data representation, achieved in the first experiment, and the proposed feature representation using 49 or less proposed features. This was checked for each of five proposed kernel functions. The results are presented in Table 5.

Table 5. Comparison of relative data representation and feature data representation classification accuracy

As it is shown in Sect. 5, for four of five kernels the difference was about 1.5%–3% (it was bigger for sigmoid kernel). It is a noticeable drop of classification accuracy, which confirms that feature extraction causes loss of some information. The other reason can be not perfect choice of features that were extracted – this can be checked in further research. On the other hand, by performing feature extraction we reduced the number of dimensions more than twice (from 120 to 49), as a result we also reduced the classification time. We judge the 1.5%–3% difference is a price worth to pay for more than twice reduction of computation time.

The main part of our research dealt with classification accuracy using data sets that consisted of different number of dimensions. Authors checked 49 data sets having number of dimensions from 1 to 49 (with a step of 1). The results are presented in Fig. 3.

Fig. 3.
figure 3

Classification accuracy referring to the number of dimensions

First of all, addition of each dimension is significantly increasing the classification accuracy for each kernel, but this tendency stops after 7–12 dimensions. At this point classification gains stable and satisfactory results. Best results most kernels (instead of sigmoid) started to achieve at 16th dimension. The further increase of dimensions from 16 to 31 does not result in significant classification accuracy growth, which means next features do not provide any more important information about data. After 31th dimension in three of five kernels classification accuracy drops, that means some information is excessive and is bringing unnecessary noise to the data set for these kernel functions.

The best results in 16–31 dimensions were comparable for each kernel function, but not sigmoid. Slightly better in this range is a wavelet function. In the larger number of dimensions the best classification accuracy was achieved by linear and polynomial kernel functions, and they achieved best results in the whole research. Sigmoid functions, comparing to the other ones, gave unsatisfactional results. We were unable to select a correct set of parameters for this function to achieve results comparable to other ones.

In Fig. 4 the best results achieved during the parameters optimization process are shown. Figure 5 shows differences between classification accuracy achieved during optimization of parameters.

Fig. 4.
figure 4

Classification accuracy referring to the number of dimensions — parameter optimization

Fig. 5.
figure 5

Differences between parameter optimization accuracy and research accuracy

In almost all of the cases results achieved using parameters optimization were better than during the research. Differences are the result of data overfitting. For four of five kernels (instead of the wavelet one) the differences were oscilating about 0%–3% all the time. For wavelet kernel the differences were much larger. For 0 to 30 dimensions they did not exceed 10%. For the larger number of dimensions (above 30) it was oscilating between 15% and 30%. It means this kernel function is the most sensitive to selection of parameters.

7 Conclusion

The method and algorithm of real-time gestures recognition described in this paper can be inserted into the CAVE3D system. Gestures can be successfully recognized using classifiers. Selection of appropriate classifier to solve the problem of gestures recognition is crucial. Based on studies presented in this paper it can be concluded that the decision should fall on the SVM classifier. It should be emphasized however, that results could be slightly different for different sets of gestures or other selected classifiers parameters, but taking into account specific nature of the problem and carefully conducted study by authors, the result of them can be considered as representative for a given research problem.

Also, according to the research presented in this paper, only 16 features are enough to achieve results that are about 1.5%–3% worse than using full data representation. This means that it is possible to reduce data set size about 7–8 times for slightly lower and probably unnoticeable cost of the classification accuracy.

Authors tested selected classifiers and found the best one that fits gesture recognition problem. Then, using this classifier, authors proved that it is possible to reduce the number of data set dimension using different feature data representation. The minimal number of features which gives satisfying result was also found for the data set used in this research.