1 Introduction

Dynamic textures can be defined as a sequence of images (or video) that exhibit certain stationary in time [6]. Examples of dynamic textures in the real world include sea waves, smoke, swaying trees, moving flag, fire, a crowd of people, among others. The approaches for dynamic texture representation are applied in different problems such as traffic condition recognition [9], human activity recognition [14], surveillance [29], among others.

In the literature, many approaches have been proposed based on different strategies to analyze the spatial and temporal characteristics of the dynamic textures. These approaches can be separated into five categories: motion-based methods [17], model-based methods [11, 13, 18], filter-based methods [7], statistical-based methods [24, 29] and, agent-based methods [9, 10]. The agent-based methods use the deterministic partially self-avoiding (DPS) walks to describe the dynamic textures. These methods achieved promising results in classification, clustering, and segmentation of dynamic textures.

In this paper, we propose a new method for dynamic textures analysis and classification based on deterministic partially self-avoiding walks on complex networks. The DPS walks was introduced initially to investigate the effects of simple walks in random media [15]. After that, the DPS walks methodology was applied for texture and dynamic texture analysis [3, 9, 10]. Basically, a DPS walk can be understood as an agent who visits points (e.g. pixels, vertices) distributed in a map (e.g. image, video, graph) based on the neighborhood, a rule of movement and memory. Starting from a given point, the next step follows the rule: go to the nearest point on the neighborhood that has not been visited in the last \(\mu \) steps (memory) [15]. Statistical features of the trajectories of the DPS walks are used to study the map.

In the proposed approach, we model the dynamic texture in two graphs (i.e. networks): spatial graph and temporal graph. The spatial graph models the appearance characteristics, while the temporal graph contains the motion properties of the dynamic texture. In this way, we apply the DPS walks on these two graphs and use statistical characteristics of the trajectories to represent the appearance and motion of the dynamic texture in a feature vector. The proposed approach is different of the previous works [9,10,11] because it combines graph modeling and DPS walks characteristics, while in [9, 10] only the DPS walks is applied in the videos and in [11] only the complex network theory is used to modeling and characterization. In Sect. 2 our proposed method to dynamic texture analysis is detailed. The experimental setup is described in Sect. 3. In Sect. 4 the experimental results are presented and discussed. Finally, the paper is concluded in Sect. 5.

2 Proposed Approach

2.1 Network Modeling

The network sciences (also called complex network) field uses the formalism of graph theory with the incorporation of statistical mechanics and complex systems. It has attracted increased attention because of its ability to represent and study a wide range of systems and data. In computer vision, the networks have been used to model and analyze images and video analysis [2, 21,22,23]. In this paper, we use the graph to represent the dynamic texture video. In dynamic texture analysis, it is important to obtain appearance and motion features in order to accurately represent the video. To achieve this, in the proposed approach, we model the dynamic texture video in two graphs (networks): the spatial graph \(G_S=(V_S, E_S)\) that characterizes the appearance properties and the temporal graph \(G_T=(V_T, E_T)\) that contains the motion properties.

In the two graphs, each pixel \(i = (x_i,y_i,t_i)\) is mapped into a vertex \(i \in V\), where \(x_i\) and \(y_i\) are the spatial coordinates and \(t_i\) the temporal coordinate of the pixel i. The main difference between the two graphs is the definition of the set of edges. In the spatial graph, the set \(E_S\) is defined by the connection of all vertices whose the Euclidean distance is smaller or equal than a given radius \(\sqrt{2}\) and the time coordinates \(t_i\) and \(t_j\) are equal,

$$\begin{aligned} e_{ij} \in E_S \iff \sqrt{(x_i-x_j)^2 + (y_i-y_j)^2} \le \sqrt{2} \text { and } t_i = t_j \end{aligned}$$
(1)

On the other hand, in the temporal graph, the set of edges is defined by connecting the vertices whose the Euclidean distance is smaller or equal than \(\sqrt{3}\) and the time coordinates are different,

$$\begin{aligned} e_{ij} \in E_T \iff \sqrt{(x_i-x_j)^2 + (y_i-y_j)^2 + (t_i-t_j)^2} \le \sqrt{3} \text { and } t_i \ne t_j \end{aligned}$$
(2)

Figure 1 illustrates three frames modeled as a graph. The frames are represented by the spheres in blue, green and red. For each edge \(e_{ij}\) connecting two vertices i and j, a weight \(w(e_{ij})\) is defined by the difference of intensities between the two pixels that represent the vertices:

$$\begin{aligned} w(e_{ij}) = \frac{|I(i) - I(j)|}{255}, \end{aligned}$$
(3)

where \(I(i) \in [0,255]\) is the gray intensity of a pixel i.

Fig. 1.
figure 1

Three frames modeled as a temporal graph. The edges connecting only vertices of different frames. (Color figure online)

2.2 DPS Walks on Networks

The deterministic partially self-avoiding (DPS) walk is an agent, which was initially used to study regular and random media [15]. This deterministic walk produces a set of trajectories that are used to characterize the environment in which they were performed. The DPS walk was applied with success for feature extraction in different classification tasks, such as in texture analysis [3], dynamic texture classification [9, 10], shape analysis [20] and complex network classification [12].

In the proposed approach, the DPS walks are used for feature extraction of the graphs that model the dynamic texture videos. In this way, the DPS walks are performed on the vertices. The DPS walk is an agent that walks on the vertices of the graph based on a deterministic rule r. The agent starts the walk from a pre-defined vertex i and the movement to the next vertex j is given by: go to the vertex j in the neighborhood \(\eta (i)\) (vertices connected to the vertex i) which minimizes the edge weight \(w(e_{ij})\) and that has not been visited in the previous \(\mu \) steps (i.e. that is not in memory \(j \notin M_\mu \)). Here, we will call this rule of movement as \(r = min\). We also consider another rule of movement that moves the agent in the direction of the maximum edge weight \(w(e_{ij})\) (\(r=max\)). The two rules of movement are used because each one produces different trajectories and, consequently, obtain different characteristics of the graph. The walk will end when the agent to find a set of vertices in which it cannot escape, called attractor.

The memory \(M_\mu \) of size \(\mu \) is the last \(\mu \) vertices visited by the agent and that cannot be visited. This memory is updated in each step of the agent to save the last \(\mu \) vertices visited. The trajectory of the agent can be divided into two parts: an initial part of size \(\tau \) called transient, and, a final part named attractor, which is composed of vertices that form a cycle of period \(\rho \ge \mu +1\) where the agent gets stuck. In the cases in which the agent cannot find an attractor, the trajectory is represented only by the transient part. For each vertex of the graph, a DPS walk is started with a given memory size \(\mu \) and a rule of movement r. Therefore, for a graph with N vertices, we have N different trajectories. In order to measure this set of trajectories, the transient time and attractor period joint distribution \(S_{\mu ,r}(\tau ,\rho )\) is considered. In this distribution, the frequency of trajectories with transient \(\tau \) and attractor \(\rho \) is stored in each position [3],

$$\begin{aligned} S_{\mu ,r}(\tau ,\rho ) = \frac{1}{N} \sum _{i \in V}^{} \left\{ \begin{matrix} 1,&{} \text {if } \tau _i = \tau \text { and } \rho _i = \rho \\ 0,&{} \text {otherwise} \end{matrix} \right. , \end{aligned}$$
(4)

where \(\mu \) is the memory size and r the rule of movement used.

2.3 Proposed Signature

The joint distribution contains relevant information about the trajectories of the DPS walks performed in a given graph. Thus, previous works [3] have used features obtained from this joint distribution for characterization. In this way, the best results were obtained using the histogram \(h_{\mu ,r}^t(l)\), which calculates the number of trajectories with size \(l= \tau + \rho \) in a joint distribution computed with memory size \(\mu \) and rule of movement r,

$$\begin{aligned} h_{\mu ,r}(l)= \sum _{b=0}^{l-1} S_{\mu ,r}^k(b,l-b). \end{aligned}$$
(5)

In order to characterize the dynamic texture videos, the DPS walks are performed in the two graphs: \(G_S\) and \(G_T\). Thus, for each graph a histogram \(h_{\mu ,r}(l)\) can be obtained. Several previous works have shown that the most discriminative information of the histogram \(h_{\mu ,r}(l)\) are concentrated on its first elements [3, 9]. In this work, we use the n first descriptors of the histogram \(h_{\mu ,r}(l)\), with the first position defined as \((\mu + 1)\), since there is no attractor smaller than \((\mu + 1)\). Thus, given a memory size \(\mu \) and a rule of movement r, a feature vector \(\nu _u^\varTheta \) is obtained:

$$\begin{aligned} \nu _{\mu ,r}^\varTheta = [h_{\mu ,r}^{\varTheta }(\mu +1), h_{\mu ,r}^{\varTheta }(\mu +2), ..., h_{\mu ,r}^{\varTheta }(\mu +n)] \end{aligned}$$
(6)

where \(\varTheta \) is the type of graph: spatial S or temporal T.

The size of the memory directly influences the complexity of the trajectories and, consequently, in the information extracted by the method. For example, DPS walks with low memory values perform better local analysis [10]. In this sense, histograms obtained with different memory sizes are used for a more robust characterization of the different patterns present in the graphs (i.e. dynamic texture videos), according to:

$$\begin{aligned} \vartheta ^\varTheta _r = [\nu _{\mu _1,r}^\varTheta , \nu _{\mu _2,r}^\varTheta , ..., \nu _{\mu _m,r}^\varTheta ]. \end{aligned}$$
(7)

To characterize patterns of appearance and movement of dynamic textures, a feature vector that consists of the concatenation of the spatial and temporal descriptors is considered. Thus, this feature vector is composed of the characteristics extracted from the spatial \(\vartheta ^S_r\) and temporal \(\vartheta ^T_r\) graphs using different memory values, as described:

$$\begin{aligned} \lambda _r = [\vartheta ^S_r, \vartheta ^T_r]. \end{aligned}$$
(8)

The feature vector obtained above refers to a single rule of movement. Although this vector may be able to properly characterize dynamic textures, another possibility is to combine the two rules of movement. The rule of movement is another parameter that influences the trajectory of the agent. In this sense, it is considered a final feature vector that consists of the concatenation of vectors obtained with the two rules of movement \(r = min\) and \( r = max \), as follows:

$$\begin{aligned} \lambda = [\lambda _{min}, \lambda _{max}] \end{aligned}$$
(9)

2.4 Computational Complexity

Basically, the proposed approach models a dynamic texture with \(N = w \times h \times T\) pixels in two graphs. For modeling, each pixel is mapped into a vertex, which is linked with 8 and 18 neighbors for the spatial and temporal graphs, respectively. As the number of neighbors is a multiplicative constant and much smaller than the number of pixels in the video, it can be disregarded. Thus, the computational complexity of the modeling is given then by O(N) for each type of graph. Next, a DPS walk is started from each pixel, producing a trajectory of size \(l = \tau + \rho \), where \(\tau \) is the transient time and \(\rho \) is the attractor period. For cases where an attractor is not found, we finish the walk after M steps. In this work, we define \(M=20\) because we use only the first elements of the joint distribution for characterization. Therefore, the computational complexity to compute the features from each graph (temporal and spatial) is given by the number of pixels and the size of the trajectories, \(O(N\times (\tau +2\rho ))\). We have \( 2\rho \) because it is necessary to go through twice the same cycle of pixels to identify an attractor.

3 Experimental Setup

To classify the feature vectors, we adopted the 1-Nearest Neighbor (1NN) classifier and a specific scheme for each database to separate the training and test set. The Dyntex++ [8] database is a compiled version of the Dyntex database [16]. The samples are preprocessed in order to eliminate static or dynamic non-representative backgrounds, zoom, and textures without movement. The database has 3600 samples divided into 36 classes (e.g. boiling water, river water, colony of ants and smoke). In the experiments, a 10-fold cross-validation scheme with 10 trials was used [11]. The accuracy is reported as the average performance of all experimental trials.

The UCLA [4] database is composed of 200 dynamic texture videos separated into 50 classes with 4 samples per class (named UCLA-50 version). Each sample has \( 48 \times 48 \times 75 \) pixels. This database also has two variations of the original database proposed in [19]. On the UCLA-9 version, the samples are reorganized into 9 classes: boiling water (8 samples), fire (8), flower (12), fountains (20), plants (108), sea (12), smoke (4), water (12) and waterfall (16). In the UCLA-8 version, the plant class is eliminated due to the large number of samples. The experimental setup adopted in these databases is similar to [19]. For the UCLA-50 is used a \(4-\)fold cross-validation scheme with 10 repetitions. On the other versions, it is used for the testing set, half of the sequences (randomly selected from each class), and the remaining half is used for training. For these databases, the correct classification rate (CCR) or accuracy is reported as the average performance of all experimental trials.

4 Results and Discussion

First, we investigate the effects of the parameters of our proposed approach in the task of dynamic texture classification. The parameters analyzed were: (i) memory sizes \(\mu \) and (ii) rules of movement r. In the experiments, it was used the first \(n=3\) elements of the histogram \(h_{\mu ,r}^{\varTheta }(l)\) for the UCLA databases and the first \(n=5\) positions for the Dyntex++ database. These values were defined based on the idea that the main information are in the first elements and from the experimental tests.

Figure 2 shows the results of our proposed method on the two databases for different combinations of memory sizes and rules of movement. On both databases, the rule of movement \(r=max\) obtained higher accuracies than the rule of movement \(r=min\). The rule of movement \(r=max\) is related to heterogeneous regions of the video (i.e. graph), that is, the agent walks on regions where the difference of the gray level between the pixels (i.e. high edge weight) is higher. On the other hand, in the rule \(r=min\), the agent walks on homogeneous regions, that is, where the edge weight is smaller. This indicates that heterogeneous regions have more discriminative information of the dynamic texture. However, the best results are obtained when both rules of movement are combined.

Concerning the memory sizes, we note that low memory sizes provide inferior accuracies. Thus, as we increase the memory sizes, the accuracy also is increased. However, when using a combination of memory size higher than [0, 1, 2, 3], the accuracy obtained starts to stabilize, suggesting that the proposed descriptors have reached their limits in terms of discrimination ability. Such behavior is expected: the larger the memory sizes \(\mu \), the harder find an attractor. From the results, we set up as default parameters of the proposed method \(\mu =[0,1,2,3,4,5,6,7,8]\) and \(r = [min,max]\). On both databases, the highest accuracy was using this configuration (94.5% and 96.0% for the Dyntex++ and UCLA-50, respectively). These results are interesting because they indicate that the method is not strongly parameter dependent.

Fig. 2.
figure 2

Accuracies using different combination of memory sizes and rules of movement.

In order to improve the analysis of our proposed approach, we performed a comparison experiment using literature methods of dynamic textures. To achieve this, we considered the accuracy, standard deviation and number of features of the methods, when described in the original papers. In all comparison, we use the same experimental setup described in Sect. 3.

Table 1 presents the classification results of the proposed method and others on UCLA-50 database. Note that the proposed method obtained the best accuracy when compared to the others. Concerning the complex network based methods, the proposed method improves the accuracy compared to the CNDT [11] method by 1.0%. This method uses traditional complex networks measures while our method uses the DPS walk for complex network characterization. Thus, the results suggest that the DPS walk is more effective to describe the graph and, consequently, the dynamic texture.

Table 1. Comparison of the classification results of the proposed method and others on UCLA-50 database.

Table 2 summarizes the results on the UCLA-9 database. On this database, our approach yields the second best result (96.80%). This result is slightly inferior to the one obtained by CVLBP method (96.90%). On the other hand, on the UCLA-8 database, the proposed method achieved the best accuracy, as can be seen in Table 3. Here, the proposed method gives an accuracy of 96.59% against 95.65% of the CVLBP method. The proposed method also outperformed the method DPSW-TOP, which is a DPS walk based method. This method applies the DPS walk on three orthogonal planes to analyze the appearance and motion properties of the dynamic textures. In this way, the results indicate that our approach based on DPS walk applied on the graph is more effective for dynamic texture characterization.

Table 2. Classification results for all methods on the UCLA-9 database.
Table 3. Comparison results on the UCLA-8 database.

Table 4 presents the results on the Dyntex++ database for different methods. The proposed method shows an improvement of 10.74% and 3.21% compared to the CNDT and DPSW-TOP methods, respectively. However, on this database, the proposed method obtained a performance lower than the RI-VLBP and LBP-TOP methods. Nevertheless, it is important to emphasize that the feature vector size of these methods is significantly higher than the feature vector of our method. Therefore, our method is still competitive due to the small feature vector, for example, the RI-VLBP extracts a long feature vector of dimension 16384, whereas our method produces only 180 characteristics.

Table 4. Comparison results for different dynamic texture methods on the Dyntex++ database.

Besides these compared methods, called hand-craft methods, we also compare our proposed signature with a method based on learned features. This method proposed in [1] uses a convolutional neural network (GoogleNet) to learn the characteristics of the dynamic textures in three orthogonal planes and obtain a signature. On the UCLA-50, UCLA-9 and UCLA-8 databases, the CNN-based method obtained 99.50%, 98.35% and 99.02% of accuracy, respectively. These results are higher than the obtained by our method. However, it is important to highlight that even with inferior results, our method is still competitive due to its computational simplicity.

5 Conclusion

This paper presents a new method for characterization and classification of dynamic textures using deterministic partially self-avoiding walks on complex networks. In this method, we have shown a graph modeling from dynamic texture videos, which allows us to analyze appearance (spatial graph) and motion (temporal graph) properties. Thus, the DPS walks are performed on these two graphs and the statistical information of its trajectories are used to compose a feature vector. Experimental results obtained on the UCLA and Dyntex++ databases showed that our method is very competitive when compared to other methods. Our method also outperformed the other previous DPSW-based and complex network based methods. In addition, the proposed approach is competitive in terms of dimensionality, producing feature vectors significantly smaller than other literature methods. In this way, the tradeoff between performance and feature vector size demonstrates the great potential of the proposed method for dynamic texture classification.