1 Introduction

Abnormal behavior analysis on crowded scenes is an important and growing research field. Video cameras, given their ease installation and low cost, have been widely used for monitoring internal and external areas such as buildings, parks, stadiums etc. With the world’s population increasing, the presence of people in common areas has been increasing too. Algorithms for pose detection and action recognition for single or, in some cases, very low density groups of people are extensively treated in the pattern recognition community. Nevertheless, abnormal behavior detection and localization in crowded scenes remains an open problem due to high levels of occlusion and the impractical approach of individual segmentation.

The concept of abnormal behavior is always associated with the scene context, a behavior considered as normal in a scene may be considered abnormal in other. These specific conditions increase the difficulties for automatic analysis and require specific modeling of the abnormal behavior for each particular scene.

In order to build such models many algorithms have been proposed. In [6] optical flow is used to compute interaction forces between adjacent pixels and a model, known as Social Force Model, is created based in a bag of words approach for classify frames either normal or abnormal. In [5] dynamic textures (DT) are used to model the appearance and dynamics of normal behavior, samples with a low probabilistic values in the model are labeled as abnormal. In [7] entropy and energetic concepts are used as features to model the probability of finding abnormal behavior in the scene. Natural language processing is used in [10] as classification algorithm for features based on viscous fluid field concepts.

Many algorithms employs machine learning techniques as classification tool. Support Vector Machine (SVM) are used in [8, 11] for classify histograms of the orientation of optical flow. Multilayer Perceptron Neural Networks is used in [13]. k-Nearest Neighbors is used in [1] for classify outlier observed trajectories as abnormal behavior. Finally, Fuzzy C-Means are used in [2, 3] to derive an unsupervised model for the crowd’s trajectory patters.

In general, to construct the feature vector used in many of the algorithms described above, a several set of parameters must be correctly set in order to achieve the performance reported by the authors. Some of the state-of-the-art methods are based in complex probabilistic models which leads high processing time. Despite the processing time per frame is reported only for a very few papers, it is in general high. For example, in [5] the authors reported a test time of 25 s per frame for 160\(\,\times \,\)240 pixel images and in [12] the reported test time per frame is 5 s in videos with 320\(\,\times \,\)240 pixel resolution.

The main contribution for this paper is present a simple but efficient method that reduce the processing time per frames in near real time allowing practical use.

The rest of this paper is organized as follows. Section 2 describes the proposed approach. Section 3 presents the experimental results. Section 4 presents the conclusions.

Fig. 1.
figure 1

General pipeline for the proposed method.

2 Proposed Method

The general pipeline for the proposed approach is shown in Fig. 1. The five initial modules (1 to 5 in Fig. 1) aims to compute the model features and are the same for the training and test phases. These initial modules are described in Sect. 2.1. In the training phase, represented by module 6, frames with normal behavior are used to update the model as described in Sect. 2.2. In the test phase, represented by module 7, each new sample is compared with the model and classified as normal or abnormal as described in Sect. 2.3. A false positive reduction methodology, represented by module 8, is also described in Sect. 2.3.

2.1 OFCCs Computation

In the training phase a sequences of N frames are used to build a normal behavior model. The algorithm presented in [9] is used to compute the background model and a foreground mask \(I_{fm}\) of each frame is obtained.

In order to reduce the noise and de computational load a connected components labeling algorithm is used to obtain the blobs \((b_1, b_2, \dots , b_n)\) where n is the total number of blobs in the foreground mask \(I_{fm}\). In parallel to foreground extraction the dense optical flow of each frame is computed using [14]. The optical flow vectors are used to obtain the magnitude m(xy) and direction \(\theta (x,y)\) values of each (xy) point in the input image.

An Optical Flow Connected Component \(OFCC_i\) can be defined as the set of values \([m(x,y), \theta (x,y)]\) for all (xy) points belonging to the i-th blob, as expressed in Eq. 1.

$$\begin{aligned} OFCC_i = {[m(x,y), \theta (x,y)]} \, \forall \, (x,y) \in b_i. \end{aligned}$$
(1)

The main direction \(\overline{\theta }_i\) of the i-th OFCC is computed as follows. A histogram of the direction values of \(OFCC_i\) is obtained with a fixed bin width of \(\varDelta \theta = 45^{\circ }\). The angle associated with the highest bin is used as the main direction \(\overline{\theta }_i\) of \(OFCC_i\).

The main magnitude \(\overline{m}_i\) of \(OFCC_i\) is obtained as the statistical mean of the magnitudes values in \(OFCC_i\) as shown in Eq. 2.

$$\begin{aligned} \overline{m}_i = \frac{1}{S} \sum _{k=1}^S m(x,y) \, | \, m(x,y) \in OFCC_i \end{aligned}$$
(2)

where S is the total number of magnitude values in \(OFCC_i\). Finally, the main direction \(\overline{\theta }_i\) and the main magnitude \(\overline{m}_i\) values of each \(OFCC_i\) are used to construct the normal behavior model.

2.2 Normal Behavior Model

In this algorithm the behavioral model is composed of m matrices \((A_1, A_2, \dots , A_m)\) where m is computed as

$$\begin{aligned} m = \frac{360}{\varDelta \theta } \end{aligned}$$
(3)

and represents the number of possible values that \(\overline{\theta }_i\) can adopt. For instance, if \(\varDelta \theta = 45^{\circ }\) then \(m = 8\) matrices will be defined.

For each frame in the training video a set of n OFCCs are obtained as described in the previous section. After compute the \(\overline{\theta }_i\) and \(\overline{m}_i\) values of each \(OFCC_i\) the corresponding A matrix number \(\eta \) is obtained as,

$$\begin{aligned} \eta = \frac{\overline{\theta }_i}{\varDelta \theta } \end{aligned}$$
(4)

Then, the values of the \(A_\eta \) matrix are updated using the next condition

$$\begin{aligned} A_\eta (x,y) = {\left\{ \begin{array}{ll} \overline{m}_i, &{} \text {if } \, \overline{m}_i > A_\eta (x,y)\\ A_\eta (x,y), &{} \text {otherwise} \end{array}\right. }, \forall \; (x,y) \in b_i. \end{aligned}$$
(5)

At the end of the training phase each matrix \(A_\eta \) will store the maximum principal magnitude \(\overline{m}_i\) in the full training video at each point (xy) at the direction \(\eta *\varDelta \theta \).

Figure 2 shows an example of a normal behavior model with \(\eta = (1, 2, \dots , 8)\) matrices for \(\varDelta \theta = 45^{\circ }\). A color map was applied to each matrix \(A_\eta \) for better visualization.

Fig. 2.
figure 2

The magnitude model for eight directions \((\eta = 8)\). Each image shows the pixel highest magnitude for direction angles between (a) (\(0^{\circ }\)\(45^{\circ }\)], (b) (\(45^{\circ }\)\(90^{\circ }\)], (c) (\(90^{\circ }\)\(135^{\circ }\)], (d) (\(135^{\circ }\)\(180^{\circ }\)], (e) (\(180^{\circ }\)\(225^{\circ }\)], (f) (\(225^{\circ }\)\(270^{\circ }\)], (g) (\(270^{\circ }\)\(315^{\circ }\)] and (h) (\(315^{\circ }\)\(360^{\circ }\)]. (Color figure online)

2.3 Abnormality Detection

After all the training frames have been processed and the model is completed, test videos with both normal and abnormal behaviors can be analyzed.

The set of OFCCs and their main directions \(\overline{\theta }_i\)s are obtained as described in Sect. 2.1 for each video frame. To determine if the \(OFCC_i\) is abnormal or not its main direction \(\overline{\theta }_i\) is used to find the corresponding \(A_\eta \) matrix with \(\eta \) computed using Eq. 4. Next the maximum value \(\hat{a}_\eta \) in \(A_\eta \) within the same region defined by the blob \(b_i\) is founded according to

$$\begin{aligned} \hat{a}_\eta = max(A_\eta (x,y)) \, | \,(x,y) \in b_i. \end{aligned}$$
(6)

Then, the comparison between each m(xy) value in \(OFCC_i\) and \(\hat{a}_\eta \) is done as follows. If m(xy) is grater than \(\hat{a}_\eta \) then the pixel (xy) is marked as abnormal, otherwise its marked as normal.

After compare all the magnitude values in \(OFCC_i\) an abnormal binary mask image \(I_{ab}(x,y)\) with the same size as the input frames, can be use to store the abnormal marked pixels as \(I_{ab}(x,y) = 1\) and the normal ones as \(I_{ab}(x,y) = 0\).

In order to improve the algorithm performance a FIFO type list with fixed size M is defined and filled up with the latest M binary images \(I_{ab}(x,y)\). To consider an OFCC as abnormal it must appear at least a number W of times in the list. The list size M and the number W are user controlled parameters and can be used for sensitivity adjustment, since a higher value of W means a higher alarm delay time.

3 Results and Comparisons

The proposed algorithm was implemented in Qt/C++ using OpenCV on a 2.7 GHz Intel Core i7 PC with 16 GB of RAM. The method was tested in two popular datasets: UMNFootnote 1 and UCSDFootnote 2. Figure 3 shows a frame for each of the scenarios in the UMN dataset and the abnormality detected by the proposed approach. The frame size in all UMN videos is 320\(\,\times \,\)240 pixels. The frame size in the UCSDPed1 videos is 238\(\,\times \,\)158 and for UCSDPed2 is 360\(\,\times \,\)240 pixels.

Fig. 3.
figure 3

Examples of normal (top) and abnormal (bottom) situations in UMN dataset.

Figure 4 shows three examples frames with abnormal behavior for each of the two scenarios in the UCSD dataset.

Fig. 4.
figure 4

Example of abnormal behavior detected in the UCSD dataset. UCSDped1 (top) and UCSDped2 (below).

Fig. 5.
figure 5

Quantitative comparison of abnormal behavior detection in (a) UCSDped1 and (b) UCSDped2 against state-of-the-art algorithms.

The proposed method was compared with similar state-of-the-art algorithms including Mixture Dynamic Texture (MDT) [5], Mixture of Optical Flow (MPPCA) [4], Social Force [6], Social Force with MPPCA [4] and the Hierarchical Activity Approach [12]. Figure 5 shows the Receiver Operation Characteristic (ROC) curves for the proposed method and the comparative algorithms, taken from [12]. Table 1 shows the Area Under the ROC curve (AUC) for the five comparative methods and the proposed one. Finally, Fig. 6 shows the processing time per frame for some state-of-the-art algorithms and the proposed in this paper.

The ground truth provided by the UCSD dataset, and used for performance evaluation in all the comparison methods, labels people in wheelchair as abnormal behavior, even when their speed is lower than the speed of walking people. This leads to additional False Negative detected frames because, in the presented algorithm, this situation is not considered as abnormal. A second situation when the output of the presented algorithm differs from the ground truth is when somebody, in the test phase, walks in a region where no people walked in the training phase. Examples of this type of abnormal detection are shown in Fig. 4(a), (b), (e) and (f). Frames that present only this kind of abnormality are ignored in the comparison results.

Table 1. Comparison of the Area Under Curve of the proposed method compared with the others algorithms.
Fig. 6.
figure 6

Comparison of consumed time per frame with others state-of-the-art algorithms. The showed time is for the test phase in UCSDped1.

4 Conclusions

This paper presents a new method for abnormal behavior detection. It’s based on optical flow and connected component analysis. From the experimental results it can be concluded that, when compared to other state of the art methods, the proposed method presents better performance in abnormal detection in the UCSDped2 dataset and is very close to the best one in the UCSDped1 but, as shown in Fig. 6 it presents the lowest processing time, near to real-time processing which allows practical use in modern computers.