Keywords

1 Introduction

Surveillance videos contain rich information and are ubiquitous in our life. They play the important role for the public security (e.g., evidence recording, case analyses, crime prevention). Surveillance videos have the typical spatio-temporal characteristics with the two-dimensional property of time and space. The present surveillance video investigation mainly relies on an analyst’s scanning, monitoring, judgment, and reasoning to discover clues related to the case. To restore the process of a real case and track the closely related information of the case, a large number of surveillance videos with extremely long video footages must be collected, resulting in a laborious and time-consuming process of video analysis. A large number of surveillance video data also lead to poor performance (e.g., misjudgment, missed sanction), lack of concentration, high mental workload, fatigue, and bad mood (e.g., frustration, anxiety). Therefore, an interactive system that can facilitate the process of video analysis and improve the job performance and emotional experience is needed.

Paper-based video image investigation is one of the most widely used methods in practical work. Analysts have to save a screenshot when they find any relevant information of the case, and write down all information (e.g., time, location, criminal characteristics) on the paper for the further analysis. Due to individual differences in hands-on background, paper-based investigation information is easy to be interrupted without timely data storage and difficult to be recognized by other analysts. Moreover, clues are usually scattered in different parts of a video or different videos. Analysts need to integrate many video and image fragments of information into a chain of clues to analyze the whole process of the crime. The existing paper-based video image investigation approach is far from our expected goals. Therefore, how to effectively extract the clues from a large number of surveillance videos, simplify the process of video saving, organize the relationship between different clues, and improve the efficiency of the video information retrieval are the interests of our current study. In addition, how to develop a user-centric video image investigation system with an effective and natural way to interact and how to evaluate the naturalness and the efficiency of the system interaction are the most challenging parts of this study.

2 Natural Interaction in Video Image Investigation

2.1 Natural Interaction

Hand-drawn sketch is a kind of natural and direct human thinking externalization and communication mode. The sketch is able to use simple shapes to express the abstract thinking intention. It has the semantic features of texts and images, so when people see a rough sketch, they can immediately know the semantics behind it. In addition, sketch can be used as a simple gesture to interact, such as the abstract, symbolic and fuzzy characteristics. Its simple, rapid and optional features can also serve as a good medium of information expression. Therefore, it is the key research direction of sketch-based analysis of surveillance videos to extract and organize important clues from surveillance videos.

How to develop a user-centric video image investigation system with a natural way to interact? Researchers have made their efforts in interviewing with actual users and recording their subjective thoughts and feelings. In this study, we developed a mental model of the video image investigation to help us to understand surveillance video analysts and designed an interactive video image investigation system for them.

2.2 Mental Model of the Video Image Investigation

Mental model is an internal symbolic representation for actual users in the external world which helps individual to understand, interpret and anticipate how things work [1]. It is dynamic and can be manipulated in mind to obtain outcomes [2]. Mental model has been applied in a variety of product design (e.g., domestic appliances, traffic and military facilities) to help designers to understand users and make better products for them [3,4,5,6]. The most commonly used method is the interview, especially in the development of mental models for the practical applications. There are two categories of interview: structural interview and semi-structural interview. Structural interview implies previously settled questions before interview implement, while semi-structural interview is more flexible and diverse among interviewees. Through an interview, mental models of variable products (e.g., flush toilet, home heating, electronic healthy recording system [7,8,9]) have been developed.

In this study, to obtain a mental model of the video image investigation, 27 target users (i.e., surveillance video analysts) were interviewed by two experienced experimenters. There were two sessions during the interview. Their demographic information, working history and experience, daily workload, etc. were collected to control individual differences in the development of mental models. In the first session, each video analyst talked about their daily work, for example, what they usually look for in a video to find a suspect, how they deal with different video clips and clue information, and the common contents they write down or save. Two real cases were replayed in which each video analyst was required to explain the detailed investigation procedure step by step. One simple case was analyzed with 10-h surveillance videos from 5 camera locations, while the other one case was more complicated and was analyzed with hundreds-of-hours surveillance videos from more than 20 camera locations.

In the second session of our interview, all video analysts were gathered for a group interview in which they were required to complete a real case together. Before the group interview, all video analysts were informed about what kind of case happened, time and place of the case, and target suspect and vehicles. Their primary work was to watch each surveillance video, mark on the target suspect when it appeared in the video, and draw a road map of the target suspect from the first time appeared in the video to the time and place of the case. They were required to work according to their usual mode of work and use a common video player to play each video and a notebook to write down any relevant clue information. During the task, video analysts were interrupted at each necessary step and were asked to explain their operations and discuss what were the problems at the moment and possible solutions they expected to solve the problems. Here, we listed three major problems reflected in both sessions of the interview:

  • Some video clips (e.g., from personal cameras of small merchants) were poorly uniformed and cannot be played by a common video player.

  • Important clues were scattered in the notebook with different marking/symbolic styles, which were difficult to be recognized and used by other analysts.

  • Clue information retrieval was troublesome because there was no hyperlink between video clips and important clues. Videos must be found out again and played back to the time point based on the information on the notebook.

Based on the in-depth interview with actual video analysts, we drew a mental model (see Fig. 1) to reflect how the video image investigation system should work and interact with video analysts. There were four major components of the video image investigation system: video database, video player, material warehouse, and case management. Case management component was used to create a new case or close an existing case, assign tasks to different video analysts, and organize all case information by combining and retrieving text and image clues. Video database component and video player component were used to process video clips. Nonstandard video clips were first transformed into a standard format. A standard video clip was selected and played and its time and location information were automatically marked on the road map. When video analysts found a relevant clue, the annotation function was activated to generate a hyperlink connecting a video segment (e.g., segment length can be customized) with the annotation by individual analysts. When a number of videos were investigated, all hyperlinks with the annotation were connected to form a road map that illustrated the suspect’s spatio-temporal activity trajectory. Except for surveillance video, other types of evidence (e.g., images, testimony) were stored in the material warehouse component. In Fig. 1, green parts represented storage space, surveillance videos with hyperlinks, target images and corresponding notes, electric evidence, testimony were stored in two components. Blue parts represented a video player’s major operations and functions. Black parts represented information management and organization during the process of video image investigation.

Fig. 1.
figure 1

Mental model of the video image investigation consisted of four components: video database, video player, material warehouse, and case management. Arrows indicated how videos and clue information were passed between four components. (Color figure online)

2.3 Major Functions of the Proposed Video Image Investigation System

According to the mental model of the video image investigation, we designed a new video image investigation system for video analysts to facilitate the process of video analysis and enhance their job performance and emotional experience. Due to page limitations, we only described two major features of the proposed system.

Sketch-Based Video Annotation.

The proposed system uses the combination of sketches and texts to generate video annotations chosen from our sketch tag library. Video annotations includes the time point at which an annotation is added, annotation tag, and the current analyst’s information. Specifically, video annotation is divided into two categories: video frame annotation and video clip annotation. Video frame annotation is added when a video suspends and is used to annotate an object or event in different video frames. Video clip annotation is added when a video is playing and is used to annotate a video within a period of time. We take the advantage of simple, intuitive characteristics of sketch to achieve effective content annotation through the annotation of suspects and vehicles, and use the text comments to describe their attributes and characteristics. We use the frame image for video frame annotation and the key frame extracted from the video for video clip annotation. The key frame extraction algorithm is based on clustering, which is on the basis of the similarity between two frames’ color histograms. Compared to traditional paper-based annotation, sketch-based video annotation was expected to improve an analyst’s work efficiency and the collaboration between multiple analysts on the basis of standard annotation principles.

Hyperlink.

When a video annotation is created, a hyperlink is automatically generated connecting the video segment with the corresponding video annotation. There are two major contributions of the hyperlink during the process of video analysis. First, hyperlink is quite useful when analysts tend to play back a video to browse the clues. Traditional method is to find out the target video and navigate to the time point based on the recording of screenshots or notes. In contrast, the proposed system can track back to the target video segment with detailed video annotations using the hyperlink function in a more efficient way. Second, surveillance videos are usually kept for three months, and only those relevant to the important cases are kept for no longer than six months. The proposed system provides a solution to save and organize those video segments with hyperlinks which are relevant to the case for a longer time. All hyperlinks generate a chain of evidence (i.e., the road map in the mental model) to facilitate video analysis.

2.4 Evaluation Index System of Natural Interaction

After the video image investigation system was proposed, another key research question was how to evaluate the naturalness and the efficiency of the system interaction. In this study, work efficiency, mental workload, and emotional experience were considered as three evolution indices.

Work Efficiency.

Here, work efficiency refers to an analyst’s activities directed toward the accomplishment of video analysis. Time to completion (TTC) and learning time are two common indices of work efficiency. A shorter time to get familiar with an interactive system or to complete a task using this interactive system indicates its better work efficiency. The previous study compared 3 levels of learning time (no learning time, 15-min and 30-min learning time) and examined the effects of learning time on the work efficiency of three video summarization systems. The authors found shorter TTCs with the longer learning time, suggesting that learning time is an effective indicator of work efficiency [10].

Mental Workload.

Mental workload reflects the interaction of mental demands imposed on operators by tasks they attend to [11] or the mental cost of accomplishing the task demands [12]. Subjective ratings, task performance and physiological signals are able to measure mental workload. Among them, EEG signals are sensitive to subtle changes in mental workload. Under the high mental workload condition, alpha band activity is suppressed while theta band activity increases [13,14,15]. When dealing with a complex or multitasking situation, theta band activity increases at the frontal and parietal areas [16, 17].

Emotional Experience.

Emotional experience refers to affective feelings at work. EEG signals are sensitive to the changes in emotional states and frontal alpha asymmetry (FAA) is the most widely used indicators. FAA reflects the differences of alpha band activity between left and right frontal lobes. Alpha asymmetric pattern can be explained by two models: motivational model and valence model. Emotions with approach motivational tendencies are linked to a higher left frontal activity, whereas emotions with withdrawal motivational tendencies are linked to a higher right frontal activity [18, 19]. On the other hand, greater left hemisphere activity (lower alpha power) is associated with positive emotion, whereas greater right hemisphere activity is associated with negative emotion [20, 21].

3 Evaluation of the Proposed Video Image Investigation System

In this section, we conducted a user study to evaluate the naturalness and the efficiency of the system interaction. Only a few comparable features (e.g., sketch-based annotation vs. paper-based annotation) between the proposed interactive system and the traditional method were investigated in the user study.

3.1 Participant

Thirty healthy volunteers took part in the experiment. Due to incomplete EEG data recordings, two participants were excluded from the further analysis, leading to a sample of 28 participants (13 males and 15 females), whose average age was 24 (range = 20–29, standard deviation = 2.3) years old. All had normal or correct-to-normal eye-sight and no previous nervous or psychiatric disorders.

3.2 Task Description

Participants were required to watch thirteen video clips and detect a target suspect from each video clip, which would eventually generate a road map of the target suspect from the first time appeared in the video to the time and place of the case. Because the target suspect appeared in all video clips and appeared more than one time in some video clips, participants had to watch each video clip from the beginning to the end (jumping forward was not allowed). Normal interactive operations with a video player (e.g., select and open a video clip, play, pause, jump backward) were identical for two groups. Participants in the experimental group used the proposed video image investigation system (Fig. 2a) while those in the control group used traditional screenshots and paper-based annotations (Fig. 2b).

Fig. 2.
figure 2

(a) Interaction with the proposed video image investigation system. When a target appeared in the video (e.g., a black car), participants in the experimental group used the mouse to draw a circle around the target. This operation generated a screenshot automatically and a pop-out window which allowed the editing of annotation information. (b) Traditional screenshots and paper-based annotations. When a target appeared in the video, in contrast, participants in the control group paused the video to save a screenshot and write down all annotation information on the notebook.

3.3 Procedure

Upon arrival, participants completed an informed consent form, followed by a questionnaire regarding their demographic information. After the setup of an EEG cap, they sit in front of a standard desktop computer (3.40 GHz, Intel Core i7 processor) with a 22-in monitor in the lab. The viewing distance to the monitor was approximate 65 cm.

Participants first went through a practice session to get familiar with the interactive investigation system (experimental group) or traditional screenshots and paper-based annotations (control group). After that, participants were instructed to sit quietly and watch a blank screen with their eyes open for 5 min (i.e., baseline condition). Before the formal test, participants were informed about the task and video materials. They were encouraged to finish the task as accurately and quickly as possible. The whole experiment lasted for about 2 h and the participants received ¥100 as reimbursement.

3.4 EEG Acquisition

We used a 64-electrode Neuroscan Cap to collect participants’ brain activities when they watched video clips. All electrodes were placed according to the International 10–20 electrode placement standard. The reference electrodes were placed on the left and right mastoids. The horizontal and vertical EOGs were recorded with electrodes placed 10 mm away from the outer canthi of both eyes and below and above the left eye. Electrical impedances at each electrode site were reduced to less than 5 kOhms. In addition to two reference electrodes, 28 electrodes from the frontal and parietal areas (FP1, FPZ, FP2, AF3, AF4, F5, F3, F1, FZ, F2, F4, F6, FC5, FC3, FC1, FCZ, FC2, FC4, FC6, P7, P5, P3, P1, PZ, P2, P4, P6, P8) were used. The sampling rate was 1024 Hz.

The raw EEG data were digitally filtered with a 0.1–50 Hz bandpass filter. All trials were visually inspected and those trials with excessive peak-to-peak deflections and bursts of electromyography activity were excluded from further analyses. The experimenter looked through a bunch of eye blinks and figured out a threshold that would catch the majority of them for each participant. All the blinks within the threshold were segmented into epochs and rejected in the set that contained more than 1 eye blink or appeared to deviate from the norm. The spatial singular value decomposition was performed to create the ocular artifacts linear derivation file which was used to approximate the topographies for each component to be removed from the EEG raw data. The average of M1 and M2 was used as the reference electrodes. Clean and re-referenced EEG data were transformed into a frequency domain by a short-term Fast Fourier transformation through a 2-s Hanning window. Power values within the theta band (4–8 Hz) and alpha band (8–13 Hz) were averaged for each resting and task conditions for each participant.

4 Results

4.1 Mental Workload

EEG data during the experimental task was subtracted by EEG data at rest, which were normalized into the range [0, 1]. A one-way analysis of variance (ANOVA) was performed with group (2 levels: experimental group vs. control group) as a between-subjects variable. Dependent variables were the theta band power at the frontal lobe and alpha band power at the parietal lobe.

As shown in Fig. 3, significant differences in the theta band power between two groups were found at F4 site (F(1,26) = 4.774, p = 0.038, η2 = 0.155), F5 site (F(1,26) = 6.126, p = 0.020, η2 = 0.191), and FC6 site (F(1,26) = 5.351, p = 0.029, η2 = 0.171). A significant difference between two groups was also found for the average theta band power of the frontal areas (F(1,26) = 4.251, p = 0.049, η2 = 0.141). The frontal theta band power of the experimental group was significantly lower than that of the control group (see Table 1) at these electrode sites. On the other hand, however, no significant difference was obtained for the alpha band power at the frontal or parietal area. These results were consistent with the previous findings of mental workload, indicating that the theta band powers at the frontal areas were sensitive to the rest-task differences in the mental workload. Participants using the proposed interactive system might spend less cognitive efforts/resources to complete the video image investigation task compared to those with the traditional paper-based method.

Fig. 3.
figure 3

Comparison of the theta band power at the frontal areas between two groups of participants (error bars indicate ±1 standard error).

Table 1. Average and standard errors of the theta band power between two group of participants.

4.2 Emotional Experience

A repeated measures analysis of variance (ANOVA) was performed with hemisphere (2 levels: left vs. right hemisphere) as a within-subjects variable and group (2 levels: experimental group vs. control group) as between-subjects variable. The frontal alpha asymmetry (FAA) was dependent variables which were computed with 6 pairs of electrode sites at the frontal areas (FP1/FP2, AF3/AF4, F1/F2, F3/F4, F5/F6, F7/F8). Significant interaction findings were followed-up with the simple effect analysis, in which the differences in the alpha frontal asymmetry between the left and right hemisphere were assessed for each group of participants.

The hemisphere × group interaction was significant for the pair of F1/F2 (F(1, 25) = 6.982, p = 0.014, η2 = 0.206) and the pair of F3/F4 (F(1, 25) = 5.519, p = 0.027, η2 = 0.17). Simple effect analysis showed significant differences of the FAA between the left and right hemisphere F1/F2 (p = 0.019) and F3/F4 (p = 0.03) in the experimental group (see Fig. 4). Specifically, the alpha band power at the right hemisphere was significantly larger than that at the left hemisphere in the experimental group, while there was no significant difference in the control group (see Table 2). There was no significant main effect for either hemisphere or group. These results indicated that participants in the experimental group showed more asymmetric brain activation in the left than right hemisphere, suggesting an increased allocation of the cortical activation. Because the left hemisphere activity correlates with positive affection or approach motivation, participants using the proposed interactive system experienced more positive emotions or approaching motivation during the process of investigation compared to those with the traditional paper-based method.

Fig. 4.
figure 4

Comparison of the alpha band power between the left and right frontal lobe between two groups of participants (error bars indicate ±1 standard error).

Table 2. Average and standard errors of the alpha band power between two group of participants.

4.3 Performance Data

The time to completion (TTC) was calculated to reflect the task performance. A one-way ANOVA was performed with group as a between-subjects variable. As shown in Fig. 5, there was significant difference in the TTC between the experimental group and control group (F(1, 28) = 17.391, p = 0.000, η2 = 0.383). Participants using the proposed interactive system spend less time (mean = 43.033, standard error = 0.972 min) to complete the video image investigation task than those with the traditional paper-based method (mean = 59.3, standard error = 3.778 min).

Fig. 5.
figure 5

Comparison of the TTC between two groups of participants (error bars indicate ±1 standard error).

5 Discussion

This study designed an interactive system based on the mental model of the video image investigation. A user study was conducted to evaluate the naturalness of the newly designed system in terms of work efficiency, mental workload, and emotional experience. Through the analysis of EEG data, we found that the proposed interactive system led to higher work efficiency, lower mental workload, and better emotional or motivational states compared to traditional paper-based method. Specifically, the theta band power increased at the frontal area when high mental demands was imposed on the participant, while no significant difference in the alpha band power was found between two groups. The previous studies showed similar results in the field of traffic monitor and power plant center [22, 23] in which higher theta band power and lower alpha band activity were observed under the high mental workload condition. The relationship between the EEG band power and multiple mental activities (e.g., visual searching, monitoring, attention) may explain the current findings [24].

Analysis of the frontal alpha asymmetry results showed that a significant increase of the alpha activity was observed in the right hemisphere for participants using the interactive system. According to the valence model, the alpha activity in the left hemisphere is positively correlated with positive emotions while the alpha activity in the left hemisphere is positively correlated with negative emotions. The current FAA results indicated that participants with the interactive system experienced better emotional states. Compared to traditional paper-based method, participants using the interactive system felt more positive emotions or they tended to use the interactive system [25]. One possible reason was that the new interactive system can easily add annotations for each target with hyperlinks which help to organize different cases and clues into an integrated part.

In the future work, a field study involving the actual video analysts will offer more powerful results and insights for the improvement of the proposed interactive system. Professional video analysts with experience and skills in video image investigation may use this interactive system with their own preferences. Moreover, portable EEG devices with real-time recognition of mental workload and emotional states will benefit the evaluation.