Keywords

1 Introduction

Gaze-based control of user interfaces has been proposed and evaluated in plenty of contributions addressing various application domains. Researchers investigated gaze input for common desktop interaction tasks like object selection, eye typing, or password entry [1,2,3], for zooming maps or windows [4, 5], for foveated video streaming [6, 7], and PTZ camera remote control for surveillance [8] or teleoperation [9].

All implementations make use of gaze as a natural pointing »device«, as gaze is typically directed to the region of visual interest within the environment. Even though gaze has been evolved for perception, it has been shown that gaze can also be utilized as a method for information input. Particularly, gaze input is an alternative in situations where manual input is not possible, e.g., due to motor impairment [2] and in hands-busy and attention switching situations [9]. Moreover, gaze input proved to be a beneficial alternative for interaction in dynamic scenes, e.g., for moving target selection in full motion video [10], where manual input might be exhausting and challenging.

Recently, the eye tracking manufacturer Tobii started to make eye tracking and gaze interaction suitable for another application domain where interaction in dynamic scenes happens – the mass market of computer gaming. They provide the low-cost eye tracking device Tobii »4C« for 149$ (159€) [11]. Navigation in the scene of computer games using a first-person perspective is one of the proposed gaze input methods [12]. If the user directs their gaze, e.g., to the right corner of the current scene, the image section changes with the right corner subsequently becoming the next scene center. Thus, the visual focus of interest is brought to the scene center without any manual intervention. Such interaction models have also been proposed before by several authors investigating gaze input for computer gaming, e.g., [13, 14].

A similar kind of interaction is required when controlling a camera in a video surveillance task. Due to the rich visual input, this task can be very exhausting for the human operator, particularly, if the camera is mounted on a moving platform. Hence, any reduction of workload caused by less demanding human-computer interaction is welcome as it frees cognitive capacities for the actual surveillance task. A frequently occurring task is keeping track of a moving object, e.g., a person. If the object moves out of the currently displayed image section, the human operator must redirect the camera field of view. Gaze-based control of a camera appears compelling, keeping the camera focused on the object by just looking at this object. That way, the observer’s visual attention could be focused on the (primary) surveillance task and the (secondary) interaction task is accomplished effortlessly at the same time.

In order to find out, whether such gaze interaction is appropriate and convenient, an experimental system was implemented simulating the control of a virtual camera as navigation in 360° video imagery. The system was evaluated in a user study with 28 participants, comparing gaze control versus manual control during the task of visually tracking a moving person.

2 Experimental System

The experimental system was implemented as a Java application which is able to play 360° video data recorded by the 360fly camera [15]. Figure 1 shows a video frame captured at an altitude of 30 m. For presentation to the user, the raw video data is rectified first, and in the next step, an image section of (width x height) 125° ×  70° of the rectified 360° video data is provided on the user interface (Fig. 2). In related work, Boehm et al. [16] introduce a similar system displaying an image section of a 185° fisheye camera.

Fig. 1.
figure 1

Video data captured by 360fly at an altitude of 30 m.

Fig. 2.
figure 2

Experimental system: image section displayed full-screen on a 14in laptop, equipped with a Tobii 4C eye-tracking device for gaze input, and a standard computer mouse providing the manual input alternative.

Gaze interaction is performed using the Tobii 4C eye-tracker [10]. They provide gaze data in different modes [17]; in our system, the »lightly filtered« mode is used and passing additional low-pass filtering before being processed in the application. Figure 3 shows the underlying gaze interaction model for navigation in the scene. When the gaze position is located within the center region (white), the displayed image section remains the same and the human operator is enabled to calmly inspect that central region. When the gaze is located off the center region (blue), the image section is re-centered on this gaze position. The farther the eye gaze is directed away from the center towards the edges or corners, the faster the image section is centered on the new gaze position; similar models have been proposed before for remote camera control for surveillance [8] and teleoperation [9]. Calculation of the repositioning speed is based on the squared Euclidian distance between current gaze position and screen center. The maximum allowed speed for image section repositioning (achieved if looking at the edges) is 3° per frame (frame rate is 60 Hz).

Fig. 3.
figure 3

The Gaze interaction model visualizing the activation dynamics on the screen: Gaze positions on the center region (white) have no effect. Gaze positions off the center region (blue) re-center the displayed image section to that gaze position; the closer to edges/corners (darker blue) a gaze position is located, the faster the image section is re-centered. (Color figure online)

The experimental system allows image section repositioning also with manual interaction using a computer mouse (Fig. 2). The user selects the image position of visual interest by pressing the left mouse button, then »drags« the image position with the left mouse button pressed to the wanted new position, for example the screen center.

3 Methodology

A pilot study was conducted to get first insight about the subjective workload of the gaze-based (virtual) camera control. 28 subjects (25 male, 3 female; 18 expert video analysts, 10 students and colleagues) performed the experimental task »Keep track of a person« using two different 3-min video sequences. Once, the test task instruction was to »Keep track of the person wearing the black jacket«, once »Keep track of the person wearing the red jacket« (Fig. 4). The video material was captured at an altitude of 30 m using a 360fly camera mounted on a 3DR solo drone [18]. The subjects were sitting at a distance of about 60 cm from the monitor (Fig. 5), the target persons’ sizes therefore covered about 0.3° × 0.3° of visual angle on screen.

Fig. 4.
figure 4

Screenshot of a test ask with a target person. (Color figure online)

Fig. 5.
figure 5

Experimental setup.

To ensure that subjects would have to reposition the scene in order to be able to keep track of the target person, the actors had been told to vary their motion trajectory and speed during video recording; thus, they temporarily moved straight on, or unpredictably, and sometimes shortly disappeared when walking under a tree. Furthermore, the drone and therefore also the camera trajectory carried out various motion patterns, like following an actor’s trajectory, crossing an actor’s trajectories, orbiting around the actors, or rotating at a stationary position. After performing the two test tasks, the subject answered the NASA-TLX [19, 20] questionnaire applied in the »Raw TLX« version, eliminating the weighting process.

For better interpretability of the NASA-TLX results for gaze input, the experimental design also incorporated performing the two test tasks using mouse input, and assessing it using the NASA-TLX. Half of the subjects performed the test tasks with gaze input first, the other half performed with mouse input first. The data recording of the 10 non-expert subjects was carried out in our lab, the data recording of the 18 expert video analysts was carried out at two locations of the German armed forces.

The procedure was as follows. Subjects were introduced into the experimental task but kept naïve in terms of the purpose of the investigation. Then, they performed the test tasks with the two interaction conditions one after another. In case of gaze input, subjects started performing the eye-tracker calibration provided by the Tobii-Software which requires fixating 7 calibration points; the calibration procedure was repeated until the offset between each fixated point and corresponding estimated gaze position was less than 1° of visual angle. Then, subjects got a different 3-min video sequence for training of the experimental task using that interaction technique. After that, subjects performed the two test tasks, immediately followed by rating their subjective workload using the NASA-TLX questionnaire. The mouse input condition was carried out performing the same three steps of training task, test tasks, and NASA-TLX rating. Finally, subjects were asked for their preferred interaction technique. The total duration of a session was about 30 min.

4 Results

The NASA-TLX results show that gaze input was rated with less workload both overall and in all single TLX categories. Results are provided using descriptive statistics as means with 1 standard deviation in Fig. 6 for all 28 subjects, in Fig. 7 for the expert video analysts only (N = 18). From those 18 experts, ten experts had much current practice in video surveillance and therefore were analyzed again, separately; results are shown in Fig. 8.

Fig. 6.
figure 6

Subjective workload with gaze input and mouse input, for all subjects.

Fig. 7.
figure 7

Subjective workload with gaze input and mouse input, for subjects with expertise in video analysis.

Fig. 8.
figure 8

Subjective workload with gaze input and mouse input, for subjects with expertise in video analysis and much current practice in video surveillance.

The NASA-TLX score is low for both interaction techniques, but it is significantly better for gaze input: A Wilcoxon signed-rank test for paired samples \( \left( {\alpha = 0.05} \right) \) revealed significant differences with p < 0.001 for N = 28, and p < 0.05 for N = 18; the result for the experts with much current practice in video surveillance (N = 10) is not significant (p = 0.153).

Analysis of the six subscales revealed further significant differences when analyzing all subjects (N = 28), for mental demand with p < 0.05, temporal demand p < 0.01, performance p < 0.01, effort p < 0.05, and frustration p < 0.001. Subscale analysis for expert video analysts (N = 18) and experts with much current practice in video surveillance (N = 10) still shows a significant difference between gaze and mouse for frustration (p < 0.05) despite the few data samples.

For mouse control, it can be observed that the subjective workload depends on video analysis expertise and current practice: The more expertise and practice, the lower the subjective workload (resulting in the NASA-TLX score difference between gaze and mouse being not significant any more, as reported above). However, for gaze control, subjective workload is very low for all subjects, independent of expertise. So, at least for control of a virtual camera, gaze input seems to be the more appropriate and convenient method to use.

Asked for their preference, 25 subjects preferred gaze input, 3 preferred mouse input (N = 18 experts: 15 preferred gaze input, 3 mouse input; N = 10 experts with much current practice in video surveillance: 10 preferred gaze input, 1 mouse input).

5 Conclusion

A pilot study was conducted in order to find out whether gaze input could be an appropriate input technique for camera control (panning and tilting) without any manual intervention. 28 subjects (18 expert video analysts from the German forces and 10 non-experts in video analysis) participated in the user study. Each performed the experimental task of tracking a target person using both gaze input and mouse input for navigation in a virtual camera, implemented based on 360° video imagery. The NASA-TLX showed that subjects rated both interaction conditions imposing rather little workload; however, gaze input was rated imposing significantly lower workload than mouse input. Hence, gaze input showed its potential to provide effortless interaction for this application, as it did for many other applications before.

Recently, the experimental system has been refactored and now besides navigation in recorded 360fly video data also allows live navigation in 360fly imagery. Future work will address gaze control for a real sensor, and user testing will show how workload would turn out to be in such condition with interaction latencies due to the necessary gimbal movements. Furthermore, future user studies will include more complex test tasks like observing more than one target object, as well as test tasks with a longer duration.