Keywords

1 Introduction

Eye tracking metrics are found to be useful indicators of visual attention and cognitive workload in numerous application areas, including reading and language comprehension [1], driving [2], individual differences [3], gaming devices [4], and medical applications [5]. Eye tracking devices (eye trackers) are used to collect measurements, such as pupil dilation, gaze locations and eye-closing patterns. Recent technical advances in video sensors and miniaturized computing power have resulted in cost-effective mass produced eye tracking devices; thus, several low-cost eye tracking devices have become available for researchers. However, the effectiveness of these low-cost devices to study human behavior remains an ongoing investigation [6,7,8,9,10,11,12,13] and is the objective of this paper. Specifically, we examine a low-cost eye tracker, the Gazepoint GP3 (cost \(\approx \) $500), and objectively evaluate its ability to differentiate pupil dilation metrics under different cognitive loads and luminance conditions. To our knowledge, this is one of the first studies reporting the effectiveness of Gazepoint GP3 in capturing pupillary data.

Several pupillary metrics have been proposed in the past as useful indices of cognitive context [14,15,16]. Out of those, we employ two widely accepted metrics in this paper: one computed in the time domain and the other in the frequency domain. Using data collected by the Gazepoint GP3 eye tracking device, a time domain measure, task evoked pupillary response (TEPR) [17], as well as a recently published frequency domain measure, pupillary power spectral density (PSD) [18], are computed and evaluated as indicators of mental workload under different luminance conditions. It has been well established that pupil diameter is impacted by both mental workload and luminance conditions [19,20,21,22,23,24]. Therefore, the objective of our experiment is to verify the potential use of Gazepoint system to study the impact of these two factors on pupil diameter in studies involving cognitive context analysis.

Towards this end, we employed the digit span task [19] experiment under different luminance conditions, which is explained in Sect. 2. The rest of the paper is organized as follows: data collection and analysis methods are described in Sects. 2 and 3, respectively, the results of classification analysis are presented and discussed in Sect. 4, and the paper is concluded in Sect. 5.

2 Experiment

2.1 Subjects

Twenty participants ranging in age from 22 to 29 years (\(M = 23.9, SD = 2.41\)) voluntarily participated in the experiment conducted by researchers from the Naval Research laboratory (NRL) at the Naval Aerospace Medical Institute (NAMI).

2.2 Apparatus

All the eye tracking data were collected using the Gazepoint GP3 system. The system was calibrated for each user according to the Gazepoint Application Program Interface (API) manual [25]. GP3 collects the pupillary data, specifically, pupil size in pixels for each eye and their corresponding binary quality factor (valid/invalid) at 60 samples/s.

2.3 Task

A visual digit span task (also known as memory span task), which is a common technique used for assessing working memory capacity, was employed to assess the pupillary response of the participants to mental workload. In this task, participants are presented with a series of numbers and are then asked to recall the digits in the order they saw them. Longer series of numbers present more of a challenge for working memory, while shorter series are expected to be easier.

A luminance change task was employed to assess the pupillary response of the participants to the screen luminance. While completing the digit span task, participants were fixating on a monitor which varied in the background luminance (black, gray, and white).

2.4 Procedure

As mentioned in the previous section, participants engaged in a digit span task. Each participant was given four sets of digits of sizes 3, 5, 7 and 9 under three different screen luminance conditions (black, gray and white). The experiment utilized a within subject design (i.e., repeated measures) in which each participant completed all digit span set sizes (3, 5, 7 and 9, randomly ordered and exhaustive) three times for each of the 3 different background colors (white, gray, and black). Thus, a total of 36(= 4 set sizes \(\times \) 3 colors \(\times \) 3 times) trials were conducted. Participants were told to focus on a central fixation cross (a “+” sign \(\sim \)50 pixels tall and wide) that was offset from the background color (80 brighter for the black and gray backgrounds, and 80 darker for the white background). The string of numbers was then sequentially presented \(\sim \)1 s per number. Following each number set (e.g., “2, 6, 1, 8, 4”), a numeric keypad appeared on the screen and participants used the mouse to input the string of numbers (“2, 6, 1, 8, 4”) by clicking on the corresponding numbers in order. The keypad was used to ensure that participants continued to fixate on the screen, while they were making a response. When satisfied, the participants clicked the submit button. Participants were not given performance feedback on their response accuracy. Following each set of digits, there was a pause of \(\approx \)3 s before presenting the participant with a numeric keypad on the monitor to enter his/her response. The pupillary measures from this time segment, known as the encoding phase of the memory, are analyzed here. The total time to complete the digit span task varied from 10–15 min, depending on the participant’s response times.

3 Data Analysis

The Gazepoint GP3 collects the following pupillary data: pupil size in pixels for each eye and their corresponding binary quality factors (valid/invalid) at 60 samples/s, the scale factor of each eye pupil (unitless), whose value equals 1 at calibration depth, is less than 1 when the user is closer to the eye tracker and greater than 1 when the user is further away. Only data from the encoding time segment are analyzed in this work, as it has been established by the human factors researchers that the maximum pupil dilation occurs during the encoding of the stimulus materials for short term memory recall tasks [26, 27].

3.1 Data Preprocessing

For time-domain analysis (TEPR), the poor quality samples (quality factor = 0) of the pupil size signals were marked as missing values (or NaN in MATLAB® [28]). Pupil size data of the eye with fewer missing observations [29] were utilized for analysis. A “clean-up” function was employed to remove all the data below 4th percentile and above 98th percentile, in order to remove any sudden dips/peaks in the pupil size signal. Then, a hampel filter (of order 6) [30] was applied to remove outliers and a linear interpolator was used to recover missing values. Figure 1a shows an example of raw data and filtered data signals.

For frequency-domain analysis (PSD), the linear trend in the above preprocessed signals was removed using the detrend function in MATLAB® and the resulting signals were passed through a zero-phase lowpass butterworth filter with a cutoff frequency \(f_c = 4\) Hz using the filtfilt function, since most of the pupillary activity falls in the frequency range of 0–4 Hz [31]. Figure 1b shows an example of detrended data and filtered data signals.

Fig. 1.
figure 1

Pupil size signal preprocessing

3.2 Data Analysis

Task Evoked Pupillary Response (TEPR): To evaluate the ability of the eye tracker in capturing the changes in pupil diameter caused by mental workload changes, we analyzed the data of set sizes 3 (labeled as EASY), 5 (labeled as MEDIUM) and 7 (labeled as HARD) only. The set size 9 was excluded from the analysis since recall performance dropped to 65% (i.e., only remembering 65% of the 9 numbers) and there was increased variability between participants, suggesting it was either too difficult for some participants or that some participants gave up. For classification purposes, the median values of the pupil size in the encoding phase (TEPR), for each person, for each set size, each background color, and for each trial, (e.g., pupil size of person 13, set size 3 in a black background for the first trial) were computed over a sliding window of size 30 samples with an overlap of 25 samples (\(\approx \)80% overlap). A simple cut-point grouping into binary classes was implemented for pairs of set sizes 3 (EASY) vs. 7 (HARD), 3 (EASY) vs 5 (MEDIUM) and 5 (MEDIUM) vs. 7 (HARD) for the corresponding pairs of the moving-median filtered signals. The Receiver Operating Characteristic (ROC) curves [32] were drawn by varying the cut-points from the minimum of the two signals, in steps of 0.01 pixels, to the maximum value of the two signals.

Power Spectral Density (PSD): PSD of the pupil diameter signals was computed for each person using the Welch’s method with segments of 50 samples with 50% overlap [18]. Each segment was windowed with a Hamming window. Only the ‘encoding’ phase was considered when computing PSD under the memory tasks of set size 3 (EASY) vs set size 5 (MEDIUM) vs. set size 7 (HARD). PSD presented here is the average PSD over 20 participants * 3 trials; thus averaged over a total of 60 trials for each background luminance color.

4 Results and Discussion

At the preprocessing stage, an average of 37% data was missing due to poor quality recordings. Figure 2 shows the boxplots for average pupil diameters across different background luminance conditions and workload conditions. It is evident that the average pupil diameter in a black background is higher than that of the grey background which, in turn, is greater than that of the white background; this pattern agrees with earlier pupillary light reflex studies, thereby assuring the GP3’s capability to capture light-sensitive pupillary readings. Figure 2 also shows the differences in average pupil diameter for different workload tasks within the same background conditions and it can be seen that the average pupil diameter for set size 3 is lower than that of set size 7 under all 3 luminance conditions. However, the pupil diameters of set size 5 is not clearly greater than (or lesser than) for set size 3 (or for set size 7) under black and grey background luminance conditions.

Fig. 2.
figure 2

Boxplot of average pupil diameters under different backgrounds and mental workloads

To further analyze the differences in TEPRs corresponding to the different set sizes, we plotted the ROC curves from classification as described in Sect. 3. An example set of ROC curves for one person are shown in Figs. 3, 4 and 5. For this particular example, Fig. 3 shows a 100% accuracy in classifying pupil size signals of set size 3 vs. 7 for all three background conditions, whereas a 68% accuracy in classifying pupil size signals of set size 3 vs. 5 in grey background conditions and a 78% accuracy in classifying pupil size signals of set size 5 vs. 7 in white background conditions. Table 1 gives the average classification accuracy values over all participants and over all 3 repeated trials. Therefore, the minimum average classification accuracy is approximately 80%, which is considered a significant value by psychologists in detecting human cognitive context.

Fig. 3.
figure 3

ROC curves from classification of TEPRs between set size 3 and 7

Fig. 4.
figure 4

ROC curves from classification of TEPRs between set size 3 and 5

Fig. 5.
figure 5

ROC curves from classification of TEPRs between set size 5 and 7

Table 1. Average accuracies in TEPR classification

Figure 6 shows the results of PSD analysis, where Figs. 6(a–c) correspond to black, grey and white background conditions, respectively. The results agree with earlier studies only in the average power spectral densities of set size 3 vs. set size 5 or 7. However, the results we obtained do not conform to the finding that average PSD increases in the frequency range of 0.1–0.5 Hz and 1.6–3.5 Hz with increase in cognitive workload as the average PSD in set size 5 is seen to be greater than that of set size 7. This could be due to the recovery of lost data points by using a linear interpolator or due to similar spectral behavior of pupils during set sizes 5 and 7. Also, to our knowledge, there is no detailed mechanism for this phenomena of pupil control and PSD, yet. Future research will integrate the PSD metrics in classification studies to attempt to validate the findings of Peysakhovich et al. [18] and Nakayama and Shimizu [31].

Fig. 6.
figure 6

Power spectral density under different workload conditions

5 Summary and Conclusion

In this paper, we evaluated the performance of Gazepoint GP3, a low-cost eye tracker, by using pupillary metrics that are already tested and used by human factors researchers: TEPRs and PSD. We collected pupil size data from 20 volunteers engaged in a visual digit span task. First, a preprocessing routine was employed to filter out outliers from the data for time domain analysis, and low pass filtering was performed prior to frequency domain analysis. Then, TEPRs and PSDs were computed and studied for different digit set sizes. The classification performance is computed in the form of a receiver operating characteristic (ROC) curve and the results show the applicability and limitations of low-cost eye tracking devices by cognitive workload researchers.

The results indicate that the Gazepoint GP3 is an easy and inexpensive tool that can be utilized in psychological studies involving pupil diameter data. The classification results indicate that the eye tracker does a good job in classifying mental workloads under different background luminance conditions; however, it is not a robust tool for frequency domain analysis which could be attributable to linear interpolation of poor quality readings. Researchers, with budget constraints, who are interested in incorporating pupillary measures of cognitive workload now have access to a reliable inexpensive eye tracker. However, they should keep in mind the GP3 is limited to collecting pupil diameter data for tasks which use a single screen and is vulnerable to loss of chunks of data. Finally, we believe that the low cost eyetrackers are of great value to researchers from all disciplines trying to incorporate human factors aspects in their systems.