Keywords

1 Introduction

The goal of using machine learning on data generated by neural sensors is to predict or identify a user’s mental state in real-time. The roadmap to achieving that goal usually involves conducting controlled experiments, where subjects are exposed to a series of control and treatment conditions. Depending on the experimental setup, the subject’s mental state under the treatment condition can be either identified through self-report measures, the nature of the task and its expected effect on mental state, or the subject’s task performance. The labels are then used as predictor variables for machine learning algorithms. Each of these approaches has drawbacks in terms of validity and reliability, which may lead us to train the algorithms on incorrectly labeled data (Hoskin 2012).

In an ideal world, the problems of mislabeled data would average out over large data sets – machine learning is successfully applied in many spaces where a high volume of data is available to train the models. However, in the applied neuroscience space the number of cases is usually very small and the dimensionality of the data is very large, which can easily lead to overfitted models. This is one of the reasons why developing models that work across subjects, experimental conditions, and/or treatments is very difficult.

To sum up the challenge in a single sentence, we are trying to build predictive models on unreliably tagged data under the curse of dimensionality.

There are valid reasons for undertaking this effort. First, these efforts will enable future systems that adapt to our mental, physical, and emotional state in real-time, helping us make better decisions, gain deeper insights, and solve bigger problems, from medical diagnoses to adaptive military technology (Gateau et al. 2015; Girouard 2010; Naseer and Hong 2015; Marx et al. 2015). Developing such systems will involve integrating voluminous data from multiple sensors, a task which machine learning is especially well-suited. This paper addresses one of the handful of challenges associated with building adaptive systems – identifying the ground truth to build upon when developing models and systems.

The remainder of the paper is structured as follows: first, a brief description of neurophysiological sensors is provided. Then, an overview of the approaches to labelling data for algorithm training purposes, and a discussion of the validity and reliability problems associated with the labeling approaches follows. The paper concludes with a discussion of potential solutions and directions for future research.

2 Generating Data

Neurophysiological sensors rely on different physical mechanisms to measure activity in the brain. For example, an EEG measures electrical activity generated by neurons firing within the brain, fNIRS measures blood flow to and from areas of the brain instigated by activation and deactivation of specific regions, and fMRI uses magnetic resonance imaging to track blood flow in a manner similar to fNIRS.

The main point to consider when thinking about labeling neurophysiological data for machine learning purposes is that the sensor generates one row of data every time it samples. For example, an fNIRS can be set to sample at 10 Hz, generating approximately 36,000 rows of data for a one hour experiment. For each row there may be some number of data points associated with the channels in the device, which we can call features. Supervised learning algorithms use those features and derivative features to build a model that predicts a label. Most algorithms perform this model creation and evaluation by passing over the data frequently, iteratively refining the weights given to each of the features until such a point that the algorithm has satisfied its optimization criteria. The labeling process involves estimating the subject’s mental or emotional state at each of those points in time and assigning a category code (the label, or “class label”) to that point in time or interval of time.

The labeling process results in at least a few error in the labeling process in the boundary regions between state shifts because identifying the exact point in time when a cognitive state changes in not currently possible. Furthermore, we argue in this paper that the labels in general are not entirely reliable due to limitations of the approaches available, and as a result, the trained models are not reliable. Using incorrectly labeled data to train a supervised algorithm would be analogous to teaching a child to add by giving her a set of addition problem examples with answers that were correct only some of the time, then expecting the child to know how to add when given a new set of problems.

Before going into detail justifying our argument, we start with a brief description of the approaches researchers have used to determine those labels and the justifications for making that choice (Fairclough and Gilleade 2014; Noah et al. 2015; Liu et al. 2015).

3 Approaches to Labeling Data

We identified three approaches to labeling the data – response based, task based, and task-performance based. Here we will refer to the label as “ground truth” – what the researcher believes to be the best approximation of the subject’s mental or emotional state. Each approach has strengths and weaknesses, and none appears to be the superior approach.

3.1 Response-Based Labeling

Response based labeling uses the subject’s subjective interpretation of their mental state as the ground truth. For example, researchers have used the self-assessment manikins (Bradley and Lang 1994; Balconi et al. 2015; Bandara et al. In press) and NASA TLX. Fundamentally, this boils down to asking the subject – were you upset, overloaded, angry, sad, etc. This requires a certain amount of self-awareness on the subjects, a fair degree of honesty (Paulhus and Vazire 2009), and a good recollection of how they felt over a period of time without succumbing to recency bias (Sackett and Larson Jr. 1990; Morrison et al. 2014).

Any one of the instruments listed above is considered well-validated gross measures of emotional or cognitive state. However, sensors sample anywhere between 1 and 1000 times per second, which means the subject’s state needs to be accurately labeled for each of the intervals. For example, fNIRS focuses on the hemodynamics of the brain, and most devices sample somewhere in the vicinity of 10 Hz, tracking blood flow to regions that changes measurably within a 6–8 s window. Yet, we use a subject’s best estimation of their mental state over a 16–30 s window, hoping that the most recent impression of their mental state does not prompt them to ignore the mode state. Researchers have noted repeatedly that self-reports do not have guaranteed accuracy, with some suggesting that a best practice is to triangulate their observations with other known, validated measures (Liu et al. 2015; Rusnock et al. 2015).

Response based labeling has certain advantages – it can be used to triangulate task-based labeling (see below), or to explore meaningful concepts that are tested using protocols that are known to reliably induce cognitive or emotional responses. An additional advantage of response based labeling is that it may also make it easier to connect the machine learning body of literature to other HCI literature that still relies heavily on self-report measures (Rek et al. 2013; Lottridge 2009; Olson and Kellogg 2014).

3.2 Task-Based Labeling

Task-based labeling involves using tasks that are known or expected to elicit certain mental or emotional states reliably (Ang et al. 2012; An et al. 2013). Task-based labeling is not reliable, even for well-established measures. For example, researchers developed and tested a game that was perceived to have multiple levels of difficulty and thus expected to provoke different levels of engagement. However, during their experiment, two subjects did not notice the difference in difficulty levels and seven did (Girouard et al. 2009). Any attempts to build models using fixed channels as inputs, where the channels are expected to map to specific areas of the brain, faces challenges as well. In some studies, handedness influences cerebral blood flow on certain tasks (Cuzzocreo et al. 2009). Finally, task-based labeling is built on the assumption that participants are engaged in the task.

Task-based labeling has certain advantages, with the most notable being that they avoid the limitations of response-based labeling of conditions. There are two additional benefits of task-based labeling – (1) it allows researchers to more accurately track expected changes in cognitive state because expected changes can be synced to changes in the task; and (2) researchers do not have to interrupt the flow of the experiment to ask the subject to rate his or her experience.

3.3 Performance-Based Labeling

Performance-based labeling involves establishing ground truth by measuring the subject’s performance on a specific task. The general chain of assumptions appears to be that (a) the task relies on known cognitive processes, (b) performance on the task requires effort, and (c) performance is correlated with activation and failure is correlated with lack of activation. An example of performance-based labeling can be found in (James et al. 2010), where the researchers estimated cognitive burden generated by learning a visual-motor task by measuring the distance from the cursor to the target on the screen.

Performance-based labeling avoids the pitfalls of self-report measures in that they offer temporal granularity and do not require subjects to estimate their own state. They also avoid some of the limitations of task-based measures, most notably addressing the concern that subjects may or may not be engaged in the task. Performance-based labeling does not address limitations in terms of accurately localizing activation in individuals, although determining the subject’s handedness appears to account for a large portion of behavioral lateralization (Lawlor-Savage and Goghari 2014).

4 Conclusion

In this paper we presented a provisional taxonomy for determining ground-truth of emotional or cognitive states in experiments involving the use of machine learning on neurophysiological data. Each approach has strengths and weaknesses, and researchers can either determine those limitations are within the limits of acceptability or employ triangulation procedures to improve their confidence in the measures. We are not arguing that researchers should always use triangulation (although it would be beneficial); instead, we would like to suggest starting a discussion on how it would be possible to estimate the upper boundary of accuracy for the algorithms based on the acknowledgement that the models were trained on data that had low n and was only partially accurate.