Keywords

1 Introduction

Gerjets et al. [1] describes optimum learning conditions as providing learning at the appropriate level and pace for the learner. To be able to tailor the teaching material, it is first imperative to decide on the difficulty level of material as perceived by the learner. But assessment measures such as obtaining correct vs incorrect responses in the exams may not be a good indication to gauge the understanding of the students. Often, learners could miss-assume their level of understanding and therefore causing incorrect tailoring of the material, pace etc.

Therefore, it becomes necessary to utilise measures that can correctly predict the difficulty level. Subjective and dual-task procedures can be used for this purpose but likely to interrupt the subjects in-between the experiments and it may annoy them though it could produce less noisy data and provide promising results [2].

Electroencephalogram (EEG) is a suitable approach for unobstructive and continuous measure of the task level difficulty [3] as it can measure the brain’s response to the learning material presented and therefore offers a direct measure on the task difficulty level (TDL). Furthermore, EEG is non-invasive, portable and relatively cheap when compared to other measures of brain activity such as functional magnetic resonance imaging (fMRI).

Klimesch [4] has proposed using event related desynchronisation (ERD) feature extracted from the EEG as a measure of task difficulty level. ERD measures the extent to which neuron populations no longer oscillate synchronously to process the given task [5]. Band energies in specific EEG bands such as delta, alpha and beta in frontal areas of the brain have also been used to predict the memory load [6,7,8]. Here, we set out to use more channels to cover more areas of the brain and additionally combine inter-hemispheric asymmetry (ASR) features [9] as additional measure of cognitive load. We also use subjective measurement with NASA TLX index [10].

The band energies, ERD and ASR are used individually and in combination with six different classifiers: Quadratic Discriminant Analysis (QDA), Support Vector Machine (SVM), Naïve Bayes (NB), k-Nearest Neighbour (KNN), neural network (NN) and random forest decision tree (TREE) to classify the programming mental task into either the easy or the difficult levels. We also employ a confidence approach to further increase the prediction performance. Java programming language was used here as it is popular in Computer Science programmes throughout the world but any coding language could have been used instead.

2 Methodology

2.1 Experimental Paradigm

Nine subjects were recruited from a pool of postgraduate students from School of Computing, University of Kent, who had at least six months of Java experience or have taken Java programming module as a part of their postgraduate course. Out of nine subjects, seven were males and two females. Subjects age ranged between 20 and 37 years (mean = 26±3.74). However, data from two male subjects could not be used as they did not complete a baseline task which was necessary to compute the ERD features (discussed later).

Ethical approval was obtained from University of Kent Sciences Research Ethics Committee and subjects signed a voluntary consent form and were paid £15 each. The subjects were briefed on the tasks and the experiment was designed such that the subjects would be able to understand the given program and perform the code execution mentally. Subjects have to give the final output of the program code as an answer and this method was chosen to avoid inductive bias. All codes were written in Java programming language and initially, a total of 20 Java programs were developed into three categories (spatial relation, visual object grouping, mathematical execution), each from two different TDL (easy, difficult). From this, six Java programs as deemed easy and difficult by questionnaire responders were selected (three for easy and three for difficult categories).

The easy and difficult TDL were pre-determined using questionaire responses from 15 subjects who were not involved in the EEG data collection. The volunteers (age: 28.8±4.63, 9 males and 6 females, all non-related to University of Kent, who have sufficient experience in Java (currently working or proficient in Java - mean experience of 30.53±3.56 months). This good Java experience ensures correct ‘ground truth’ of choosing different task difficulty level; there was no statistical difference in age range for these and the volunteers for EEG based study. These subjects completed a questionnaire on time-spent and task difficulty level rating for each task. The difficulty level ranged from 1 to 10 (where 1 is very easy task, 10 is impossible to solve mentally). Only those questionaire with correct answers to the questions were considered. The different task categories to be solved were:

  • Spatial relation tasks that test subject’s spatial reasoning skills like visualising shape of objects mentally. For example, visualising two rectangle objects mentally using parameters of x-axis and y-axis coordinates, width and height and to solve whether the two rectangles overlaps or not.

  • Visual object grouping tasks that utilises subject’s working memory to recall the swapped, mapped or sorted shape of objects group correctly. For example, given a number of shape objects mapped to variables and grouped in an array in different order, subject has to map the object variable name with the shape objects correctly and output those objects in order.

  • Mathematical execution where the subject had to perform arithmetic calculations mentally. For example, subject has to compute the mean of an array of integers.

Prior to performing the tasks, subjects were asked to relax for one minute (EEG was also collected during this time as baseline). Table 1 shows the GUI steps in collecting the EEG data. Steps 3 and 4 will repeat until all six programs are shown (in random order). Figure 1 shows an example of the task screen.

Table 1. GUI sequence for the experiment
Fig. 1.
figure 1

Task screen.

This GUI not only serves as a front-end, but also communicates with the EEG collection device via COM port (emulated serial port) by sending different markers values for different user activities such as relax and task execution states. Table 2 gives the details of marker type and the sent value to EEG device during the experiment. This information can be used to segment the EEG into the different tasks.

Table 2. Marker values sent by GUI to EEG device

Subjects were demonstrated the working of the GUI and were asked to perform the practise tasks in order to familiarise with the tool. Before the experiment was started, subjects sat comfortably. They were discouraged to make physical movement (example avoid blinking where possible, excessive swallowing or any hand gestures etc during the task experiment and to focus on the presented task while solving the program code. Figures 2, 3 and 4 show examples of the tested Java codes.

Fig. 2.
figure 2

An example of the tested spatial relation Java code.

Fig. 3.
figure 3

An example of the tested visual object grouping Java code.

Fig. 4.
figure 4

An example of the tested mathematical Java code.

2.2 NASA TLX Survey

After solving each task, the subjects were instructed to fill a paper based NASA TLX rating sheet based on their perception on task difficulty level. NASA TLX index is a six dimensional subjective measurement method developed NASA to measure cognitive loads [10]. The six dimensional sub scales are mental demand, physical demand, temporal demand, performance, effort and frustration level. The workload is evaluated in two procedures for each task: first, subjects have to give their perspective in a sub-scale rating range from 0–100 (divided into 20 equal intervals) and second is the sub-scale weights created by forming 15 possible pair from six dimensional elements and subjects choose the most important dimension or factor contributing to the workload.

Here, after marking the six dimension ratings, the subject were instructed to circle the most important dimension that contributes to the task which is given in pairs as mentioned above. The overall Weighted Workload Score (WWS) is computed from the subjects rating and weight that contribute to the cognitive workload. This procedure is similar to the usage of NASA TLX Index form in the study done by Fritz et al. [11].

2.3 EEG Data

The EEG data was obtained from Emotiv Epoc 14 channels (configuration as shown in Fig. 5) wireless EEG device sampled at 128 Hz. During the experiment, the signal strength was continually checked and adjusted to ensure all the electrodes had good contact with the scalp through the use of saline solution.

Fig. 5.
figure 5

Emotive electrode locations

The EEG data was segmented to one second lengths. Elliptic IIR filters were used to filter the segmented EEG signals in delta (1–4) Hz), theta (4–8 Hz), alpha (8–12 Hz), beta (12–30 Hz), gamma (30–50 Hz) bands [12] and feature extraction techniques were performed on these segments. Eighty such segments were obtained for each category and as such there were 480 patterns from six tasks altogether (easy and difficult tasks from three categories).

2.4 EEG Analysis

ERD was computed by band pass filtering the EEG signal within the specified frequency band and percentage band power change was computed between the relaxed state and task execution state using (1):

$$ ERD_{b} = \left( {BE_{r} - BEtask_{b} } \right) / BE_{r} $$
(1)

where band energy during resting was computed using

$$ BE_{r} = \sum\limits_{i = 1}^{n} {\left( {x - \overline{x} } \right)^{2} } $$
(2)

and band energy during task using

$$ BEtask_{b} = \sum\limits_{i = 1}^{n} {\left( {x - \overline{x} } \right)^{2} } $$
(3)

with x is EEG data from each channel with length n from either the rest or task execution state and \( \overline{x} \) is the mean of each channel. Given 14 channels and 5 bands, there were 70 ERD features for each one second EEG.

The ASR of each spectral band was computed using (2) and as in [9]:

$$ ASR_{b} = \left( {BE_{left} - BE_{right} } \right)/\left( {BE_{left} - BE_{right} } \right) $$
(4)

where ASR is the asymmetric ratio between left and right hemispheres, BEleft is the spectral energy from one channel in left hemisphere (computed using (3)) and BEright is spectral energy from opposite channel in right hemisphere. Since there were 14 channels (7 on each hemisphere) and 5 spectral bands, ASR resulted in a total of 35 features.

In addition, band energies (EN) for each channel in the five bands were also computed using (3) giving 70 features. Finally, all the available features were combined giving all feature (AF) set of 175 features.

2.5 Classification

These features were used by six different classifiers: QDA, SVM, NB, KNN, NN and TREE. For KNN, Euclidean distance was used whereas for QDA, the covariance matrices could vary among classes. TREE approach used an ensemble of 100 decision trees. For NN, the two output layer nodes values were set as either [1 0] or [0 1] with 10 hidden units (size chosen randomly) and trained using Matlab’s trainlm. For the rest, classifier default parameters as available in Matlab’s fitcsvm, fitcnb, fitcensemble, fitcdicsr, patternnet and fitcknn were used [13]. The two easy and difficult TDL were predicted with randomly split 40 fold cross validation.

Classifier Confidence.

The classifier confidence (CC) approach used here worked by computing the output of the classifier for the test data. From the results, it was found that NN gave the best performance for most subjects, so only the output of this classifier was used. Also, all the features gave the best performance for majority of the subjects, so these features were used. The two classifier outputs for each test pattern were checked and the predicted class was seen as confident only if the two outputs differed by at least 0.1. With perfect classification, the best outputs would differ by 1 since one output would have a value of 1 and the other value of 0. Hence, having a 10% threshold value of 0.1 is appropriate though this value would need to be experimented in future to obtain the best threshold. It should be noted that some data will be discarded where classification outputs have lower confidence than the threshold. Figure 6 shows the flow of the experimental design.

Fig. 6.
figure 6

Experimental flow design

3 Results and Discussion

Figure 7 shows the overall WWS from NASA TLX for the different task difficulty levels. Non-parametric Kruskal-Wallis test (as normality was not assumed) showed that there is significant difference between TDL (p < 0.01). Comparing each sub-scale (refer to Table 3), there were significant differences (using sign rank tests, p < 0.01) between TDL for mental demand, temporal demand, frustration and effort. Performance and physical demand did not indicate any difference. The latter is not surprising since there is no physical effort required in the tasks, though it is somewhat surprising there was no difference in performance measure. This clearly indicates the necessity of utilising measures such as EEG as subjects are unable to differentiate different levels of performance required to complete the tasks.

Fig. 7.
figure 7

Boxplot of overall NASA TLX index mean weighted workload for different task difficulty levels.

Table 3. NASA TLX – subscale

Kruskal-Wallis test showed that there is a statistically significant difference between easy and difficult tasks of EEG features (p < 0.05). Table 4 shows the classification results for EN, ERD, ASR and combined features for the five different classifiers for subject 1.

Table 4. Subject 1 results

Similarly, Tables 5, 6, 7, 8, 9 and 10 show the results for rest of the subjects. To decide on the best classifier, all the features were combined and statistical test revealed significant difference between the classifier performances (p < 0.05). The mean rank comparison showed that NN classifier gave the best performance. It also gave the best performance for five out of seven subjects.

Table 5. Subject 2 results
Table 6. Subject 3 results
Table 7. Subject 4 results
Table 8. Subject 5 results
Table 9. Subject 6 results
Table 10. Subject 7 results

Next, using NN classification results (as it gave the best significant performance overall), significant difference was obtained in the classification accuracies between the different feature extraction approaches, H(3) = 26.33, p = 8.12e–6. Comparing the mean rank values (EN: 581.03, ERD: 576.88, ASR: 478.44, AF:605.66) showed that ERD features had the highest discriminatory information to separate the two mental tasks with combination of all features giving the best results. Using all the features also gave the best accuracy for six out of seven subjects with the remaining subject having ERD giving the best accuracy.

Using CC approach revealed further improvement in the classification performance. As NN gave the best performance, it was decided to use this classifier with the best performing all feature combination. Figure 8 shows the performance for the seven subjects and it can be seen that performances were higher when CC was used. The improvements were statistically significant for all subjects (sign rank test, p < 0.05) except subject 6. This is as expected since only the classification outputs that have slightly more confident predictions are being used (the experiment revealed that about 10% of patterns were dropped).

Fig. 8.
figure 8

Classification (%) comparing the improvement with confidence approach (blue: with confidence, red: without confidence). (Color figure online)

Table 11 shows the average response time (i.e. the time taken to complete the tasks). It can be seen that the difficult tasks take longer to complete as compared to easy tasks as expected.

Table 11. Average completion time (secs) for different task levels.

This research was limited by significant noise in experiment design procedure with some subjects verbalising, flicking pens, nodding etc. Eye blinks occurred in the EEG data as shown in Fig. 9 (example shown for one subject but similar artifacts were observed for other subjects too). While these could have been removed in the pre-processing stage (for example using independent component analysis), we chose not to in order to simulate actual classroom settings where it will be difficult to force students to adhere to strict no-movement instruction.

Fig. 9.
figure 9

EEG segment with artifacts.

4 Conclusion

Both NASA TLX and task completion time showed significant differences between TDL. NASA TLX has been used as a non-physiological measure to discriminate different cognitive load for different programming language [14]. However, based on the lack of statistical difference in the performance measure in TLX sub scale, we can infer that it is difficult for subjects to estimate the TDL, hence showing the necessity to have measures that directly measure the ability.

In this report, we have shown that it is possible to differentiate the task difficulty of Java programming code using EEG signals. Though the subject pool is small and the performance needs improvement for real-life implementation, there is sufficient promise in the method to be studied further. The combination of proposed ASR with ERD and EN features improves the classification performance and among the tested classifiers, NN gave the best performance. The use of CC approach further improved the performance to give a maximum accuracy of 87.05%. It is possible that with proper feature selection and tuning of classifier parameters could further improve the accuracy.

In conclusion, the findings here will hopefully pave the way for future research studies on tailoring learning material with appropriate level of difficulty, which will be especially useful for those with independent learning plans.