Keywords

1 Introduction

Brain-computer interfaces (BCIs) provide a communication channel for a user to control an external device using only one’s brain neural activity. They can be used as a rehabilitation tool for patients with severe neuromuscular disabilities [7], and also a range of other applications including neural prosthesis, Virtual Reality (VR), internet access etc. Among different types of neuroimaging techniques, electroencephalogram (EEG) is among one of the non-invasive methods exploited mostly in BCI experiments. And, among them event related desynchronization (ERD), visually evoked potential (VEP), slow cortical potential (SCP), and P300 evoked potentials are widely used for BCI studies.

In accordance with the topographic patterns of brain rhythm modulations, feature extraction using Common Spatial Patterns (CSP) algorithm [17] provides subject-specific and discriminant spatial filters. However, CSP has some limitations, as it is sensitive to frequency bands related to neural activity, because of that the frequency band are manually selected or set to a broad band filter. Apart from that, it also results in overfitting problem when dealt with large number of channels. Hence, the problem of overfitting the classifier and spatial filter rises due to trivial channel configuration. Henceforth, a simultaneous optimization of spatial and spectral filter is highly desirable in BCI studies.

Recent years, motor imagery (MI) based BCI has proven to be an independent system with high classification accuracy. Most of the MI based BCI use brain oscillations at mu (8–12 Hz) and beta (13–26 Hz) rhythms, which displays particular areas of event related desynchronization (ERD) [16] each corresponding to respective MI states (such as right hand or right foot motion). Apart from that, Readiness-potential (RP) [18] which is a slow negative event-related potential that appears before a movement is initiated can also be used as input to BCI to predict future movements. RP is mainly divided into early RP and late RP. Early RP is slow negative potential that begins 1.5 s before action, which is immediately followed by late RP that occurs 500 ms before the movement. In MI based BCI, combining of features vectors [5] i.e., ERD and RP have shown a significant boost in the classification performance.

In the literature, several number of sophisticated CSP based algorithms have been witnessed especially in the BCI study. A brief review has been presented here. Taking into account of avoid overfitting and selection of optimal frequency bands for CSP algorithm, various methods were proposed. To avoid overfitting problem, Regularized CSP (RCSP) [13] was proposed, in which the regularization information was added into the CSP learning procedure. The Common Spatio-Spectral Pattern (CSSP) [11] is an extension of CSP algorithm with time delayed sample. However, due to flexibility issues the Common Sparse Spectral-Spatial Pattern (CSSSP) [6] was presented, where its FIR filter consists of single time delay parameter. Since, these methods were computationally expensive, a Spectrally-weighted Common Spatial Pattern (SPEC-CSP) [19] was designed which alternatively optimizes the temporal filter in frequency domain and then the spatial filter in the iteration process. To improve the performance of SPEC-CSP, Iterative Spatio-Spectral Pattern Learning (ISSPL) [22] was proposed which does not rely on statistical assumptions and optimizes all temporal filters under a common optimization framework.

Despite of various studies and advanced algorithm, it is still a challenge to extract optimal spatial spectral filters for BCI studies, so as to be used as a rehabilitation tool especially for disabled subjects. The spatial and spectral information related to brain activities associated with BCI paradigms are usually pre-determined as default in EEG analysis without speculation, which can lead to loses effects in practical applications due to individual variability across different subjects. Hence, to solve this issue, a CSSBP [12] with combined feature vectors is designed for BCI based paradigms, since the combination of features each corresponding to different physiological phenomena such as Readiness Potential (RP) and Event Related Desynchronization (ERD) can benefit BCI making it more robust against artifacts from non-Central Nervous System (CNS) activity such as eye blinks (EOG) and muscle movements (EMG) [5]. At first, the EEG signal is first divided into several sub bands using a band pass filter, then the channel and frequency bands are modeled as preconditions before classifying and a heuristic of stochastic gradient boost is used to train the base learners under these preconditions. The effectiveness and robustness of the designed algorithm along with feature combination is evaluated on widely used benchmark dataset BCI competition IV (IIa). The remaining part of the paper is organized as follows; a detailed design of proposed Boosting Algorithm is given in Sect. 2, performance comparison results shown in Sect. 3. Finally, conclusion is given in Sect. 4.

2 Proposed Algorithm

Under this section, a combination model of CSSBP (common spatial spectral boosting pattern) with feature combination is given in detail; it includes modeling the problem, and learning algorithm for the model. The model consists of five stages, data preprocessing which includes multiple spectral filtering by decomposing the signal into several sub bands using a band pass filter and spatial filtering, feature extraction using common spatial pattern (CSP), feature combination, training the weak classifiers, and pattern recognition with the help of a combinational model. The architecture of the designed algorithm is shown in Fig. 1. The EEG data is firstly spatial filtered and band pass filtered under multiple spatial-spectral preconditions.

Fig. 1.
figure 1

Block diagram of proposed boosting pattern

Afterwards, the CSP algorithm is applied to extract features of the EEG training dataset and combine these feature vectors, then the weak classifiers \( \{ f_{m } \}_{m = 1}^{M} \), are trained and combined to a weighted combination model. Lastly, a new test sample \( \hat{x} \) is classified using this combination model.

2.1 Problem Design

During BCI studies, the two main concerns are the channel configuration and frequency band, which are predefined as default for implementing EEG analysis. But, predefining these conditions without deliberations leads to poor performance while executing it in a real scenario due to subject variability in EEG patterns. Hence, an efficient and robust configuration is desirable in case of practical applications.

To model this problem, let us denote the training dataset as \( E_{train} = (x_{i} , y_{i } )_{i = 1}^{N} \), where E i is the ith sample and y i is its corresponding label. The main aim is to find a subset ω c ν, by using a set of all probable preconditions ν, which generates a combination model F by incorporating all sub models trained under condition WM (WM ϵ ω) and reducing the misclassification rate on the train dataset E train , given by,

$$ \upomega = \arg min_{\upomega} \frac{1}{N}\left| {{\text{Ei }}:\,{\text{F}}\left( { x_{i} ,\upomega} \right) \ne \mathop {y_{i} }\nolimits_{i = 1}^{N} } \right| $$
(1)

In the following part of this section, 2 homogeneous problems are modeled in detail and then an adaptive boosting algorithm is designed to solve them.

Spatial Channel and Frequency Band Selection.

For channel selection, the aim is to select an optimal channel set S(S ⊂ U), where U is the universal set including all possible channel subsets for set of channels C so that each subset Um in U satisfies \( \left| {\text{Um}} \right| \, \le \, \left| {\text{C}} \right| \) (here |.| is used to represent the size of the corresponding set), which produces an optimal combination classifier F on the training data by combining base classifiers learned under different channel set preconditions. Therefore, we get,

$$ F\left( {E_{train}; S} \right) = \sum\nolimits_{{S_{m} \in S}} {\alpha_{m} f_{m } \left( {E_{train} ;S_{m} } \right)} $$
(2)

Where F is the optimal combination model, \( f_{m } \) is mth sub model learned with channel set precondition \( S_{m} \), \( E_{train} \) is the training dataset, and \( \alpha_{m} \) is combination parameter. The original EEG \( E_{i} \) is multiplied with the obtained spatial filter, to obtain a projection of \( E_{i} \) on channel set \( S_{m} \), which is the alleged channel selection. In the simulation work, 21 channels were selected, denoted as universal set of all channels, C = (CP6, CP4, CP2, C6, C4, C2, FC6, FC4, FC2, CPZ, CZ, FCZ, CP1, CP3, CP5, C1, C3, C5, FC1, FC3, FC5), where each one indicates an electrode channel.

For frequency band selection, the spectra denoted as G is simplified as a closed interval, where the elements are all integer points (e.g., G is Hz). Here G is split into various sub-bands B and D as given in [12, 14], which denotes a universal set composed of all possible sub-bands. While selection of optimal frequency band, the objective is to obtain an optimal band set B (B ⊂ D), so that an optimal combination classifier on the training data is produced.

$$ F\left( {E_{train} ;B} \right) = \sum\nolimits_{{B_{m} \in B}} {\alpha_{m} f_{m } \left( {E_{train} ;B_{m} } \right)} $$
(3)

Where \( f_{m } \) is mth weak classifier learned by sub-band \( B_{m} \). In the simulation study, a fifth order zero phase forward/reverse FIR filter was used to filter the raw EEG signal \( E_{i} \) into sub bands \( B_{m} \).

2.2 Model Learning Algorithm

Here, the models of channel selection and frequency selection are combined to form a two-tuple, \( \upvartheta_{\text{m}} = ({\text{S}}_{\text{m}} ,{\text{B}}_{\text{m}} ) \), it is used to denote a spatial-spectral precondition, and ν is represented as a universal set including all these spatial-spectral preconditions. Lastly, the combination function can be computed as

$$ F\left( {E_{train} ;\upvartheta} \right) = \sum\nolimits_{{\upvartheta_{\text{m}} \in\upvartheta}} {\alpha_{m} f_{m } \left( {E_{train} ;\upvartheta_{\text{m}} } \right)} $$
(4)

Hence, for each spatial-spectral precondition \( \upvartheta_{\text{m}} \in\upvartheta \), the training dataset \( E_{train} \) is filtered under \( \upvartheta_{\text{m}} \). The CSP features are obtained by the filtered training dataset \( E_{train} \) and these features of individual physiological nature were combined using PROB method [1]. Let us denote the N features by random variables \( X_{i} ,{\text{ i}} = 1, \ldots ,{\text{N}} \) having class labels as \( {\text{Y}} \in \, \left\{ { \pm 1} \right\} \). An optimal classifier \( f_{i } \) is defined for each feature i on the single feature space D i hence reducing the misclassification rate. Let gi,y denote the density of \( {\text{f}}_{\text{i}} \left( {{\text{X}}_{\text{i}} |{\text{Y}} = {\text{y}}} \right) \) for each i and labels say y = +1 or −1. Then f is the optimal classifier on the combined feature space \( {\text{D}} = ({\text{D}}_{ 1} ,{\text{ D}}_{ 2} , \ldots ,{\text{D}}_{\text{N}} ) \), and X is the combined random variable \( {\text{X}} = ({\text{X}}_{ 1} ,{\text{ X}}_{ 2} , \ldots ,{\text{X}}_{\text{N}} ) \), densities of \( {\text{f }}\left( {{\text{X }}|{\text{Y}} = {\text{y}}} \right) \) is given by gy, hence under the assumption of equal class prior for \( {\text{x}} = \left( {{\text{x}}_{ 1} ,{\text{ x}}_{ 2} , \ldots ,{\text{x}}_{\text{N}} } \right) \in {\text{D}} \),

$$ f_{i} \left( {x_{i } ; \gamma (\vartheta_{i} )} \right) = 1 \leftrightarrow \hat{f}_{i} \left( {x_{i, } \gamma (\vartheta_{i} )} \right) \text{ := } {\text{log}}\left( {\frac{{g_{i, 1} \left( {x_{i} } \right)}}{{g_{i, - 1} \left( {x_{i} } \right)}}} \right) > 0 $$
(5)

Where γ is the model parameter determined by \( \vartheta_{i} \) and \( E_{train} \), and incorporating independence between the features to the above equation results in an optimal decision function given by,

$$ f\left( {x;\gamma (\vartheta )} \right) = 1 \leftrightarrow \hat{f}\left( {x;\gamma (\vartheta )} \right) = \sum\nolimits_{i = 1}^{N} {\hat{f}_{i} \left( {x_{i, } \gamma (\vartheta_{i} )} \right) > 0} $$
(6)

In this, the assumption is that, for each class the features are Gaussian distributed with equal covariance, i.e., \( X_{i} | Y = y N \left( {\mu_{i,y} , \mathop \sum \nolimits i} \right) \), with \( w_{i} := \sum\nolimits_{i}^{ - 1} {(\mu_{i,1} + \mu_{i, - 1} )} \), then the classifier,

$$ f\left( {x;\gamma (\vartheta )} \right) = 1 \leftrightarrow \hat{f}\left( {x;\gamma (\vartheta )} \right) = \sum\nolimits_{i = 1}^{N} {[w_{i}^{T} x_{i} - \frac{1}{2}\left( {\mu_{i,1} + \mu_{i, - 1} } \right)^{T} w_{i} ] > 0} $$
(7)

Then obtained weak classifier can be rewritten as \( f_{m } \left( {E_{train} ;\upvartheta_{\text{m}} } \right) \), which is trained using the boosting algorithm. Thus, the classification error defined earlier can be formulated as,

$$ \{ \alpha , \vartheta \}_{0}^{M} = min_{{\{ \alpha , \vartheta \}_{0}^{M } }} \sum\nolimits_{i = 1}^{N} {L\left( {y_{i} ,\sum\nolimits_{m = 0}^{M} {\alpha_{m} f_{m } \left( {x_{i} ;\upgamma(\upvartheta_{\text{m}} )} \right)} } \right)} $$
(8)

A Greedy approach [8] is used to solve (8), which is given in detail below,

$$ F\left( {E_{train} , \gamma , \{ \alpha , \vartheta \}_{0}^{M} } \right) = \sum\nolimits_{m = 0}^{M - 1} {\alpha_{m} f_{m } \left( {E_{train} ;\upgamma\left( {\upvartheta_{\text{m}} } \right)} \right) + \alpha_{M} f_{M} \left( {E_{train} ;\upgamma\left( {\upvartheta_{\text{M}} } \right)} \right)} $$
(9)

Transforming the Eq. (9) into a simple recursion formula we get,

$$ F_{m} \left( {E_{train} } \right) = F_{m - 1} \left( {E_{train} } \right) + \alpha_{m} f_{m } \left( {E_{train} ;\upgamma\left( {\upvartheta_{\text{m}} } \right)} \right) $$
(10)

We suppose, \( F_{m - 1} \left( {E_{train} } \right) \) is known, then \( f_{m } \) and \( \alpha_{m} \) can be determined by,

$$ \begin{array}{*{20}l} {F_{m} \left( {E_{train} } \right) = F_{m - 1} \left( {E_{train} } \right)} + \arg min_{f} \sum\nolimits_{i = 1}^{N} {L\left( {y_{i} ,\left[ {F_{m - 1} \left( {x_{i} } \right) + \alpha_{m} f_{m } \left( {x_{i} ;{\gamma (\vartheta }_{\text{m}} \text{)}} \right)} \right]} \right) } \end{array} $$
(11)

The problem in (11) is solved by using a steepest gradient descent [9], and the pseudo-residuals are given by,

$$ \begin{array}{*{20}l} {r_{\pi (i)m} = - \nabla_{F} L\left( {y_{\pi (i)} , F(x_{\pi (i)} )} \right)} \hfill \\ {\;\;\;\;\;\;\;\; = - [\frac{{\partial L\left( {y_{\pi (i)} , F(x_{\pi (i)} )} \right)}}{{F(x_{\pi (i)} )}}]_{{F(x_{\pi (i)} ) = F_{m - 1} (x_{\pi (i)} )}} } \hfill \\ \end{array} $$
(12)

Here, the first \( \hat{N} \) elements of a random permutation of \( \{ i\}_{i = 1}^{N} \) are given by \( \{ \pi (i)\}_{i = 1}^{{\hat{N}}} \). Henceforth, a new set \( \{ (x_{\pi \left( i \right)} , r_{\pi (i)m} )\}_{i = 1}^{N} \), which signifies a stochastically partly best descent step direction, is produced and employed to learn \( \upgamma(\upvartheta_{\text{m}} ) \) given by,

$$ \upgamma(\upvartheta_{\text{m}} ) = \arg min_{\gamma ,\rho } \sum\nolimits_{i = 1}^{{\hat{N}}} {\left[ {r_{\pi (i)m} - \rho f\left( {x_{\pi (i)} ;\gamma_{m} (\upvartheta_{\text{m}} )} \right)} \right]} $$
(13)

The combination coefficient \( \alpha_{m} \) is obtained with \( \gamma_{m} (\upvartheta_{\text{m}} ) \) as,

$$ \alpha_{m} = \arg min_{\alpha } \sum\nolimits_{i = 1}^{N} {L\left( {y_{i} ,\left[ {F_{m - 1} \left( {x_{i} } \right) + \alpha f_{m } \left( {x_{i} ;\upgamma\left( {\upvartheta_{\text{m}} } \right)} \right)} \right]} \right)} $$
(14)

Here, each weak classifier \( f_{m } \) is trained under a random subset \( \{ \pi (i)\}_{i = 1}^{N} \) (without replacement) from the full training data set. This random subset is used instead of the full sample, to fit the base learner as shown in Eq. (13) and the model update is computed using Eq. (14) for the current iteration. During the iteration, a self-adjusted training data pool P is maintained at background, given in detail in Algorithm 1. Then, the number of copies is computed using local classification error and these copies of incorrectly classified samples are then added to the training data pool.

2.3 Algorithm 1: Architecture of Proposed Boosting Algorithm

Input: The EEG training dataset given by \( \{ {\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} \}_{{{\text{i}} = 1}}^{\text{N}} \), L(y, x) is the squared error loss function, number of weak learners denoted by M, and ν is the set of all preconditions.

  1. (1)

    Initialize the training data pool \( {\text{Po = E}}_{\text{train}} = \{ {\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} \}_{{{\text{i}} = 1}}^{\text{N}} \),

  2. (2)

    for m = 1 to M.

  3. (3)

    Generate a random permutation

$$ \{\uppi({\text{i}})\}_{{{\text{i}} = 1}}^{{|{\text{P}}_{{{\text{m}} - 1}} |}} = {\text{randperm}}({\text{i}})_{{{\text{i}} = 1}}^{{|{\text{P}}_{{{\text{m}} - 1}} |}} $$
  1. (4)

    Select the first \( {\hat{\text{N}}} \) elements \( \{\uppi({\text{i}})\}_{{{\text{i}} = 1}}^{{{\hat{\text{N}}}}} \) as \( ({\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} )_{{{\text{i}} = 1}}^{{{\hat{\text{N}}}}} \), from Po.

  2. (5)

    Use this \( \{\uppi({\text{i}})\}_{{{\text{i}} = 1}}^{{{\hat{\text{N}}}}} \) elements to optimize new learner \({\text{f}}_{\text{m}}\) and its related parameters is obtained in output as,

Output: F is the optimal combination classifier, weak learners obtained as \( \{ {\text{f}}_{\text{m}}\}_{{{\text{m}} = 1}}^{\text{M}} \), where \( \{\upalpha_{\text{m}}\}_{{{\text{m}} = 1}}^{\text{M}} \) is the weights of weak learners and \( \{\upvartheta_{\text{m}}\}_{{{\text{m}} = 1}}^{\text{M}} \) is the preconditions under which these weak learners are learned.

  1. (6)

    Input \( ({\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} )_{{{\text{i}} = 1}}^{\text{N}} \), and \( \upvartheta \) into a classifier using CSP, extract features and combine these feature vectors to generate family of weak learners.

  2. (7)

    Initialize \( {\text{P}} \), \( {\text{F}}_{0} ({\text{E}}_{\text{train}} ) = {\text{arg min}}_{\upalpha} \sum\nolimits_{{{\text{i}} = 1}}^{\text{N}} {{\text{L}}({\text{y}}_{\text{i}} ,\upalpha )} \)

  3. (8)

    Optimalize \( {\text{f}}_{\text{m }} \left( {{\text{E}}_{\text{train}} ;\upgamma(\upvartheta_{\text{m}} )} \right) \) as defined in Eq. (10).

  4. (9)

    Optimalize \( \upalpha_{\text{m }} \) as defined in Eq. (11).

  5. (10)

    Update Pm using the following steps,

  1. A.

    Use current local optimal classifier F m to split the original training set \( {\text{E}}_{\text{train}} = ({\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} )_{{{\text{i}} = 1}}^{\text{N}} \) into two parts \( {\text{T}}_{\text{True = }}\{ {\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} \}_{{{\text{i}}: {\text{y}}_{\text{i}} }} = {\text{F}}_{\text{m}} ({\text{x}}_{\text{i}} ) \), and \( {\text{T}}_{\text{False}} = \{ {\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} \}_{{{\text{i}}: {\text{y}}_{\text{i}} }} \ne {\text{F}}_{\text{m}} ({\text{x}}_{\text{i}} ) \)

Re-adjust the training data pool:

  1. B.

    For each \( \left( {{\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} } \right) \in {\text{T}}_{\text{False}} \) do.

  2. C.

    Select out all \( \left( {{\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} } \right) \in {\text{P}}_{{{\text{m}} - 1}} \) as \( \{ {\text{x}}_{{{\text{n}}({\text{k}})}} , {\text{y}}_{{{\text{n}}({\text{k}})}} \}_{{{\text{k}} = 1}}^{\text{K}} \).

  3. D.

    Copy \( \{ {\text{x}}_{{{\text{n}}({\text{k}})}} , {\text{y}}_{{{\text{n}}({\text{k}})}} \}_{{{\text{k}} = 1}}^{\text{K}} \) with d(d ≥ 1) times so that we get total (d + 1)K duplicated samples.

  4. E.

    Return these (d + 1) K samples into \( {\text{P}}_{{{\text{m}} - 1}} \) and we get a new adjusted pool \( {\text{P}}_{\text{m}} \). And

$$ {\text{F}}_{\text{m}} \left( {{\text{E}}_{\text{train}} } \right) = {\text{F}}_{{{\text{m}} - 1}} \left( {{\text{E}}_{\text{train}} } \right) +\upalpha_{\text{m}} {\text{f}}_{\text{m }} \left( {{\text{E}}_{\text{train}} ;\upgamma(\upvartheta_{\text{m}} )} \right) $$
  1. F.

    end for.

  2. (11)

    end for.

  3. (12)

    for each \( {\text{f}}_{\text{m }} \left( {{\text{E}}_{\text{train}} ;\upgamma(\upvartheta_{\text{m}} )} \right) \), use mapping \( {\text{F}} \leftrightarrow \vartheta \), to obtain its corresponding precondition \( \upvartheta_{\text{m}} \).

  4. (13)

    Return F, \( \left\{ {{\text{f}}_{\text{m}} }\right\}_{{{\text{m}} = 1}}^{\text{M}} \), \( \left\{ {\upalpha_{\text{m}} }\right\}_{{{\text{m}} = 1}}^{\text{M}} \), and \( \left\{ {\upvartheta_{\text{m}} }\right\}_{{{\text{m}} = 1}}^{\text{M}} \).

With the help of Early stopping strategy [23], the iteration time M is determined to avoid overfitting, using \( {\hat{\text{N}}} = {\text{N}} \), doesn’t introduce randomness, hence smaller \( \frac{{{\hat{\text{N}}}}}{\text{N}} \) fraction, incorporates more overall randomness into the process. In this work, \( \frac{{{\hat{\text{N}}}}}{\text{N}} = 0.9 \) and a comparably satisfactory performance is obtained for the above approximation. While adjusting P, the copies of incorrectly classified samples, d is computed by the local classification error, \( {\text{e}} = \frac{{\left| {{\text{T}}_{\text{False}} } \right|}}{\text{N}} \) is given by,

$$ {\text{d}} = { \hbox{max} }(1, \left\lfloor{\frac{{1 - {\text{e}}}}{{{\text{e}}\, + \in }}} \right\rfloor) $$
(15)

Here, the parameter \( \in \) is called as accommodation coefficient, and e is always less than 0.5, and decreases during the iterations, so that large weights on samples will be given which were incorrectly classified by strong learners.

3 Result

The robustness of the designed algorithm was assessed on dataset obtained from BCI competition IV (IIa) dataset [2]. In order to remove artifacts obtained from eye and muscle movements, FastICA was employed [15]. For comparing the performance and efficiency of the designed algorithm, Regularized CSP (RCSP) [13] was used for feature extraction. In this, model parameter λ for RCSP, were chosen on the training set using a Hold Out validation procedure. In case of the four-class motor imagery classification task for dataset II, one-versus-rest (OVR) [21] strategy was employed for CSP. PROB method [1] was utilized for feature combination which incorporates independence between ERD and LRP features. Feature selection was done to select relevant features, since as more features cannot improve the training accuracy. Here feature selection was done using Fisher score (a variant, \( {\text{J}} = \frac{{\left\| {\upmu_{ + } -\upmu_{ - } } \right\|^{2} }}{{\upsigma_{ + } +\upsigma_{ - } }} \)) [10], it makes selection by measuring the discrimination of individual feature in the feature vector for classification. Then the features with largest fisher score are selected as most discriminative features. Linear Discriminant Analysis (LDA) [4] which minimizes the expected risk of misclassification rate was utilized for classification.

Here, the most optimal channel using [20] for all four MI movements i.e., left hand, right hand, foot and tongue were CP4, Cz, FC2, and C1. The 2-D topoplot maps of peak amplitudes of boosting based CSSP filtered EEG in each electrode for subject S1 is shown in Fig. 2.

Fig. 2.
figure 2

2-D topoplot maps of peak amplitude of Boosting based CSSP filtered EEG in each channel for subject S1 in BCI competition IV (II a) dataset.

To compute the spatial weight for each channel, the quantitative vector,\( {\text{L}} = \sum\nolimits_{{{\text{S}}_{\text{i}} \in {\text{S}}}} {\upalpha_{\text{i}} {\text{S}}_{\text{i}} } \) [17] was used where \( {\text{S}}_{\text{i}} \) is the channel sets and \( \upalpha_{\text{i}} \) are their weights. The spectral weights were computed as given in [12] and then projected onto the frequency bands. In addition, the temporal information were also obtained and visualized. The training dataset are preprocessed under the spatial-spectral pre-condition \( \upvartheta_{\text{m}} \in\upvartheta \), which results in a new dataset on which spatial filtering is done using CSP to obtain the spatial patterns. Then the first two components obtained by CSP are projected onto the space yielding the CSP filtered signal Em. The peak amplitude PmCi for Em and each channel \( {\text{C}}_{\text{i}} \in {\text{C}} \). Then the PmCi is averaged over all set of preconditions \( \upvartheta_{\text{m}} \in\upvartheta \), computed as \( {\text{P}}_{{{\text{C}}_{\text{i}} }} = (\frac{1}{|\vartheta |})\sum\nolimits_{{\upvartheta_{\text{m}} \in\upvartheta}} {\upalpha_{\text{m}} {\text{P}}_{{{\text{mC}}_{\text{i}} }} } \) where \( \upalpha_{\text{m}} \) is the corresponding weight for the mth condition, which is then visualized using a 2-D topoplot map. From the topoplot, it can be observed that the left hand and right hand movement resulted in activation over the right and left hemisphere of the brain, the foot movement activated the central cortical area and tongue showed activation in the motor cortex region.

The classification results of the test dataset for the proposed method and the other competing method i.e., Regularized CSP (RCSP) is detailed as follows. In all the subjects the maximum number of iterations, M of the boosting algorithm was set to 180, which was computed using early stopping strategy so as to avoid overfitting, and ϵ was set to 0.05. The cohen’s kappa values for all 9 subjects in the BCI IV(IIa) dataset is shown in Fig. 3. In case of dataset 2, the CSSBP outperformed the RCSP algorithm and showed highest average cohen’s kappa value [3]. From the kappa values, it can be seen that when feature vectors are combined in RCSP algorithm, there was a significant improvement in kappa values in all subjects (except for subjects S4 and S6).

Fig. 3.
figure 3

Cohen’s kappa values for all the 9 subjects in BCI IV (II a) dataset, where A is RCSP, B is RCSP with combined feature vectors, C is Boosting based CSSP (CSSBP), and D is Boosting based CSSP (CSSBP) with combined feature vectors.

Whereas the proposed method improved the kappa values compared to the above algorithm and moreover when feature vectors were combined, it outperformed CSSBP with single feature when compared with combined feature vectors. The statistical analysis was done using IBM SPSS ver. 23., it showed significant difference between designed method and the other methods used for comparison in a Mann-Whitney U test. For all the cases, the designed method outperformed for level of significance p < 0.05, as shown in Fig. 4.

Fig. 4.
figure 4

Boxplots of RCSP and Boosting Approach, where A is RCSP, B is RCSP with combined feature vectors, C is CSSBP, and D is CSSBP with combined feature vectors for BCI IV (IIa) dataset (p < 0.05).

4 Conclusion

In this work, a boosting based common spatial-spectral pattern (CSSBP) algorithm with feature combination has been designed for multichannel EEG classification. Here, the channel and frequency configurations are divided into multiple spatial-spectral preconditions by using a sliding window strategy. Under these preconditions, the weak learners are trained using a boosting approach. The motive is to select the most contributed channel groups and frequency bands related to neural activity. From the results, it can be seen that the CSSBP clearly outperformed the other method use for comparison. In addition, combining the widely used feature vectors ERD and readiness potentials (RP) significantly improved the classification performance compared to CSSBP and resulted in increased robustness.

The PROB method was utilized which incorporates independence between ERD and LRP features enhanced the performance. This can also be used to better explore the neurophysiological mechanism of underlying brain activities. Feature combination of different brain tasks in feedback environment, where the subject is trying to adapt with the feedback scenario might cause the learning process complex and time consuming, so for that this process needs to investigate further in future online BCI experiments.