1 Introduction

Phishing is a type of social engineering attacks, where attackers create fake websites with the look and feel similar to the real ones, and lure users to these websites with the intention of stealing their private credentials (e.g., password, credit card information, and social security numbers) for malicious purposes. Because phishing attacks are a big threat to cybersecurity, many studies have been conducted to understand why users are susceptible to phishing attacks [10, 34, 35, 41], and to design automated detection mechanisms, e.g., by utilizing image processing [39], URL processing [7, 37], or blacklisting [36]. Recently, Neupane et al. [24,25,26] introduced a new detection methodology based on the differences in the neural activity levels when users are visiting real and phishing websites. In this paper, we advance this line of work by introducing tensor decomposition to represent phishing detection related brain-computer interface data.

With the emergence of the Brain-Computer Interface (BCI), electroencephalography (EEG) devices have become commercially available and have been popularly used in gaming, meditation, and entertainment sectors. Thus, in this study, we used an EEG-based BCI device to collect the neural activities of users when performing a phishing detection task. The EEG data are often analyzed with methodologies like time-series analysis, power spectral analysis, and matrix decomposition, which consider either the temporal or spatial spectrum to represent the data. However, in this study, we take advantage of the multi-dimensional structure of the EEG data and perform tensor analysis, which takes into account spatial, temporal and spectral information, to understand the neural activities related to phishing detection and extract related features.

In this paper, we show that the tensor representation of the EEG data helps better understanding of the activated brain areas during the phishing detection task. We also show that the tensor decomposition of the EEG data reduces the dimension of the feature vector and achieves higher accuracy compared to the state-of-the-art feature extraction methodologies utilized by previous research [25].

Our Contributions: In this paper, we learned tensor representations of brain data related to phishing detection task. Our contributions are three-fold:

  • We show that the multi-way nature of tensors is a powerful tool for the analysis and discovery of the underlying hidden patterns in neural data. To the best of our knowledge, this is the first study which employs the tensor representations to understand human performance in security tasks.

  • We perform a comprehensive tensor analysis of the neural data and identify the level of activation in the channels or brain areas related to the users’ decision making process with respect to the real and the fake websites based on the latent factors extracted.

  • We extract features relevant to real and fake websites, perform cross-validation using different machine learning algorithms and show that using tensor-based representations can achieve the accuracy of above 94% consistently across all classifiers. We also reduce the dimension of the feature vector keeping the features related to the highly activated channels, and show that we can achieve better accuracy (97%) with the dimension-reduced feature vector.

The tensor representations of the data collected in our study provided several interesting insights and results. We observed that the users have higher component values for the channels located in the right frontal and parietal areas, which meant the areas were highly activated during the phishing detection task. These areas have been found to be involved in decision-making, working memory, and memory recall. Higher activation in these areas shows that the users were trying hard to infer the legitimacy of the websites, and may be recalling the properties of the website from their memory. The results of our study are consistent with the findings of the previous phishing detection studies [25, 26]. Unlike these studies, our study demonstrates a tool to obtain the active brain areas or channels involved in the phishing detection task without performing multiple statistical comparisons. On top of that, our methodology effectively derives more predictive features from these channels to build highly accurate machine-learning based automated phishing detection mechanism.

2 Data Collection Experiments

In this section, we describe details on data collection and preprocessing.

2.1 Data Collection

The motivation of our study is to learn tensor based representations from the BCI measured data for a phishing detection task, where users had to identify the phishing websites presented to them. We designed and developed a phishing detection experiment that measured the neural activities when users were viewing the real and fake websites. We designed our phishing detection study inline with the prior studies [10, 24,25,26]. Our phishing websites were created by obfuscating the URL either by inserting an extra similar looking string in the URL, or by replacing certain characters of the legitimate URL. The visual appearances of the fake websites were kept intact and similar to the real websites. We designed our fake webpages based on the samples of phishing websites and URLs available at PhishTank [32] and OpenPhish [28]. We choose twenty websites from the list of top 100 popular websites ranked by Alexa [3] and created fake versions of the 17 websites applying the URL obfuscation methodology. We also used the real versions of these 17 websites in the study. We collected data in multiple sessions and followed the EEG experiments like prior studies [21, 40]. In each session of the experiment, the participants were presented with 34 webpages in total.

We recruited fifteen healthy computer science students after getting the Institutional Review Board (IRB) approval and gave them $10 Amazon gift-card for participating in our study. We had ten (66.66%) male participants, and five (33.33%) female participants with the age-range of 20–32 years. The participants were instructed to look at the webpage on the screen and give response by pressing a ‘Yes’/‘No’ button using a computer mouse. We used commercially available, light-weight EEG headset [1] to simulate a near real-world browsing experience. EmotivPro software package was used to collect raw EEG data. We presented with the same set of (randomized) trials to all the participants. All participants performed the same tasks for four different sessions. There was a break of approximate 5 min between two consecutive sessions. We collected all sessions data in the same day and same room. We have total 2040 (Participants\(\ (15)\ \times \) Number of sessions \( (4)\ \times \) Number of events per session (34)) responses. We discarded 187 wrong responses and only considered 1853 responses for our analysis.

2.2 Data Preprocessing

The EEG signals can be contaminated by eye blink, eyeball movement, breath, heart beats, and muscles movement. They can overwhelm the neural signals and may eventually degrade the performance of the classifiers. So, we preprocess the data to reduce the noise before modeling the data for tensor decomposition. Electrooculogram (EOG) produced by eye movements and Electromyography (EMG) produced by muscles movement are the common noise sources contaminating the EEG data. We used the AAR (Automatic Artifact Removal) toolbox [13] to remove both EOG and EMG [17]. After removing the EOG and EMG artifacts, EEG data were band pass filtered with the eighth-order Butterworth filter with the pass-band 3 to 60 Hz to remove other high frequency noises. The band pass filter keeps signals within the specified frequency range and rejects the rest. The electrical activities in the brain are generated by billions of neuron and the raw EEG signals we collected using sensors of Emotiv Epoc+ device had received signals from a mixture of sources. So we applied the Independent Component Analysis (ICA) [16], a powerful technique to separate independent sources linearly mixed in several sensors, to segregate the electrical signals related to each sensor. Our EEG data pre-processing methodology is similar to the process reported in [23].

3 Problem Formulation and Proposed Data Analysis

Tensor decomposition method is useful to capture the underlying structure of the analyzed data. In this experiment, the tensor decomposition method is applied to the EEG brain data measured for a phishing detection task.

One of the most popular tensor decomposition is the so-called PARAFAC decomposition [14]. In PARAFAC, by following an Alternating Least Square (ALS) method we decompose the tensor into 3 factor matrices. The PARAFAC decomposition decomposes the tensor into a sum of component rank-one tensors. Therefore, for a 3-mode tensor where \(X \in R^{I \times J \times K}\), the decomposition will be,

$$\begin{aligned} X = \sum _{r=1}^{R} a_r \circ b_r \circ c_r \end{aligned}$$
(1)
Fig. 1.
figure 1

PARAFAC decomposition with 3 factor matrices (Time, Channel and Event). Event matrix (blue colored) is used as features. (Color figure online)

Here, R is a positive integer and \(a_r \in R^{I}\), \(b_r \in R^{J}\) and \(c_r \in R^{K}\) are the factor vectors which we combine over all the modes and get the factor matrices. Figure 1 is showing the graphical representation of PARAFAC decomposition. However, PARAFAC model assumes that, for a set of variables the observations are naturally aligned. Since, in our phishing experiments, this is not guaranteed, we switched to PARAFAC2 model which is a variation of PARAFAC model.

The dimension of the feature matrix varies in dimension 68 \(\times \) N, where 68 is for the number of event and N indicates the number of components or features. We have selected different number of features for our experiment to test what number of features trains a better model.

3.1 PARAFAC2 Decomposition

In real life applications, a common problem is the dataset is not completely aligned in all modes. This situation occurs for different problems for example, clinical records for different patients where patients had different health problems and depending on that the duration of treatments varied over time [31]. Moreover, participants response record for phishing detection where each of them took a variable amount of time to select and decide whether the website presented is a real one or phishing one. In these examples, the number of samples per participant does not align naturally. The traditional models (e.g., PARAFAC and Tucker) assume that, the data is completely aligned. Moreover, if further preprocessing is applied in the data to make it completely aligned it might be unable to represent actual representation of the data [15, 38]. Therefore, in order to model unaligned data, the traditional tensor models need changes. The PARAFAC2 model is designed to handle such data.

The PARAFAC2 model is the flexible version of the PARAFAC model. It also follows the uniqueness property of PARAFAC. However, the only difference is that the way it computes the factor matrices. It allows the other factor matrix to vary while applying the same factor in one mode. Suppose, the dataset contains data for K subjects. For each of these subjects (1, 2,..., K) there are J variables across which \(I_{k}\) observations are recorded. The \(I_{k}\) observations are not necessarily of equal length. The PARAFAC2 decomposition can be expressed as,

$$\begin{aligned} X_{k} \approx U_{k} S_{k} V^{T} \end{aligned}$$
(2)

This is an equivalence relation of Eq. 1. It only represents the frontal slices \(X_{k}\) of the input tensor X. Where, for subject k and rank R, \(U_{k}\) is the factor matrix in the first mode with dimension \(I_{k}\) \(\times \) R, \(S_{k}\) is a diagonal matrix with dimension R \(\times \) R and V is the factor matrix with dimension J \(\times \) R. The \(S_{k}\) is the frontal slices of S where S is of dimension R \(\times \) R \(\times \) K and also \(S_{k} = diag(W(k, :))\). Figure 2 shows the PARAFAC2 decomposition.

Fig. 2.
figure 2

PARAFAC2 decomposition of a mode - 3 tensor.

PARAFAC2 can naturally handle sparse data or dense data [18]. However, this statement was true only for a small number of subject [6]. The SPARTan algorithm is used for PARAFAC2 decomposition when the dataset is large and sparse [31].

3.2 Formulating Our Problem Using PARAFAC2

In order to apply different tensor decomposition method, at first we need to form the tensor. We form the initial tensor by considering all participants phishing detection brain data. The tensor for this experiment is of three dimensions, time \(\times \) channel \(\times \) events.

In this experiment, the participants were given the option to take the necessary time to decide whether the current website is phishing or not. Since, the participants were not restricted to take a decision within a particular time-frame, it has been found that for each event different participants took variable amount of time. Therefore, it is not possible to apply general tensor decomposition algorithm and even form a general tensor.

In order to solve the above problem, the PARAFAC2 model is used in this experiment. The SPARTan [31] algorithm is used to compute the PARAFAC2 decomposition. This algorithm has used the Matricized-Tensor-Times-Khatri-Rao-Product (MTTKRP) kernel. The major benefit of SPARTan is that it can handle large and sparse dataset properly. Moreover, it is more scalable and faster than existing PARAFAC2 decomposition algorithms.

3.3 Phishing Detection and Tensor

In this project, each participant was shown the real and phishing website and during that time, the brain EEG signal was captured. The participants were given the flexibility to take the required amount of time to select whether the website is real or not. Therefore, the observations for a set of variables do not align properly and the PARAFAC2 model is used to meaningfully align the data.

In order to create the PARAFAC2 model, the EEG brain data for all user for both real/phishing website was merged. The 3-mode tensor was then formed as Time \(\times \) Channel \(\times \) Events. In events, both the real and the phishing website are considered. Therefore, the tensor formed from this dataset consists of 1853 events, 14 channels (variables) and a maximum of 3753 observations (time in seconds). Figure 3 shows the PARAFAC2 model of the phishing experiment.

Fig. 3.
figure 3

PARAFAC2 model representing the brain EEG data across different events.

The 3 factor matrices obtained from the decomposition are U, V and W. These factor matrices representing the mode Time, Channel and Events respectively. In this experiment, we analyzed the V and W factor matrices to see which channels capture the high activity of brain regions and also distinguish between real and phishing events respectively.

In the SPARTAN algorithm [31], a modified version of the Matricized-Tensor-Times-Khatri-Rao-Product (MTTKRP) kernel has been used. It computes a tensor that is required in the PARAFAC2 decomposition algorithm. For a PARAFAC2 model, if our factor matrices are H, V and W and of dimension RXR, JXR, and KXR respectively, then for mode 1 with respect to K MTTKRP is computed as,

$$\begin{aligned} M^{(1)} = Y_{(1)} (W \odot V) \end{aligned}$$
(3)

The computation here is then parallelized by computing the matrix multiplication as the sum of outer products for each block of \((W \odot V)\). The efficient way to compute the specialized MTTKRP is, first computing \(Y_{k}V\) for each row of the intermediate result and then computing the Hadamard product with W(k, :). Since \(Y_{k}\) is column sparse, it reduces the computation of redundant operations. For this project, we have computed the factor matrices in Channel mode and Events mode using the above method.

Brain Data vs Tensor Rank. In exploratory data mining problems, it is really important to determine the quality of the results. In order to ensure a good quality of the decomposition, it is important to select a right number of components as the rank of the decomposition. In this experiment, we used the AutoTen [29] algorithm to assess the performance of the decomposition with different ranks.

The application of AutoTen algorithm is not straightforward for the phishing experiment, since the observations for a set of variables do not align properly. Therefore, a number of additional operations are performed to bring the tensor of the whole dataset into a naturally aligned form. From Eq. (2), if we decompose \(U_{k}\) as \(Q_{k} H\), then we can rewrite Eq. 2 as,

$$\begin{aligned} X_{k} \approx Q_{k} H S_{k} V^{T} \end{aligned}$$
(4)

Where \(Q_{k}\) is with dimension \(I_{k}\,\times \,R\) and H is with dimension R \(\times \) R. \(Q_{k}\) has orthonormal columns. Now, if both sides of the above equation is multiplied by \(Q_{k}^{T}\), then we get,

$$\begin{aligned} Q_{k}^{T} X_{k} \approx Q_{k}^{T} Q_{k} H S_{k} V^{T} \approx H S_{k} V^{T} \end{aligned}$$
(5)

Therefore, we can write,

$$\begin{aligned} Y_{k} \approx H S_{k} V^{T} \end{aligned}$$
(6)

Where \(Y_{k}\) is the outer product of \(Q_{k}^{T}\) and \(X_{k}\). The above equation is now same as the PARAFAC decomposition with consistency in all the modes. \(Y_{k}\) is also a tensor and is used in the AutoTen algorithm as input. The AutoTen algorithm was run for maximum rank 20 and it has been found that 3 is the rank for which the model can perform better. Therefore, for the PARAFAC2 decomposition using SPARTan, rank 3 is used.

4 Classification Performance

In this section, we discuss our classification performance for detecting the real and phishing page based on neural data. We merge all the data across all the sessions and across all the users. We extracted features from brain data using tensor decomposing with rank 3 computed by our modification of AutoTen as discussed in Sect. 3.3. We then applied the different type of machine learning algorithms for distinguishing the real and fake website based on brain data and checked their performance. We tested with Bayesian type BayesNet (BN), Function type Logistic Regression and MultilayerPerceptron, Rules type JRip and DecisionTable, Lazy type KStar and IB1 and Tree type J48, RandomTree, Logistic Model Tree (LMT), and RandomForest (RF). We present the best one (BayesNet, Logistic Regression, JRip, IB1, RandomForest) from each type of machine learning algorithms. We use 15-fold cross validation because we have 15 users data in our dataset. Here, the dataset is divided into 15 subsets where 14 subsets will be in training set and rest one subset will be in the testing subset.

We tested our model using several metrics: accuracy, precision, recall, F1 score and Area Under the Curve (AUC). We compared our classification performance in two different cases.

  • All Channels: In this setting, we consider all 14 channel’s data as feature vectors.

  • Top 6 Channels: In this setting, we consider only top 6 highly activated channel’s data as feature vectors. Details discussion for this can be found in Sect. 5.

Table 1. Classification Performance: In this table, we present the classification results of the five classifiers. Here, we have classification results for two scenarios. One for considering all channels for features extraction and another for considering only top 6 channels based on their activation. We have highlighted the accuracy of the best performing classifier in grey.

The summary of classification performance for different metrics (Accuracy, Recall, Precision, and F-measure) can be found in Table 1. We have seen that for considering all channels logistic regression algorithm gives 94% accuracy. We get 97% accuracy for considering top 6 highly activated channels using Random Forest algorithm. We achieved improved performance than the prior study which reported 76% accuracy of their phishing detection model built using neural signals when the participants were asked to identify real and fake websites under fNIRS scanning [25].

We also validated our classification performance by plotting the ROC curve in Fig. 4 using the Random Forest algorithm which gives the best accuracy among all the algorithms. In an ideal scenario, the AUC should be 100%. The baseline for AUC is 50%, which can be achieved through purely random guessing. Our model achieved 97.32% AUC for when considering all channels data and 99.22% when considering only top 6 highly activated channels data. We have seen that our True Positive Rate is 79.04 in case of all channels data and True Positive Rate is 94.91 in case of top 6 channels data while keeping False Positive Rate less than 1%. Reducing the channels gives us better phishing detection accuracy.

Fig. 4.
figure 4

AUC curve for all channels vs top 6 channels using the Random Forest algorithm. Here, we observed that TPR for all channels is 79.04% and 94.91% for top 6 channels when FPR is <1%

5 Discussion

In this section, we answer why we are getting good accuracy in classifying real and fake websites using brain data. We highlight the several key points for getting the good accuracy. First, we show that certain brain areas are highly activated during the phishing detection task. Second, we show that there is a statistically significant difference between the real and fake components.

5.1 Phishing Detection vs Brain Areas

In this section, we provide a concise neuro-scientific insight of the brain data measured for the phishing detection. We discuss the relationship between the brain activities and phishing detection task. In our experiments, we collected brain data from human scalp using a commercially available non-invasive brain computer interface device. The data we collected using Emotiv Epoc+ device come from fourteen (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4) different sensors as shown in Fig. 5. These sensors are placed on different regions according to the International 10−20 system. Two sensors positioned above the participant’s ears (CMS/DRL) are used as references. Sensors location and functionality of each region is given below:

  • Frontal Lobe, located at the front of the brain and associated with reasoning, attention, short memory, planning, and expressive language. The sensors that are placed in those area are AF3, F7, F3, FC5, FC6, F4, F8, and AF4.

  • Parietal Lobe, located in the middle section of the brain and associated with perception, making sense of the world, and arithmetic. The censors P7 and P8 belongs to this area.

  • Occipital Lobe, located in the back portion of the brain and associated with vision. The sensors from this location are O1 and O2.

  • Temporal Lobe, located on the bottom section of the brain and associated with sensory input processing, language comprehension, and visual memory retention. The sensors of this location are T7 and T8.

Based on the factor analysis in channel dimension, we observed that mostly Frontal lobe and Parietal lobe sensors (AF3, F3, FC5, F7, P7, and P8) are highly activated for the phishing detection task. In Fig. 5(a), we present the channel activity based on channel factor data. Here, we consider all phishing detection events and get the factor matrix data in channel dimension using rank 3. We consider the first component data for drawing this graph. We have found that same subset of channels while considering the second and the third component data. In Fig. 5(b) we show the corresponding brain mapping for phishing detection task. Higher the red is the higher brain activity for phishing detection task. Our findings are aligned with the prior fMRI [26] and fNIRS [25] studies.

Fig. 5.
figure 5

(a) Shows the channel activity after the application of SPARTan decomposition on the tensor. The channel data for the first component is plotted in this figure to determine which channels have high activity. (b) Shows the corresponding brain region activation. (Color figure online)

5.2 Statistical Analysis: Real vs Fake Events

In this subsection, we present the statistical analysis of the components obtained from the tensor analysis. First, we performed the Kolmogorov-Smirnov (KS) test to determine the statistical distribution of the first component values of the real and fake factor matrix. In KS test we observed that the distribution of the real and fake samples was non-normal (p < .0005). We then applied Wilcoxon Singed-Rank Test, a non-parametric test comparing two sets of scores that come from the same participants, to measure the difference between real and fake components. We observed that there was statistically significantly high differences between the real and fake components (Z \(=\) 6.8, p < .0005).

5.3 Feature Space Reduction

One of the primary application of our study is the reduction of the dimension of the feature vector by keeping the features related to highly activated frontal and parietal channels. We observed that the prediction accuracy of the machine learning model trained on the features belonging to the top 6 highly activated channels was better than the prediction accuracy of the models better trained on features related to all channels. Our model achieved 97% of accuracy while applying reduced features vector. From the ROC curve in Fig. 4, we can see that our true positive rate increases from 79% to 94% when we use reduced feature vector in classification while keeping false positive rate <1%.

6 Related Works

Phishing attacks usually come in different forms or structures. In the case of the phishing website, the front-end structure of the website or URL is changed which is sometimes difficult to distinguish from the real website. There are a number of tools that are considering different features to detect a phishing website automatically. However, different studies show that these tools should consider the behavioral aspect of the user as well [11]. In different experiments, participants were tested to identify the features of a website. For example, evaluating the website URL, identifying icons or logos and past web experiences. It has been found that participants who know about phishing are less likely to fall for a phishing website.

In order to make the user aware of phishing website, proper education on this topic is required. There are several works that discuss how to identify phishing website from URLs [22]. These works show that, by looking at the lexical and host-based (IP-address, domain name, etc.) features of the URL, it can be easily found out whether the website is phishing or not. In this work, the accuracy obtained in classifying the phishing and the real webpage is 95–99%. Furthermore, it has been found that if appropriate education is provided, the user will be more efficient in avoiding phishing website [4]. Moreover, it has also been studied that what type of browser phishing warnings works better for the user and the performance of active warnings outperform the passive ones [12].

Apart from understanding user behavior while browsing the internet, it is also possible to prevent phishing by focusing on tracking the hacker’s behavior. The hybrid feature selection method is applied to capture the phishing attacker’s behavior from email header [2] and they achieved an accuracy of 94%. In these methods, both the content of email header and behavioral basis of it is considered for feature selection.

Automated Phishing Detection Method: In order to automatically detect phishing website, the pattern of the URL is considered as the primary method, and with the aid of machine learning algorithms it can protect the user from a phishing attack. However, these models do not perform well due to the lack in the number of features. Moreover, the domain top-page similarity based method is also used for phishing detection [33] where they obtained maximum AUC of 93%.

There are few more automated phishing detection system that use density based spatial clustering techniques to distinguish phishing and real website [20] with the accuracy of 91.44%. Linear classifiers are also used for phishing detection problem, and phishing domain ontology is also used for this task [42]. The content of a webpage is analyzed and based on their linguistic feature, an accuracy of 97% is achieved.

Tensor Decomposition and Phishing Detection: Tensor is useful for EEG brain data representation and visualization as well. It provides a compact representation of the brain network data. Moreover, it is useful to use tensor decomposition method to capture the underlying structure of the brain data. In Cichocki et al. [8], a brain computer interface system is used where tensor decomposition is applied in EEG signals. Tensor decomposition has already been applied for feature extraction in different problems involving EEG data. In P300 based BCIs, tensor decomposition is used to extract hidden features because of its multi-linear structures [27]. Unlike the general Event-related Potentials (ERP) based BCI problems, tensor can consider both temporal and spatial structure for feature extraction instead of only temporal structure which ensures better accuracy [8, 9]. Tensor decomposition method has also been used for the classification of Mild and Severe Alzheimer’s Disease using brain EEG data [19].

Tensor decomposition has been used for brain data analysis as well. GEBM is an algorithm that models the brain activity effectively [30]. SEMIBAT is a semi-supervised Brain network analysis approach based on constrained Tensor factorization [5]. The optimization objective is solved using the Alternating Direction Method of Multipliers (ADMM) framework. The proposed SEMIBAT method showed 31.60% improved results over plain vanilla tensor factorization for graph classification problem in EEG brain network.

Tensor decomposition methods have been applied for a variety of problems related to the analysis of brain signal. However, the idea of applying tensor decomposition methods in an automated system where the main task is to classify phishing and real websites based on brain EEG data is novel. In our case, we achieved the classification accuracy of real and phishing websites as high as 97% using neural signatures.

7 Conclusion

In this paper, we show that the tensor representation of brain data helps better understanding of the brain activation during the phishing detection task. In this scheme, owing to tensor representation on multi-modes of channel, time, and event, different characteristics of EEG signals can be presented simultaneously. We observed that right frontal and parietal areas are highly activated for participants during the phishing website detection task. These areas are involved in decision making, reasoning, and attention. We use the AutoTen algorithm to measure the quality of the result and also to choose a proper rank for the decomposition. We reduce the dimension of feature vectors and achieve a maximum 97% of classification accuracy while considering only highly activated brain area sensor’s data. Our results show that the proposed methodology can be used in the cybersecurity domain for detecting phishing attacks using human brain data.