Keywords

1 Introduction

Cybersecurity (CS) is to study the processes and/or technologies that protect computers, programs, networks, and data from attacks, unauthorized access/change, or destruction. Cyber attacks are argued as the actions that attempt to bypass security mechanisms of computer systems [31]. A cyber attack detection is to identify individuals who try to use a computer system without authorization, or those who have access to the system but abuse their authorities [10]. Most attacks in general can be grouped into Denial of Service Attacks, Remote to Local Attacks, User to Root Attacks, and Probing [34].

Due to the increasing number of incidents of cyber attacks, CS has always been a critical issue concerned by every Internet user. Well-trained hackers make the traditional online security protection methods such as firewalls or virus detection software no longer effective. On the other hand, user behavior analysis (UBA) become a new area to detect online attacks and also perform real-time analysis based on user behavior and actions.

For example, email is a main intermediary for spreading virus and Trojan horse. The attackers will widely spread emails containing worm programs to infect computers and networks. These emails usually have attractive subject, which draw user’s attention and allure them to open them. For example the widely spread Verona virus, also known as Romeo & Juliet virus, contain the words like I love you or sorry. So if a user is easily attracted by these words and opens these emails from unknown senders, it is likely that they will be affected and cause security threat to their systems. For cautious users with good online behavior, they may check the sender email address and distinguish if an email is safe to open. In many cases these users may just delete these emails and blacklist he sender address.

Meanwhile, big data analysis has been widely used in commercial applications such like product recommendation, and UBA is used for identifying target customers for certain products. Recently, UBA has gained tractions in CS [26]. Compared with the traditional attack detection methods (which detect the actions of a certain attack or the existence of virus software or Trojan horse), UBA focuses on determining abnormal user actions based on their usual, normal activities. After that, warnings can then be generated and countermeasure can be implemented. In general, it is difficult for an attacker to imitate the original user’s behavior. UBA can also be used to detect insider attacks, for example, an employee may illegally transfer the company data to other competitors for profit.

Since UBA allows users to implement preventive measures instead of detecting the attacks, CS companies have started adding UBA into their products/services. However, to the best of our knowledge, most UBA approaches focus on finding vulnerable user operations, and do not provide explanation or reasoning of their findings. In this paper, we propose a novel approach that is based on the concept of transparent learning, in which a prediction of attacks can be reasoned with explanations. As a result, an organization can be aware of its weakness, and can better prepare for proactive attack defense or reactive responses.

The rest of this paper is organized as follows. Section 2 summarizes related work and Sect. 3 presents our proposed approach. Section 4 describes an example to illustrate our approach and finally Sect. 5 concludes this paper.

2 Related Work

2.1 Machine Learning

There are plenty of works focusing on using different machine learning (ML) methods/models for detecting abnormal operations and predicting potential attacks in CS. For example, artificial neural networks were used in Cannady [11] to classify user operations into different categories of user misuses. Lippmann and Cunningham [21] proposed an Anomaly Detection System using keyword selection via articial neural networks. Bivens et al. [8] presented a complete Intrusion Detection System including the following stages: preprocessing, clustering the normal traffic, normalization, articial neural networks training and articial neural networks decision. Jemili et al. [15] suggested a framework using Bayesian network classifiers using nine features from the KDD 1999 data for anomaly detection and attack type recognition. Kruegel et al. [17] used a Bayesian network to classify events for OS calls. In Li et al. [20], an SVM classifier with an RBF kernel was used to classify the KDD 1999 dataset into predefined categories (Denial of service, Probe or Scan, User to root, Remote to local, and normal). Amiri et al. [2] used a least-square SVM to be faster son the same dataset. Some other ML models such as Hidden Markov Models [5] and Nave Bayes classifiers [35] have also been popular in CS, e.g., [3, 16, 25].

2.2 Rule Mining

Association rule Mining was introduced by Agrawal et al. [1] for discovering frequently appearing co-occurrences in supermarket data. Brahmi [9] applied the method to capture relationships between TCP/IP parameters and attack types. Zhengbing et al. [36] proposed a novel algorithm based on the signature apriori algorithm [14] to find new attack signatures from existing ones. Decision Trees such as ID3 [29] and C4.5 [30] are widely used for rule mining. Snort [24] is a well-known open-source tool using the signature-based approach. Kruegel and Toth [18] used DT to replace the misuse detection engine of Snort. Exposure [6, 7] is a system using Weka J48 DT (an implementation of C4.5) as the classifier to detect domains that are involved in malicious activities. Inductive learning is a bottom-up approach that generates rules and theories from specific observations. Several ML algorithms such as DT are inductive, but when researchers refer to inductive learning, they usually mean Repeated Incremental Pruning to Produce Error Reduction [12] and the algorithm quasi-optimal [22]. Lee et al. [19] developed a framework using several ML techniques such as inductive learning, Association rule and Sequential pattern m ining to detect user misuses.

2.3 User Behavior Analysis

Researchers in [4] used UBA in CS. In [26], clustering algorithms such as Expectation Maximization, Density-Based Spatial Clustering of Applications with Noise, k-means have been used in UBA. A combination of these methods has also been discussed in [33]. Commercial products/services such as Lancope, Splunk, Solera have started including UBA in their offerings.

3 Our Approach

Unlike most existing UBA approaches, by following the concept of transparent learning, we use 2 independent modules working together. Firstly, a semi-supervised learning module is designed to rate a security risk of each user. The learning takes the rating that is a tuple of 3 scores that consider 3 different aspects, namely, constancy, accuracy and consistency, into account. Secondly, a rule mining module is used to identify hidden patterns between historic user operations and an attack, and is used to reason why the user is rated at a particular security level.

3.1 Transparent Learning

Transparent learning is a ML concept that aims at the transparency of ML models, algorithms and results. An ideal transparent learning technique is one that [32]:

  • Produces models that a typical user can read, understand and modify.

  • Uses algorithms that a typical user can understand and influence.

  • Allows a user to incorporate domain knowledge when generating the models.

The main advantage of transparent learning is its interpretability. This is very important in CS, especially in understanding the reasons behind potential attack prediction. Most existing UBA systems can find potential vulnerabilities, but are unable to provide reasoning/explanations. This is because most of these systems are based on clustering or outlier detection, and many security experts may not understand why or how a prediction is made. To address this issue, we propose a transparent learning model containing 2 modules as shown in Fig. 1.

Fig. 1.
figure 1

The two main modules

3.2 User Security Rating Module

Based on user’s daily online activities and the manner of using the computer, each user will be given a security rating. This rating indicates the probability that this user may lead to a threat to the system. This user behavior is a representation of one’s personality and knowledge, and can be analysed from different aspects. Inspired by [13], we consider the following aspects when determining the rating:

 

Constancy :

– how long the user continuously maintains good online behavior. It indicates that the user has a good understanding of CS and/or the security policy of the company. It also illustrates the awareness of potential online attacks and will cautiously protect themselves from potential cyber attacks.

Accuracy :

– the frequency that a user makes security related mistake during online activities and the accuracy of the user to finish certain tasks. This shows the proficiency of the user. An experienced user is likely to have better security rating than a beginner.

Consistency :

– consistent usage patterns. If the user’s online behavior is consistent. It is easy to be tracked and risk is also minimized.

 

3.3 Rule Mining Module

We base on an inductive learning algorithm called GOLEM [23] to develop a rule mining module. The algorithm works as follows. Firstly, it generates LGG [27, 28] clauses from each pair of records. Then it picks the one that covering the maximum number of positive examples and considering reduction based on negative examples. Finally, it adds the reduced clause as a sub-rule into the final clause and mark the covered, positive examples. It repeats this process until all positive examples are covered.

GOLEM is a useful algorithm to find hidden rules between a set of facts and the results. However, in UBA, action frequency needs to be considered because frequent user actions should be more important than less frequent ones.

figure a

Therefore, we extend the original GOLEM algorithm to User Behavior Frequency GOLEM (UBF-GOLEM). It is based on the notion of User Behavior Frequency Coverage (UBFC) - the product of the coverage of a specific operation and the frequency of this operation category. Then we change the LGG selection criteria from coverage based to UBFC based. The final algorithm is shown in Algorithm 1, and an example showing how GOLEM algorithm and UBF-GOLEM work is described in Sect. 4.

UBF-GOLEM is able to find hidden rules behind attack actions. For each attack action, we determine the most relevant operations that are related to an attack. After that, for attack prediction, vulnerable operations can be identified by comparing user operations with the generated rules.

4 An Illustrating Example

This section presents a simplified example to illustrate how the two modules work. Assume some worm virus spread from a domain called “virus.com”. We first prepare the encoded raw data to generate the user security rating via a semi-supervised learning method.

4.1 Training the User Security Rating Module

Raw data from user online activities, such as log-on data, Device connection data, file transfer data, Http access data and Email data, are collected. Sample of these data is shown in Fig. 2.

Fig. 2.
figure 2

Raw data sample

Data Encoding. Each user will be given a rating that contains 3 scores by considering the 3 aspects (constancy, accuracy and consistency) mentioned in the previous subsection. Based on these scores, we can train a classifier to rate and rank (based on the likelihood and risk) of a particular user to be attacked.

Training. After the collected data are encoded into multi-dimensional numeric features, we then cluster the data using an unsupervised algorithm such as k-mean and Expectation Maximization. We then manually select and labeled some data for supervised learning the security risk of a user. Typical supervised ML algorithms, such as SVM, artificial neural networks or DT, can be used to train a classifier. We can then use the trained classifier to predict the remaining unlabeled data.

4.2 Using Inductive Learning

In the rule mining module, we classify the user operations into different types. The checkpoint for each action type is as shown in Table 1.

Example Data. For each user that has been attacked, we encode their data as discussed before. A snapshot of some positive examples is listed below:

figure b
Table 1. Data encoding

Similarly, this is a snapshot of some negative examples from actions of users (that are not attacked):

figure c

And finally, this is a snapshot of some user actions that need to predict/determine if these are safe:

figure d

Rules Generated by GOLEM. Based on the example data listed before, the sub-rules generated from GOLEM are shown below. The coverage of each rule is also shown for each clause.

figure e
Table 2. GOLEM result

The rule generated for the attack is:

figure f

By applying the generated rule to the testing user actions, the result is shown in Table 2.

Rules Generated by UBF-GOLEM. Similarly to GOLEM, we apply UBF-GOLEM to the example data. There are 77 actions in all positive and negative examples. The top 3 frequent operations are: access website (16), send email (10) and get file (10). The rules are generated as follows:

figure g

Furthermore, the rule generated for the attack is:

figure h

Finally we can use the rules generated from UBF-GOLEM to determine the vulnerable actions from each user, as shown in Table 3.

Table 3. UBF-GOLEM result

UBF-GOLEM vs GOLEM. There are two advantages by using UBFC instead of a simple coverage for clause selection. Firstly, it considers the weights. The frequent operations should be more important as these operations have more chances to be attacked. Secondly, in GOLEM, if two clauses have the same coverage, the algorithm just chooses the first one. With UBFC, clauses with the same positive coverage can be ordered by UBF and this shall lead to more meaningful selection criteria.

In the example above, GOLEM generated rules ignore the potentially important sub-rule which is [website_domain(“virus.com”)]. This is because, for each sample incidence which is covered by more than one clause, only the best-covered one will be selected. For example, [website_domain(“virus.com”)] is the second best in the first two rounds, but it is ignored because all of its covered examples have been marked as covered by other sub-rules. However, [website_domain(“virus.com”)] is the most important sub-rule generated by UBF-GOLEM. This is because the website has the highest frequency of user operations, which makes its UBFC value larger than those from other sub-rules.

4.3 Discussions

Compared with other CS systems, our proposal focuses on supporting reasoning and determining relationships between an attack and user operations. Our approach determines the relationship between a threat and the user actions that may cause this threat. Based on the frequency and variety of these user operations, companies may consider adjusting their security policies accordingly. In our two module approach, the user security rating module provides each user a security rating, and the rule mining module is to determine potentially vulnerable operations performed by each user. This gives an idea on which individual or user group and also what operations need to be considered. When an attack happens, it will also be easier to locate and faster response to the attack.

5 Conclusions

User behavior is useful to predict potential attacks based on vulnerable user actions. Much work to date has focused on finding vulnerable operations instead of providing reasoning/explanations of its findings. In this paper, we have presented a transparent learning approach for UBA to address this issue. A user rating system is proposed to determine a security level of each user, with explanations of potential attacks based on his/her vulnerable user actions. A detailed example has been presented to illustrate how approach works. We believe that, with justifiable reasoning from our proposed approach, an organization can be aware of the weakness of its current system, and can better prepare for proactive attack defense or reactive responses.