Abstract
Data protection and insider threat detection and prevention are significant steps that organizations should take to enhance their internal security. Data loss prevention (DLP) is an emerging mechanism that is currently being used by organizations to detect and block unauthorized data transfers. Existing DLP approaches, however, face several practical challenges that limit their effectiveness. In this chapter, by extracting and analyzing document content semantic, we present a new DLP approach that addresses many existing challenges. We introduce the notion of a document semantic signature as a summarized representation of the document semantic. We show that the semantic signature can be used to detect a data leak by experimenting on a public dataset, yielding very encouraging detection effectiveness results including on average a false positive rate (FPR) of 6.71% and on average a detection rate (DR) of 84.47%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- BM:
-
Boyer Moore algorithm
- CBSD:
-
Component-based software development
- CF:
-
Concept vector file
- CM:
-
Concept map
- CS:
-
Cosine similarity
- CT:
-
Concept tree
- DCT:
-
Document concept tree
- DL:
-
Ontology description logics
- DLP:
-
Data loss prevention
- DR:
-
Detection rate
- DSS:
-
Document semantic signature
- FDR:
-
False discovery rate
- FIBO:
-
Financial Industry Business Ontology
- FNR:
-
False negative rate
- FPR:
-
False Positive rate
- IDF:
-
Inverse document frequency
- IDS:
-
Intrusion detection systems
- KB:
-
Knowledge base
- KDE:
-
Kernel density estimation
- NIDS:
-
Network-based intrusion detection system
- NTAC:
-
National Threat Assessment Center
- OWL:
-
Ontology web language
- RDF:
-
Resource description framework
- RNCVM:
-
Relevancy nodes-based concept vector model
- SIDD:
-
Sensitive information dissemination detection
- SVM:
-
Support vector machines
- SW:
-
Smith–Waterman algorithm
- TF:
-
Term frequency
- TF-IDF:
-
Term frequency inverse document frequency
References
E. Kowalski, D. Cappelli, A. Moore, U.S. Secret Service and CERT/SEI Insider Threat Study: Illicit Cyber Activity in the Information Technology and Telecommunications Sector (Carnegie Mellon Software Engineering Institute, Pittsburgh, 2008)
D.L. Costa, M.L. Collins, S.J. Perl, et al., An Ontology for Insider Threat Indicators Development and Applications (Carnegie-Mellon University, Pittsburgh, Software Engineering Inst, 2014)
M. Kandias, A. Mylonas, N. Virvilis, et al., An insider threat prediction model, in International Conference on Trust, Privacy and Security in Digital Business, (Springer, Cham, 2010), pp. 26–37
A.W. Udoeyop, Cyber Profiling for Insider Threat Detection. Master’s Thesis, University of Tennessee (2010)
Y. Liu, C. Corbett, K. Chiang, et al., SIDD: A framework for detecting sensitive data exfiltration by an insider attack. System Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on IEEE 2009, pp. 1–10
H. Ragavan, Insider threat mitigation models based on thresholds and dependencies (University of Arkansas, Fayetteville, 2012)
P. Raman, H.G. Kayacık, A. Somayaji, Understanding data leak prevention, in 6th Annual Symposium on Information Assurance (ASIA’11) (2011), pp. 27–3
S. Liu, R. Kuhn, Data loss prevention. IT Professional 12(2), 10–13 (2010)
M. Hart, P. Manadhata, R. Johnson, Text classification for data loss prevention, ed. by S. Fischer-Hübner, N. Hopper. PETS 2011. LNCS, vol. 6794 (2011), p 18–37
V. Stamati-Koromina, C. Ilioudis, R. Overill, et al., Insider threats in corporate environments: a case study for data leakage prevention, in Proceedings of the Fifth Balkan Conference in Informatics, (ACM, New York, 2012), pp. 271–274
Y. Canbay, H. Yazici, S. Sagiroglu, A Turkish language based data leakage prevention system. in Digital Forensic and Security (ISDFS), 5th International Symposium (IEEE, April 2017), pp. 1–6
S. Vodithala, S. Pabboju, A keyword ontology for retrieval of software components. Int. J. Control Theory Appl. 10(19), 177–182 (2017)
M. Fernández, I. Cantador, V. López, et al., Semantically enhanced information retrieval: an ontology-based approach. Web Semant. Sci. Serv. Agents World Wide Web 9(4), 434–452 (2011)
K. Doing-Harris, Y. Livnat, S. Meystre, Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system. J. Biomed. Semant. 6, 15 (2015)
H.Z. Liu, H. Bao, D. Xu, Concept vector for similarity measurement based on hierarchical domain structure. Comput. Inform. 30(5), 881–900 (2012)
C. Corley, R. Mihalcea, Measuring the semantic similarity of texts. in Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Association for Computational Linguistics, 2003, p 13–18
Onix, Onix Text Retrieval Toolkit API Reference (2017), http://www.lextek.com/manuals/onix/stopwords1.html, Accessed 14 Nov 2017
B. Klimt, Y. Yang, The Enron Corpus: a new dataset for email classification research, in Machine learning, ECML 2004, (Springer, Berlin, 2004), pp. 217–226
FIBO, Financial Industry Business Ontology (2017), https://www.edmcouncil.org/financialbusiness. Accessed 20 Oct 2017
Business Balls (2017), http://www.businessballs.com/business-thesaurus.htm. Accessed 19 Oct 2017
Enron Email Dataset (2017), http://www-2.cs.cmu.edu/~enron/. Accessed 20 Oct 2017
A. Mahajan, S. Sharma, The malicious insiders threat in the cloud. Int. J. Eng. Res. Gen. Sci. 3(2), 245–256 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Examples
Appendix: Examples
In this section, we illustrate the DSS model through practical examples. Table 7.7 shows the concepts from the partial FIBO ontology graph in Fig. 7.3, along with their node labels and depths.
Figure 7.4 shows a sample document, specifically an e-mail from the Enron e-mail dataset; [18, 21] we refer to it as the reference (or sensitive) document d_1
The extracted concept file, document concept tree, and document semantic signature from the above e-mail sample are shown in Figs. 7.5 and 7.6.
To illustrate the matching process, assume that we have three sensitive documents in the reference (including the e-mail sample given above, d 1): M = (d 1, d 2, d 3).
Using the above approach, we can generate the reference signature SS(M) = {SS(d 1), SS(d 2), SS(d 3)}.
Assume that we have two monitored documents, CF1 and CF2, that need to be matched against the reference signature. Figure 7.7 shows the matching process of the monitored document CF1 against the reference signature SS(M).
This figure shows the comparison process of each concept vector in the monitored document CF1 against all sensitive documents’ semantic signatures. If there is a match between the monitored concept vector and document’s semantic signature, then the frequency will be incremented by one and saved to the frequency matrix and so on. Then, the same steps will be repeated for all concept vectors in the monitored document against the remaining documents’ semantic signatures. As an example, CF1 has nine matched concepts in sensitive document SS(d1), four matched concepts in SS(d2), and 31 matched concepts in SS(d3).
The two matrices below represent the matching frequencies for the two monitored documents CF1 and CF2.
Also, the two matrices below show the frequency percentage for the monitored concept vector files CF1 and CF2, which show that the highest percentage of frequency of CF1 is 91.18% in SS(d 3), while the lowest frequency percentage is 22.22% in SS(d 2). For the second monitored file CF2, the highest frequency percentage is 20% in SS(d 1), while the lowest is 11.11% in SS(d 2).
In addition, the Jaccard index is calculated below for both monitored documents. The two matrices below show that the highest Jaccard index for CF1 is 79.49% in SS(d3), while the highest Jaccard index for CF2 is 42.86% in SS(d1).
From the measures above, our model will classify CF1 as a suspicious file because the Frequency F(CF1) = 91.18% against SS(d3), which is higher than the threshold value 60%, and the Jaccard index J(CF1) = 79.49% against SS(d 3), which is higher than the threshold value 60%, too.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Alhindi, H., Traore, I., Woungang, I. (2019). Data Loss Prevention Using Document Semantic Signature. In: Woungang, I., Dhurandher, S. (eds) 2nd International Conference on Wireless Intelligent and Distributed Environment for Communication. WIDECOM 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-030-11437-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-11437-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11436-7
Online ISBN: 978-3-030-11437-4
eBook Packages: EngineeringEngineering (R0)