Data Loss Prevention Using Document Semantic Signature

Alhindi, Hanan; Traore, Issa; Woungang, Isaac

doi:10.1007/978-3-030-11437-4_7

Hanan Alhindi⁴,
Issa Traore⁴ &
Isaac Woungang⁵

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 27))

Included in the following conference series:

International Conference on Wireless Intelligent and Distributed Environment for Communication

412 Accesses
2 Citations

Abstract

Data protection and insider threat detection and prevention are significant steps that organizations should take to enhance their internal security. Data loss prevention (DLP) is an emerging mechanism that is currently being used by organizations to detect and block unauthorized data transfers. Existing DLP approaches, however, face several practical challenges that limit their effectiveness. In this chapter, by extracting and analyzing document content semantic, we present a new DLP approach that addresses many existing challenges. We introduce the notion of a document semantic signature as a summarized representation of the document semantic. We show that the semantic signature can be used to detect a data leak by experimenting on a public dataset, yielding very encouraging detection effectiveness results including on average a false positive rate (FPR) of 6.71% and on average a detection rate (DR) of 84.47%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

BM:: Boyer Moore algorithm
CBSD:: Component-based software development
CF:: Concept vector file
CM:: Concept map
CS:: Cosine similarity
CT:: Concept tree
DCT:: Document concept tree
DL:: Ontology description logics
DLP:: Data loss prevention
DR:: Detection rate
DSS:: Document semantic signature
FDR:: False discovery rate
FIBO:: Financial Industry Business Ontology
FNR:: False negative rate
FPR:: False Positive rate
IDF:: Inverse document frequency
IDS:: Intrusion detection systems
KB:: Knowledge base
KDE:: Kernel density estimation
NIDS:: Network-based intrusion detection system
NTAC:: National Threat Assessment Center
OWL:: Ontology web language
RDF:: Resource description framework
RNCVM:: Relevancy nodes-based concept vector model
SIDD:: Sensitive information dissemination detection
SVM:: Support vector machines
SW:: Smith–Waterman algorithm
TF:: Term frequency
TF-IDF:: Term frequency inverse document frequency

References

E. Kowalski, D. Cappelli, A. Moore, U.S. Secret Service and CERT/SEI Insider Threat Study: Illicit Cyber Activity in the Information Technology and Telecommunications Sector (Carnegie Mellon Software Engineering Institute, Pittsburgh, 2008)
Google Scholar
D.L. Costa, M.L. Collins, S.J. Perl, et al., An Ontology for Insider Threat Indicators Development and Applications (Carnegie-Mellon University, Pittsburgh, Software Engineering Inst, 2014)
Google Scholar
M. Kandias, A. Mylonas, N. Virvilis, et al., An insider threat prediction model, in International Conference on Trust, Privacy and Security in Digital Business, (Springer, Cham, 2010), pp. 26–37
Chapter Google Scholar
A.W. Udoeyop, Cyber Profiling for Insider Threat Detection. Master’s Thesis, University of Tennessee (2010)
Google Scholar
Y. Liu, C. Corbett, K. Chiang, et al., SIDD: A framework for detecting sensitive data exfiltration by an insider attack. System Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on IEEE 2009, pp. 1–10
Google Scholar
H. Ragavan, Insider threat mitigation models based on thresholds and dependencies (University of Arkansas, Fayetteville, 2012)
Google Scholar
P. Raman, H.G. Kayacık, A. Somayaji, Understanding data leak prevention, in 6th Annual Symposium on Information Assurance (ASIA’11) (2011), pp. 27–3
Google Scholar
S. Liu, R. Kuhn, Data loss prevention. IT Professional 12(2), 10–13 (2010)
Article Google Scholar
M. Hart, P. Manadhata, R. Johnson, Text classification for data loss prevention, ed. by S. Fischer-Hübner, N. Hopper. PETS 2011. LNCS, vol. 6794 (2011), p 18–37
Google Scholar
V. Stamati-Koromina, C. Ilioudis, R. Overill, et al., Insider threats in corporate environments: a case study for data leakage prevention, in Proceedings of the Fifth Balkan Conference in Informatics, (ACM, New York, 2012), pp. 271–274
Chapter Google Scholar
Y. Canbay, H. Yazici, S. Sagiroglu, A Turkish language based data leakage prevention system. in Digital Forensic and Security (ISDFS), 5th International Symposium (IEEE, April 2017), pp. 1–6
Google Scholar
S. Vodithala, S. Pabboju, A keyword ontology for retrieval of software components. Int. J. Control Theory Appl. 10(19), 177–182 (2017)
Google Scholar
M. Fernández, I. Cantador, V. López, et al., Semantically enhanced information retrieval: an ontology-based approach. Web Semant. Sci. Serv. Agents World Wide Web 9(4), 434–452 (2011)
Article Google Scholar
K. Doing-Harris, Y. Livnat, S. Meystre, Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system. J. Biomed. Semant. 6, 15 (2015)
Article Google Scholar
H.Z. Liu, H. Bao, D. Xu, Concept vector for similarity measurement based on hierarchical domain structure. Comput. Inform. 30(5), 881–900 (2012)
MATH Google Scholar
C. Corley, R. Mihalcea, Measuring the semantic similarity of texts. in Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Association for Computational Linguistics, 2003, p 13–18
Google Scholar
Onix, Onix Text Retrieval Toolkit API Reference (2017), http://www.lextek.com/manuals/onix/stopwords1.html, Accessed 14 Nov 2017
B. Klimt, Y. Yang, The Enron Corpus: a new dataset for email classification research, in Machine learning, ECML 2004, (Springer, Berlin, 2004), pp. 217–226
Chapter Google Scholar
FIBO, Financial Industry Business Ontology (2017), https://www.edmcouncil.org/financialbusiness. Accessed 20 Oct 2017
Business Balls (2017), http://www.businessballs.com/business-thesaurus.htm. Accessed 19 Oct 2017
Enron Email Dataset (2017), http://www-2.cs.cmu.edu/~enron/. Accessed 20 Oct 2017
A. Mahajan, S. Sharma, The malicious insiders threat in the cloud. Int. J. Eng. Res. Gen. Sci. 3(2), 245–256 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, University of Victoria, Victoria, BC, Canada
Hanan Alhindi & Issa Traore
Department of Computer Science, Ryerson University, Toronto, ON, Canada
Isaac Woungang

Authors

Hanan Alhindi
View author publications
You can also search for this author in PubMed Google Scholar
Issa Traore
View author publications
You can also search for this author in PubMed Google Scholar
Isaac Woungang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanan Alhindi .

Editor information

Editors and Affiliations

Department of Computer Science, Ryerson University, Toronto, ON, Canada
Isaac Woungang
CAITFS, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, India
Sanjay Kumar Dhurandher

Appendix: Examples

In this section, we illustrate the DSS model through practical examples. Table 7.7 shows the concepts from the partial FIBO ontology graph in Fig. 7.3, along with their node labels and depths.

Table 7.7 Partial FIBO ontology concepts, label, and depth

Full size table

Figure 7.4 shows a sample document, specifically an e-mail from the Enron e-mail dataset; [18, 21] we refer to it as the reference (or sensitive) document d_1

The extracted concept file, document concept tree, and document semantic signature from the above e-mail sample are shown in Figs. 7.5 and 7.6.

To illustrate the matching process, assume that we have three sensitive documents in the reference (including the e-mail sample given above, d ₁): M = (d ₁, d ₂, d ₃).

Using the above approach, we can generate the reference signature SS(M) = {SS(d ₁), SS(d ₂), SS(d ₃)}.

Assume that we have two monitored documents, CF₁ and CF₂, that need to be matched against the reference signature. Figure 7.7 shows the matching process of the monitored document CF₁ against the reference signature SS(M).

This figure shows the comparison process of each concept vector in the monitored document CF₁ against all sensitive documents’ semantic signatures. If there is a match between the monitored concept vector and document’s semantic signature, then the frequency will be incremented by one and saved to the frequency matrix and so on. Then, the same steps will be repeated for all concept vectors in the monitored document against the remaining documents’ semantic signatures. As an example, CF₁ has nine matched concepts in sensitive document SS(d1), four matched concepts in SS(d2), and 31 matched concepts in SS(d3).

The two matrices below represent the matching frequencies for the two monitored documents CF₁ and CF₂.

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}{\mathrm{CF}}_1& 9& 4& 31\end{array}} $$

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}{\mathrm{CF}}_2& 3& 2& 6\end{array}} $$

Also, the two matrices below show the frequency percentage for the monitored concept vector files CF₁ and CF₂, which show that the highest percentage of frequency of CF₁ is 91.18% in SS(d ₃), while the lowest frequency percentage is 22.22% in SS(d ₂). For the second monitored file CF₂, the highest frequency percentage is 20% in SS(d ₁), while the lowest is 11.11% in SS(d ₂).

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}F\left({\mathrm{CF}}_1\right)& 60\%& 22.22\%& 91.18\%\end{array}} $$

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}F\left({\mathrm{CF}}_2\right)& 20\%& 11.11\%& 17.65\%\end{array}} $$

In addition, the Jaccard index is calculated below for both monitored documents. The two matrices below show that the highest Jaccard index for CF₁ is 79.49% in SS(d3), while the highest Jaccard index for CF₂ is 42.86% in SS(d1).

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}J\left({\mathrm{CF}}_1\right)& 21.43\%& 8\%& 79.49\%\end{array}} $$

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}J\left({\mathrm{CF}}_2\right)& 42.86\%& 8\%& 16.22\%\end{array}} $$

From the measures above, our model will classify CF₁ as a suspicious file because the Frequency F(CF₁) = 91.18% against SS(d3), which is higher than the threshold value 60%, and the Jaccard index J(CF₁) = 79.49% against SS(d ₃), which is higher than the threshold value 60%, too.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alhindi, H., Traore, I., Woungang, I. (2019). Data Loss Prevention Using Document Semantic Signature. In: Woungang, I., Dhurandher, S. (eds) 2nd International Conference on Wireless Intelligent and Distributed Environment for Communication. WIDECOM 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-030-11437-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-11437-4_7
Published: 28 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11436-7
Online ISBN: 978-3-030-11437-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Data Loss Prevention Using Document Semantic Signature

Abstract

Access this chapter

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Examples

Appendix: Examples

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation