Skip to main content

Abstract

Data protection and insider threat detection and prevention are significant steps that organizations should take to enhance their internal security. Data loss prevention (DLP) is an emerging mechanism that is currently being used by organizations to detect and block unauthorized data transfers. Existing DLP approaches, however, face several practical challenges that limit their effectiveness. In this chapter, by extracting and analyzing document content semantic, we present a new DLP approach that addresses many existing challenges. We introduce the notion of a document semantic signature as a summarized representation of the document semantic. We show that the semantic signature can be used to detect a data leak by experimenting on a public dataset, yielding very encouraging detection effectiveness results including on average a false positive rate (FPR) of 6.71% and on average a detection rate (DR) of 84.47%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

BM:

Boyer Moore algorithm

CBSD:

Component-based software development

CF:

Concept vector file

CM:

Concept map

CS:

Cosine similarity

CT:

Concept tree

DCT:

Document concept tree

DL:

Ontology description logics

DLP:

Data loss prevention

DR:

Detection rate

DSS:

Document semantic signature

FDR:

False discovery rate

FIBO:

Financial Industry Business Ontology

FNR:

False negative rate

FPR:

False Positive rate

IDF:

Inverse document frequency

IDS:

Intrusion detection systems

KB:

Knowledge base

KDE:

Kernel density estimation

NIDS:

Network-based intrusion detection system

NTAC:

National Threat Assessment Center

OWL:

Ontology web language

RDF:

Resource description framework

RNCVM:

Relevancy nodes-based concept vector model

SIDD:

Sensitive information dissemination detection

SVM:

Support vector machines

SW:

Smith–Waterman algorithm

TF:

Term frequency

TF-IDF:

Term frequency inverse document frequency

References

  1. E. Kowalski, D. Cappelli, A. Moore, U.S. Secret Service and CERT/SEI Insider Threat Study: Illicit Cyber Activity in the Information Technology and Telecommunications Sector (Carnegie Mellon Software Engineering Institute, Pittsburgh, 2008)

    Google Scholar 

  2. D.L. Costa, M.L. Collins, S.J. Perl, et al., An Ontology for Insider Threat Indicators Development and Applications (Carnegie-Mellon University, Pittsburgh, Software Engineering Inst, 2014)

    Google Scholar 

  3. M. Kandias, A. Mylonas, N. Virvilis, et al., An insider threat prediction model, in International Conference on Trust, Privacy and Security in Digital Business, (Springer, Cham, 2010), pp. 26–37

    Chapter  Google Scholar 

  4. A.W. Udoeyop, Cyber Profiling for Insider Threat Detection. Master’s Thesis, University of Tennessee (2010)

    Google Scholar 

  5. Y. Liu, C. Corbett, K. Chiang, et al., SIDD: A framework for detecting sensitive data exfiltration by an insider attack. System Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on IEEE 2009, pp. 1–10

    Google Scholar 

  6. H. Ragavan, Insider threat mitigation models based on thresholds and dependencies (University of Arkansas, Fayetteville, 2012)

    Google Scholar 

  7. P. Raman, H.G. Kayacık, A. Somayaji, Understanding data leak prevention, in 6th Annual Symposium on Information Assurance (ASIA’11) (2011), pp. 27–3

    Google Scholar 

  8. S. Liu, R. Kuhn, Data loss prevention. IT Professional 12(2), 10–13 (2010)

    Article  Google Scholar 

  9. M. Hart, P. Manadhata, R. Johnson, Text classification for data loss prevention, ed. by S. Fischer-Hübner, N. Hopper. PETS 2011. LNCS, vol. 6794 (2011), p 18–37

    Google Scholar 

  10. V. Stamati-Koromina, C. Ilioudis, R. Overill, et al., Insider threats in corporate environments: a case study for data leakage prevention, in Proceedings of the Fifth Balkan Conference in Informatics, (ACM, New York, 2012), pp. 271–274

    Chapter  Google Scholar 

  11. Y. Canbay, H. Yazici, S. Sagiroglu, A Turkish language based data leakage prevention system. in Digital Forensic and Security (ISDFS), 5th International Symposium (IEEE, April 2017), pp. 1–6

    Google Scholar 

  12. S. Vodithala, S. Pabboju, A keyword ontology for retrieval of software components. Int. J. Control Theory Appl. 10(19), 177–182 (2017)

    Google Scholar 

  13. M. Fernández, I. Cantador, V. López, et al., Semantically enhanced information retrieval: an ontology-based approach. Web Semant. Sci. Serv. Agents World Wide Web 9(4), 434–452 (2011)

    Article  Google Scholar 

  14. K. Doing-Harris, Y. Livnat, S. Meystre, Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system. J. Biomed. Semant. 6, 15 (2015)

    Article  Google Scholar 

  15. H.Z. Liu, H. Bao, D. Xu, Concept vector for similarity measurement based on hierarchical domain structure. Comput. Inform. 30(5), 881–900 (2012)

    MATH  Google Scholar 

  16. C. Corley, R. Mihalcea, Measuring the semantic similarity of texts. in Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Association for Computational Linguistics, 2003, p 13–18

    Google Scholar 

  17. Onix, Onix Text Retrieval Toolkit API Reference (2017), http://www.lextek.com/manuals/onix/stopwords1.html, Accessed 14 Nov 2017

  18. B. Klimt, Y. Yang, The Enron Corpus: a new dataset for email classification research, in Machine learning, ECML 2004, (Springer, Berlin, 2004), pp. 217–226

    Chapter  Google Scholar 

  19. FIBO, Financial Industry Business Ontology (2017), https://www.edmcouncil.org/financialbusiness. Accessed 20 Oct 2017

  20. Business Balls (2017), http://www.businessballs.com/business-thesaurus.htm. Accessed 19 Oct 2017

  21. Enron Email Dataset (2017), http://www-2.cs.cmu.edu/~enron/. Accessed 20 Oct 2017

  22. A. Mahajan, S. Sharma, The malicious insiders threat in the cloud. Int. J. Eng. Res. Gen. Sci. 3(2), 245–256 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanan Alhindi .

Editor information

Editors and Affiliations

Appendix: Examples

Appendix: Examples

In this section, we illustrate the DSS model through practical examples. Table 7.7 shows the concepts from the partial FIBO ontology graph in Fig. 7.3, along with their node labels and depths.

Table 7.7 Partial FIBO ontology concepts, label, and depth
Fig. 7.3
figure 3

Partial representation of the FIBO ontology tree

Figure 7.4 shows a sample document, specifically an e-mail from the Enron e-mail dataset; [18, 21] we refer to it as the reference (or sensitive) document d_1

Fig. 7.4
figure 4

Sample document from the Enron e-mail dataset

The extracted concept file, document concept tree, and document semantic signature from the above e-mail sample are shown in Figs. 7.5 and 7.6.

Fig. 7.5
figure 5

Extracting concept tree and generating semantic signature for Reference Document d1

Fig. 7.6
figure 6

Document semantic signature SS(d1)

To illustrate the matching process, assume that we have three sensitive documents in the reference (including the e-mail sample given above, d 1): M = (d 1, d 2, d 3).

Using the above approach, we can generate the reference signature SS(M) = {SS(d 1), SS(d 2), SS(d 3)}.

Assume that we have two monitored documents, CF1 and CF2, that need to be matched against the reference signature. Figure 7.7 shows the matching process of the monitored document CF1 against the reference signature SS(M).

Fig. 7.7
figure 7

Matching process of monitored document CF1 against the reference signature

This figure shows the comparison process of each concept vector in the monitored document CF1 against all sensitive documents’ semantic signatures. If there is a match between the monitored concept vector and document’s semantic signature, then the frequency will be incremented by one and saved to the frequency matrix and so on. Then, the same steps will be repeated for all concept vectors in the monitored document against the remaining documents’ semantic signatures. As an example, CF1 has nine matched concepts in sensitive document SS(d1), four matched concepts in SS(d2), and 31 matched concepts in SS(d3).

The two matrices below represent the matching frequencies for the two monitored documents CF1 and CF2.

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}{\mathrm{CF}}_1& 9& 4& 31\end{array}} $$
$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}{\mathrm{CF}}_2& 3& 2& 6\end{array}} $$

Also, the two matrices below show the frequency percentage for the monitored concept vector files CF1 and CF2, which show that the highest percentage of frequency of CF1 is 91.18% in SS(d 3), while the lowest frequency percentage is 22.22% in SS(d 2). For the second monitored file CF2, the highest frequency percentage is 20% in SS(d 1), while the lowest is 11.11% in SS(d 2).

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}F\left({\mathrm{CF}}_1\right)& 60\%& 22.22\%& 91.18\%\end{array}} $$
$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}F\left({\mathrm{CF}}_2\right)& 20\%& 11.11\%& 17.65\%\end{array}} $$

In addition, the Jaccard index is calculated below for both monitored documents. The two matrices below show that the highest Jaccard index for CF1 is 79.49% in SS(d3), while the highest Jaccard index for CF2 is 42.86% in SS(d1).

$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}J\left({\mathrm{CF}}_1\right)& 21.43\%& 8\%& 79.49\%\end{array}} $$
$$ {\displaystyle \begin{array}{cccc}& \mathrm{SS}\left({d}_1\right)& \mathrm{SS}\left({d}_2\right)& \mathrm{SS}\left({d}_3\right)\\ {}J\left({\mathrm{CF}}_2\right)& 42.86\%& 8\%& 16.22\%\end{array}} $$

From the measures above, our model will classify CF1 as a suspicious file because the Frequency F(CF1) = 91.18% against SS(d3), which is higher than the threshold value 60%, and the Jaccard index J(CF1) = 79.49% against SS(d 3), which is higher than the threshold value 60%, too.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alhindi, H., Traore, I., Woungang, I. (2019). Data Loss Prevention Using Document Semantic Signature. In: Woungang, I., Dhurandher, S. (eds) 2nd International Conference on Wireless Intelligent and Distributed Environment for Communication. WIDECOM 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-030-11437-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11437-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11436-7

  • Online ISBN: 978-3-030-11437-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics