Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention

Subhashini, Peneti; Rani, B. Padmaja

doi:10.1007/978-81-322-2526-3_29

Peneti Subhashini⁶ &
B. Padmaja Rani⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 381))

993 Accesses
1 Citations

Abstract

Confidential documents detection is a key activity in data leakage prevention methods. Once the document is marked as confidential, then it is possible to prevent data leakage from that document. Confidential terms are significant terms, which indicate confidential content in the document. This paper presents confidential terms detection method using language model with Dirichlet prior smoothing technique. Clusters are generated for training dataset documents (confidential and nonconfidential documents). Language model is created separately for confidential and nonconfidential documents. Expand nonconfidential language model in a cluster using similar clusters, which helps to identify the confidential content in the nonconfidential documents. Smoothing assigns a nonzero probability value to unseen words and improves accuracy of the language model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Shabtai, A., Elovici, Y., Rokach, L: A Survey of Data Leakage Detection and Prevention Solutions. Springer Briefs in Computer Science. Springer, New Work (2012)
Google Scholar
http://www.verizonbusiness.com/resources/security/reports/2009-databreach_rp.pdf
http://www.privacyrights.org/ar/chronDataBreaches.html
http://www.soxlaw.com/
http://www.hhs.gov/ocr/privacy/
http://www.ftc.gov/privacy/privacyinitiatives/glbact.html
Ouellet, E., Proctor, P.E.: Magic quadrant for content-aware data loss prevention. Technical Report, RA4 06242010, Gartner RAS Core Research (2009)
Google Scholar
Katz, G., Elovici, Y., Shapira, B.: CoBAn: a context based model for data leakage prevention. Info. Sci. 262, 107–128 (2011)
Google Scholar
Zilberman, P., Shabtai, A., Rokach, L.: Analyzing group communication for preventing data leakage via email. IEEE (2011)
Google Scholar
http://croce.ggf.br/dados/K%20mean%20Clustering1.pdf
Steinbach, M., Karypis, G., Vipin K.: A comparison of document clustering techniques. Technical Report #00–034
Google Scholar
Song, F., Croft, W.: A general language model for information retrieval. In: Proceedings of the 8th International Conference on Information and Knowledge Management, pp. 310–321. ACM, Kanasas City, Missouri, United States (1999)
Google Scholar
Ponte, J., Croft, W.: A language modeling approach to information retrieval. In: Proceeding of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, Melbourne, Australia (1998)
Google Scholar
Lavrenko, V., Croft, W.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127.ACM, New Orleans, Louisiana, United States (2001)
Google Scholar
Zhai, J., Lafferty.: A study of smoothing methods for language models applied to adhoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2001)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modelling. Tech Report.TR-10-98, Harward University
Google Scholar
https://www.cs.cmu.edu/~./enron/

Download references

Author information

Authors and Affiliations

Computer Science Engineering, Jawaharlal Nehru Technological University, Hyderabad, Telangana, 500085, India
Peneti Subhashini & B. Padmaja Rani

Authors

Peneti Subhashini
View author publications
You can also search for this author in PubMed Google Scholar
B. Padmaja Rani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peneti Subhashini .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India
Suresh Chandra Satapathy
Department of CSE, CMR Technical Campus, Hyderabad, India
K. Srujan Raju
Computer Science & Engineering, Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
Electronics and Communication, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Subhashini, P., Rani, B.P. (2016). Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 381. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2526-3_29

Download citation

DOI: https://doi.org/10.1007/978-81-322-2526-3_29
Published: 11 September 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2525-6
Online ISBN: 978-81-322-2526-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics