Abstract
Confidential documents detection is a key activity in data leakage prevention methods. Once the document is marked as confidential, then it is possible to prevent data leakage from that document. Confidential terms are significant terms, which indicate confidential content in the document. This paper presents confidential terms detection method using language model with Dirichlet prior smoothing technique. Clusters are generated for training dataset documents (confidential and nonconfidential documents). Language model is created separately for confidential and nonconfidential documents. Expand nonconfidential language model in a cluster using similar clusters, which helps to identify the confidential content in the nonconfidential documents. Smoothing assigns a nonzero probability value to unseen words and improves accuracy of the language model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shabtai, A., Elovici, Y., Rokach, L: A Survey of Data Leakage Detection and Prevention Solutions. Springer Briefs in Computer Science. Springer, New Work (2012)
http://www.verizonbusiness.com/resources/security/reports/2009-databreach_rp.pdf
Ouellet, E., Proctor, P.E.: Magic quadrant for content-aware data loss prevention. Technical Report, RA4 06242010, Gartner RAS Core Research (2009)
Katz, G., Elovici, Y., Shapira, B.: CoBAn: a context based model for data leakage prevention. Info. Sci. 262, 107–128 (2011)
Zilberman, P., Shabtai, A., Rokach, L.: Analyzing group communication for preventing data leakage via email. IEEE (2011)
Steinbach, M., Karypis, G., Vipin K.: A comparison of document clustering techniques. Technical Report #00–034
Song, F., Croft, W.: A general language model for information retrieval. In: Proceedings of the 8th International Conference on Information and Knowledge Management, pp. 310–321. ACM, Kanasas City, Missouri, United States (1999)
Ponte, J., Croft, W.: A language modeling approach to information retrieval. In: Proceeding of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, Melbourne, Australia (1998)
Lavrenko, V., Croft, W.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127.ACM, New Orleans, Louisiana, United States (2001)
Zhai, J., Lafferty.: A study of smoothing methods for language models applied to adhoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2001)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modelling. Tech Report.TR-10-98, Harward University
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Subhashini, P., Rani, B.P. (2016). Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 381. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2526-3_29
Download citation
DOI: https://doi.org/10.1007/978-81-322-2526-3_29
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2525-6
Online ISBN: 978-81-322-2526-3
eBook Packages: EngineeringEngineering (R0)