Skip to main content

Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention

  • Conference paper
  • First Online:
Proceedings of the Second International Conference on Computer and Communication Technologies

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 381))

Abstract

Confidential documents detection is a key activity in data leakage prevention methods. Once the document is marked as confidential, then it is possible to prevent data leakage from that document. Confidential terms are significant terms, which indicate confidential content in the document. This paper presents confidential terms detection method using language model with Dirichlet prior smoothing technique. Clusters are generated for training dataset documents (confidential and nonconfidential documents). Language model is created separately for confidential and nonconfidential documents. Expand nonconfidential language model in a cluster using similar clusters, which helps to identify the confidential content in the nonconfidential documents. Smoothing assigns a nonzero probability value to unseen words and improves accuracy of the language model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shabtai, A., Elovici, Y., Rokach, L: A Survey of Data Leakage Detection and Prevention Solutions. Springer Briefs in Computer Science. Springer, New Work (2012)

    Google Scholar 

  2. http://www.verizonbusiness.com/resources/security/reports/2009-databreach_rp.pdf

  3. http://www.privacyrights.org/ar/chronDataBreaches.html

  4. http://www.soxlaw.com/

  5. http://www.hhs.gov/ocr/privacy/

  6. http://www.ftc.gov/privacy/privacyinitiatives/glbact.html

  7. Ouellet, E., Proctor, P.E.: Magic quadrant for content-aware data loss prevention. Technical Report, RA4 06242010, Gartner RAS Core Research (2009)

    Google Scholar 

  8. Katz, G., Elovici, Y., Shapira, B.: CoBAn: a context based model for data leakage prevention. Info. Sci. 262, 107–128 (2011)

    Google Scholar 

  9. Zilberman, P., Shabtai, A., Rokach, L.: Analyzing group communication for preventing data leakage via email. IEEE (2011)

    Google Scholar 

  10. http://croce.ggf.br/dados/K%20mean%20Clustering1.pdf

  11. Steinbach, M., Karypis, G., Vipin K.: A comparison of document clustering techniques. Technical Report #00–034

    Google Scholar 

  12. Song, F., Croft, W.: A general language model for information retrieval. In: Proceedings of the 8th International Conference on Information and Knowledge Management, pp. 310–321. ACM, Kanasas City, Missouri, United States (1999)

    Google Scholar 

  13. Ponte, J., Croft, W.: A language modeling approach to information retrieval. In: Proceeding of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, Melbourne, Australia (1998)

    Google Scholar 

  14. Lavrenko, V., Croft, W.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127.ACM, New Orleans, Louisiana, United States (2001)

    Google Scholar 

  15. Zhai, J., Lafferty.: A study of smoothing methods for language models applied to adhoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2001)

    Google Scholar 

  16. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modelling. Tech Report.TR-10-98, Harward University

    Google Scholar 

  17. https://www.cs.cmu.edu/~./enron/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peneti Subhashini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Subhashini, P., Rani, B.P. (2016). Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 381. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2526-3_29

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2526-3_29

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2525-6

  • Online ISBN: 978-81-322-2526-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics