Skip to main content

Text Classification for Data Loss Prevention

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 6794))

Abstract

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer “data loss prevention” (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Spyropoulos, C.D.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. ACM, New York (2000)

    Chapter  Google Scholar 

  2. Internet Archive. Brown corpus, http://www.archive.org/details/BrownCorpus

  3. Internet Archive. Wayback machine, http://www.archive.org/web/web.php

  4. Attardi, G., Gull, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: Proceedings of THAI 1999, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (1999)

    Google Scholar 

  5. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998: Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, New York (1998)

    Chapter  Google Scholar 

  6. Borko, H., Bernick, M.: Automatic document classification. J. ACM 10(2), 151–162 (1963)

    Article  MATH  Google Scholar 

  7. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)

    Google Scholar 

  8. Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7(3), 163–178 (1998)

    Article  Google Scholar 

  9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of Artificial Inteligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  10. Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorer Newsletter 6(1) (2004)

    Google Scholar 

  11. Privacy Rights Clearinghouse. Chronology of data breaches: Security breaches 2005–present (August 2010), http://www.privacyrights.org/data-breach

  12. Cohen, W.W.: Learning rules that classify e-mail. In: In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25. AAAI Press, Menlo Park (1996)

    Google Scholar 

  13. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 273–297 (1995)

    Google Scholar 

  14. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1) (1967)

    Google Scholar 

  15. Dyncorp. Dyncorp website, http://www.dyncorp.com

  16. Wikimedia Foundation. Wikipedia, http://en.wikipedia.org/

  17. Freedman, D.: Statistical Models: Theory and Practice. Cambridge University Press, New York (2005)

    Google Scholar 

  18. Fuhr, N., Knorz, G.E.: Retrieval test evaluation of a rule based automatic indexing (air/phys). In: Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, pp. 391–408. Cambridge University Press, Cambridge (1984)

    Google Scholar 

  19. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc, San Francisco (2007)

    Google Scholar 

  20. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  21. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10. NIPS 1997, pp. 507–513. MIT Press, Cambridge (1997)

    Google Scholar 

  22. Hayes, P.J., Weinstein, S.P.: Construe/tis: A system for content-based indexing of a database of news stories. In: IAAI 1990: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, pp. 49–64. AAAI Press, Menlo Park (1991)

    Google Scholar 

  23. Hearst, M.: Teaching applied natural language processing: triumphs and tribulations. In: TeachNLP 2005: Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, USA (2005)

    Google Scholar 

  24. Hitz, F.: Why Spy?: Espionage in an Age of Uncertainty. Thomas Dunne Books (2008)

    Google Scholar 

  25. Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2008, pp. 179–186. ACM, New York (2008)

    Chapter  Google Scholar 

  26. Poneman Institute. Fourth annual us cost of data breach study (January 2009), http://www.ponemon.org/local/upload/fckjail/generalcontent/18/file/2008-2009USCostofDataBreachReportFinal.pdf

  27. Japkowicz, N.: Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning 42, 97–122 (2001)

    Article  MATH  Google Scholar 

  28. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  29. Joachims, T.: Learning to Classify Text Using Support Vector Machines – Methods, Theory, and Algorithms. Springer, Kluwer (2002)

    Google Scholar 

  30. Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1999)

    Google Scholar 

  31. Joachims, T.: Transductive inference for text classification using support vector machines. In: International Conference on Machine Learning (ICML), Bled, Slowenien, pp. 200–209 (1999)

    Google Scholar 

  32. Koller, D., Lerner, U., Angelov, D.: A general algorithm for approximate inference and its application to hybrid bayes nets. In: Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (1999)

    Google Scholar 

  33. Lewis, D.: Reuters 21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/

  34. Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: IJCAI 2003: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592. Morgan Kaufmann Publishers Inc, San Francisco (2003)

    Google Scholar 

  35. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    MATH  Google Scholar 

  36. Maron, M.E.: Automatic indexing: An experimental inquiry. J. ACM 8(3), 404–417 (1961)

    Article  MATH  Google Scholar 

  37. McAfee. Data loss prevention, http://www.mcafee.com/us/enterprise/products/data_loss_prevention/

  38. Transcendental Meditation. Transcendental meditation websites, http://www.alltm.org , http://www.alltm.org

  39. Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 157–164. IEEE Computer Society Press, Los Alamitos (2007)

    Chapter  Google Scholar 

  40. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning. ICML 1999, pp. 258–267. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  41. Ng, K.: A comparative study of the practical characteristics of neural network and conventional pattern classifiers. Technical report (1990)

    Google Scholar 

  42. Church of Jesus Christ of Latter Day Saints. Church of jesus christ of latter day saints website, http://lds.org

  43. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 91–100. ACM, New York (2008)

    Chapter  Google Scholar 

  44. Proofpoint. Outbound email security and data loss prevention, http://www.proofpoint.com/id/outbound/index.php

  45. proofpoint. Unified email security, email archiving, data loss prevention and encryption, http://www.proofpoint.com/products/

  46. RSA Data Loss Prevention, http://www.rsa.com/node.aspx?id=1130

  47. Schapire, R.E.: Theoretical Views of Boosting and Applications. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 13–25. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  48. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  49. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1419–1424. AAAI Press, Menlo Park (2006)

    Google Scholar 

  50. Symantec. Data Loss Prevention Products & Services, http://www.symantec.com/business/theme.jsp?themeid=vontu

  51. Thet, T.T., Na, J.-C., Khoo, C.S.G.: Filtering product reviews from web search results. In: DocEng 2007: Proceedings of the 2007 ACM symposium on Document engineering, pp. 196–198. ACM, New York (2007)

    Google Scholar 

  52. Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: CIKM 2001: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM, New York (2001)

    Chapter  Google Scholar 

  53. Trend Micro. Trend Micro Data Loss Prevention, http://us.trendmicro.com/us/products/enterprise/data-loss-prevention/

  54. Trustwave. Global security report 2010 (February 2010), https://www.trustwave.com/whitePapers.php

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hart, M., Manadhata, P., Johnson, R. (2011). Text Classification for Data Loss Prevention. In: Fischer-Hübner, S., Hopper, N. (eds) Privacy Enhancing Technologies. PETS 2011. Lecture Notes in Computer Science, vol 6794. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22263-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22263-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22262-7

  • Online ISBN: 978-3-642-22263-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics