Text Classification for Data Loss Prevention

Hart, Michael; Manadhata, Pratyusa; Johnson, Rob

doi:10.1007/978-3-642-22263-4_2

Text Classification for Data Loss Prevention

Michael Hart¹⁸,
Pratyusa Manadhata¹⁹ &
Rob Johnson¹⁸

Conference paper

2166 Accesses
42 Citations
4 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 6794))

Abstract

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer “data loss prevention” (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100^th time).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Spyropoulos, C.D.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. ACM, New York (2000)
Chapter Google Scholar
Internet Archive. Brown corpus, http://www.archive.org/details/BrownCorpus
Internet Archive. Wayback machine, http://www.archive.org/web/web.php
Attardi, G., Gull, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: Proceedings of THAI 1999, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (1999)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998: Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, New York (1998)
Chapter Google Scholar
Borko, H., Bernick, M.: Automatic document classification. J. ACM 10(2), 151–162 (1963)
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7(3), 163–178 (1998)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of Artificial Inteligence Research 16, 321–357 (2002)
MATH Google Scholar
Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorer Newsletter 6(1) (2004)
Google Scholar
Privacy Rights Clearinghouse. Chronology of data breaches: Security breaches 2005–present (August 2010), http://www.privacyrights.org/data-breach
Cohen, W.W.: Learning rules that classify e-mail. In: In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25. AAAI Press, Menlo Park (1996)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 273–297 (1995)
Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1) (1967)
Google Scholar
Dyncorp. Dyncorp website, http://www.dyncorp.com
Wikimedia Foundation. Wikipedia, http://en.wikipedia.org/
Freedman, D.: Statistical Models: Theory and Practice. Cambridge University Press, New York (2005)
Google Scholar
Fuhr, N., Knorz, G.E.: Retrieval test evaluation of a rule based automatic indexing (air/phys). In: Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, pp. 391–408. Cambridge University Press, Cambridge (1984)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc, San Francisco (2007)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10. NIPS 1997, pp. 507–513. MIT Press, Cambridge (1997)
Google Scholar
Hayes, P.J., Weinstein, S.P.: Construe/tis: A system for content-based indexing of a database of news stories. In: IAAI 1990: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, pp. 49–64. AAAI Press, Menlo Park (1991)
Google Scholar
Hearst, M.: Teaching applied natural language processing: triumphs and tribulations. In: TeachNLP 2005: Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, USA (2005)
Google Scholar
Hitz, F.: Why Spy?: Espionage in an Age of Uncertainty. Thomas Dunne Books (2008)
Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2008, pp. 179–186. ACM, New York (2008)
Chapter Google Scholar
Poneman Institute. Fourth annual us cost of data breach study (January 2009), http://www.ponemon.org/local/upload/fckjail/generalcontent/18/file/2008-2009USCostofDataBreachReportFinal.pdf
Japkowicz, N.: Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning 42, 97–122 (2001)
Article MATH Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Learning to Classify Text Using Support Vector Machines – Methods, Theory, and Algorithms. Springer, Kluwer (2002)
Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1999)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: International Conference on Machine Learning (ICML), Bled, Slowenien, pp. 200–209 (1999)
Google Scholar
Koller, D., Lerner, U., Angelov, D.: A general algorithm for approximate inference and its application to hybrid bayes nets. In: Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (1999)
Google Scholar
Lewis, D.: Reuters 21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: IJCAI 2003: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592. Morgan Kaufmann Publishers Inc, San Francisco (2003)
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
MATH Google Scholar
Maron, M.E.: Automatic indexing: An experimental inquiry. J. ACM 8(3), 404–417 (1961)
Article MATH Google Scholar
McAfee. Data loss prevention, http://www.mcafee.com/us/enterprise/products/data_loss_prevention/
Transcendental Meditation. Transcendental meditation websites, http://www.alltm.org , http://www.alltm.org
Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 157–164. IEEE Computer Society Press, Los Alamitos (2007)
Chapter Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning. ICML 1999, pp. 258–267. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Ng, K.: A comparative study of the practical characteristics of neural network and conventional pattern classifiers. Technical report (1990)
Google Scholar
Church of Jesus Christ of Latter Day Saints. Church of jesus christ of latter day saints website, http://lds.org
Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 91–100. ACM, New York (2008)
Chapter Google Scholar
Proofpoint. Outbound email security and data loss prevention, http://www.proofpoint.com/id/outbound/index.php
proofpoint. Unified email security, email archiving, data loss prevention and encryption, http://www.proofpoint.com/products/
RSA Data Loss Prevention, http://www.rsa.com/node.aspx?id=1130
Schapire, R.E.: Theoretical Views of Boosting and Applications. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 13–25. Springer, Heidelberg (1999)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1419–1424. AAAI Press, Menlo Park (2006)
Google Scholar
Symantec. Data Loss Prevention Products & Services, http://www.symantec.com/business/theme.jsp?themeid=vontu
Thet, T.T., Na, J.-C., Khoo, C.S.G.: Filtering product reviews from web search results. In: DocEng 2007: Proceedings of the 2007 ACM symposium on Document engineering, pp. 196–198. ACM, New York (2007)
Google Scholar
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: CIKM 2001: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM, New York (2001)
Chapter Google Scholar
Trend Micro. Trend Micro Data Loss Prevention, http://us.trendmicro.com/us/products/enterprise/data-loss-prevention/
Trustwave. Global security report 2010 (February 2010), https://www.trustwave.com/whitePapers.php

Download references

Author information

Authors and Affiliations

Computer Science Department, Stony Brook University, USA
Michael Hart & Rob Johnson
HP Labs, USA
Pratyusa Manadhata

Authors

Michael Hart
View author publications
You can also search for this author in PubMed Google Scholar
Pratyusa Manadhata
View author publications
You can also search for this author in PubMed Google Scholar
Rob Johnson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Karlstad University, Sweden
Simone Fischer-Hübner
Department of Computer Science and Engineering, 200 Union Street SE, University of Minnesota, 55455, Minneapolis, MN, USA
Nicholas Hopper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hart, M., Manadhata, P., Johnson, R. (2011). Text Classification for Data Loss Prevention. In: Fischer-Hübner, S., Hopper, N. (eds) Privacy Enhancing Technologies. PETS 2011. Lecture Notes in Computer Science, vol 6794. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22263-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-22263-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22262-7
Online ISBN: 978-3-642-22263-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics