Summary
Sensitive Unclassified information is defined as any unclassified information that may cause adverse consequences against the government facilities. In this chapter, we explore the use of categorization techniques and information extraction to discover this kind of information in scanned documents.
We show here that the combined use of a K-Dependence Bayesian categorization engine and a semi-automated review application reduce by nearly 95% the number of man hours required to redact sensitive unclassified information. We also discuss and provide statistics on how OCR errors can affect the information extraction tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Kohavi, R., B. Becker, and D. Sommerfield. 1997. Improving simple bayes. In Proceedings of ECML-97. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html
Lewis, D.D. 1991. Evaluating text categorization. In Proceedings of the Speech and Language Workshop. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html
Maron, M.E. 1967. Automatic indexing: An experimental inquiry. Journal of the ACM, 8:404–417.
Maron, M.E. and J.L. Kuhns. 1960. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3): 216–240.
McCallum, A. and K. Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization. URL citeseer.nj.nec.com/mccallum98comparison.html
Miller, D., S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the Sixth Conference on Applied Natural Languae Processing, pp. 316–324.
Sahami, M. 1996. Learning limited dependence Bayesian classifiers. In Second International Conference on Knowledge Discovery in Databases. http://robotics.stanford.edu/users/sahami/papers.html
Taghva, K., J. Borsack, and A. Condit. 1996. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transaction on Information Systems, pp. 64–93.
Taghva, K., R. Beckley, and J. Coombs. 2006. The effects of OCR error on the extraction of private information. In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS 2006), pp. 348–357.
U.S. Government. 2004. The Freedom of Information Act (FOIA), 5 USC Section 552(b)(6). http://www.usdoj.gov/oip/exemption6.html
U.S. Department of Energy. 2001. Licensing support network baselined design requirements. http://www.lsnnet.gov/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Taghva, K. (2009). Identification of Sensitive Unclassified Information. In: Argamon, S., Howard, N. (eds) Computational Methods for Counterterrorism. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01141-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-01141-2_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01140-5
Online ISBN: 978-3-642-01141-2
eBook Packages: Computer ScienceComputer Science (R0)