Skip to main content

Identification of Sensitive Unclassified Information

  • Chapter
  • First Online:

Summary

Sensitive Unclassified information is defined as any unclassified information that may cause adverse consequences against the government facilities. In this chapter, we explore the use of categorization techniques and information extraction to discover this kind of information in scanned documents.

We show here that the combined use of a K-Dependence Bayesian categorization engine and a semi-automated review application reduce by nearly 95% the number of man hours required to redact sensitive unclassified information. We also discuss and provide statistics on how OCR errors can affect the information extraction tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Kohavi, R., B. Becker, and D. Sommerfield. 1997. Improving simple bayes. In Proceedings of ECML-97. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html

  • Lewis, D.D. 1991. Evaluating text categorization. In Proceedings of the Speech and Language Workshop. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html

  • Maron, M.E. 1967. Automatic indexing: An experimental inquiry. Journal of the ACM, 8:404–417.

    Google Scholar 

  • Maron, M.E. and J.L. Kuhns. 1960. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3): 216–240.

    Article  Google Scholar 

  • McCallum, A. and K. Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization. URL citeseer.nj.nec.com/mccallum98comparison.html

    Google Scholar 

  • Miller, D., S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the Sixth Conference on Applied Natural Languae Processing, pp. 316–324.

    Google Scholar 

  • Sahami, M. 1996. Learning limited dependence Bayesian classifiers. In Second International Conference on Knowledge Discovery in Databases. http://robotics.stanford.edu/users/sahami/papers.html

  • Taghva, K., J. Borsack, and A. Condit. 1996. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transaction on Information Systems, pp. 64–93.

    Google Scholar 

  • Taghva, K., R. Beckley, and J. Coombs. 2006. The effects of OCR error on the extraction of private information. In Proceedings of 7th IAPR Workshop on Document Analysis Systems (DAS 2006), pp. 348–357.

    Google Scholar 

  • U.S. Government. 2004. The Freedom of Information Act (FOIA), 5 USC Section 552(b)(6). http://www.usdoj.gov/oip/exemption6.html

  • U.S. Department of Energy. 2001. Licensing support network baselined design requirements. http://www.lsnnet.gov/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazem Taghva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Taghva, K. (2009). Identification of Sensitive Unclassified Information. In: Argamon, S., Howard, N. (eds) Computational Methods for Counterterrorism. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01141-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01141-2_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01140-5

  • Online ISBN: 978-3-642-01141-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics