Skip to main content

Enhancing Sensitivity Classification with Semantic Features Using Word Embeddings

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

Abstract

Government documents must be reviewed to identify any sensitive information they may contain, before they can be released to the public. However, traditional paper-based sensitivity review processes are not practical for reviewing born-digital documents. Therefore, there is a timely need for automatic sensitivity classification techniques, to assist the digital sensitivity review process. However, sensitivity is typically a product of the relations between combinations of terms, such as who said what about whom, therefore, automatic sensitivity classification is a difficult task. Vector representations of terms, such as word embeddings, have been shown to be effective at encoding latent term features that preserve semantic relations between terms, which can also be beneficial to sensitivity classification. In this work, we present a thorough evaluation of the effectiveness of semantic word embedding features, along with term and grammatical features, for sensitivity classification. On a test collection of government documents containing real sensitivities, we show that extending text classification with semantic features and additional term n-grams results in significant improvements in classification effectiveness, correctly classifying 9.99% more sensitive documents compared to the text classification baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.legislation.gov.uk/ukpga/2000/36/contents.

  2. 2.

    http://www.foia.gov.

  3. 3.

    14 of the 24 FOIA exemptions apply to documents that are to be archived for public access.

  4. 4.

    https://code.google.com/archive/p/word2vec/.

  5. 5.

    http://nlp.stanford.edu/projects/glove/.

  6. 6.

    https://code.google.com/archive/p/word2vec/.

  7. 7.

    http://nlp.stanford.edu/projects/glove/.

  8. 8.

    http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

  9. 9.

    http://scikit-learn.org/.

References

  1. DARPA: DARPA, new technologies to support declassification (2010). http://fas.org/sgp/news/2010/09/darpa-declass.pdf

  2. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  3. McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: Rijke, M., Kenter, T., Vries, A.P., Zhai, C.X., Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 500–506. Springer, Cham (2014). doi:10.1007/978-3-319-06028-6_48

    Chapter  Google Scholar 

  4. Berardi, G., Esuli, A., Macdonald, C., Ounis, I., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In: Proceedings of CIKM (2015)

    Google Scholar 

  5. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  6. Fung, B., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. (CSUR) 42(4), 14 (2010)

    Article  Google Scholar 

  7. Fang, Y., Godavarthy, A., Lu, H.: A utility maximization framework for privacy preservation of user generated content. In: Proceedings of ICTIR (2016)

    Google Scholar 

  8. Berardi, G., Esuli, A., Sebastiani, F.: A utility-theoretic ranking method for semi-automated text classification. In: Proceedings of SIGIR (2012)

    Google Scholar 

  9. McDonald, G., Macdonald, C., Ounis, I.: Using part-of-speech n-grams for sensitive-text classification. In: Proceedings of ICTIR (2015)

    Google Scholar 

  10. Lioma, C., Ounis, I.: Examining the content load of part-of-speech blocks for information retrieval. In: Proceedings of COLING/ACL (2006)

    Google Scholar 

  11. Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of ACL-IJCNLP (2015)

    Google Scholar 

  12. Ghosh, D., Guo, W., Muresan, S.: Sarcastic or not: word embeddings to predict the literal or sarcastic meaning of words. In: Proceedings of EMNLP (2015)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)

    Google Scholar 

  14. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of EMNLP (2014)

    Google Scholar 

  15. Zheng, G., Callan, J.: Learning to reweight terms with distributed representations. In: Proceedings of SIGIR (2015)

    Google Scholar 

  16. Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L.: Integrating and evaluating neural word embeddings in information retrieval. In: Proceedings of ADCS (2015)

    Google Scholar 

  17. Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in Twitter election classification. CoRR abs/1606.07006 (2016)

    Google Scholar 

  18. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016)

    Google Scholar 

  19. Balikas, G., Amini, M.: An empirical study on large scale text classification with skip-gram embeddings. CoRR abs/1606.06623 (2016)

    Google Scholar 

  20. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34(8), 1388–1429 (2010)

    Article  Google Scholar 

  21. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  22. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)

    MATH  Google Scholar 

  23. Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of NIPS (2011)

    Google Scholar 

  24. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683

    Chapter  Google Scholar 

  25. McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)

    Article  Google Scholar 

Download references

Acknowledgements

The authors are thankful to the Foreign & Commonwealth Office and The National Archives of the UK for their support of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham McDonald .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

McDonald, G., Macdonald, C., Ounis, I. (2017). Enhancing Sensitivity Classification with Semantic Features Using Word Embeddings. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56608-5_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56607-8

  • Online ISBN: 978-3-319-56608-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics