Active Learning Strategies for Technology Assisted Sensitivity Review

  • Graham McDonaldEmail author
  • Craig Macdonald
  • Iadh Ounis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10772)


Government documents must be reviewed to identify and protect any sensitive information, such as personal information, before the documents can be released to the public. However, in the era of digital government documents, such as e-mail, traditional sensitivity review procedures are no longer practical, for example due to the volume of documents to be reviewed. Therefore, there is a need for new technology assisted review protocols to integrate automatic sensitivity classification into the sensitivity review process. Moreover, to effectively assist sensitivity review, such assistive technologies must incorporate reviewer feedback to enable sensitivity classifiers to quickly learn and adapt to the sensitivities within a collection, when the types of sensitivity are not known a priori. In this work, we present a thorough evaluation of active learning strategies for sensitivity review. Moreover, we present an active learning strategy that integrates reviewer feedback, from sensitive text annotations, to identify features of sensitivity that enable us to learn an effective sensitivity classifier (0.7 Balanced Accuracy) using significantly less reviewer effort, according to the sign test (\(p<0.01\)). Moreover, this approach results in a 51% reduction in the number of documents required to be reviewed to achieve the same level of classification accuracy, compared to when the approach is deployed without annotation features.


  1. 1.
    TNA: The application of technology-assisted review to born-digital records transfer, inquiries and beyond (2016)Google Scholar
  2. 2.
    McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 500–506. Springer, Cham (2014). CrossRefGoogle Scholar
  3. 3.
    Berardi, G., Esuli, A., Macdonald, C., Ounis, I., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In: Proceedings of the CIKM (2015)Google Scholar
  4. 4.
    McDonald, G., Macdonald, C., Ounis, I.: Using part-of-speech n-grams for sensitive-text classification. In: Proceedings of the ICTIR (2015)Google Scholar
  5. 5.
    McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). CrossRefGoogle Scholar
  6. 6.
    Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the SIGIR (2014)Google Scholar
  7. 7.
    Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of information retrieval for e-discovery. Artif. Intell. Law 18(4), 347–386 (2010)CrossRefGoogle Scholar
  8. 8.
    Settles, B.: Active learning. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Settles, B.: Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances. In: Proceedings of the EMNLP (2011)Google Scholar
  10. 10.
    Berardi, G., Esuli, A., Sebastiani, F.: A utility-theoretic ranking method for semi-automated text classification. In: Proceedings of the SIGIR (2012)Google Scholar
  11. 11.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  12. 12.
    McDonald, G., García-Pedrajas, N., Macdonald, C., Ounis, I.: A study of svm kernel functions for sensitivity classification ensembles with pos sequences. In: Proceedings of the SIGIR (2017)Google Scholar
  13. 13.
    Lefebvre, C., Manheimer, E., Glanville, J.: Searching for studies. Cochrane handbook for systematic reviews of interventions: Cochrane book series, pp. 95–150 (2008)Google Scholar
  14. 14.
    Sanderson, M., Joho, H.: Forming test collections with no system pooling. In: Proceedings of the SIGIR (2004)Google Scholar
  15. 15.
    Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Legal track overview. In: Proceedings of the TREC (2008)Google Scholar
  16. 16.
    Grossman, M.R., Cormack, G.V.: Technology-assisted review in E-discovery can be more effective and more efficient than exhaustive manual review. Rich. JL Tech. 17, 1 (2010)Google Scholar
  17. 17.
    Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the SIGIR (1994)Google Scholar
  18. 18.
    Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the ICML (2003)Google Scholar
  19. 19.
    McCallum, A., Nigam, K., et al.: Employing EM and pool-based active learning for text classification. In: ICML, vol. 98, pp. 350–358 (1998)Google Scholar
  20. 20.
    Shannon, C.E.: A mathematical theory of communication. BSTJ 27, 623–656 (1948)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Scheffer, T., Decomain, C., Wrobel, S.: Active hidden markov models for information extraction. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 309–318. Springer, Heidelberg (2001). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of GlasgowGlasgowUK

Personalised recommendations