Abstract
Government documents must be reviewed to identify and protect any sensitive information, such as personal information, before the documents can be released to the public. However, in the era of digital government documents, such as e-mail, traditional sensitivity review procedures are no longer practical, for example due to the volume of documents to be reviewed. Therefore, there is a need for new technology assisted review protocols to integrate automatic sensitivity classification into the sensitivity review process. Moreover, to effectively assist sensitivity review, such assistive technologies must incorporate reviewer feedback to enable sensitivity classifiers to quickly learn and adapt to the sensitivities within a collection, when the types of sensitivity are not known a priori. In this work, we present a thorough evaluation of active learning strategies for sensitivity review. Moreover, we present an active learning strategy that integrates reviewer feedback, from sensitive text annotations, to identify features of sensitivity that enable us to learn an effective sensitivity classifier (0.7 Balanced Accuracy) using significantly less reviewer effort, according to the sign test (\(p<0.01\)). Moreover, this approach results in a 51% reduction in the number of documents required to be reviewed to achieve the same level of classification accuracy, compared to when the approach is deployed without annotation features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In active learning parlance, “query” usually refers to membership queries i.e. the system poses queries in the form of instances to be reviewed. In this work we use query in the IR sense, i.e. a textual passage used to retrieve relevant documents from an IR system. For membership queries we say that the system suggests documents to be reviewed.
- 3.
In practice this means that we randomly down-sample the classifier’s training data to loosely match the class frequencies. In preliminary experiments this led to uniform improvements across all tested approaches of \(\sim \)+0.4 Balanced Accuracy, after all documents had been reviewed.
References
TNA: The application of technology-assisted review to born-digital records transfer, inquiries and beyond (2016)
McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 500–506. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_48
Berardi, G., Esuli, A., Macdonald, C., Ounis, I., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In: Proceedings of the CIKM (2015)
McDonald, G., Macdonald, C., Ounis, I.: Using part-of-speech n-grams for sensitive-text classification. In: Proceedings of the ICTIR (2015)
McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_35
Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the SIGIR (2014)
Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of information retrieval for e-discovery. Artif. Intell. Law 18(4), 347–386 (2010)
Settles, B.: Active learning. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)
Settles, B.: Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances. In: Proceedings of the EMNLP (2011)
Berardi, G., Esuli, A., Sebastiani, F.: A utility-theoretic ranking method for semi-automated text classification. In: Proceedings of the SIGIR (2012)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
McDonald, G., García-Pedrajas, N., Macdonald, C., Ounis, I.: A study of svm kernel functions for sensitivity classification ensembles with pos sequences. In: Proceedings of the SIGIR (2017)
Lefebvre, C., Manheimer, E., Glanville, J.: Searching for studies. Cochrane handbook for systematic reviews of interventions: Cochrane book series, pp. 95–150 (2008)
Sanderson, M., Joho, H.: Forming test collections with no system pooling. In: Proceedings of the SIGIR (2004)
Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Legal track overview. In: Proceedings of the TREC (2008)
Grossman, M.R., Cormack, G.V.: Technology-assisted review in E-discovery can be more effective and more efficient than exhaustive manual review. Rich. JL Tech. 17, 1 (2010)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the SIGIR (1994)
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the ICML (2003)
McCallum, A., Nigam, K., et al.: Employing EM and pool-based active learning for text classification. In: ICML, vol. 98, pp. 350–358 (1998)
Shannon, C.E.: A mathematical theory of communication. BSTJ 27, 623–656 (1948)
Scheffer, T., Decomain, C., Wrobel, S.: Active hidden markov models for information extraction. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 309–318. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44816-0_31
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
McDonald, G., Macdonald, C., Ounis, I. (2018). Active Learning Strategies for Technology Assisted Sensitivity Review. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-76941-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)