Environment Systems and Decisions

, Volume 39, Issue 3, pp 269–280 | Cite as

Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

  • Arun VargheseEmail author
  • Tao Hong
  • Chelsea Hunter
  • George Agyeman-Badu
  • Michelle Cawley


Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. “Active” machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method’s potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.


Literature review Systematic review Automated document classification Machine learning Active learning Natural language processing 



The development of the methods presented here was fully supported by ICF. The results presented here were generated for the purposes of this paper alone. We thank Gregory Carter for review and helpful comments.

Supplementary material

10669_2019_9717_MOESM1_ESM.docx (69 kb)
The supplementary data include 18 tables that correspond to the results generated in the simulations summarized as trends in Figs. 2–5. In the interests of brevity, these tables present simulation results only up to the point where the actual omission fraction of relevant documents in less than the required threshold of 0.05. Each table is supplied with a proposed interpretation of apparent trends in the context of the theoretical discussions in Section 2. Supplementary material 1 (DOCX 68 KB)


  1. Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF (2005) Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 12:207–216CrossRefGoogle Scholar
  2. Bekhuis T, Demner-Fushman D (2012) Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers. Artif Intell Med 55(3):197–207CrossRefGoogle Scholar
  3. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022Google Scholar
  4. Chen Y, Mani S, Xu H (2012) Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform 45(2):265–272. CrossRefGoogle Scholar
  5. Dasgupta S (2009) The two faces of active learning. In: Proceedings of the twentieth conference on algorithmic learning theoryGoogle Scholar
  6. Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM conference on information and knowledge management, ACM, pp 127–136Google Scholar
  7. Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4:1–58CrossRefGoogle Scholar
  8. Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235. Google Scholar
  9. Harris ZS (1954) Distributional structure. WORD 10:146–162CrossRefGoogle Scholar
  10. Ingersoll GS, Morton TS, Farris AL (2013) Taming text: how to find, organize, and manipulate it. Manning Publications Co., New YorkGoogle Scholar
  11. Jonnalagadda S, Goyal P, Huffman M (2015) Automating data extraction in systematic reviews: a systematic review. Syst Rev 15(4):78. CrossRefGoogle Scholar
  12. Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 148–156Google Scholar
  13. Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR conference on research and development in information retrieval. ACM/Springer, pp 3–12Google Scholar
  14. O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4:5CrossRefGoogle Scholar
  15. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830Google Scholar
  16. Python Software Foundation. Python language reference (Version 2.7)Google Scholar
  17. Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 441–448Google Scholar
  18. Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, MadisonGoogle Scholar
  19. Settles B, Craven M, Ray S (2008) Multiple-instance active learning. Adv Neural Inf Process Syst 20:1289–1296Google Scholar
  20. Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In Proceedings of the ACM workshop on computational learning theory, pp 287–294Google Scholar
  21. Shemilt I et al (2014) Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Res Synth Methods 5(1):31–49CrossRefGoogle Scholar
  22. Tomanek K, Olsson F (2009) A web survey on the use of active learning to support annotation of text data. In Proceedings of the NAACL HLT workshop on active learning for natural language processing. ACL Press, pp 45–48Google Scholar
  23. U.S. EPA (2015) IRIS toxicological review of dibutyl phthalate (Dbp) (preliminary assessment materials). U.S. Environmental Protection Agency, Washington, DC, EPA/635/R-13/302Google Scholar
  24. Varghese A, Cawley M, Hong T (2017) Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environ Syst Decis. Google Scholar
  25. Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform 11:55CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.ICFDurhamUSA
  2. 2.Health Sciences Library, Clinical Academic and Research EngagementUniversity of North CarolinaChapel HillUSA

Personalised recommendations