Reverse Active Learning for Optimising Information Extraction Training Production

Nguyen, Dung; Patrick, Jon

doi:10.1007/978-3-642-35101-3_38

Reverse Active Learning for Optimising Information Extraction Training Production

Dung Nguyen²¹ &
Jon Patrick²¹

Conference paper

3442 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7691))

Abstract

When processing a noisy corpus such as clinical texts, the corpus usually contains a large number of misspelt words, abbreviations and acronyms while many ambiguous and irregular language usages can also be found in training data needed for supervised learning. These are two frequent kinds of noise that can affect the overall performance of machine learning process. The first noise is usually filtered by the proof reading process. This paper proposes an algorithm to deal with noisy training data problem, for a method we call reverse active learning to improve performance of supervised machine learning on clinical corpora. The effects of reverse active learning are shown to produce results on the i2b2 clinical corpus that are state-of-the-art of supervised learning method and offer a means of improving all processing strategies in clinical language processing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2010)
Google Scholar
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. J. Artif. Intell. 97(1-2), 245–271 (1997)
Article MathSciNet MATH Google Scholar
Fung, G., Mangasarian, O.L.: Data selection for support vector machine classifiers. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2000, pp. 64–70. ACM, New York (2000)
Chapter Google Scholar
Donmez, P., Carbonell, J.G., Schneider, J.: Efficiently learning the accuracy of labeling sources for selective sampling. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 259–268. ACM, New York (2009)
Google Scholar
Nallapati, R., Surdeanu, M., Manning, C.: Corractive learning: Learning from noisy data through human interaction. In: IJCAI Workshop on Intelligence and Interaction (2009)
Google Scholar
Lewis, D., Gale, W.: Training text classifiers by uncertainty sampling. In: Proceedings of ACM-SIGIR Conference on Information Retrieval, pp. 3–12 (1994)
Google Scholar
Scheffer, T., Decomain, C., Wrobel, S.: Active Hidden Markov Models for Information Extraction. In: Hoffmann, F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS, vol. 2189, pp. 309–318. Springer, Heidelberg (2001)
Chapter Google Scholar
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 1070–1079. Association for Computational Linguistics, Stroudsburg (2008)
Google Scholar
Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT 1992, pp. 287–294. ACM, New York (1992)
Chapter Google Scholar
McCallum, A., Nigam, K.: Employing em and pool-based active learning for text classification. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 350–358. Morgan Kaufmann Publishers Inc., San Francisco (1998)
Google Scholar
Vlachos, A.: A stopping criterion for active learning. Computer Speech and Language 22(3), 295–312 (2008)
Article Google Scholar
Olsson, F., Tomanek, K.: An intrinsic stopping criterion for committee-based active learning. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, pp. 138–146. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Kang, J., Ryu, K., Kwon, H.C.: Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 384–388. Springer, Heidelberg (2004)
Chapter Google Scholar
Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 441–448. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Patrick, J., Sabbagh, M., Jain, S., Zheng, H.: Spelling correction in clinical notes with emphasis on first suggestion accuracy. In: 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining, pp. 2–8 (2010)
Google Scholar
Lafferty, J., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge Univ. Pr. (2000)
Google Scholar
Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms. Journal of Machine Learning Research 5, 255–291 (2004)
MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software http://www.csie.ntu.edu.tw/
Article Google Scholar
Uzuner, O., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(5), 552–556 (2011)
Article Google Scholar
Chih-Wei Hsu, C.C.C., Lin, C.J.: A practical guide to support vector classification. Technical report (2003-2010)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Rink, B., Harabagiu, S., Roberts, K.: Automatic extraction of relations between medical concepts in clinical texts. J. Am. Med. Inform. Assoc. 18(5), 594–600 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of IT, University of Sydney, 1 Cleveland, Sydney, 2006, NSW, Australia
Dung Nguyen & Jon Patrick

Authors

Dung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Jon Patrick
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, 2052, Sydney, NSW, Australia
Michael Thielscher
School of Computing and Mathematics, University of Western Sydney, 1797, Penrith South DC, NSW, Australia
Dongmo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, D., Patrick, J. (2012). Reverse Active Learning for Optimising Information Extraction Training Production. In: Thielscher, M., Zhang, D. (eds) AI 2012: Advances in Artificial Intelligence. AI 2012. Lecture Notes in Computer Science(), vol 7691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35101-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-35101-3_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35100-6
Online ISBN: 978-3-642-35101-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics