Comparison of Documents Classification Techniques to Classify Medical Reports

Saad, F. H.; de la Iglesia, B.; Bell, G. D.

doi:10.1007/11731139_34

F. H. Saad²²,
B. de la Iglesia²² &
G. D. Bell²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3070 Accesses
2 Citations

Abstract

This paper addresses a real world problem: the classification of text documents in the medical domain. There are a number of approaches to classifying text documents. Here, we use a partially supervised classification approach and argue that it is effective and computationally efficient for real-world problems. The approach uses a two-step strategy to cut down on the effort required to label each document for classification. Only a small set of positive documents are labeled initially, with others being labeled automatically as a result of the first step. The second step builds the actual text classifier. There are a number of methods that have been proposed for each step. A comprehensive evaluation of various combinations of methods is conducted to compare their performances using real world medical documents. The results show that using EM based methods to build the classifier yields better results than SVM. We also experimentally show that careful selection of a subset of features to represent the documents can improve the performance of the classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (2003)
Google Scholar
Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially Supervised Classification of Text Document’s. In: Proceedings of the Nineteenth International Conference on Mach ine Learning (ICML 2002), Sydney, Australia (2002)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program; automated library and information systems 14(3), 130–137 (1980)
Article Google Scholar
Benbrahim, H., Barmer, M.A.: Neighborhood Exploitation in Hypertext Categorization. In: Research and Development in Intelligent Systems XXI. Springer, Heidelberg (2005)
Google Scholar
Aronow, D.B., Feng, F.: Ad-Hoc Classification of Electronic Clinical Documents. D-Lib Magazine (1997), ISSN 1082-9873
Google Scholar
Bowles, C.J., Leicester, R., Romaya, C., Swarbrick, E., Williams, C.B., Epstein, O.: A Prospective Study of Colonoscopy Practice in the UK today: are we Adequately Prepared for national colorectal Cancer Screening Tomorrow? Gut 53(2), 277–283 (2004)
Article Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Learning to Classify Text from Labeled and Unlabeled documents. In: AAAI 1998, pp. 792–799. AAAI Press, Menlo Park (1998)
Google Scholar
Yang, Y., Liu, X.: Are-examination of Text Categorization Methods, Special Interest Group of Information Retrieval (SIGIR) (1999)
Google Scholar
Lewis, D.D.: Representation and Learning in Information Retrieval, PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)
Google Scholar
Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS, vol. 1501, pp. 112–126. Springer, Heidelberg (1998)
Chapter Google Scholar
Rocchio, J.: Relevant Feedback in Information Retrieval. The smart retrieval system experiments in automatic document processing, Englewood Cliffs, NJ (1971)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (2003)
Google Scholar
Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of the Speechand Natural Language Workshop Asilomar, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Chapter Google Scholar
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)
Chapter Google Scholar
Dempster, A., Laird, N.M., Rubin, D.: Maximum Likelihood from Incomplete Data via EM Algorithm. Journal of the Royal Statistical Society (1997)
Google Scholar
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. In: 3rd annual symposium on document analysis and information retrieval, pp. 81–93 (1994)
Google Scholar
Joachim, T.: Making Large Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 103–134 (2000)
Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
F. H. Saad, B. de la Iglesia & G. D. Bell

Authors

F. H. Saad
View author publications
You can also search for this author in PubMed Google Scholar
B. de la Iglesia
View author publications
You can also search for this author in PubMed Google Scholar
G. D. Bell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saad, F.H., de la Iglesia, B., Bell, G.D. (2006). Comparison of Documents Classification Techniques to Classify Medical Reports. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_34

Download citation

DOI: https://doi.org/10.1007/11731139_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics