Skip to main content

Comparison of Documents Classification Techniques to Classify Medical Reports

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Abstract

This paper addresses a real world problem: the classification of text documents in the medical domain. There are a number of approaches to classifying text documents. Here, we use a partially supervised classification approach and argue that it is effective and computationally efficient for real-world problems. The approach uses a two-step strategy to cut down on the effort required to label each document for classification. Only a small set of positive documents are labeled initially, with others being labeled automatically as a result of the first step. The second step builds the actual text classifier. There are a number of methods that have been proposed for each step. A comprehensive evaluation of various combinations of methods is conducted to compare their performances using real world medical documents. The results show that using EM based methods to build the classifier yields better results than SVM. We also experimentally show that careful selection of a subset of features to represent the documents can improve the performance of the classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (2003)

    Google Scholar 

  2. Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially Supervised Classification of Text Document’s. In: Proceedings of the Nineteenth International Conference on Mach ine Learning (ICML 2002), Sydney, Australia (2002)

    Google Scholar 

  3. Porter, M.F.: An algorithm for suffix stripping. Program; automated library and information systems 14(3), 130–137 (1980)

    Article  Google Scholar 

  4. Benbrahim, H., Barmer, M.A.: Neighborhood Exploitation in Hypertext Categorization. In: Research and Development in Intelligent Systems XXI. Springer, Heidelberg (2005)

    Google Scholar 

  5. Aronow, D.B., Feng, F.: Ad-Hoc Classification of Electronic Clinical Documents. D-Lib Magazine (1997), ISSN 1082-9873

    Google Scholar 

  6. Bowles, C.J., Leicester, R., Romaya, C., Swarbrick, E., Williams, C.B., Epstein, O.: A Prospective Study of Colonoscopy Practice in the UK today: are we Adequately Prepared for national colorectal Cancer Screening Tomorrow? Gut 53(2), 277–283 (2004)

    Article  Google Scholar 

  7. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Learning to Classify Text from Labeled and Unlabeled documents. In: AAAI 1998, pp. 792–799. AAAI Press, Menlo Park (1998)

    Google Scholar 

  8. Yang, Y., Liu, X.: Are-examination of Text Categorization Methods, Special Interest Group of Information Retrieval (SIGIR) (1999)

    Google Scholar 

  9. Lewis, D.D.: Representation and Learning in Information Retrieval, PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)

    Google Scholar 

  10. Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS, vol. 1501, pp. 112–126. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Rocchio, J.: Relevant Feedback in Information Retrieval. The smart retrieval system experiments in automatic document processing, Englewood Cliffs, NJ (1971)

    Google Scholar 

  12. McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  13. Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (2003)

    Google Scholar 

  14. Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of the Speechand Natural Language Workshop Asilomar, pp. 312–318. Morgan Kaufmann, San Francisco (1991)

    Chapter  Google Scholar 

  15. Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Dempster, A., Laird, N.M., Rubin, D.: Maximum Likelihood from Incomplete Data via EM Algorithm. Journal of the Royal Statistical Society (1997)

    Google Scholar 

  17. Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. In: 3rd annual symposium on document analysis and information retrieval, pp. 81–93 (1994)

    Google Scholar 

  18. Joachim, T.: Making Large Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999)

    Google Scholar 

  19. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 103–134 (2000)

    Google Scholar 

  20. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saad, F.H., de la Iglesia, B., Bell, G.D. (2006). Comparison of Documents Classification Techniques to Classify Medical Reports. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_34

Download citation

  • DOI: https://doi.org/10.1007/11731139_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33206-0

  • Online ISBN: 978-3-540-33207-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics