On Feature Weighting and Selection for Medical Document Classification

  • Bekir ParlakEmail author
  • Alper Kursat Uysal
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 718)


Medical document classification is still one of the popular research problems inside text classification domain. In this study, the impact of feature selection and feature weighting on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini index and distinguishing feature selector and two different term weighting methods namely term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of these methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the better result is obtained with combination of distinguishing feature selector, TF feature weighting, and Bayesian network classifier.


Text classification Medical documents Disease classification MeSH 



This work was supported by Anadolu University, Fund of Scientific Research Projects under grant number 1503F136.


  1. 1.
    Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)CrossRefGoogle Scholar
  2. 2.
    Idris, I., Selamat, A., Nguyen, N.T., Omatu, S., Krejcar, O., Kuca, K., Penhaker, M.: A combined negative selection algorithm—particle swarm optimization for an email spam detection system. Eng. Appl. Artif. Intell. 39, 33–44 (2015)CrossRefGoogle Scholar
  3. 3.
    Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)CrossRefGoogle Scholar
  4. 4.
    Ozel, S.A.: A Web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Agarwal, B., Mittal, N.: Prominent Feature Extraction for Sentiment Analysis, pp. 21–45. Springer (2016)Google Scholar
  6. 6.
    Pak, M.Y., Gunal, S.: Sentiment classification based on domain prediction. Elektronika ir Elektrotechnika 22(2), 96–99 (2016)CrossRefGoogle Scholar
  7. 7.
    Garla, V., Taylor, C., Brandt, C.: Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J. Biomed. Inform. 46(5), 869–875 (2013)CrossRefGoogle Scholar
  8. 8.
    Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on MEDLINE document classification. In: AMIA Annual Symposium Proceedings, p. 849. American Medical Informatics Association (2005)Google Scholar
  9. 9.
    Yepes, A.J.J., Plaza, L., Carrillo-de-Albornoz, J., Mork, J.G., Aronson, A.R.: Feature engineering for MEDLINE citation categorization with MeSH. BMC Bioinform. 16(1), 1 (2015)CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Pubmed []. Accessed 2015
  12. 12.
    Rak, R., Kurgan, L.A., Reformat, M.: Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. IEEE Eng. Med. Biol. Mag. 26(2), 47 (2007)CrossRefGoogle Scholar
  13. 13.
    Spat, S., Cadonna, B., Rakovac, I., Gutl, C., Leitner, H., Stark, G., Beck, P.: Multi-label text classification of German language medical documents. In: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2343 (2007)Google Scholar
  14. 14.
    Camous, F., Blott, S., Smeaton, A.F.: Ontology-based MEDLINE document classification. In: Bioinformatics Research and Development, pp. 439–452. Springer Berlin Heidelberg (2007)Google Scholar
  15. 15.
    Poulter, G.L., Rubin, D.L., Altman, R.B.: Seoighe, C.: MScanner: a classifier for retrieving medline citations. BMC Bioinform. 9(1), 108 (2008)CrossRefGoogle Scholar
  16. 16.
    Yi, K., Beheshti, J.: A hidden Markov model-based text classification of medical documents. J. Inf. Sci. (2008)Google Scholar
  17. 17.
    Frunza, O., Inkpen, D., Matwin, S., Klement, W., O’blenis, P.: Exploiting the systematic review protocol for classification of medical abstracts. Artif. Intell. Med. 51(1), 17–25 (2011)Google Scholar
  18. 18.
    Dollah, R.B., Aono, M.: Ontology based approach for classifying biomedical text abstracts. Int. J. Data Engi. (IJDE), 2(1), 1–15 (2011)Google Scholar
  19. 19.
    Albitar, S., Espinasse, B., Fournier, S.: Semantic enrichments in text supervised classification: application to medical domain. In: The Twenty-Seventh International Flairs Conference (2014)Google Scholar
  20. 20.
    Uysal, A.K., Gunal, S.: Text classification using genetic algorithm oriented latent semantic features. Expert Syst. Appl. 41(13), 5938–5947 (2014)CrossRefGoogle Scholar
  21. 21.
    Parlak, B., Uysal, A. K.: Classification of medical documents according to diseases. In: 23th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1635–1638 (2015)Google Scholar
  22. 22.
    Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 790–801. Springer International Publishing (2016)Google Scholar
  23. 23.
    Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., Korhonen, A.: Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32(3), 432–440 (2016)CrossRefGoogle Scholar
  24. 24.
    Morid, M.A., Fiszman, M., Raja, K., Jonnalagadda, S.R., Del Fiol, G.: Classification of clinically useful sentences in clinical evidence resources. J. Biomed. Inform. 60, 14–22 (2016)CrossRefGoogle Scholar
  25. 25.
    Parlak, B., Uysal, A.K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)Google Scholar
  26. 26.
    Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J. Am. Med. Inform. Assoc. 13(5), 516–525 (2006)CrossRefGoogle Scholar
  27. 27.
    Van Der Zwaan, J., Sang, E.T.K., de Rijke, M.: An experiment in automatic classification of pathological reports. In: Artificial Intelligence in Medicine, pp. 207–216. Springer, Berlin Heidelberg (2007)Google Scholar
  28. 28.
    Waraporn, P., Meesad, P., Clayton, G.: Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding (2010). arXiv:1004.1230
  29. 29.
    Boytcheva, S.: Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Workshop on Biomedical Natural Language Processing, pp. 11–18. Hissar, Bulgaria (2011)Google Scholar
  30. 30.
    Ceylan, N.M., Alpkocak, A., Esatoglu, A.E.: Tıbbi Kayıtlara ICD-10 Hastalık Kodlarının Atanmasına Yardımcı Akıllı Bir Sistem (2012)Google Scholar
  31. 31.
    Arifoglu, D., Deniz, O., Alecakır, K., Yondem, M.: CodeMagic: semi-automatic assignment of ICD-10-AM codes to patient records. In: Information Sciences and Systems 2014, pp. 259–268. Springer International Publishing (2014)Google Scholar
  32. 32.
    Uysal, A.K., Gunal, S., Ergin, S., Gunal, E.S.: Detection of SMS spam messages on mobile phones. In: 20th IEEE Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2012)Google Scholar
  33. 33.
    Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval Cambridge University Press, New York, USA (2008)Google Scholar
  34. 34.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRefGoogle Scholar
  35. 35.
    Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)CrossRefGoogle Scholar
  36. 36.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explor. 11(1) (2009)Google Scholar
  37. 37.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Jim Gray (ed.). Morgan Kaufmann Publishers, San Fransisco (2005)Google Scholar
  38. 38.
    Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Proceedings of the Europe Conference Information Retrieval Research, pp. 345–359 (2005)Google Scholar
  39. 39.
    Rocha, A., Rocha, B.: Adopting nursing health record standards. Inform. Health Soc. Care 39(1), 1–14 (2014)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Department of Computer EngineeringAnadolu UniversityEskisehirTurkey

Personalised recommendations