Skip to main content

Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation

  • Conference paper
  • First Online:
  • 1009 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9874))

Abstract

Maintaining accessibility of biomedical literature databases has led to development of text classification systems to assist human indexers by recommending thematic categories to biomedical articles. These systems rely on using machine learning methods to learn the association between the document terms and predefined categories. The accuracy of a text classification method depends on the metric used in order to assign a weight to each term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. In this paper, we propose two supervised weighting metrics (One-way Klosgen and Loevinger) which both improve the quality of biomedical document classification. We also show that by using moment generating function centroids, an alternative to the traditional arithmetical average centroids, a nearest centroid classifier with Loevinger metric performs significantly better than SVM on a biomedical text classification task.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Table 1 contains the abbreviations used in this paper.

References

  1. MEDLINE/PubMed. http://www.ncbi.nlm.nih.gov/pubmed

  2. Wahle, M., Widdows, D., Herskovic, J.R., Bernstam, E.V., Cohen, T.: Deterministic binary vectors for efficient automated indexing of medline/pubmed abstracts. In: AMIA Annual Symposium Proceedings, vol. 2012, p. 940. American Medical Informatics Association (2012)

    Google Scholar 

  3. Huang, M., Névéol, A., Zhiyong, L.: Recommending mesh terms for annotating biomedical articles. J. Am. Med. Inf. Assoc. 18(5), 660–667 (2011)

    Article  Google Scholar 

  4. Vasuki, V., Cohen, T.: Reflective random indexing for semi-automatic indexing of the biomedical literature. J. Biomed. Inf. 43(5), 694–700 (2010)

    Article  Google Scholar 

  5. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  6. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012)

    Chapter  Google Scholar 

  7. Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Quemada et al. [27], pp. 201–210

    Google Scholar 

  8. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the ACM Symposium on Applied Computing (SAC), March 9–12, Melbourne, FL, USA, pp. 784–788. ACM (2003)

    Google Scholar 

  9. Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Shanahan, J.G., Amer-Yahia, S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A., (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26–30, pp. 263–270. ACM (2008)

    Google Scholar 

  10. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  11. Altinçay, H., Erenel, Z.: Using the absolute difference of term occurrence probabilities in binary text categorization. Appl. Intell. 36(1), 148–160 (2012)

    Article  Google Scholar 

  12. Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013)

    Article  Google Scholar 

  14. Nguyen, T.T., Chang, K., Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)

    Article  Google Scholar 

  15. Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  16. Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological), pp. 131–142(1966)

    Google Scholar 

  17. Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  18. Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Combining supervised term weighting metrics for SVM text classification with extended term representation. Knowl. Inf. Syst. 1–23 (2016)

    Google Scholar 

  19. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 192–201. Springer, London (1994)

    Google Scholar 

  20. Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Supervised term weights for biomedical text classification. In: Proceedings of the 12th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB, Naples, Italy, September 10–12, pp. 55–60 (2015)

    Google Scholar 

  21. Deng, Z.-H., Tang, S., Yang, D., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  22. Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)

    Article  Google Scholar 

  23. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 38(3), 9 (2006)

    Article  Google Scholar 

  24. Porter, F.M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  25. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods-Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999). Chap. 11

    Google Scholar 

  26. Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Quemada et al. [27], pp. 211–220

    Google Scholar 

  27. Quemada, J., León, G., Maarek, Y.S., Nejdl, W., (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), Madrid, Spain, 20-24 April 2009. ACM (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saïd Abdeddaïm .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S. (2016). Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44332-4_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44331-7

  • Online ISBN: 978-3-319-44332-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics