Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation

Haddoud, Mounia; Mokhtari, Aïcha; Lecroq, Thierry; Abdeddaïm, Saïd

doi:10.1007/978-3-319-44332-4_8

Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation

Mounia Haddoud^16,17,
Aïcha Mokhtari¹⁷,
Thierry Lecroq¹⁶ &
…
Saïd Abdeddaïm¹⁶

Conference paper
First Online: 31 July 2016

1009 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9874))

Abstract

Maintaining accessibility of biomedical literature databases has led to development of text classification systems to assist human indexers by recommending thematic categories to biomedical articles. These systems rely on using machine learning methods to learn the association between the document terms and predefined categories. The accuracy of a text classification method depends on the metric used in order to assign a weight to each term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. In this paper, we propose two supervised weighting metrics (One-way Klosgen and Loevinger) which both improve the quality of biomedical document classification. We also show that by using moment generating function centroids, an alternative to the traditional arithmetical average centroids, a nearest centroid classifier with Loevinger metric performs significantly better than SVM on a biomedical text classification task.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Table 1 contains the abbreviations used in this paper.

References

MEDLINE/PubMed. http://www.ncbi.nlm.nih.gov/pubmed
Wahle, M., Widdows, D., Herskovic, J.R., Bernstam, E.V., Cohen, T.: Deterministic binary vectors for efficient automated indexing of medline/pubmed abstracts. In: AMIA Annual Symposium Proceedings, vol. 2012, p. 940. American Medical Informatics Association (2012)
Google Scholar
Huang, M., Névéol, A., Zhiyong, L.: Recommending mesh terms for annotating biomedical articles. J. Am. Med. Inf. Assoc. 18(5), 660–667 (2011)
Article Google Scholar
Vasuki, V., Cohen, T.: Reflective random indexing for semi-automatic indexing of the biomedical literature. J. Biomed. Inf. 43(5), 694–700 (2010)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012)
Chapter Google Scholar
Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Quemada et al. [27], pp. 201–210
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the ACM Symposium on Applied Computing (SAC), March 9–12, Melbourne, FL, USA, pp. 784–788. ACM (2003)
Google Scholar
Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Shanahan, J.G., Amer-Yahia, S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A., (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26–30, pp. 263–270. ACM (2008)
Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
Altinçay, H., Erenel, Z.: Using the absolute difference of term occurrence probabilities in binary text categorization. Appl. Intell. 36(1), 148–160 (2012)
Article Google Scholar
Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Chapter Google Scholar
Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013)
Article Google Scholar
Nguyen, T.T., Chang, K., Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)
Article Google Scholar
Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological), pp. 131–142(1966)
Google Scholar
Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (2006)
Article MathSciNet MATH Google Scholar
Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Combining supervised term weighting metrics for SVM text classification with extended term representation. Knowl. Inf. Syst. 1–23 (2016)
Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 192–201. Springer, London (1994)
Google Scholar
Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Supervised term weights for biomedical text classification. In: Proceedings of the 12th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB, Naples, Italy, September 10–12, pp. 55–60 (2015)
Google Scholar
Deng, Z.-H., Tang, S., Yang, D., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)
Chapter Google Scholar
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)
Article Google Scholar
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 38(3), 9 (2006)
Article Google Scholar
Porter, F.M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods-Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999). Chap. 11
Google Scholar
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Quemada et al. [27], pp. 211–220
Google Scholar
Quemada, J., León, G., Maarek, Y.S., Nejdl, W., (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), Madrid, Spain, 20-24 April 2009. ACM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique, du Traitement de l’Information et des Systèmes (LITIS), Université de Rouen, 76821, Mont-Saint-Aignan Cedex, France
Mounia Haddoud, Thierry Lecroq & Saïd Abdeddaïm
Recherche en Informatique Intelligente, Mathématiques et Applications (RIIMA), USTHB, BP 32, El-Alia, Bab-ezzouar, 16111, Algiers, Algeria
Mounia Haddoud & Aïcha Mokhtari

Authors

Mounia Haddoud
View author publications
You can also search for this author in PubMed Google Scholar
Aïcha Mokhtari
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Lecroq
View author publications
You can also search for this author in PubMed Google Scholar
Saïd Abdeddaïm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saïd Abdeddaïm .

Editor information

Editors and Affiliations

CNR, Istituto per le Applicazioni del Calcolo, Naples, Italy
Claudia Angelini
Center for Statistics in the Biomedical Sciences, Vita-Salute San Raffaele University, Milano, Italy
Paola MV Rancoita
DIBRIS, University of Genoa, Genova, Italy
Stefano Rovetta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S. (2016). Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-44332-4_8
Published: 31 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44331-7
Online ISBN: 978-3-319-44332-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics