Text Categorization by a Machine-Learning-Based Term Selection

  • Javier Fernández
  • Elena Montañés
  • Irene Díaz
  • José Ranilla
  • Elías F. Combarro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3180)


Term selection is one of the main tasks in Information Retrieval and Text Categorization. It has been traditionally carried out by statistical methods based on the frequency of appearance of the words in the documents. In this paper it is presented a method for extracting relevant words of a document by taking into account their linguistic information. These relevant words are obtained by a Machine Learning algorithm which takes manually selected words as training set. With the lexica obtained by this technique Text Categorization is performed by using Support Vector Machines. The results are compared with one of the most used method for term selection (based just on statistical information) and it is found the new method performs better and has the additional advantage of automatically selecting the filtering level.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aas, K., Eikvil, L.: Text categorisation: A survey. Technical report, Norwegian Computing Center (1999)Google Scholar
  2. 2.
    Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  3. 3.
    Basili, R., Moschitti, A., Pazienza, M.T.: Language-sensitive text classification. In: Proceeding of RIAO 2000, 6th International Conference ”Recherche d’Information Assistee par Ordinateur”, Paris, pp. 331–343. FR (2000)Google Scholar
  4. 4.
    Brill, E.: A Corpus-Based Approach to Language Learning. PhD thesis, Philadelpha, PA (1993)Google Scholar
  5. 5.
    Fernández, J., Montaés, E., Díaz, I., Ranilla, J., Combarro, E.F.: Extraction of document descriptive terms with a linguistic-based machine learning approach. To appear in LNCS (Proceedings of the ICCS 2004), Krakow, Poland (2004)Google Scholar
  6. 6.
    Gelbukh, A., Sidorov, G., Guzmn-Arenas, A.: Use of a weighted topic hierarchy for document classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)Google Scholar
  7. 7.
    Joachims, T.: Making large-scale support vector machine learning practical. In: Smola, A., Scholkopf, B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines, MIT Press, Cambridge (1998)Google Scholar
  8. 8.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19(2), 313–330 (1994)Google Scholar
  10. 10.
    Quinlan, J.R.: Constructing decision tree in c4.5. In: Programs of Machine Learning, pp. 17–26. Morgan Kaufmann, San Francisco (1993)Google Scholar
  11. 11.
    Ranilla, J., Bahamonde, A.: Fan: Finding accurate inductions. International Journal of Human Computer Studies 56(4), 445–474 (2002)CrossRefGoogle Scholar
  12. 12.
    Ranilla, J., Luaces, O., Bahamonde, A.: A heuristic for learning decision trees and pruning them into classification rules. AICom (Artificial Intelligence Communication) 16(2) (2003) (in press)Google Scholar
  13. 13.
    Salton, G., McGill, M.J.: An introduction to modern information retrieval. McGraw-Hill, New York (1983)Google Scholar
  14. 14.
    Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Harabagiu, S. (ed.) Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pp. 38–44. Association for Computational Linguistics, Somerset (1998)Google Scholar
  15. 15.
    Turney, P.: Coherent keyphrase extraction via web mining. In: IJCAI 2003, pp. 434–439 (2003)Google Scholar
  16. 16.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)CrossRefGoogle Scholar
  17. 17.
    Van-Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)CrossRefGoogle Scholar
  18. 18.
    Verbruggen, E.J., Koster, C.H.A., Derksen, C.F., Potjer, J.I.: Manual for the AGFL system version 2.0. AGFL Grammar Work Lab (August 2001)Google Scholar
  19. 19.
    Yang, T., Pedersen, J.P.: A comparative study on feature selection in text categorisation. In: Proceedings of ICML1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Javier Fernández
    • 1
  • Elena Montañés
    • 1
  • Irene Díaz
    • 1
  • José Ranilla
    • 1
  • Elías F. Combarro
    • 1
  1. 1.Artificial Intelligence CenterUniversity of OviedoSpain

Personalised recommendations