Feature subset selection in text-learning

  • Dunja Mladenić
Feature Selection
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1398)


This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval.


Information Gain Problem Domain Feature Subset Feature Selection Method Feature Subset Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Apté, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual Int. ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994.Google Scholar
  2. 2.
    Kraut, R., Scherlis, W., Mukhopadhyay, T., Manning, J., Kiesler, S., The HomeNet Field Trial of Residential Internet Services, Communications of the ACM Vol. 39, No. 12, pp.55–63, December 1996.CrossRefGoogle Scholar
  3. 3.
    Joachims, T., A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 143–151, 1997.Google Scholar
  4. 4.
    John, G.H., Kohavi, R., Pfleger, K., Irrelevant Features and the Subset Selection Problem, Proc. of the 11th International Conference on Machine Learning ICML94, pp. 121–129, 1994.Google Scholar
  5. 5.
    Kindo, T., Yoshida, H., Morimoto, T., Watanabe, T., Adaptive Personal Information Filtering System that Organizes Personal Profiles Automatically, Proc. of the 15th Int. Joint Conference on Artificial Intelligence IJCAI-97, 716–721, 1997.Google Scholar
  6. 6.
    Koller, D., Sahami, M., Hierarchically classifying documents using very few words, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170–178, 1997.Google Scholar
  7. 7.
    Kononenko, I. and Bratko, I., Information-Based Evaluation Criterion for Classifier's Performance, Machine Learning 6, Kluwer Academic Publishers, 1991.Google Scholar
  8. 8.
    Kubat, M., Holte, R., Matwing, S., Learning When Negative Examples Abound, 9th European Conference on Machine Learning ECML97, pp. 146–153, 1997.Google Scholar
  9. 9.
    Mitchell, T.M., Machine Learning, The McGraw-Hill Companies, Inc., 1997.Google Scholar
  10. 10.
    Mladenić, D., Personal WebWatcher: Implementation and Design, Technical Report IJS-DP-7472, October, 1996. Scholar
  11. 11.
    Pazzani, M., Billsus, D., Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27, Kluwer Academic Publishers, pp. 313–331, 1997.Google Scholar
  12. 12.
    van Rijsbergen, C.J,. Harper, D.J., Porter, M.F., The selection of good search terms, Information Processing & Management, 17, pp.77–91, 1981.Google Scholar
  13. 13.
    Shaw Jr, W.M., Term-relevance computations and perfect retrieval performance, Information Processing & Management, 31(4), pp.491–498, 1995.Google Scholar
  14. 14.
    Yang, Y., Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412–420, 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Dunja Mladenić
    • 1
  1. 1.Department for Intelligent SystemsJ.Stefan InstituteLjubljanaSlovenia

Personalised recommendations