Unsupervised Feature Selection for Text Data

  • Nirmalie Wiratunga
  • Rob Lothian
  • Stewart Massie
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4106)


Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: Cluster divides the entire feature space, before then selecting one feature to represent each cluster; and Greedy increments the feature subset size by a greedily selected feature. In particular we found that Greedy’s local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both Greedy and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.


Feature Selection Feature Subset Feature Selection Method Feature Selection Technique Greedy Search 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baker, L., McCallum, A.: Distributional clustering of words for text classification. In: Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96–103. ACM Press, New York (1998)Google Scholar
  2. 2.
    Bruninghaus, S., Ashley, K.: The role of information extraction for textual CBR. In: Aha, D.W., Watson, I. (eds.) ICCBR 2001. LNCS, vol. 2080, pp. 74–89. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  3. 3.
    Cover, T., Thomas, J.: Elements of Information Theory. John Wiley, Chichester (1991)zbMATHCrossRefGoogle Scholar
  4. 4.
    Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  5. 5.
    Delany, S., Cunningham, P.: An analysis of case-base editing in a spam filtering system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 128–141. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Delany, S., Cunningham, P., Doyle, D., Zamolotskikh, A.: Generating estimates of classification confidence for a case-based spam filter. In: Case-Based Reasoning Research and Development: Proceedings of the 6th International Conference on CBR, pp. 177–189. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Gupta, K., Aha, D.: Towards acquiring case indexing taxonomies from text. In: Proceedings of the Seventeenth International FLAIRS Conference, pp. 307–315. AAAI Press, Menlo Park (2004)Google Scholar
  8. 8.
    Jarmulak, J., Craw, S., Rowe, R.: Genetic algorithms to optimise CBR retrieval. In: Blanzieri, E., Portinale, L. (eds.) EWCBR 2000. LNCS, vol. 1898, pp. 136–147. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  9. 9.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorisation. In: Proceedings of the Fourteenth International Conference on Machine Learning (1997)Google Scholar
  10. 10.
    John, G., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121–129 (1994)Google Scholar
  11. 11.
    Kang, N., Domeniconi, C., Barbara, D.: Categorization and Keyword identification of Unlabelled Documents. In: Proceedings of the 5th IEEE International Conference on Data Mining (2005)Google Scholar
  12. 12.
    Lamontagne, L., Lapalme, G.: Textual reuse for email response. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 242–256. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Lee, L.: On the effectiveness of the skew divergence for statistical language analysis. In: Artificial Intelligence and Statistics 2001, pp. 65–72 (2001)Google Scholar
  14. 14.
    Lenz, M.: Defining knowledge layers for textual CBR. In: Smyth, B., Cunningham, P. (eds.) EWCBR 1998. LNCS, vol. 1488, pp. 298–309. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  15. 15.
    Lewis, D.D., Knowles, K.A.: Threading electronic mail: A preliminary study. Information Processing and Management 33(2), 209–217 (1997)CrossRefGoogle Scholar
  16. 16.
    Liu, T., Liu, S., Chen, Z., Ma, W.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 488–495 (2003)Google Scholar
  17. 17.
    Patterson, D., Rooney, N., Dobrynin, V., Galushka, M.: Sophia: A novel approach for textual case-based reasoning. In: Proceedings of the Nineteenth IJCAI Conference, pp. 1146–1153 (2005)Google Scholar
  18. 18.
    Pereira, F., Tishby, N., Lee, L.: Distributional clustering of english words. In: Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)Google Scholar
  19. 19.
    Salton, G., McGill, M.: An introduction to modern information retrieval. McGraw-Hill, New York (1983)Google Scholar
  20. 20.
    Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of the 23rd European Colloquium on Information Retrieval Research (2001)Google Scholar
  21. 21.
    Smyth, B., McKenna, E.: Building compact competent case-bases. In: Althoff, K.-D., Bergmann, R., Branting, L.K. (eds.) ICCBR 1999. LNCS, vol. 1650, pp. 329–342. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  22. 22.
    Weber, R., Ashley, K., Bruninghaus, S.: Textual case-based reasoning. In: The Knowledge Engineering Review (to appear, 2006)Google Scholar
  23. 23.
    Wiratunga, N., Craw, S., Massie, S.: Index driven selective sampling for case-based reasoning. In: Case-Based Reasoning Research and Development: Proceedings of the 5th International Conference on CBR, pp. 637–651. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  24. 24.
    Wiratunga, N., Koychev, I., Massie, S.: Feature selection and generalisation for textual retrieval. In: Proceedings of the 7th European Conference on Case-Based Reasoning, pp. 806–820. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  25. 25.
    Wiratunga, N., Lothian, R., Chakraborty, S., Koychev, I.: Propositional approach to textual case indexing. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS, vol. 3721, pp. 380–391. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  26. 26.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorisation. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  27. 27.
    Zelikovitz, S.: Mining for features to improve classification. In: Proceedings of Machine Learning, Models, Technologies and Applications (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Nirmalie Wiratunga
    • 1
  • Rob Lothian
    • 1
  • Stewart Massie
    • 1
  1. 1.School of ComputingThe Robert Gordon UniversityAberdeenScotland, UK

Personalised recommendations