Advertisement

Threshold Selection for Classification with Skewed Class Distribution

  • Xiaofeng He
  • Rong Zhang
  • Aoying Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7901)

Abstract

In document classification, threshold selection receives little attention, particularly in binary classification cases where threshold selection was largely ignored as a trivial task of a post-processing step. In webpage classification, however, we are facing a problem involving huge number of webpages usually with highly imbalanced class distribution. Due to the budget constraint, a reliable estimate of the threshold is required on only small size of human judged webpages. A good threshold selection criterion also need be adopted in highly imbalanced class distribution situation with positives being very spares in the sample set. These challenges make the threshold selection a non-trivial task for webpage classification. In this paper, we propose a novel cost efficient approach of threshold selection method for binary webpage classification with highly imbalanced class distribution. We construct a small sample set by applying stratified sampling on the webpages. The human judged samples are expanded to reflect the true class distribution of the webpage population. Experimental results show that false positive rate leads to more stable threshold estimate.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bennett, P.N.: Assessing the calibration of naive bayes posterior estimates. Tech. rep., Computer science department, school of computer science, CMU (2000)Google Scholar
  2. 2.
    Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons Inc., New York (1977)zbMATHGoogle Scholar
  3. 3.
    Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1996)CrossRefGoogle Scholar
  4. 4.
    Egan, J.P.: Signal Detection Theory and Roc Analysis. Academic Press, New York (1975)Google Scholar
  5. 5.
    Fawcett, T.: Draft roc graphs: Notes and practical considerations for data mining researchers (2003)Google Scholar
  6. 6.
    Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185. Morgan Kaufmann Publishers (2001)Google Scholar
  7. 7.
    Gvert, N., Lalmas, M., Fuhr, N.: A probabillistic description-oriented approach for categorising web documents. In: Proceedings of CIKM 1999, 8th ACM International Conference on Information and Knowledge Management, pp. 475–482 (1999)Google Scholar
  8. 8.
    Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category serch strategies. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 273–281. ACM Press (1995)Google Scholar
  9. 9.
    Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 90–95 (1998)Google Scholar
  10. 10.
    Levy, P.S., Lemeshow, S.: Sampling of Populations: Methods and Applications, 3rd edn. John Wiley & Sons Inc., New York (1999)zbMATHGoogle Scholar
  11. 11.
    Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, pp. 37–50. ACM Press (1992)Google Scholar
  12. 12.
    Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 246–254. ACM Press (1995)Google Scholar
  13. 13.
    Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)Google Scholar
  14. 14.
    Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 215–223. ACM Press (1998)Google Scholar
  15. 15.
    Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefGoogle Scholar
  16. 16.
    Spackman, K.A.: Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of 6th International Workshop on Machine Learning, pp. 160–163. Morgan Kaufman (1989)Google Scholar
  17. 17.
    Swets, J.A., Dawes, R.M., Monahan, J.: Context-sensitive learning methods for text categorization, pp. 82–87. Scientific American (October 2000)Google Scholar
  18. 18.
    Thompson, S.K.: Sampling, 2nd edn. John Wiley & Sons Inc., New York (2002)zbMATHGoogle Scholar
  19. 19.
    Tie-Yan Liu, Yiming Yang, H.W.H.J.Z.Z.C., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. In: ACM SIGKDD Explorations Newsletter - Natural Language Processing and Text Mining, vol. 7, pp. 36–43. ACM Press (June 2005)Google Scholar
  20. 20.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)CrossRefGoogle Scholar
  21. 21.
    Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)Google Scholar
  22. 22.
    Yang, Y., Slattery, S.A.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Xiaofeng He
    • 1
  • Rong Zhang
    • 1
  • Aoying Zhou
    • 1
  1. 1.Software Engineering InstituteEast China Normal UniversityShanghaiChina

Personalised recommendations