Skip to main content

Automated Retraining Methods for Document Classification and Their Parameter Tuning

  • Conference paper
Web Information Systems Engineering – WISE 2005 (WISE 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

  • 1181 Accesses

Abstract

This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The 20 newsgroups data set, http://www.ai.mit.edu/~jrennie/20Newsgroups/

  2. Internet movie database, http://www.imdb.com

  3. Amini, M.-R., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: SIGIR 2002, pp. 105–112. ACM Press, New York (2002)

    Chapter  Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  5. Bennett, K.P., Demiriz, A.: Semi-supervised support vector machines. In: NIPS 1999, pp. 368–374. MIT Press, Cambridge (1999)

    Google Scholar 

  6. Bennett, K.P., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: SIGKDD, pp. 289–296. ACM Press, New York (2002)

    Google Scholar 

  7. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Workshop on Computational Learning Theory (1998)

    Google Scholar 

  8. Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Training text classifiers with SVM on very few positive examples. Technical Report MSR-TR-2003-34, Microsoft Corp. (2003)

    Google Scholar 

  9. Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)

    Google Scholar 

  10. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, San Francisco (2002)

    Google Scholar 

  11. Chen, E., Lam, C.: Predictor-corrector with cubic spline method for spectrum estimation in compton scatter correction of spect. Computers in biology and medicine 24(3), 229 (1994), Ingenta

    Article  Google Scholar 

  12. Dumais, S., Chen, H.: Hierarchical classification of Web content. In: SIGIR (2000)

    Google Scholar 

  13. Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. In: SIGKDD Explorations, pp. 30–39 (2004)

    Google Scholar 

  14. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  15. Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, pp. 200–209 (1999)

    Google Scholar 

  16. Joachims, T.: Transductive learning via spectral graph partitioning. In: ICML, pp. 290–297 (2003)

    Google Scholar 

  17. Kohavi, R., John, G.: Automatic parameter selection by minimizing estimated error. Machine Learning (1995)

    Google Scholar 

  18. Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., Figueiredo, M.: On semi-supervised classification. In: NIPS. MIT Press, Cambridge (2005)

    Google Scholar 

  19. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML 1997, Nashville, TN, U.S.A, pp. 179–186 (1997)

    Google Scholar 

  20. Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: ICML 2003, Washingtion USA (2003)

    Google Scholar 

  21. Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)

    Chapter  Google Scholar 

  22. Manning, C., Schuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  23. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Intelligence 39(2/3) (2000)

    Google Scholar 

  24. Okanla, E., Gaydecki, P.: A real-time audio frequency cubic spline interpolator. Signal processing 49(1), 45 (1996), Ingenta

    Article  MATH  Google Scholar 

  25. Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)

    Google Scholar 

  26. Seeger, M.: Learning with labeled and unlabeled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2001)

    Google Scholar 

  27. Seymour, C., Unsworth, K.: Interactive shape preserving interpolation by curvature continuous rational cubic splines. Appl. Math. 102(1), 87–117 (1999)

    MATH  MathSciNet  Google Scholar 

  28. Siersdorfer, S., Weikum, G.: Automated retraining methods for document classification and their parameter tuning. Technical Report MPI-I-2005-5-002, Max-Planck-Institute for Computer Science, Germany (2005), http://www.mpi-sb.mpg.de/~stesi/sources/2005/report05retr.pdf

  29. Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The BINGO! system for information portal generation and expert Web search. In: Conference on Innovative Systems Research, CIDR (2003)

    Google Scholar 

  30. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  31. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: NIPS. MIT Press, Cambridge (2004)

    Google Scholar 

  32. Zhou, Z., Chen, K., Jiang, Y.: Exploiting unlabeled data in content-based image retrieval. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 525–536. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Siersdorfer, S., Weikum, G. (2005). Automated Retraining Methods for Document Classification and Their Parameter Tuning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_38

Download citation

  • DOI: https://doi.org/10.1007/11581062_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30017-5

  • Online ISBN: 978-3-540-32286-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics