Automated Retraining Methods for Document Classification and Their Parameter Tuning

Siersdorfer, Stefan; Weikum, Gerhard

doi:10.1007/11581062_38

Stefan Siersdorfer²¹ &
Gerhard Weikum²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1181 Accesses

Abstract

This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The 20 newsgroups data set, http://www.ai.mit.edu/~jrennie/20Newsgroups/
Internet movie database, http://www.imdb.com
Amini, M.-R., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: SIGIR 2002, pp. 105–112. ACM Press, New York (2002)
Chapter Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Bennett, K.P., Demiriz, A.: Semi-supervised support vector machines. In: NIPS 1999, pp. 368–374. MIT Press, Cambridge (1999)
Google Scholar
Bennett, K.P., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: SIGKDD, pp. 289–296. ACM Press, New York (2002)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Workshop on Computational Learning Theory (1998)
Google Scholar
Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Training text classifiers with SVM on very few positive examples. Technical Report MSR-TR-2003-34, Microsoft Corp. (2003)
Google Scholar
Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, San Francisco (2002)
Google Scholar
Chen, E., Lam, C.: Predictor-corrector with cubic spline method for spectrum estimation in compton scatter correction of spect. Computers in biology and medicine 24(3), 229 (1994), Ingenta
Article Google Scholar
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: SIGIR (2000)
Google Scholar
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. In: SIGKDD Explorations, pp. 30–39 (2004)
Google Scholar
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, pp. 200–209 (1999)
Google Scholar
Joachims, T.: Transductive learning via spectral graph partitioning. In: ICML, pp. 290–297 (2003)
Google Scholar
Kohavi, R., John, G.: Automatic parameter selection by minimizing estimated error. Machine Learning (1995)
Google Scholar
Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., Figueiredo, M.: On semi-supervised classification. In: NIPS. MIT Press, Cambridge (2005)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML 1997, Nashville, TN, U.S.A, pp. 179–186 (1997)
Google Scholar
Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: ICML 2003, Washingtion USA (2003)
Google Scholar
Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Chapter Google Scholar
Manning, C., Schuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Intelligence 39(2/3) (2000)
Google Scholar
Okanla, E., Gaydecki, P.: A real-time audio frequency cubic spline interpolator. Signal processing 49(1), 45 (1996), Ingenta
Article MATH Google Scholar
Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)
Google Scholar
Seeger, M.: Learning with labeled and unlabeled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2001)
Google Scholar
Seymour, C., Unsworth, K.: Interactive shape preserving interpolation by curvature continuous rational cubic splines. Appl. Math. 102(1), 87–117 (1999)
MATH MathSciNet Google Scholar
Siersdorfer, S., Weikum, G.: Automated retraining methods for document classification and their parameter tuning. Technical Report MPI-I-2005-5-002, Max-Planck-Institute for Computer Science, Germany (2005), http://www.mpi-sb.mpg.de/~stesi/sources/2005/report05retr.pdf
Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The BINGO! system for information portal generation and expert Web search. In: Conference on Innovative Systems Research, CIDR (2003)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: NIPS. MIT Press, Cambridge (2004)
Google Scholar
Zhou, Z., Chen, K., Jiang, Y.: Exploiting unlabeled data in content-based image retrieval. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 525–536. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck-Institute for Computer Science, Germany
Stefan Siersdorfer & Gerhard Weikum

Authors

Stefan Siersdorfer
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Weikum
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Texas State University, San Marcos, TX,
Anne H. H. Ngu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
University of Vienna, Vienna, Austria
Erich J. Neuhold
IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, 10598, New York, Yorktown Heights, USA
Jen-Yao Chung
School of Computer Science and Engineering, University of New South Wales, NSW 2052, Sydney, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siersdorfer, S., Weikum, G. (2005). Automated Retraining Methods for Document Classification and Their Parameter Tuning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_38

Download citation

DOI: https://doi.org/10.1007/11581062_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics