Semi-supervised Document Classification with a Mislabeling Error Model

Krithara, Anastasia; Amini, Massih R.; Renders, Jean-Michel; Goutte, Cyril

doi:10.1007/978-3-540-78646-7_34

Anastasia Krithara¹,
Massih R. Amini²,
Jean-Michel Renders¹ &
…
Cyril Goutte³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Included in the following conference series:

European Conference on Information Retrieval

2168 Accesses
11 Citations

Abstract

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amini, M.R., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: SIGIR, pp. 105–112 (2002)
Google Scholar
Amini, M.R., Gallinari, P.: Semi-supervised learning with an explicit label-error model for misclassified data. In: Proceedings of the 18th IJCAI, pp. 555–560 (2003)
Google Scholar
Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: COLT 1998, pp. 92–100 (1998)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: EMNLP/VLC (1999)
Google Scholar
Gaussier, E., Goutte, C.: Learning from partially labelled data - with confidence. In: Learning from Partially Classified Training Data - Proceedings of the ICML 2005 workshop (2005)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM SIGIR, pp. 50–57 (1999)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of ICML 1999, 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the European Conference on Machine Learning (1998)
Google Scholar
Krithara, A., Goutte, C., Renders, J.M., Amini, M.R.: Reducing the annotation burden in text classification. In: Proceedings of the 1st International Conference on Multidisciplinary Information Sciences and Technologies (InSciT 2006), Merida, Spain (October 2006)
Google Scholar
Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)
Google Scholar
McLernon, B., Kushmerick, N.: Transductive pattern learning for information extraction. In: Proc. Workshop Adaptive Text Extraction and Mining (2006), Conf. European Association for Computational Linguistics
Google Scholar
Miller, D.J., Uyar, H.S.: A mixture of experts classifier with learning based on both labelled and unlabeled data. In: Proc. of NIPS-(1997)
Google Scholar
Nigam, K., McCallum, K.A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Saul, L., Pereira, F.: Aggregate and mixed-order Markov models for statistical language processing. In: Proc of 2nd ICEMNLP (1997)
Google Scholar
Si, L., Callan, J.: A semi-supervised learning method to merge search engine results. ACM Transactions on Information Systems 24(4), 457–491 (2003)
Article Google Scholar
Slonim, N., Friedman, N., Tishby, N.: Usupervised Document Classification Using Sequentiel Information Maximization. In: SIGIR, pp. 129–136 (2002)
Google Scholar
Zhang, T.: The value of unlabeled data for classification problems. In: ICML (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, chemin de Maupertuis, F-38240, Meylan, France
Anastasia Krithara & Jean-Michel Renders
University Pierre et Marie Curie, 104, avenue du President Kennedy, 75016, Paris, France
Massih R. Amini
National Research Council Canada, 283, boulevard Alexandre-Taché, Gatineau, QC J8X 3X7, Canada
Cyril Goutte

Authors

Anastasia Krithara
View author publications
You can also search for this author in PubMed Google Scholar
Massih R. Amini
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Renders
View author publications
You can also search for this author in PubMed Google Scholar
Cyril Goutte
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krithara, A., Amini, M.R., Renders, JM., Goutte, C. (2008). Semi-supervised Document Classification with a Mislabeling Error Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-78646-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78645-0
Online ISBN: 978-3-540-78646-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics