Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

Pawar, Sachin; Ramrakhiyani, Nitin; Hingmire, Swapnil; Palshikar, Girish K.

doi:10.1007/978-3-319-75487-1_35

Sachin Pawar^14,15,
Nitin Ramrakhiyani¹⁴,
Swapnil Hingmire^14,16 &
…
Girish K. Palshikar¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1140 Accesses

Abstract

We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of “affinity” between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Chaney, A.J.B., Blei, D.M.: Visualizing topic models. In: ICWSM (2012)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: SIGIR, pp. 595–602 (2008)
Google Scholar
Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: EMNLP, pp. 81–90 (2009)
Google Scholar
Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S.: Document classification through interactive supervision of document and term labels. In: PKDD, pp. 185–196 (2004)
Google Scholar
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: NIPS (2004)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(Suppl. 1), 5228–5235 (2004)
Article Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report, University of Leipzig (2008)
Google Scholar
Hingmire, S., Chakraborti, S.: Topic labeled text classification: a weakly supervised approach. In: SIGIR, pp. 385–394. ACM (2014)
Google Scholar
Hingmire, S., Chougule, S., Palshikar, G.K., Chakraborti, S.: Document classification by topic labeling. In: SIGIR, pp. 877–880. ACM (2013)
Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of 6th New Zealand Computer Science Research Student Conference (NZCSRSC 2008), pp. 49–56 (2008)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML, pp. 200–209 (1999)
Google Scholar
Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text classification by labeling words. In: Proceedings of 19th National Conference on Artificial Intelligence, pp. 425–430 (2004)
Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. - Special issue on Information Retrieval 39(2-3), 103-134 (2000)
Google Scholar
Raghavan, H., Madani, O., Jones, R.: Active learning with feedback on features and instances. JMLR 7, 1655–1686 (2006)
MathSciNet MATH Google Scholar
Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: a latent Dirichlet allocation approach. In: Zaïane, O.R., Zilles, S. (eds.) AI 2013. LNCS (LNAI), vol. 7884, pp. 293–300. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38457-8_29
Chapter Google Scholar
Schapire, R.E., Rochery, M., Rahim, M.G., Gupta, N.K.: Incorporating prior knowledge into boosting. In: ICML, pp. 538–545 (2002)
Google Scholar
Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin–Madison (2009)
Google Scholar
Subramanya, A., Bilmes, J.: Soft-supervised learning for text classification. In: EMNLP, pp. 1090–1099. Association for Computational Linguistics (2008)
Google Scholar
Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20(1), 55–67 (2008)
Article Google Scholar
Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: Proceedings of 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–333 (2004)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer (2002)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

TCS Research, Tata Consultancy Services, Pune, 411013, India
Sachin Pawar, Nitin Ramrakhiyani, Swapnil Hingmire & Girish K. Palshikar
Department of CSE, Indian Institute of Technology Bombay, Mumbai, 400076, India
Sachin Pawar
Department of CSE, Indian Institute of Technology Madras, Chennai, 600036, India
Swapnil Hingmire

Authors

Sachin Pawar
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Ramrakhiyani
View author publications
You can also search for this author in PubMed Google Scholar
Swapnil Hingmire
View author publications
You can also search for this author in PubMed Google Scholar
Girish K. Palshikar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitin Ramrakhiyani .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pawar, S., Ramrakhiyani, N., Hingmire, S., Palshikar, G.K. (2018). Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_35
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics