Skip to main content

A Neural Architecture for Multi-label Text Classification

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2018)

Abstract

We propose a novel supervised approach for multi-label text classification, which is based on a neural network architecture consisting of a single encoder and multiple classifier heads. Our method predicts which subset of possible tags best matches an input text. It efficiently spends computational resources, exploiting dependencies between tags by encoding an input text into a compact representation which is then passed to multiple classifier heads. We test our architecture on a Twitter hashtag prediction task, comparing it to a baseline model with multiple feedforward networks and a baseline model with multiple recurrent neural networks with GRU cells. We show that our approach achieves a significantly better performance than baselines with an equivalent number of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Contact the first author.

  2. 2.

    Other network designs can also be used in the LEMIC architecture, such as ones based on a convolutional neural network (CNN).

  3. 3.

    We show that LEMIC outperforms the baseline both when examining a feedforward variant and an RNN variant.

  4. 4.

    Deep learning has of course also proven extremely useful in domains other than NLP, such as vision [22,23,24] or control and reinforcement learning [25, 26].

  5. 5.

    An encoder/decoder design for a network has been used in many architectures in the past [15, 30, 39, 40].

    Similarly, other designs form an intentional information bottleneck to obtain a concise description of an input [31, 38, 40,41,42]. The novel aspect of our design lies in the specific architecture we use for the encoder and classifiers and the way we use them to minimize the loss in predicting the relevant labels.

  6. 6.

    We note that it is simple to modify our architecture to place more emphasis on some of the classifiers by tweaking the loss structure. For instance one may multiply some classifier head losses by different constants than others in the overall loss function. However, we see no reason to do this for the specific Twitter dataset we have used for the evaluation here.

  7. 7.

    In the unrealistic scenario where one has no “budget” constraints (i.e. infinite memory and compute time), one can build an isolated classifier for each tag, possibly using a different architecture for each such tag, and achieve a better accuracy.

References

  1. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13(01), 157–169 (2004)

    Article  Google Scholar 

  2. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Text Mining, pp. 1–20 (2010)

    Google Scholar 

  3. Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)

    Google Scholar 

  4. Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)

    Google Scholar 

  5. Volkova, S., Bachrach, Y.: On predicting sociodemographic traits and emotions from communications in social networks and their implications to online self-disclosure. Cyberpsychol. Behav. Soc. Netw. 18(12), 726–736 (2015)

    Article  Google Scholar 

  6. Lewenberg, Y., Bachrach, Y., Volkova, S.: Using emotions to predict user interest areas in online social networks. In: DSAA, pp. 1–10. IEEE (2015)

    Google Scholar 

  7. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)

    Google Scholar 

  8. Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Inf. Process. Manage. 43(6), 1705–1714 (2007)

    Article  Google Scholar 

  9. Mazzia, A., Juett, J.: Suggesting hashtags on twitter. In: EECS 545m. Computer Science and Engineering, University of Michigan, Machine Learning (2009)

    Google Scholar 

  10. Xiao, F., Noro, T., Tokuda, T.: News-topic oriented hashtag recommendation in twitter based on characteristic co-occurrence word detection. In: International Conference on Web Engineering, pp. 16–30. Springer (2012)

    Google Scholar 

  11. Li, T., Wu, Y., Zhang, Y.: Twitter hash tag prediction algorithm. In: ICOMP (2011)

    Google Scholar 

  12. Kywe, S.M., Hoang, T.-A., Lim, E.-P., Zhu, F.: On recommending hashtags in twitter networks. In: International Conference on Social Informatics, pp. 337–350. Springer (2012)

    Google Scholar 

  13. Godin, F., Slavkovikj, V., De Neve, W., Schrauwen, B., Van de Walle, R.: Using topic models for twitter hashtag recommendation. In: WWW, pp. 593–596. ACM (2013)

    Google Scholar 

  14. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)

    Google Scholar 

  15. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  16. Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: ICML 2011, pp. 513–520 (2011)

    Google Scholar 

  17. Preoţiuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., Aletras, N.: Studying user income through language, behaviour and affect in social media. PloS one 10(9), e0138717 (2015)

    Article  Google Scholar 

  18. Bachrach, Y., Gregorič, A.Ž., Coope, S., Tovell, E., Maksak, B., Rodriguez, J., McMurtie, C., Bordbar, M.: An attention mechanism for neural answer selection using a combined global and local view. In: Proceedings of ICTAI 2017. IEEE (2017)

    Google Scholar 

  19. Gregorič, A.Ž., Bachrach, Y., Minkovsky, P., Coope, S., Maksak, B.: Neural named entity recognition using a self-attention mechanism. In: Proceedings of ICTAI 2017. IEEE (2017)

    Google Scholar 

  20. Kandasamy, K., Bachrach, Y., Tomioka, R., Tarlow, D., Carter, D.: Batch policy gradient methods for improving neural conversation models. arXiv preprint arXiv:1702.03334 (2017)

  21. Serban, I.V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Pieper, M., Chandar, S., Ke, N.R., et al.: A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349 (2017)

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  23. Lewenberg, Y., Bachrach, Y., Shankar, S., Criminisi, A.: Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information. In: IJCAI, pp. 1676–1682 (2016)

    Google Scholar 

  24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  25. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  26. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI, pp. 2094–2100 (2016)

    Google Scholar 

  27. Sarkar, K., Nasipuri, M., Ghose, S.: A new approach to keyphrase extraction using neural networks. arXiv preprint arXiv:1004.3274 (2010)

  28. Kumarika, B.T., Dias, N.: Smart web content bookmarking with ANN based key phrase extraction algorithm. In: 2014 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 228–234. IEEE (2014)

    Google Scholar 

  29. Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. arXiv preprint arXiv:1605.03481 (2016)

  30. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  31. Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. In: ICML Unsupervised and Transfer Learning, vol. 27, no. 37–50, p. 1 (2012)

    Google Scholar 

  32. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Article  MathSciNet  Google Scholar 

  33. Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Inc., Englewood Cliffs (1990)

    Google Scholar 

  34. Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)

    Article  Google Scholar 

  35. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  36. Jolliffe, I.: Principal Component Analysis. Wiley Online Library (2002)

    Google Scholar 

  37. Zhang, Z.-Y., Zha, H.-Y.: Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J. Shanghai Univ. (Engl. Ed.) 8(4), 406–424 (2004)

    Article  MathSciNet  Google Scholar 

  38. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  39. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  40. Kramer, M.A.: Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 37(2), 233–243 (1991)

    Article  Google Scholar 

  41. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)

    Google Scholar 

  42. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)

    Google Scholar 

  43. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  44. Botha, J.A., Pitler, E., Ma, J., Bakalov, A., Salcianu, A., Weiss, D., McDonald, R., Petrov, S.: Natural language processing with small feed-forward networks. arXiv preprint arXiv:1708.00214 (2017)

  45. Demuth, H.B., Beale, M.H., De Jess, O., Hagan, M.T.: Neural Network Design. Martin Hagan (2014)

    Google Scholar 

  46. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  47. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  48. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)

    Article  Google Scholar 

  49. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sam Coope .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Coope, S. et al. (2019). A Neural Architecture for Multi-label Text Classification. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_49

Download citation

Publish with us

Policies and ethics