A Neural Architecture for Multi-label Text Classification

Coope, Sam; Bachrach, Yoram; Žukov-Gregorič, Andrej; Rodriguez, José; Maksak, Bogdan; McMurtie, Conan; Bordbar, Mahyar

doi:10.1007/978-3-030-01054-6_49

Sam Coope¹⁷,
Yoram Bachrach¹⁸,
Andrej Žukov-Gregorič¹⁹,
José Rodriguez²⁰,
Bogdan Maksak²¹,
Conan McMurtie²¹ &
…
Mahyar Bordbar²¹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 868))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1630 Accesses

Abstract

We propose a novel supervised approach for multi-label text classification, which is based on a neural network architecture consisting of a single encoder and multiple classifier heads. Our method predicts which subset of possible tags best matches an input text. It efficiently spends computational resources, exploiting dependencies between tags by encoding an input text into a compact representation which is then passed to multiple classifier heads. We test our architecture on a Twitter hashtag prediction task, comparing it to a baseline model with multiple feedforward networks and a baseline model with multiple recurrent neural networks with GRU cells. We show that our approach achieves a significantly better performance than baselines with an equivalent number of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Contact the first author.
2.
Other network designs can also be used in the LEMIC architecture, such as ones based on a convolutional neural network (CNN).
3.
We show that LEMIC outperforms the baseline both when examining a feedforward variant and an RNN variant.
4.
Deep learning has of course also proven extremely useful in domains other than NLP, such as vision [22,23,24] or control and reinforcement learning [25, 26].
5.
An encoder/decoder design for a network has been used in many architectures in the past [15, 30, 39, 40].
Similarly, other designs form an intentional information bottleneck to obtain a concise description of an input [31, 38, 40,41,42]. The novel aspect of our design lies in the specific architecture we use for the encoder and classifiers and the way we use them to minimize the loss in predicting the relevant labels.
6.
We note that it is simple to modify our architecture to place more emphasis on some of the classifiers by tweaking the loss structure. For instance one may multiply some classifier head losses by different constants than others in the overall loss function. However, we see no reason to do this for the specific Twitter dataset we have used for the evaluation here.
7.
In the unrealistic scenario where one has no “budget” constraints (i.e. infinite memory and compute time), one can build an isolated classifier for each tag, possibly using a different architecture for each such tag, and achieve a better accuracy.

References

Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13(01), 157–169 (2004)
Article Google Scholar
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Text Mining, pp. 1–20 (2010)
Google Scholar
Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)
Google Scholar
Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)
Google Scholar
Volkova, S., Bachrach, Y.: On predicting sociodemographic traits and emotions from communications in social networks and their implications to online self-disclosure. Cyberpsychol. Behav. Soc. Netw. 18(12), 726–736 (2015)
Article Google Scholar
Lewenberg, Y., Bachrach, Y., Volkova, S.: Using emotions to predict user interest areas in online social networks. In: DSAA, pp. 1–10. IEEE (2015)
Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)
Google Scholar
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Inf. Process. Manage. 43(6), 1705–1714 (2007)
Article Google Scholar
Mazzia, A., Juett, J.: Suggesting hashtags on twitter. In: EECS 545m. Computer Science and Engineering, University of Michigan, Machine Learning (2009)
Google Scholar
Xiao, F., Noro, T., Tokuda, T.: News-topic oriented hashtag recommendation in twitter based on characteristic co-occurrence word detection. In: International Conference on Web Engineering, pp. 16–30. Springer (2012)
Google Scholar
Li, T., Wu, Y., Zhang, Y.: Twitter hash tag prediction algorithm. In: ICOMP (2011)
Google Scholar
Kywe, S.M., Hoang, T.-A., Lim, E.-P., Zhu, F.: On recommending hashtags in twitter networks. In: International Conference on Social Informatics, pp. 337–350. Springer (2012)
Google Scholar
Godin, F., Slavkovikj, V., De Neve, W., Schrauwen, B., Van de Walle, R.: Using topic models for twitter hashtag recommendation. In: WWW, pp. 593–596. ACM (2013)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: ICML 2011, pp. 513–520 (2011)
Google Scholar
Preoţiuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., Aletras, N.: Studying user income through language, behaviour and affect in social media. PloS one 10(9), e0138717 (2015)
Article Google Scholar
Bachrach, Y., Gregorič, A.Ž., Coope, S., Tovell, E., Maksak, B., Rodriguez, J., McMurtie, C., Bordbar, M.: An attention mechanism for neural answer selection using a combined global and local view. In: Proceedings of ICTAI 2017. IEEE (2017)
Google Scholar
Gregorič, A.Ž., Bachrach, Y., Minkovsky, P., Coope, S., Maksak, B.: Neural named entity recognition using a self-attention mechanism. In: Proceedings of ICTAI 2017. IEEE (2017)
Google Scholar
Kandasamy, K., Bachrach, Y., Tomioka, R., Tarlow, D., Carter, D.: Batch policy gradient methods for improving neural conversation models. arXiv preprint arXiv:1702.03334 (2017)
Serban, I.V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Pieper, M., Chandar, S., Ke, N.R., et al.: A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lewenberg, Y., Bachrach, Y., Shankar, S., Criminisi, A.: Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information. In: IJCAI, pp. 1676–1682 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI, pp. 2094–2100 (2016)
Google Scholar
Sarkar, K., Nasipuri, M., Ghose, S.: A new approach to keyphrase extraction using neural networks. arXiv preprint arXiv:1004.3274 (2010)
Kumarika, B.T., Dias, N.: Smart web content bookmarking with ANN based key phrase extraction algorithm. In: 2014 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 228–234. IEEE (2014)
Google Scholar
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. arXiv preprint arXiv:1605.03481 (2016)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. In: ICML Unsupervised and Transfer Learning, vol. 27, no. 37–50, p. 1 (2012)
Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Article MathSciNet Google Scholar
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Inc., Englewood Cliffs (1990)
Google Scholar
Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)
Article Google Scholar
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Jolliffe, I.: Principal Component Analysis. Wiley Online Library (2002)
Google Scholar
Zhang, Z.-Y., Zha, H.-Y.: Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J. Shanghai Univ. (Engl. Ed.) 8(4), 406–424 (2004)
Article MathSciNet Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Kramer, M.A.: Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 37(2), 233–243 (1991)
Article Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Botha, J.A., Pitler, E., Ma, J., Bakalov, A., Salcianu, A., Weiss, D., McDonald, R., Petrov, S.: Natural language processing with small feed-forward networks. arXiv preprint arXiv:1708.00214 (2017)
Demuth, H.B., Beale, M.H., De Jess, O., Hagan, M.T.: Neural Network Design. Martin Hagan (2014)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)
Article Google Scholar
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

PolyAI, London, UK
Sam Coope
Google Deepmind, London, UK
Yoram Bachrach
Blackrock, London, UK
Andrej Žukov-Gregorič
Melior.ai, Dublin, Ireland
José Rodriguez
DigitalGenius Ltd., London, UK
Bogdan Maksak, Conan McMurtie & Mahyar Bordbar

Authors

Sam Coope
View author publications
You can also search for this author in PubMed Google Scholar
Yoram Bachrach
View author publications
You can also search for this author in PubMed Google Scholar
Andrej Žukov-Gregorič
View author publications
You can also search for this author in PubMed Google Scholar
José Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Maksak
View author publications
You can also search for this author in PubMed Google Scholar
Conan McMurtie
View author publications
You can also search for this author in PubMed Google Scholar
Mahyar Bordbar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sam Coope .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Coope, S. et al. (2019). A Neural Architecture for Multi-label Text Classification. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-01054-6_49
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics