Abstract
We propose a novel supervised approach for multi-label text classification, which is based on a neural network architecture consisting of a single encoder and multiple classifier heads. Our method predicts which subset of possible tags best matches an input text. It efficiently spends computational resources, exploiting dependencies between tags by encoding an input text into a compact representation which is then passed to multiple classifier heads. We test our architecture on a Twitter hashtag prediction task, comparing it to a baseline model with multiple feedforward networks and a baseline model with multiple recurrent neural networks with GRU cells. We show that our approach achieves a significantly better performance than baselines with an equivalent number of parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Contact the first author.
- 2.
Other network designs can also be used in the LEMIC architecture, such as ones based on a convolutional neural network (CNN).
- 3.
We show that LEMIC outperforms the baseline both when examining a feedforward variant and an RNN variant.
- 4.
- 5.
An encoder/decoder design for a network has been used in many architectures in the past [15, 30, 39, 40].
Similarly, other designs form an intentional information bottleneck to obtain a concise description of an input [31, 38, 40,41,42]. The novel aspect of our design lies in the specific architecture we use for the encoder and classifiers and the way we use them to minimize the loss in predicting the relevant labels.
- 6.
We note that it is simple to modify our architecture to place more emphasis on some of the classifiers by tweaking the loss structure. For instance one may multiply some classifier head losses by different constants than others in the overall loss function. However, we see no reason to do this for the specific Twitter dataset we have used for the evaluation here.
- 7.
In the unrealistic scenario where one has no “budget” constraints (i.e. infinite memory and compute time), one can build an isolated classifier for each tag, possibly using a different architecture for each such tag, and achieve a better accuracy.
References
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13(01), 157–169 (2004)
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Text Mining, pp. 1–20 (2010)
Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)
Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: AAAI, pp. 4296–4297 (2015)
Volkova, S., Bachrach, Y.: On predicting sociodemographic traits and emotions from communications in social networks and their implications to online self-disclosure. Cyberpsychol. Behav. Soc. Netw. 18(12), 726–736 (2015)
Lewenberg, Y., Bachrach, Y., Volkova, S.: Using emotions to predict user interest areas in online social networks. In: DSAA, pp. 1–10. IEEE (2015)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Inf. Process. Manage. 43(6), 1705–1714 (2007)
Mazzia, A., Juett, J.: Suggesting hashtags on twitter. In: EECS 545m. Computer Science and Engineering, University of Michigan, Machine Learning (2009)
Xiao, F., Noro, T., Tokuda, T.: News-topic oriented hashtag recommendation in twitter based on characteristic co-occurrence word detection. In: International Conference on Web Engineering, pp. 16–30. Springer (2012)
Li, T., Wu, Y., Zhang, Y.: Twitter hash tag prediction algorithm. In: ICOMP (2011)
Kywe, S.M., Hoang, T.-A., Lim, E.-P., Zhu, F.: On recommending hashtags in twitter networks. In: International Conference on Social Informatics, pp. 337–350. Springer (2012)
Godin, F., Slavkovikj, V., De Neve, W., Schrauwen, B., Van de Walle, R.: Using topic models for twitter hashtag recommendation. In: WWW, pp. 593–596. ACM (2013)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: ICML 2011, pp. 513–520 (2011)
Preoţiuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., Aletras, N.: Studying user income through language, behaviour and affect in social media. PloS one 10(9), e0138717 (2015)
Bachrach, Y., Gregorič, A.Ž., Coope, S., Tovell, E., Maksak, B., Rodriguez, J., McMurtie, C., Bordbar, M.: An attention mechanism for neural answer selection using a combined global and local view. In: Proceedings of ICTAI 2017. IEEE (2017)
Gregorič, A.Ž., Bachrach, Y., Minkovsky, P., Coope, S., Maksak, B.: Neural named entity recognition using a self-attention mechanism. In: Proceedings of ICTAI 2017. IEEE (2017)
Kandasamy, K., Bachrach, Y., Tomioka, R., Tarlow, D., Carter, D.: Batch policy gradient methods for improving neural conversation models. arXiv preprint arXiv:1702.03334 (2017)
Serban, I.V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Pieper, M., Chandar, S., Ke, N.R., et al.: A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lewenberg, Y., Bachrach, Y., Shankar, S., Criminisi, A.: Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information. In: IJCAI, pp. 1676–1682 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI, pp. 2094–2100 (2016)
Sarkar, K., Nasipuri, M., Ghose, S.: A new approach to keyphrase extraction using neural networks. arXiv preprint arXiv:1004.3274 (2010)
Kumarika, B.T., Dias, N.: Smart web content bookmarking with ANN based key phrase extraction algorithm. In: 2014 International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 228–234. IEEE (2014)
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. arXiv preprint arXiv:1605.03481 (2016)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. In: ICML Unsupervised and Transfer Learning, vol. 27, no. 37–50, p. 1 (2012)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Inc., Englewood Cliffs (1990)
Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Jolliffe, I.: Principal Component Analysis. Wiley Online Library (2002)
Zhang, Z.-Y., Zha, H.-Y.: Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. J. Shanghai Univ. (Engl. Ed.) 8(4), 406–424 (2004)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Kramer, M.A.: Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 37(2), 233–243 (1991)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Botha, J.A., Pitler, E., Ma, J., Bakalov, A., Salcianu, A., Weiss, D., McDonald, R., Petrov, S.: Natural language processing with small feed-forward networks. arXiv preprint arXiv:1708.00214 (2017)
Demuth, H.B., Beale, M.H., De Jess, O., Hagan, M.T.: Neural Network Design. Martin Hagan (2014)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Coope, S. et al. (2019). A Neural Architecture for Multi-label Text Classification. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-01054-6_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)