On the Design and Tuning of Machine Learning Models for Language Toxicity Classification in Online Platforms

Rybinski, Maciej; Miller, William; Del Ser, Javier; Bilbao, Miren Nekane; Aldana-Montes, José F.

doi:10.1007/978-3-319-99626-4_29

Maciej Rybinski⁸,
William Miller⁹,
Javier Del Ser^10,11,12,
Miren Nekane Bilbao¹² &
…
José F. Aldana-Montes⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 798))

Included in the following conference series:

International Symposium on Intelligent and Distributed Computing

763 Accesses
3 Citations

Abstract

One of the most concerning drawbacks derived from the lack of supervision in online platforms is their exploitation by misbehaving users to deliver offending (toxic) messages while remaining unknown themselves. Given the huge volumes of data handled by these platforms, the detection of toxicity in exchanged comments and messages has naturally called for the adoption of machine learning models to automate this task. In the last few years Deep Learning models and related techniques have played a major role in this regard due to their superior modeling capabilities, which have made them stand out as the prevailing choice in the related literature. By addressing a toxicity classification problem over a real dataset, this work aims at throwing light on two aspects of this noted dominance of Deep Learning models: (1) an empirical assessment of their predictive gains with respect to traditional Shallow Learning models; and (2) the impact of using different text embedding methods and data augmentation techniques in this classification task. Our findings reveal that in our case study the application of non-optimized Shallow and Deep Learning models attains very competitive accuracy scores, thus leaving a narrow improvement margin for the fine-grained refinement of the models or the addition of data augmentation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See e.g. http://mlwave.com/kaggle-ensembling-guide/.
2.
http://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
3.
Frequent in terms of total term frequency in the entire collection of training documents.
4.
https://www.kaggle.com/eashish/bidirectional-gru-with-convolution.
5.
https://www.kaggle.com/chongjiujjin/capsule-net-with-gru.
6.
https://github.com/PavelOstyakov/toxic/tree/master/tools.

References

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Escalante, H.J., Villatoro-Tello, E., Garza, S.E., López-Monroy, A.P., Montes-y Gómez, M., Villaseñor-Pineda, L.: Early detection of deception and aggressiveness using profile-based representations. Expert. Syst. Appl. 89, 99–111 (2017)
Article Google Scholar
Hashim, E.N., Nohuddin, P.N.: Data mining techniques for recidivism prediction: a survey paper. Adv. Sci. Lett. 24(3), 1616–1618 (2018)
Article Google Scholar
Lara-Cabrera, R., Gonzalez-Pardo, A., Camacho, D.: Statistical analysis of risk assessment factors and metrics to evaluate radicalisation in twitter. Futur. Gener. Comput. Syst. (2017). https://www.sciencedirect.com/science/article/pii/S0167739X17308348
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., et al.: Evolving deep neural networks (2017). arXiv preprint arXiv:170300548
Nam, J., Kim, J., Mencía, E.L., Gurevych, I., Fürnkranz, J.: Large-scale multi-label text classification revisiting neural networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 437–452. Springer (2014)
Google Scholar
Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks (2013). arXiv preprint arXiv:13126026
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3859–3869 (2017)
Google Scholar
Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962. ACM (2015)
Google Scholar
Villar-Rodriguez, E., Del Ser, J., Bilbao, M.N., Salcedo-Sanz, S.: A feature selection method for author identification in interactive communications based on supervised learning and language typicality. Eng. Appl. Artif. Intell. 56, 175–184 (2016)
Article Google Scholar
Villar-Rodríguez, E., Del Ser, J., Torre-Bastida, A.I., Bilbao, M.N., Salcedo-Sanz, S.: A novel machine learning approach to the detection of identity theft in social networks based on emulated attack instances and support vector machines. Concurr. Comput. Pract. Exp. 28(4), 1385–1395 (2016)
Article Google Scholar
Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification (2015). arXiv preprint arXiv:151003820
Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., Xu, B.: Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling (2016). arXiv preprint arXiv:161106639

Download references

Acknowledgements

The work of Maciej Rybinski has been partially funded by grant TIN2017-86049-R (Ministerio de Economa, Industria y Competitividad, Spain). Javier Del Ser also thanks the Basque Government for its funding support through the EMAITEK program.

Author information

Authors and Affiliations

University of Málaga, 29071, Málaga, Spain
Maciej Rybinski & José F. Aldana-Montes
Anami Precision, San Sebastián, Spain
William Miller
TECNALIA, Bizkaia, Spain
Javier Del Ser
Basque Center for Applied Mathematics (BCAM), Bizkaia, Spain
Javier Del Ser
University of the Basque Country (UPV/EHU), 48013, Bilbao, Spain
Javier Del Ser & Miren Nekane Bilbao

Authors

Maciej Rybinski
View author publications
You can also search for this author in PubMed Google Scholar
William Miller
View author publications
You can also search for this author in PubMed Google Scholar
Javier Del Ser
View author publications
You can also search for this author in PubMed Google Scholar
Miren Nekane Bilbao
View author publications
You can also search for this author in PubMed Google Scholar
José F. Aldana-Montes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maciej Rybinski .

Editor information

Editors and Affiliations

Basque Center for Applied Mathematics (BCAM), TECNALIA, University of the Basque Country (UPV/EHU), Derio, Bizkaia, Spain
Javier Del Ser
TECNALIA, Derio, Bizkaia, Spain
Eneko Osaba
University of the Basque Country (UPV/EHU), Bilbao, Bizkaia, Spain
Miren Nekane Bilbao
Computer Science Department, University of Las Palmas de Gran Canaria, Las Palmas, Spain
Javier J. Sanchez-Medina
OpenIoT Reesarch Unit, FBK CREATE-NET, Trento, Italy
Massimo Vecchio
Department of Design Engineering and Mathematics, Middlesex University, London, UK
Xin-She Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rybinski, M., Miller, W., Del Ser, J., Bilbao, M.N., Aldana-Montes, J.F. (2018). On the Design and Tuning of Machine Learning Models for Language Toxicity Classification in Online Platforms. In: Del Ser, J., Osaba, E., Bilbao, M., Sanchez-Medina, J., Vecchio, M., Yang, XS. (eds) Intelligent Distributed Computing XII. IDC 2018. Studies in Computational Intelligence, vol 798. Springer, Cham. https://doi.org/10.1007/978-3-319-99626-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-99626-4_29
Published: 15 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99625-7
Online ISBN: 978-3-319-99626-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics