Abstract
The spread of misinformation in social media outlets has become a prevalent societal problem and is the cause of many kinds of social unrest. Curtailing its prevalence is of great importance and machine learning has shown significant promise. However, there are two main challenges when applying machine learning to this problem. First, while much too prevalent in one respect, misinformation, actually, represents only a minor proportion of all the postings seen on social media. Second, labeling the massive amount of data necessary to train a useful classifier becomes impractical. Considering these challenges, we propose a simple semi-supervised learning framework in order to deal with extreme class imbalances that has the advantage, over other approaches, of using actual rather than simulated data to inflate the minority class. We tested our framework on two sets of Covid-related Twitter data and obtained significant improvement in F1-measure on extremely imbalanced scenarios, as compared to simple classical and deep-learning data generation methods such as SMOTE, ADASYN, or GAN-based data generation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Bootstrap technique is implemented by repeating the sampling and testing procedure previously described 100 times and using the results of these experiments to estimate the real F1, Precision and Recall values and evaluate their standard deviation.
- 4.
References
Bellinger, C., Drummond, C., Japkowicz, N.: Manifold-based synthetic oversampling with manifold conformance estimation. Mach. Learn. 107(3), 605–637 (2017). https://doi.org/10.1007/s10994-017-5670-4
Boukouvalas, Z., et al.: Independent component analysis for trustworthy cyberspace during high impact events: an application to Covid-19. arXiv:2006.01284 [cs, stat] (June 2020)
Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. (CSUR) 49, 1–50 (2016)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Drummond, C., Holte, R.C., et al.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11, pp. 1–8. Citeseer (2003)
Goodfellow, I.J., et al. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008)
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.: Toward controlled generation of text. In: ICML (2017)
Hyun, M., Jeong, J., Kwak, N.: Class-imbalanced semi-supervised learning. CoRR abs/2002.06815 (2020)
Islam, M.R., Liu, S., Wang, X., Xu, G.: Deep learning for misinformation detection on online social networks: a survey and new perspectives. Soc. Netw. Anal. Min. 10(1), 1–20 (2020). https://doi.org/10.1007/s13278-020-00696-x
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300 [cs, stat] (February 2016)
Li, S., Wang, Z., Zhou, G., Lee, S.: Semi-supervised learning for imbalanced sentiment classification. In IJCAI (2011)
Müller, M., Salathé, M., Kummervold, P.E.: COVID-Twitter-BERT: a natural language processing model to analyse COVID-19 content on Twitter. arXiv preprint arXiv:2005.07503 (2020)
Mullick, S.S., Datta, S., Das, S.: Generative adversarial minority oversampling. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1695–1704 (2019)
Otair, D.M.: Approximate k-nearest neighbour based spatial clustering using KD-tree (2013)
Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.): CONSTRAINT 2021. CCIS, vol. 1402. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5
White, K., Li, G., Japkowicz, N.: Sampling online social networks using coupling from the past. In: 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 266–272 (2012)
Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. arXiv:abs/2006.07529 (2020)
Zhou, Z.-H.: Machine Learning. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-1967-3
Zhu, X.J.: Semi-supervised Learning Literature Survey (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Boukouvalas, Z., Japkowicz, N. (2021). A Semi-supervised Framework for Misinformation Detection. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-88942-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)