Abstract
The use of disinformation and purposefully biased reportage to sway public opinion has become a serious concern. We present a new dataset related to the Ukrainian Crisis of 2014–2015 which can be used by other researchers to train, test, and compare bias detection algorithms. The dataset comprises 4,538 articles in English related to the crisis from 227 news sources in 43 countries (including the Ukraine) comprising 1.7M words. We manually classified the bias of each article as either pro-Russian, pro-Western, or Neutral, and also aligned each article with a master timeline of 17 major events. When trained on the whole dataset a simple baseline SVM classifier using doc2vec embeddings as features achieves an \(F_{1}\) score of 0.86. This performance is deceptively high, however, because (1) the model is almost completely unable to correctly classify articles published in the Ukraine (0.07 \(F_{1}\)), and (2) the model performs nearly as well when trained on unrelated geopolitics articles written by the same publishers and tested on the dataset. As has been pointed out by other researchers, these results suggest that models of this type are learning journalistic styles rather than actually modeling bias. This implies that more sophisticated approaches will be necessary for true bias detection and classification, and this dataset can serve as an incisive test of new approaches.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As classified by the news media site AllSides, https://www.allsides.com/unbiased-balanced-news.
- 2.
- 3.
- 4.
- 5.
The second author is an undergraduate researcher majoring in International Relations and specializing in Russia.
- 6.
Sputnik uses the word “Topics” to refer to their article categories, though these serve the same organizing purpose as Wikipedia’s events.
References
Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., Nakov, P.: Predicting Factuality of Reporting and Bias of News Media Sources (2018)
Baumer, E.P.S., Elovic, E., Qin, Y.C., Polletta, F., Gay, G.K.: Testing and comparing computational approaches for identifying the language of framing in political news. In: ACL, pp. 1472–1482 (2015)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Newton (2009)
Card, D., Boydstun, A.E., Gross, J.H., Resnik, P., Smith, N.A.: The media frames corpus: annotations of frames across issues. In: Proceedings of the 53rd Annual Meeting of the ACM and the 7th International Joint Conference on Natural Language Processing (vol. 2: Short Papers) (2015). https://doi.org/10.3115/v1/p15-2072
Chawla, N., Bowyer, K.: SMOTE: synthetic minority over-sampling technique Nitesh. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Field, A., Kliger, D., Wintner, S., Pan, J., Jurafsky, D., Tsvetkov, Y.: Framing and Agenda-Setting in Russian News: a Computational Analysis of Intricate Political Strategies (2018)
Grimmer, J., Stewart, B.M.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013). https://doi.org/10.1093/pan/mps028
Hamborg, F., Donnay, K., Gipp, B.: Automated identification of media bias in news articles: an interdisciplinary literature review. Int. J. Digit. Libr. (2018). https://doi.org/10.1007/s00799-018-0261-y
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning – with Applications in R. Springer Texts in Statistics, vol. 103. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7
Krestel, R., Wall, A., Nejdl, W.: Treehugger or petrolhead? In: Proceedings of the 21st International Conference Companion on World Wide Web - WWW 2012 Companion, p. 547 (2012). https://doi.org/10.1145/2187980.2188120
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)
Nimmo, B.: Anatomy of an info-war: how Russia’s propaganda machine works, and how to counter it. Technical report, Central European Policy Institute (CEPI) (2015)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Peters, M.E., Lecocq, D.: Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013 Companion, pp. 89–90. ACM, New York (2013). https://doi.org/10.1145/2487788.2487828
Project, G.: Gnu Wget 1.20 Manual (2018). https://www.gnu.org/software/wget/manual/
Recasens, M., Danescu-Niculescu-Mizil, C., Jurafsky, D.: Linguistic models for analyzing and detecting biased language. In: Proceedings of the 51st Annual Meeting on ACM, pp. 1650–1659 (2013)
Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., Liu, Y.: Combating Fake News: A Survey on Identification and Mitigation Techniques, vol. 37, no. 4 (2019). https://doi.org/10.1145/1122445.1122456
Zhou, X., Zafarani, R.: Fake News: A Survey of Research, Detection Methods, and Opportunities (2018). https://doi.org/10.13140/RG.2.2.25075.37926
Acknowledgements
This work was supported by Office of Naval Research (ONR) grant number N00014-17-1-2983.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cremisini, A., Aguilar, D., Finlayson, M.A. (2019). A Challenging Dataset for Bias Detection: The Case of the Crisis in the Ukraine. In: Thomson, R., Bisgin, H., Dancy, C., Hyder, A. (eds) Social, Cultural, and Behavioral Modeling. SBP-BRiMS 2019. Lecture Notes in Computer Science(), vol 11549. Springer, Cham. https://doi.org/10.1007/978-3-030-21741-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-21741-9_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21740-2
Online ISBN: 978-3-030-21741-9
eBook Packages: Computer ScienceComputer Science (R0)