Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

Coeckelbergs, Mathias; Van Hooland, Seth

doi:10.1007/978-3-030-50072-6_4

Mathias Coeckelbergs¹⁶ &
Seth Van Hooland¹⁶

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 319))

Included in the following conference series:

International Conference on Data and Information in Online

442 Accesses

Abstract

Within the field of Digital Humanities, unsupervised machine learning techniques such as topic modeling have gained a lot of attention over the last years to explore vast volumes of non-structured textual data. Even if this technique is useful to capture recurring themes across document sets which have no metadata, the interpretation of topics has been consistently highlighted in the literature as problematic. This paper proposes a novel method based on Word Embeddings to facilitate the interpretation of terms which constituted a topic, allowing to discern different concepts automatically within a topic. In order to demonstrate this method, the paper uses the “Cabinet Papers” held and digitised by the The National Archives (TNA) of the United Kingdom (UK). After a discussion of our results, based on coherence measures, we provide details of how we can linguistically interpret these results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.nationalarchives.gov.uk/cabinetpapers/.

References

Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Blei, D.M., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16 (2004)
Google Scholar
Blei, D.M., Lafferty, J.D.: Correlated topic models. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems 18. MIT Press, Cambridge (2006)
Google Scholar
Chandler, D.: Semiotics: The Basics, 2nd edn. Routledge, London (2007)
Book Google Scholar
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 288–296 (2016)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Article Google Scholar
Firth, J.R.: Papers in Linguistics 1934–1951. Oxford, London (1957)
Google Scholar
Hengchen, S., Coeckelbergs, M., Van Hooland, S.: Exploring archives with probabilistic models: topic modeling for the valorization of digitised archives of the European Commission. In: IEEE International Conference on Big Data Workshop on Computational Archival Science, Washington D.C., pp. 3245–3249 (2016)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems 2, pp. 3111–3119 (2013)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108 (2010)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Université libre de Bruxelles, Brussels, Belgium
Mathias Coeckelbergs & Seth Van Hooland

Authors

Mathias Coeckelbergs
View author publications
You can also search for this author in PubMed Google Scholar
Seth Van Hooland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathias Coeckelbergs .

Editor information

Editors and Affiliations

Universidade de São Paulo, São Paulo, Brazil
Rogério Mugnaini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Coeckelbergs, M., Van Hooland, S. (2020). Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections. In: Mugnaini, R. (eds) Data and Information in Online Environments. DIONE 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 319. Springer, Cham. https://doi.org/10.1007/978-3-030-50072-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-50072-6_4
Published: 16 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50071-9
Online ISBN: 978-3-030-50072-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics