Concepts in Topics. Using Word Embeddings to Leverage the Outcomes of Topic Modeling for the Exploration of Digitized Archival Collections

  • Mathias CoeckelbergsEmail author
  • Seth Van Hooland
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 319)


Within the field of Digital Humanities, unsupervised machine learning techniques such as topic modeling have gained a lot of attention over the last years to explore vast volumes of non-structured textual data. Even if this technique is useful to capture recurring themes across document sets which have no metadata, the interpretation of topics has been consistently highlighted in the literature as problematic. This paper proposes a novel method based on Word Embeddings to facilitate the interpretation of terms which constituted a topic, allowing to discern different concepts automatically within a topic. In order to demonstrate this method, the paper uses the “Cabinet Papers” held and digitised by the The National Archives (TNA) of the United Kingdom (UK). After a discussion of our results, based on coherence measures, we provide details of how we can linguistically interpret these results.


Topic modeling Word embeddings Document classification Information retrieval 


  1. 1.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)Google Scholar
  2. 2.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  3. 3.
    Blei, D.M., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Blei, D.M., Griffiths, T.L., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16 (2004)Google Scholar
  5. 5.
    Blei, D.M., Lafferty, J.D.: Correlated topic models. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems 18. MIT Press, Cambridge (2006)Google Scholar
  6. 6.
    Chandler, D.: Semiotics: The Basics, 2nd edn. Routledge, London (2007)CrossRefGoogle Scholar
  7. 7.
    Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 288–296 (2016)Google Scholar
  8. 8.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)CrossRefGoogle Scholar
  9. 9.
    Firth, J.R.: Papers in Linguistics 1934–1951. Oxford, London (1957)Google Scholar
  10. 10.
    Hengchen, S., Coeckelbergs, M., Van Hooland, S.: Exploring archives with probabilistic models: topic modeling for the valorization of digitised archives of the European Commission. In: IEEE International Conference on Big Data Workshop on Computational Archival Science, Washington D.C., pp. 3245–3249 (2016)Google Scholar
  11. 11.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems 2, pp. 3111–3119 (2013)Google Scholar
  12. 12.
    Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108 (2010)Google Scholar
  13. 13.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  14. 14.
    Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth International Conference on Web Search and Data Mining, pp. 399–408 (2015)Google Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2020

Authors and Affiliations

  1. 1.Université libre de BruxellesBrusselsBelgium

Personalised recommendations