Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11313)


Historical newspapers are a novel source of information for historical ecologists to study the interactions between humans and animals through time and space. Newspaper archives are particularly interesting to analyse because of their breadth and depth. However, the size and the occasional noisiness of such archives also brings difficulties, as manual analysis is impossible. In this paper, we present experiments and results on automatic query expansion and categorisation for the perception of animal species between 1800 and 1940. For query expansion and to the manual annotation process, we used lexicons. For the categorisation we trained a Support Vector Machine model. Our results indicate that we can distinguish newspaper articles that are about animal species from those that are not with an F\(_{1}\) of 0.92 and the subcategorisation of the different types of newspapers on animals up to 0.84 F\(_{1}\).


Natural language processing Lexicology Humanities Historical ecology Digital libraries 



The research for this paper was made possible by the CLARIAH-CORE project financed by NWO: We thank the Dutch National Library for providing access to their newspaper corpus.


  1. 1.
    Arulanandam, R., Savarimuthu, B.T.R., Purvis, M.A.: Extracting crime information from online newspaper articles. In: Proceedings of the Second Australasian Web Conference-Volume 155, pp. 31–38. Australian Computer Society, Inc. (2014)Google Scholar
  2. 2.
    Balée, W.: The research program of historical ecology. Annu. Rev. Anthropol. 35, 75–98 (2006)CrossRefGoogle Scholar
  3. 3.
    van Berkel, K.: Vóór Heimans en Thijsse: Frederik van Eeden sr. en de natuurbeleving in negentiende-eeuws Nederland, vol. 63. Koninklijke Nederlandse Akademie van Wetenschappen (2006)Google Scholar
  4. 4.
    Bosveld, J., Kranenbarg, J., Lenders, H., Hendriks, J.: Historic decline and recent increase of burbot, in the Netherlands. Hydrobiologia 757Google Scholar
  5. 5.
    Brukner, P., Gara, T.J., Fortington, L.V.: Traumatic cricket-related fatalities in australia: a historical review of media reports. Med. J. Aust. 208(6), 261–264 (2018)CrossRefGoogle Scholar
  6. 6.
    Depuydt, K., de Does, J.: The diachronic semantic lexicon of dutch as linked open data. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, France, May 2018Google Scholar
  7. 7.
    Dirke, K.: Where is the big bad wolf? Notes and narratives on wolves in swedish newspapers during the eighteenth and nineteenth centuries. In: Masius, P., Sprenger, J. (eds.) A Fairy Tale in Question. Historical Interactions Between Humans and Wolves, pp. 101–118. The White Horse Press, Cambridge (2015)Google Scholar
  8. 8.
    van Erp, M., van Goethem, T., Depuydt, K., de Does, J.: Towards semantic enrichment of newspapers: a historical ecology use case. In: Proceedings of the Second Workshop on Humanities in the Semantic Web (WHiSe II) Co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, 22 October 2017.
  9. 9.
    Gotscharek, A., Reffle, U., Ringlstetter, C., Schulz, K.U., Neumann, A.: Towards information retrieval on historical document collections: the role of matching procedures and special lexica. IJDAR 14(2), 159–171 (2011)CrossRefGoogle Scholar
  10. 10.
    Koperski, K., Bhatti, S., Liang, J., Klein, A.: Cluster-based identification of news stories, August 25 2015, uS Patent 9,116,995Google Scholar
  11. 11.
    Kwok, R.: Historical data: hidden in the past. Nature 549, 419–421 (2017)CrossRefGoogle Scholar
  12. 12.
    Lenders, H.J.R.: Ten a penny? Deadly viper bites in the netherlands in a socio-economic perspective. Litteratura Serpentium 34, 290–316 (2014)Google Scholar
  13. 13.
    Lonij, J., Harbers, F.: Genre classifier (2016).
  14. 14.
    Maks, I., van Erp, M., Vossen, P., Hoekstra, R., van der Sijs, N.: Integrating diachronous conceptual lexicons through linked open data. Presented at DHBenelux 2016, 9–10 June 2016Google Scholar
  15. 15.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  16. 16.
    Mcclenachan, L., Cooper, A., McKenzie, M., Drew, J.: The importance of surprising results and best practices in historical ecology. BioScience 65 (2015). Scholar
  17. 17.
    Moretti, F.: Distant Reading. Verso Books (2013)Google Scholar
  18. 18.
    Nerghes, A., Hellsten, I., Groenewegen, P.: A toxic crisis: metaphorizing the financial crisis. Int. J. Commun. 9, 27 (2015)Google Scholar
  19. 19.
    Runhaar, H., Runhaar, M., Vink, H.: Reports on badgers meles meles in Dutch newspapers 1900–2013: same animals, different framings? Mammal Rev. 45(3), 133–145 (2015)CrossRefGoogle Scholar
  20. 20.
    Seo, Y.W., Giampapa, J.A., Sycara, K.: Financial news analysis for intelligent portfolio management. Technical report. CMU-RI-TR-04-04, Carnegie Mellon University (2004)Google Scholar
  21. 21.
    Thurstan, R., Campbell, A., Pandolfi, J.: Nineteenth century narratives reveal historic catch rates for australian snapper (pagrus auratus). Fish Fish. 17, 210–225 (2016)CrossRefGoogle Scholar
  22. 22.
    Walma, L.: Filtering the ‘news’: uncovering Morphine’s multiple meanings on Delpher’s Dutch newspapers and the need to distinguish more article types. TS: Tijdschrift voor Tijdschriftstudies 38, 61–78 (2015)CrossRefGoogle Scholar
  23. 23.
    Yzaguirre, A., Smit, M., Warren, R.: Newspaper archives + text mining = rich sources of historical geo-spatial data. IOP Conf. Ser. Earth Environ. Sci. 34(1), 012043 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.DHLabKNAW Humanities ClusterAmsterdamNetherlands
  2. 2.Instituut voor de Nederlandse TaalLeidenNetherlands
  3. 3.Radboud University NijmegenNijmegenNetherlands

Personalised recommendations