Creating and Analyzing Literary Corpora

  • Michael PercillierEmail author
Part of the Multimedia Systems and Applications book series (MMSA)


Using a study of non-standardized linguistic features in literary texts as a working example, the chapter describes the creation of a digital corpus from printed source texts, as well as its subsequent annotation and analysis. The sections detailing the process of corpus creation take readers through the steps of document scanning, Optical Character Recognition, proofreading, and conversion of plain text to XML, while offering advice on best practices and overviews of existing tools. The presented corpus annotation method introduces the programming language Python as a tool for automated basic annotation, and showcases methods for facilitating thorough manual annotation. The data analysis covers both qualitative analysis, facilitated by CSS styling of XML data, and quantitative analysis, performed with the statistical software package R and showcasing a number of sample analyses.


Data Frame Optical Character Recognition Word Count Quotation Mark Python Script 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. ABBYY, FineReader (2016). Available via Cited 30 January 2016
  2. C. Achebe, A Man of the People (Penguin, London, 1966)Google Scholar
  3. Adobe, Acrobat (2016). Available via Cited 30 January 2016
  4. R.H. Baayen, Analyzing Linguistic Data (Cambridge University Press, Cambridge, 2008)CrossRefGoogle Scholar
  5. Bare Bones Software, TextWrangler. Available via Cited 19 March 2016
  6. S. Behnel, lxml (2015). Available via Cited 26 January 2016
  7. D. Biber, S. Conrad, R. Reppen, Corpus Linguistics: Investigating Language Structure and Use (Cambridge University Press, Cambridge, 1998)CrossRefGoogle Scholar
  8. S. Bird, E. Loper, E. Klein, Natural Language Processing with Python (O’Reilly Media Inc., Sebastopol, CA, 2009)zbMATHGoogle Scholar
  9. B. Bos, CSS & XSL (2015). Available via Cited 1 April 2016
  10. H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M.A. Greenwood, H. Saggion, J. Petrak, Y. Li, W. Peters, Text Processing with GATE (Version 6) (GATE, Sheffield, 2011)Google Scholar
  11. H. Cunningham, V. Tablan, A. Roberts, K. Bontcheva, Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput. Biol. 9 (2) (2013). doi: 10.1371/journal.pcbi.1002854
  12. Distributed Proofreaders Foundation, DPCustomMono2 (2015). Available via Cited 16 January 2016
  13. C.C. Evans, YAML (2009). Available via Cited 21 January 2016
  14. R. Garside, The CLAWS word-tagging system, in The Computational Analysis of English: A Corpus-based Approach, ed. by R. Garside, G. Leech, G. Sampson (Longman, London, 1987), pp. 30–41Google Scholar
  15. R. Garside, The robust tagging of unrestricted text: the BNC experience, in Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech, ed. by J. Thomas, M. Short (Longman, London, 1996), pp. 167–180Google Scholar
  16. R. Garside, N. Smith, A hybrid grammatical tagger: CLAWS 4, in Corpus Annotation: Linguistic Information from Computer Text Corpora, ed. by R. Garside, G. Leech, T. McEnery (Longman, London, 1997), pp. 102–121Google Scholar
  17. S.Th. Gries, Quantitative Corpus Linguistics with R: A Practical Introduction. (Routledge, London, 2009)Google Scholar
  18. S.Th. Gries, Useful statistics for corpus linguistics, in A Mosaic of Corpus Linguistics: Selected Approaches, ed. by A. Sánchez, M. Almela (Peter Lang, Frankfurt, 2010), pp. 269–291Google Scholar
  19. S.Th. Gries, Statistics for Linguistics with R: A Practical Introduction (de Gruyter, Berlin, 2013)CrossRefGoogle Scholar
  20. E. Hamrick, D. Hamrick, VueScan (2016). Available via Cited 30 January 2016
  21. C. Hanley, The Orange and the Green, in Scottish Literature in the Twentieth Century. An Anthology, ed. by D. McCordick (Scottish Cultural Press, Dalkeith, 2002), pp. 765–767Google Scholar
  22. D. Ho, Notepad + + (2016). Available via Cited 19 March 2016
  23. IBM, SPSS (2015). Available via Cited 21 March 2016
  24. IRIS, Readiris (2015). Available via Cited 30 January 2016
  25. R.I. Kabacoff, R in Action: Data Analysis and Graphics with R (Manning Publications, Greenwich, CT, 2015)Google Scholar
  26. G. Leech, R. Garside, M. Bryant, CLAWS4: the tagging of the British National Corpus, in Proceedings of the 15th International Conference on Computational Linguistics (COLING 94), Kyoto (1994), pp. 622–628Google Scholar
  27. N. Levshina, How to do Linguistics with R (John Benjamins, Amsterdam, 2015). doi: 10.1075/z.195 CrossRefGoogle Scholar
  28. P. Maggi, P. Borelli, S. Frécinaux, J. van den Kieboom, J. Willcox, C. Celorio, F. Mena Quintero, gedit (2015). Available via Cited 19 March 2016
  29. Microsoft, OneNote (2016). Available via Cited 30 January 2016
  30. K.J. Millman, M. Aivazis, Python for scientists and engineers. Comput. Sci. Eng. 13, 9–12 (2011). doi: 10.1109/MCSE.2011.36 CrossRefGoogle Scholar
  31. T.E. Oliphant, Python for scientific computing. Comput. Sci. Eng. 9, 10–20 (2007). doi: 10.1109/MCSE.2007.58 CrossRefGoogle Scholar
  32. C. Paulin, M. Percillier, Oral varieties of English in a literary corpus of West African and South East Asian prose (1954–2013): commitment to local identities and catering for foreign readers. Etudes de stylistique anglaise 59–79 (2015)Google Scholar
  33. M. Percillier, XmlCat (2015). Available via Cited 29 March 2016
  34. M. Percillier, C. Paulin, A corpus linguistic investigation of world Englishes in literature. World Englishes (2016). doi: 10.1111/weng.12208 Google Scholar
  35. Project Gutenberg Literary Archive Foundation, Project Gutenberg (2015). Available via Cited 11 January 2016
  36. G. Prince, A Dictionary of Narratology (University of Nebraska Press, Lincoln, 2003)Google Scholar
  37. Python Software Foundation, Python. Available via Cited 15 January 2016
  38. R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2015). Available via Cited 26 January 2016
  39. L. Richardson, Beautiful Soup (2015). Available via Cited 26 January 2016
  40. S. Rimmon-Kenan, Narrative Fiction: Contemporary Poetics (Routledge, London, 1983)CrossRefGoogle Scholar
  41. SAS Institute, SAS (2013). Available via Cited 21 March 2016
  42. H. Schmid, Probabilistic part-of-speech tagging using decision trees, in Proceedings of International Conference on New Methods in Language Processing, Manchester (1994)Google Scholar
  43. H. Schmid, Improvements in part-of-speech tagging with an application to German, in Proceedings of the ACL SIGDAT-Workshop, Dublin (1995)Google Scholar
  44. Z.A. Shaw, Learn Python the Hard Way (2013). Available via Cited 1 April 2016
  45. R. Smith, Tesseract (2015). Available via Cited 30 January 2016
  46. A. Spence, Its colours they are fine, in Scottish Literature in the Twentieth Century. An Anthology, ed. by D. McCordick (Scottish Cultural Press, Dalkeith, 2002), pp. 998–1005Google Scholar
  47. StataCorp, Stata (2015). Available via Cited 21 March 2016
  48. A. Sweigart, Automate the boring stuff with Python (2015). Available via Cited 1 April 2016
  49. P. Teetor, R Cookbook (O’Reilly Media, Sebastopol, CA, 2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.University of MannheimMannheimGermany

Personalised recommendations