Skip to main content

Linguistic Annotation in/for Corpus Linguistics

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of ‘corpus’ as a radial category and then, in Sect. 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics. Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the diversity of formats currently available and the emergence of XML annotation as, for now, the most widespread form of annotation. Section 4 summarizes and concludes with desiderata for future developments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A reviewer points out that most corpora are in English and are thus by default Unicode-compliant, since English orthographic characters use the ASCII subset of Unicode.

  2. 2.

    A reviewer points out that entry of IPA characters is still difficult on some computers, although software like IPA Palette (http://www.blugs.com/IPA/) make this task easier than it has been.

References

  1. Aijmer, K.: Parallel and comparable corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 275–292. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  2. Aldebazal, I., Aranzabe, M.J., Arriola, J.M., Dias de Ilarraza, A.: Syntactic annotation in the reference Corpus for the processing of basque (EPEC): theoretical and practical issues. Corpus Linguist. Linguistic Theory 5(2), 241–269 (2009)

    Google Scholar 

  3. Anthony, L.: AntConc: a freeware concordance program for Windows, Macintosh OS X, and Linux. http://www.antlab.sci.waseda.ac.jp/antconc_index.html (2014)

  4. Archer, D., Wilson, A., Rayson, P.: Introduction to the USAS Category System. Lancaster University, Lancaster. http://ucrel.lancs.ac.uk/usas/usas%20guide.pdf (2002)

  5. Archer, D., Culpeper, J., Davies, M.: Pragmatic annotation. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 613–642. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  6. Bard, E.G., Sotillo, C., Anderson, A.H., Thompson, H.S., Taylor, M.M.: The DCIEM map task corpus: spontaneous dialogue under sleep deprivation and drug treatment. Speech Commun. 20(1/2), 71–84 (1996)

    Article  Google Scholar 

  7. Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.): Creating and Digitizing Language Corpora. Vol. 1: Synchronic databases. Palgrave Macmillan, Houndmills (2007a)

    Google Scholar 

  8. Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.) Creating and Digitizing Language Corpora. Vol. 2: Diachronic databases. Palgrave Macmillan, Houndmills (2007b)

    Google Scholar 

  9. Berez, A.L., Gries, S.T.: Correlates to middle marking in Dena’ina iterative verbs. Int. J. Am. Linguist. 76(1), 145–165 (2010)

    Article  Google Scholar 

  10. Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)

    Article  Google Scholar 

  11. Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German Corpus. J. Lang. Comput. 2004(2), 597–620 (2004)

    Article  Google Scholar 

  12. Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., Møller, M.B.: A generic approach to software support for linguistic annotation using XML. In: Sampson, G., McCarthy, D. (eds.) Corpus Linguistics: Readings in a Widening Discipline, pp. 449–459. Continuum, London (2004)

    Google Scholar 

  13. Cox, C.: Corpus linguistics and language documentation: challenges for collaboration. In: Newman, J., Harald Baayen, R., Rice, S. (eds.) Corpus-based Studies in Language Use, Language Learning, and Language Documentation, pp. 239–264. Rodopi, Amsterdam (2011)

    Google Scholar 

  14. Czaykowska-Higgins, E.: Research models, community engagement, and linguistic fieldwork: reflections on working with Canadian Indigenous communities. Lang. Doc. Conserv. 3(1), 15–50 (2009)

    Google Scholar 

  15. Dagneaux, E.S.D., Granger, S.: Computer-aided error analysis. System 26, 163–174 (1998)

    Article  Google Scholar 

  16. DGS-Korpus Sign Language Corpora Survey. http://www.sign-lang.uni-hamburg.de/dgs-korpus/index.php/sl-corpora.html. Accessed 20 Sept 2013

  17. Díaz-Negrillo, A.: A fine-grained error tagger for English learner corpora. Unpublished Ph.D. thesis, University of Jaén (2007)

    Google Scholar 

  18. Du Bois, J.W., Cumming, S., Schuetze-Coburn, S., Paolino, D. (eds.): Discourse Transcription. University of California, Santa Barbara (1992). (Santa Barabara Papers in Linguistics, vol. 4)

    Google Scholar 

  19. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    Google Scholar 

  20. Fellbaum, C., Garabowski, J., Landes, S., Baumann, A.: Matching words to senses in WordNet: Naïve versus expert differentiation. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 217–239. MIT Press, Cambridge (1998)

    Google Scholar 

  21. Fillmore, C.J.: Frame semantics and the nature of language. Ann. New York Acad. Sci. Conf. Origin Dev. Lang. Speech 280, 20–32 (1976)

    Article  Google Scholar 

  22. Fitschen, A., Gupta, P.: Lemmatising and morphological tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 552–564. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  23. Gahl, S.: The “Up” Corpus: A corpus of speech samples across adulthood. Corpus Linguistics and Linguistic Theory

    Google Scholar 

  24. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia (1993)

    Google Scholar 

  25. Garside, R., Fligelstone, S., Botley, S.: Discourse annotation: anaphoric relations in corpora. In: Garside, R., Leech, G., McEnery, T. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 66–84. Longman, London (1997)

    Google Scholar 

  26. Garside, R., Leech, G., McEnery, T. (eds.): Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London (1997)

    Google Scholar 

  27. Gilquin, G., Gries, S.Th.: Corpora and experimental methods: a state-of-the-art review. Corpus Linguist. Linguistic Theory 5(1), 1–26 (2009)

    Google Scholar 

  28. Godfrey, J.J., Holliman, E.: Switchboard-1 Release 2. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  29. Granger, S., Dagneaux, E., Meunier, F. (eds.): The International Corpus of Learner English. Handbook and CD-ROM. Presses Universitaires de Louvain, Louvain-la-Neuve (2002)

    Google Scholar 

  30. Gries, S.T.: Corpus-based methods and cognitive semantics: the many meanings of to run. In: Gries, S.T., Stefanowitsch, A. (eds.) Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, pp. 57–99. Mouton de Gruyter, Berlin (2006)

    Chapter  Google Scholar 

  31. Gries, S.T.: Data in construction grammar. In: Trousdale, G., Hoffmann, T. (eds.) The Oxford Handbook of Construction Grammar, pp. 93–108. Oxford University Press, Oxford (2013)

    Google Scholar 

  32. Hanke, T.: HamNoSys - representing sign language data in language resources and language processing contexts. In: Streiter, O., Chiara, C. (eds).: Proceedings of the Workshop Representation and Processing of Sign Languages, LREC 2004, pp. 1–6. ELRA, Paris (2004)

    Google Scholar 

  33. Hirschmann, L., Chinchor, N.A.: MUC-7 Coreference Task Definition. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/co_task.html (1997). version 3.0

  34. Hunston, S.: Corpora in Applied Linguistics. Cambridge University Press, Cambridge (2002)

    Book  Google Scholar 

  35. Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. Proceedings of LREC 1998, 463–470 (1998)

    Google Scholar 

  36. Iruskieta, M., Diaz de Ilarraza, A., Lersundi, M.: Establishing criteria for RST-based discourse segmentation and annotation for texts in Basque. Corpus Linguistics and Linguistic Theory

    Google Scholar 

  37. Jefferson, G.: Sequential aspects of storytelling in conversation. In: Schenkein, J. (ed.) Studies in the Organization of Conversational Interaction, pp. 219–248. Academic Press, New York (1978)

    Chapter  Google Scholar 

  38. Jefferson, G.: Issues in the transcription of naturally-occurring talk: caricature versus capturing pronunciation particulars. Tilburg Papers in Language and Literature 34 (1983a)

    Google Scholar 

  39. Jefferson, G.: An Exercise in the Transcription and Analysis of Laughter. Tilburg Papers in Language and Literature 34. Tilburg University, Tilburg (1983b)

    Google Scholar 

  40. Jefferson, G.: An exercise in the transcription and analysis of laughter. In: van Dijk, T.A. (ed.) Handbook of Discourse Analysis, vol. III, pp. 25–34. Academic Press, New York (1985)

    Google Scholar 

  41. Jefferson, G.: A case of transcriptional stereotyping. J. Pragmat. 26(2), 159–170 (1996)

    Article  Google Scholar 

  42. Johnston, T.: Auslan Corpus Annotation Guidelines. Macquarie University, Sydney (2013)

    Google Scholar 

  43. Jorgensen, J.: The psychological reality of word senses. J. Psycholinguist. Res. 19(3), 167–190 (1990)

    Article  Google Scholar 

  44. Jun, S.-A. (ed.): Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford (2005)

    Google Scholar 

  45. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  46. Kilgarriff, A.: I don’t believe in word senses. Comput. Humanit. 31(2), 91–113 (1997)

    Article  Google Scholar 

  47. Kipp, M., Neff, M., Albrecht, I.: An annotation scheme for conversational gesture: how to economically capture timing and form. Lang. Resour. Eval. 41(3/4), 325–339 (2007)

    Article  Google Scholar 

  48. Koehn, P.: Europarl: a Parallel Corpus for Statistical Machine Translation. University of Edinburgh, MT Summit (2005)

    Google Scholar 

  49. Lücking, A., Bergman, K., Hahn, F., Kopp, S., Rieser, H.: The bielefeld speech and gesture alignment Corpus (SaGA). In: Proceedings of the LREC 2010 Workshop: Multimodal Corpora-Advances in Capturing, Coding and Analyzing Multimodality, pp. 92–98 (2010)

    Google Scholar 

  50. Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, pp. 17–29. Oxbow, Oxford (2005)

    Google Scholar 

  51. Leech, G., McEnery, T., Wynne, M.: Further levels of annotation. In: Garside, R., Leech, G., McEnery, T. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 85–101. Longman, London (1997)

    Google Scholar 

  52. Lu, H.-C.: An annotated Taiwanese learners’ Corpus of Spanish. CATE. Corpus Linguist. Linguist. Theory 6(2), 297–300 (2010)

    Google Scholar 

  53. Lüdeling, A., Kytö, M. (eds.): Corpus Linguistics: an International Handbook, vol. 1. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  54. MacWhinney, B.: The expanding horizons of corpus analysis. In: Newman, J., Harald Baayen, R., Rice, S. (eds.) Corpus-based Studies in Language use, Language Learning, and Language Documentation, pp. 178–212. Rodopi, Amsterdam (2011)

    Google Scholar 

  55. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated Corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  56. Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. EUDICO Linguistic Annotator (ELAN). http://tla.mpi.nl/tools/tla-tools/elan/ (2014)

  57. McEnery, T., Ostler, N.: A new agenda for corpus linguistics - working with all of the world’s languages. Lit. Linguist. Comput. 15(4), 403–419 (2000)

    Article  Google Scholar 

  58. McEnery, T., Xiao, R., Tono, Y.: Corpus-based Language Studies: An Advanced Resource Book. Routledge, London (2006)

    Google Scholar 

  59. Mitkov, R.: Corpora for anaphora nad coreference resolution. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 579–598. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  60. Müller, C.: Redebegleitende Gesten: Kulturgeschichte – Theorie – Sprachvergleich, vol. 1 of Körper – Kultur – Kommunikation. Berlin, Berlin (1998)

    Google Scholar 

  61. Nelson, G., Wallis, S., Aarts, B.: Exploring Natural Language: Working with the British Component of the International Corpus of English. John Benjamins, Amsterdam (2002)

    Book  Google Scholar 

  62. Oostdijk, N., Boves, L.: Preprocessing speech corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 642–663. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  63. Ostler, N.: Corpora of less studies languages. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 457–483. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  64. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–105 (2005)

    Article  Google Scholar 

  65. Pellard, T.: Ōgami (Miyako ryukyuan). In: Shimoji, M., Pellard, T. (eds.) An Introduction to Ryukyuan Languages, pp. 113–166. Research Institute for Languages and Cultures of Asia and Africa, Tokyo (2010)

    Google Scholar 

  66. Pierrehumbert, J.: The Phonology and Phonetics of English Intonation. Unpublished Ph.D. Dissertation, MIT (1980)

    Google Scholar 

  67. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B.: The penn discourse treebank 2.0. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008) (2008)

    Google Scholar 

  68. Pustejovsky, J., et al.: The timebank corpus. Proc. Corpus Linguist. 2003, 647–656 (2003)

    Google Scholar 

  69. Rayson, P., Stevenson, M.: Sense and semantic tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 564–579. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  70. Rice, K.: Ethical issues in linguistic fieldwork. In: Thieberger, N. (ed.) Oxford Handbook of Linguistic Fieldwork, pp. 407–429. Oxford University Press, Oxford (2012)

    Google Scholar 

  71. van Rooy, B., Schäfer, L.: The effect of learner errors on POS tag errors during automatic POS tagging. S. Afr. Linguist. Appl. Lang. Studies 20(4), 325–335 (2002)

    Article  Google Scholar 

  72. Roy, D.: New horizons in the study of child language acquisition. In: Proceedings of Interspeech, Brighton, England (2009)

    Google Scholar 

  73. Rühlemann, C., O’Donnell, M.B.: Introducing a corpus of conversational stories: construction and annotation of the Narrative Corpus and interim results. Corpus Linguistics and Linguistic Theory

    Google Scholar 

  74. Sacks, H., Schegloff, E.A., Jefferson, G.: A simplest systematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)

    Article  Google Scholar 

  75. Santorini, B.: Part-of-Speech Tagging Guidelines for the Penn Treebank Project. 3rd revision, 2nd printing. ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz (1990)

  76. Schegloff, E.A.: Sequence Organization in Interaction. Cambridge University Press, Cambridge (2007)

    Book  Google Scholar 

  77. Schmid, H.: Tokenizing and part-of-speech tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 527–551. Walter de Gruyter, Berlin (2008)

    Google Scholar 

  78. Sloetjes, H., Wittenburg, P.: In: Proceedings of the LREC, Annotation by category - ELAN and ISO DCR (2008)

    Google Scholar 

  79. Streeck, J.: Depicting by gesture. Gesture 8(3), 285–301 (2008)

    Article  Google Scholar 

  80. Tagliamonte, S.: Representing real language: consistency, trade-offs, and thinking ahead! In: Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.), Creating and Digitizing Language Corpora, vol. 1: Synchronic Databases, pp. 205–240. Palgrave Macmillan, Houndmills (2007)

    Google Scholar 

  81. Taylor, A., Marcus, M.P., Santorini, B.: The penn treebank: an overview. Text, Speech Lang. Technol. 20, 5–22 (2003)

    Google Scholar 

  82. The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/ (2007)

  83. Thieberger, N., Berez, A.L.: Linguistic data management. In: Thieberger, N. (ed.) Oxford Handbook of Linguistic Fieldwork, pp. 90–118. Oxford University Press, Oxford (2012)

    Google Scholar 

  84. Thompson, H.S., McKelvie, D.: Hyperlink semantics for standoff markup of read-only documents. In: Proceedings of the SGML Europe (1997). http://www.ltg.ed.ac.uk/~ht/sgmleu97.html

  85. University of Hamburg. iLex – a tool for sign language lexicography and corpus analysis. (2014) http://www.sign-lang.uni-hamburg.de/ilex/

  86. Woodbury, A.: Language documentation. In: Austin, P.K., Sallabank, J. (eds.) The Cambridge Handbook of Endangered Languages, pp. 159–186. Cambridge University Press, Cambridge (2011)

    Chapter  Google Scholar 

  87. Xiao, R.: Theory-driven corpus research: using corpora to inform aspect theory. In: Lüdeling, A., Kytö, M. (eds.) Corpus lInguistics: An International Handbook, vol. 2, pp. 987–1008. Walter de Gruyter, Berlin (2008)

    Chapter  Google Scholar 

  88. Zinsmeister, H., Hinrichs, E., Kübler, S., Witt, A.: Linguistically annotated corpora: quality assurance, reusability and sustainability. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 759–776. Walter de Gruyter, Berlin (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Th. Gries .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Gries, S.T., Berez, A.L. (2017). Linguistic Annotation in/for Corpus Linguistics. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_15

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics