Skip to main content

Designing Annotation Schemes: From Model to Representation

  • Chapter
  • First Online:
Book cover Handbook of Linguistic Annotation

Abstract

The physical formats used to represent linguistic data and its annotations have evolved over the past four decades, accommodating different needs and perspectives as well as incorporating advances in data representation generally. This chapter provides an overview of representation formats with the aim of surveying the relevant issues for representing different data types together with current state-of-the-art solutions, in order to provide sufficient information to guide others in the choice of a representation format or formats.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://nlp.stanford.edu/software/tagger.shtml.

  2. 2.

    http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

  3. 3.

    http://www.coli.uni-saarland.de/~thorsten/tnt/.

  4. 4.

    Several initiatives have focused on reusability of language data from the late 1980s onward; see chapter “Community Standards for Linguistically-Annotated Resources” in this volume for a fuller history of standards efforts in the field.

  5. 5.

    Note that the Hypertext Markup Language (HTML) is an application of SGML/XML, in that it uses the SGML/XML meta-format to define specific tag names and document structure for use in creating web pages.

  6. 6.

    www.tei-c.org/.

  7. 7.

    http://www.ilc.cnr.it/EAGLES/browse.html.

  8. 8.

    http://www-nlpir.nist.gov/related_projects/tipster/trec.htm.

  9. 9.

    http://gate.ac.uk.

  10. 10.

    http://groups.inf.ed.ac.uk/nxt/index.shtml.

  11. 11.

    Originally called “remote markup”–see http://www.cs.vassar.edu/CES/CES1-5.html.

  12. 12.

    An ad hoc mechanism to connect annotations on different graphs was later introduced into the AG model to accommodate hierarchical relations.

  13. 13.

    http://www.ukp.tu-darmstadt.de/research/current-projects/dkpro/.

  14. 14.

    http://www.conll.org.

  15. 15.

    In addition, to solve the well-known problem of representing alternative tokenizations over the same data, segmentation into smaller units that may be combined to form differing tokenizations has been proposed [16, 30].

  16. 16.

    http://www.natcorp.ox.ac.uk.

  17. 17.

    An extreme example of hybrid standoff is the format used in PropBank (http://verbs.colorado.edu/~mpalmer/projects/ace.html), which uses the syntactic structure to address to attach semantic role annotations to nodes in the syntax trees defined in the original Penn Treebank.

  18. 18.

    In GrAF, each annotation is associated with a node in the annotation graph.

  19. 19.

    Note that the decision to represent annotation layers in this fashion does not automatically lead to the distribution of layers across separate data files. While the MMAX2 model and others (see Sect. 5) indeed use one file per layer, other approaches such as that of the model underlying the Serengeti tool [21] prefer combining all information into a single file, which begins with the token layer and then lists the various standoff annotation layers.

  20. 20.

    http://childes.psy.cmu.edu.

  21. 21.

    Taken from http://www.ling.hawaii.edu/ldtc/website/syllabus/sp06/LehmannGlossing.pdf.

  22. 22.

    https://tla.mpi.nl/tools/tla-tools/elan/.

  23. 23.

    http://emu.sourceforge.net.

  24. 24.

    http://www.fon.hum.uva.nl/praat/.

  25. 25.

    http://www.exmaralda.org/en.

  26. 26.

    http://www.anvil-software.org.

  27. 27.

    www.mpi.nl/corpus/manuals/manual-elan.pdf.

  28. 28.

    http://catalog.ldc.upenn.edu/LDC93S1.

  29. 29.

    See Sect. 4.2.2.

  30. 30.

    See Sect. 4.2.1.

  31. 31.

    http://www.w3.org/RDF/.

  32. 32.

    http://www.uml.org.

  33. 33.

    The nature of the referring pointer used may depend on the medium. For text, references to beginning and ending offsets (“virtual nodes” between characters) of a text span are standard.

  34. 34.

    See [31] for more detailed information on the GrAF resource header.

  35. 35.

    http://graf.anc.org/gcs.

  36. 36.

    http://linkeddata.org.

  37. 37.

    http://www.w3.org/RDF/.

  38. 38.

    http://json-ld.org.

  39. 39.

    http://www.w3.org/TR/sparql11-query/.

  40. 40.

    http://www.w3.org/TR/owl-ref/.

  41. 41.

    For a more detailed description of NIF, see chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9 in this volume.

  42. 42.

    http://nlp2rdf.org/nif-1-0/.

  43. 43.

    http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core.

  44. 44.

    The development of an OWL/DL version of FrameNet has been announced on the FrameNet site.

  45. 45.

    See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 6, in this volume.

  46. 46.

    See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9, in this volume.

  47. 47.

    http://www.w3.org/TR/2013/REC-sparql11-query-20130321/propertypaths.

References

  1. Baker, C., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. Third Linguistic Annotation Workshop (LAW-2009), pp. 125–129. Suntec, Singapore (2009)

    Google Scholar 

  2. Banski, P., Przepiórkowski, A.: Stand-off TEI annotation: the case of the national corpus of polish. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 64–67. Suntec, Singapore (2009)

    Google Scholar 

  3. Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)

    Google Scholar 

  4. Bow, C., Hughes B., Bird S.: Towards a general model of interlinear text. In: Proceedings of EMELD workshop, pp. 11–13 (2003)

    Google Scholar 

  5. Bradshaw, J., Burridge, K., Clyne, M.: The monash corpus of spoken Australian english. In: Proceedings of the 2008 Conference of the Australian Linguistics Society, pp. 2123/7099 (2009)

    Google Scholar 

  6. Brants, T., Skut, W., Krenn, B.: Tagging grammatical functions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97). Providence, RI (1997)

    Google Scholar 

  7. Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. Res. Lang. Comput. 2(4), 597–620 (2004)

    Article  Google Scholar 

  8. Bray, T., Paoli, J., Sperberg-McQueen, C.M. (eds.): Extensible Markup Language (XML) Version 1.0. W3C Recommendation. http://www.w3.org/TR/1998/REC-xml-19980210 (1998)

  9. Carletta, J., Evert, S., Heid, U., Kilgour, J.: The NITE XML Toolkit: data model and query. Lang. Res. Eval. J. (LREJ) 39(4), 313–334 (2005)

    Article  Google Scholar 

  10. Charniak, E.: A Maximum-entropy-inspired Parser. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 132–139 (2000)

    Google Scholar 

  11. Chen, P.P.S: The entity-relationship model–toward a unified view of data. ACM. Trans. Database. Syst. 1(1), 9–36 (1976)

    Google Scholar 

  12. Chiarcos C (accepted) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012)

    Google Scholar 

  13. Chiarcos, C.: Grounding an ontology of linguistic annotations in the data category registry. In: Workshop on Language Resource and Language Technology Standards (LR & LTS), held in Conjunction with LREC 2010. Valetta, Malta (2010)

    Google Scholar 

  14. Chiarcos, C.: An ontology of linguistic annotations. LDV Forum 23(1), 1–16 (2008)

    Google Scholar 

  15. Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., Stede, M.: A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2), 217–246 (2008)

    Google Scholar 

  16. Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens. Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)

    Article  Google Scholar 

  17. Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: ANLC ’88: Proceedings of the Second Conference on Applied Natural Language Processing

    Google Scholar 

  18. Collins, M.: Head-driven statistical models for natural language parsing. Comput. Linguist. 29(4), 589–637 (2003)

    Article  Google Scholar 

  19. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: ACL. http://www.aclweb.org/anthology/P02-1022 (2002). doi:10.3115/1073083.1073112

  20. DeRose, Steven J.: Grammatical Category Disambiguation by Statistical Optimization. Comput. Linguist. 14(1), 31–39 (1988)

    Google Scholar 

  21. Diewald, N., Sthrenberg, M., Garbar, A., Goecke, D.: Serengeti - Webbasierte annotation semantischer relationen. J. Lang. Technol. Comput. Linguist. 23(2), 74–93 (2008)

    Google Scholar 

  22. Eckart, K., Riester, A., Schweitzer, K.: A discourse information radio news database for linguistic analysis. In: Nordhoff, S., Hellmann, S. Chiarcos C. (eds.) Linked data in Linguistics. Springer (2012)

    Google Scholar 

  23. Farrar, S., Langendoen, D.T.: An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Springer, Dordrecht (2010)

    Google Scholar 

  24. Ferrucci, D., Lally, A.: UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3/4), 327–348 (2004)

    Article  Google Scholar 

  25. Goodwin, C., Heritage, J.: Conversation analysis. Ann. Rev. Anthropol. a, 283–307 (1990)

    Article  Google Scholar 

  26. Grishman, R. (ed.): Tipster Text Architecture Design. http://www-nlpir.nist.gov/related_projects/tipster/ (1998)

  27. Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics, pp. 466–471 (1996)

    Google Scholar 

  28. Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference (LREC), pp. 463–70 (1998)

    Google Scholar 

  29. Ide, N., Pustejovsky, J.: What does interoperability mean, anyway? toward an operational definition of interoperability for language technology. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources (ICGL 2010) (2010)

    Google Scholar 

  30. Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop (LAW), pp. 1–8. Prague (2007)

    Google Scholar 

  31. Ide, N., Suderman, K.: The linguistic annotation framework: a standard for annotation interchange and merging. Lang. Res. Eval. 48(3), 395–418 (2014)

    Article  Google Scholar 

  32. Ide, N., Romary, L., de la Clergerie, E.: International standard for a linguistic annotation framework. In: Proceedings of HLT-NAACL’03 Workshop on the Software Engineering and Architecture of Language Technology, pp. 25–30. Edmonton, Canada (2003)

    Google Scholar 

  33. Ide, N., Bonhomme, P., Romary, L.: XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second International Language Resources and Evaluation Conference (LREC), pp. 825–830 (2000)

    Google Scholar 

  34. Ide, N., Suderman, K., Simms, B.: ANC2Go: A Web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC) (2010)

    Google Scholar 

  35. Ide, N., Baker, C., Fellbaum, C., Passonneau, R.: The Manually Annotated Sub-Corpus: A Community Resource For and By the People. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 68–73. Uppsala, Sweden (2010)

    Google Scholar 

  36. ISO8879:1986: Information processing – Text and Office Systems – Standard Generalized Markup Language (SGML). International Organization for Standardization (1986)

    Google Scholar 

  37. ISO 24612:2012: Language resource management – Linguistic Annotation Framework (LAF). International Organization for Standardization (2012)

    Google Scholar 

  38. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.: ISOcat: remodelling metadata for language resources. Int. J. Metadata Semant. Ontol. 4(4), 261–276 (2009)

    Article  Google Scholar 

  39. Mann, W., Thompson, S.: Rhetorical structure theory: towards a functional theory of text organization. TEXT 8, 243–281 (1988)

    Article  Google Scholar 

  40. Marcus, M.P., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  41. Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC) (2014)

    Google Scholar 

  42. Müller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods, pp. 197–214. Frankfurt: Peter Lang (2006)

    Google Scholar 

  43. Schmidt, T.: Visualising linguistic annotation as interlinear text. Sonderforschungsbereich 538 (2003)

    Google Scholar 

  44. Schmidt, T., Duncan, S., Ehmer, O., Hoyt, J., Kipp, M., Loehr, D., Magnusson, M., Rose, T., Sloetjes, H.: An exchange format for multimodal annotations. In: Multimodal Corpora, pp. 207–221. Springer (2009)

    Google Scholar 

  45. Schmidt, T., Elenius, K., Trilsbeek, P.: Multimedia corpora (Media encoding and annotation). Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C-3 Interoperability and Standards (2010)

    Google Scholar 

  46. Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J.: Tobi: a standard for labeling english prosody. In: Proceedings of the 1992 International Conference on Spoken Language Processing, ICSLP, pp. 12–16 (1992)

    Google Scholar 

  47. Windhouwer, M., Wright, S.: Linking to linguistic data categories in ISOcat. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked data in Linguistics, pp. 99–107. Springer, Heidelberg (2012)

    Google Scholar 

  48. Wittenburg, P., Lenkiewicz, P., Auer, E., Lenkiewicz, A., Gebre, B.G., Drude, S.: Av processing in ehumanities–a paradigm shift. In: Digital Humanities 2012 Conference, vol. 2 (2012)

    Google Scholar 

  49. Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C.: ANNIS: a search tool for multi-layer annotated corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK (2009)

    Google Scholar 

  50. Zipser, F., Romary, L.: A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, pp. 7–18 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Ide, N., Chiarcos, C., Stede, M., Cassidy, S. (2017). Designing Annotation Schemes: From Model to Representation. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_3

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics