Designing Annotation Schemes: From Model to Representation

Ide, Nancy; Chiarcos, Christian; Stede, Manfred; Cassidy, Steve

doi:10.1007/978-94-024-0881-2_3

Nancy Ide³,
Christian Chiarcos⁴,
Manfred Stede⁵ &
…
Steve Cassidy⁶

2138 Accesses
7 Citations
3 Altmetric

Abstract

The physical formats used to represent linguistic data and its annotations have evolved over the past four decades, accommodating different needs and perspectives as well as incorporating advances in data representation generally. This chapter provides an overview of representation formats with the aim of surveying the relevant issues for representing different data types together with current state-of-the-art solutions, in order to provide sufficient information to guide others in the choice of a representation format or formats.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://nlp.stanford.edu/software/tagger.shtml.
2.
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
3.
http://www.coli.uni-saarland.de/~thorsten/tnt/.
4.
Several initiatives have focused on reusability of language data from the late 1980s onward; see chapter “Community Standards for Linguistically-Annotated Resources” in this volume for a fuller history of standards efforts in the field.
5.
Note that the Hypertext Markup Language (HTML) is an application of SGML/XML, in that it uses the SGML/XML meta-format to define specific tag names and document structure for use in creating web pages.
6.
www.tei-c.org/.
7.
http://www.ilc.cnr.it/EAGLES/browse.html.
8.
http://www-nlpir.nist.gov/related_projects/tipster/trec.htm.
9.
http://gate.ac.uk.
10.
http://groups.inf.ed.ac.uk/nxt/index.shtml.
11.
Originally called “remote markup”–see http://www.cs.vassar.edu/CES/CES1-5.html.
12.
An ad hoc mechanism to connect annotations on different graphs was later introduced into the AG model to accommodate hierarchical relations.
13.
http://www.ukp.tu-darmstadt.de/research/current-projects/dkpro/.
14.
http://www.conll.org.
15.
In addition, to solve the well-known problem of representing alternative tokenizations over the same data, segmentation into smaller units that may be combined to form differing tokenizations has been proposed [16, 30].
16.
http://www.natcorp.ox.ac.uk.
17.
An extreme example of hybrid standoff is the format used in PropBank (http://verbs.colorado.edu/~mpalmer/projects/ace.html), which uses the syntactic structure to address to attach semantic role annotations to nodes in the syntax trees defined in the original Penn Treebank.
18.
In GrAF, each annotation is associated with a node in the annotation graph.
19.
Note that the decision to represent annotation layers in this fashion does not automatically lead to the distribution of layers across separate data files. While the MMAX2 model and others (see Sect. 5) indeed use one file per layer, other approaches such as that of the model underlying the Serengeti tool [21] prefer combining all information into a single file, which begins with the token layer and then lists the various standoff annotation layers.
20.
http://childes.psy.cmu.edu.
21.
Taken from http://www.ling.hawaii.edu/ldtc/website/syllabus/sp06/LehmannGlossing.pdf.
22.
https://tla.mpi.nl/tools/tla-tools/elan/.
23.
http://emu.sourceforge.net.
24.
http://www.fon.hum.uva.nl/praat/.
25.
http://www.exmaralda.org/en.
26.
http://www.anvil-software.org.
27.
www.mpi.nl/corpus/manuals/manual-elan.pdf.
28.
http://catalog.ldc.upenn.edu/LDC93S1.
29.
See Sect. 4.2.2.
30.
See Sect. 4.2.1.
31.
http://www.w3.org/RDF/.
32.
http://www.uml.org.
33.
The nature of the referring pointer used may depend on the medium. For text, references to beginning and ending offsets (“virtual nodes” between characters) of a text span are standard.
34.
See [31] for more detailed information on the GrAF resource header.
35.
http://graf.anc.org/gcs.
36.
http://linkeddata.org.
37.
http://www.w3.org/RDF/.
38.
http://json-ld.org.
39.
http://www.w3.org/TR/sparql11-query/.
40.
http://www.w3.org/TR/owl-ref/.
41.
For a more detailed description of NIF, see chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9 in this volume.
42.
http://nlp2rdf.org/nif-1-0/.
43.
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core.
44.
The development of an OWL/DL version of FrameNet has been announced on the FrameNet site.
45.
See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 6, in this volume.
46.
See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9, in this volume.
47.
http://www.w3.org/TR/2013/REC-sparql11-query-20130321/propertypaths.

References

Baker, C., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. Third Linguistic Annotation Workshop (LAW-2009), pp. 125–129. Suntec, Singapore (2009)
Google Scholar
Banski, P., Przepiórkowski, A.: Stand-off TEI annotation: the case of the national corpus of polish. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 64–67. Suntec, Singapore (2009)
Google Scholar
Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)
Google Scholar
Bow, C., Hughes B., Bird S.: Towards a general model of interlinear text. In: Proceedings of EMELD workshop, pp. 11–13 (2003)
Google Scholar
Bradshaw, J., Burridge, K., Clyne, M.: The monash corpus of spoken Australian english. In: Proceedings of the 2008 Conference of the Australian Linguistics Society, pp. 2123/7099 (2009)
Google Scholar
Brants, T., Skut, W., Krenn, B.: Tagging grammatical functions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97). Providence, RI (1997)
Google Scholar
Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. Res. Lang. Comput. 2(4), 597–620 (2004)
Article Google Scholar
Bray, T., Paoli, J., Sperberg-McQueen, C.M. (eds.): Extensible Markup Language (XML) Version 1.0. W3C Recommendation. http://www.w3.org/TR/1998/REC-xml-19980210 (1998)
Carletta, J., Evert, S., Heid, U., Kilgour, J.: The NITE XML Toolkit: data model and query. Lang. Res. Eval. J. (LREJ) 39(4), 313–334 (2005)
Article Google Scholar
Charniak, E.: A Maximum-entropy-inspired Parser. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 132–139 (2000)
Google Scholar
Chen, P.P.S: The entity-relationship model–toward a unified view of data. ACM. Trans. Database. Syst. 1(1), 9–36 (1976)
Google Scholar
Chiarcos C (accepted) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012)
Google Scholar
Chiarcos, C.: Grounding an ontology of linguistic annotations in the data category registry. In: Workshop on Language Resource and Language Technology Standards (LR & LTS), held in Conjunction with LREC 2010. Valetta, Malta (2010)
Google Scholar
Chiarcos, C.: An ontology of linguistic annotations. LDV Forum 23(1), 1–16 (2008)
Google Scholar
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., Stede, M.: A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2), 217–246 (2008)
Google Scholar
Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens. Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)
Article Google Scholar
Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: ANLC ’88: Proceedings of the Second Conference on Applied Natural Language Processing
Google Scholar
Collins, M.: Head-driven statistical models for natural language parsing. Comput. Linguist. 29(4), 589–637 (2003)
Article Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: ACL. http://www.aclweb.org/anthology/P02-1022 (2002). doi:10.3115/1073083.1073112
DeRose, Steven J.: Grammatical Category Disambiguation by Statistical Optimization. Comput. Linguist. 14(1), 31–39 (1988)
Google Scholar
Diewald, N., Sthrenberg, M., Garbar, A., Goecke, D.: Serengeti - Webbasierte annotation semantischer relationen. J. Lang. Technol. Comput. Linguist. 23(2), 74–93 (2008)
Google Scholar
Eckart, K., Riester, A., Schweitzer, K.: A discourse information radio news database for linguistic analysis. In: Nordhoff, S., Hellmann, S. Chiarcos C. (eds.) Linked data in Linguistics. Springer (2012)
Google Scholar
Farrar, S., Langendoen, D.T.: An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Springer, Dordrecht (2010)
Google Scholar
Ferrucci, D., Lally, A.: UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3/4), 327–348 (2004)
Article Google Scholar
Goodwin, C., Heritage, J.: Conversation analysis. Ann. Rev. Anthropol. a, 283–307 (1990)
Article Google Scholar
Grishman, R. (ed.): Tipster Text Architecture Design. http://www-nlpir.nist.gov/related_projects/tipster/ (1998)
Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics, pp. 466–471 (1996)
Google Scholar
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference (LREC), pp. 463–70 (1998)
Google Scholar
Ide, N., Pustejovsky, J.: What does interoperability mean, anyway? toward an operational definition of interoperability for language technology. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources (ICGL 2010) (2010)
Google Scholar
Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop (LAW), pp. 1–8. Prague (2007)
Google Scholar
Ide, N., Suderman, K.: The linguistic annotation framework: a standard for annotation interchange and merging. Lang. Res. Eval. 48(3), 395–418 (2014)
Article Google Scholar
Ide, N., Romary, L., de la Clergerie, E.: International standard for a linguistic annotation framework. In: Proceedings of HLT-NAACL’03 Workshop on the Software Engineering and Architecture of Language Technology, pp. 25–30. Edmonton, Canada (2003)
Google Scholar
Ide, N., Bonhomme, P., Romary, L.: XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second International Language Resources and Evaluation Conference (LREC), pp. 825–830 (2000)
Google Scholar
Ide, N., Suderman, K., Simms, B.: ANC2Go: A Web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC) (2010)
Google Scholar
Ide, N., Baker, C., Fellbaum, C., Passonneau, R.: The Manually Annotated Sub-Corpus: A Community Resource For and By the People. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 68–73. Uppsala, Sweden (2010)
Google Scholar
ISO8879:1986: Information processing – Text and Office Systems – Standard Generalized Markup Language (SGML). International Organization for Standardization (1986)
Google Scholar
ISO 24612:2012: Language resource management – Linguistic Annotation Framework (LAF). International Organization for Standardization (2012)
Google Scholar
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.: ISOcat: remodelling metadata for language resources. Int. J. Metadata Semant. Ontol. 4(4), 261–276 (2009)
Article Google Scholar
Mann, W., Thompson, S.: Rhetorical structure theory: towards a functional theory of text organization. TEXT 8, 243–281 (1988)
Article Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC) (2014)
Google Scholar
Müller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods, pp. 197–214. Frankfurt: Peter Lang (2006)
Google Scholar
Schmidt, T.: Visualising linguistic annotation as interlinear text. Sonderforschungsbereich 538 (2003)
Google Scholar
Schmidt, T., Duncan, S., Ehmer, O., Hoyt, J., Kipp, M., Loehr, D., Magnusson, M., Rose, T., Sloetjes, H.: An exchange format for multimodal annotations. In: Multimodal Corpora, pp. 207–221. Springer (2009)
Google Scholar
Schmidt, T., Elenius, K., Trilsbeek, P.: Multimedia corpora (Media encoding and annotation). Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C-3 Interoperability and Standards (2010)
Google Scholar
Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J.: Tobi: a standard for labeling english prosody. In: Proceedings of the 1992 International Conference on Spoken Language Processing, ICSLP, pp. 12–16 (1992)
Google Scholar
Windhouwer, M., Wright, S.: Linking to linguistic data categories in ISOcat. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked data in Linguistics, pp. 99–107. Springer, Heidelberg (2012)
Google Scholar
Wittenburg, P., Lenkiewicz, P., Auer, E., Lenkiewicz, A., Gebre, B.G., Drude, S.: Av processing in ehumanities–a paradigm shift. In: Digital Humanities 2012 Conference, vol. 2 (2012)
Google Scholar
Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C.: ANNIS: a search tool for multi-layer annotated corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK (2009)
Google Scholar
Zipser, F., Romary, L.: A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, pp. 7–18 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, NY, USA
Nancy Ide
Institute for Computer Science, Johann Wolfgang Goethe Universität, Frankfurt am Main, Germany
Christian Chiarcos
UFS Cognitive Science, University of Potsdam, Potsdam, Germany
Manfred Stede
Department of Computing, Macquarie University, Sydney, NSW, Australia
Steve Cassidy

Authors

Nancy Ide
View author publications
You can also search for this author in PubMed Google Scholar
Christian Chiarcos
View author publications
You can also search for this author in PubMed Google Scholar
Manfred Stede
View author publications
You can also search for this author in PubMed Google Scholar
Steve Cassidy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ide, N., Chiarcos, C., Stede, M., Cassidy, S. (2017). Designing Annotation Schemes: From Model to Representation. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_3

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_3
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics