Abstract
The physical formats used to represent linguistic data and its annotations have evolved over the past four decades, accommodating different needs and perspectives as well as incorporating advances in data representation generally. This chapter provides an overview of representation formats with the aim of surveying the relevant issues for representing different data types together with current state-of-the-art solutions, in order to provide sufficient information to guide others in the choice of a representation format or formats.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
Several initiatives have focused on reusability of language data from the late 1980s onward; see chapter “Community Standards for Linguistically-Annotated Resources” in this volume for a fuller history of standards efforts in the field.
- 5.
Note that the Hypertext Markup Language (HTML) is an application of SGML/XML, in that it uses the SGML/XML meta-format to define specific tag names and document structure for use in creating web pages.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
Originally called “remote markup”–see http://www.cs.vassar.edu/CES/CES1-5.html.
- 12.
An ad hoc mechanism to connect annotations on different graphs was later introduced into the AG model to accommodate hierarchical relations.
- 13.
- 14.
- 15.
- 16.
- 17.
An extreme example of hybrid standoff is the format used in PropBank (http://verbs.colorado.edu/~mpalmer/projects/ace.html), which uses the syntactic structure to address to attach semantic role annotations to nodes in the syntax trees defined in the original Penn Treebank.
- 18.
In GrAF, each annotation is associated with a node in the annotation graph.
- 19.
Note that the decision to represent annotation layers in this fashion does not automatically lead to the distribution of layers across separate data files. While the MMAX2 model and others (see Sect. 5) indeed use one file per layer, other approaches such as that of the model underlying the Serengeti tool [21] prefer combining all information into a single file, which begins with the token layer and then lists the various standoff annotation layers.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
See Sect. 4.2.2.
- 30.
See Sect. 4.2.1.
- 31.
- 32.
- 33.
The nature of the referring pointer used may depend on the medium. For text, references to beginning and ending offsets (“virtual nodes” between characters) of a text span are standard.
- 34.
See [31] for more detailed information on the GrAF resource header.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
For a more detailed description of NIF, see chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9 in this volume.
- 42.
- 43.
- 44.
The development of an OWL/DL version of FrameNet has been announced on the FrameNet site.
- 45.
See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 6, in this volume.
- 46.
See also chapter “Community Standards for Linguistically-Annotated Resources”, Sect. 9, in this volume.
- 47.
References
Baker, C., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. Third Linguistic Annotation Workshop (LAW-2009), pp. 125–129. Suntec, Singapore (2009)
Banski, P., Przepiórkowski, A.: Stand-off TEI annotation: the case of the national corpus of polish. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 64–67. Suntec, Singapore (2009)
Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)
Bow, C., Hughes B., Bird S.: Towards a general model of interlinear text. In: Proceedings of EMELD workshop, pp. 11–13 (2003)
Bradshaw, J., Burridge, K., Clyne, M.: The monash corpus of spoken Australian english. In: Proceedings of the 2008 Conference of the Australian Linguistics Society, pp. 2123/7099 (2009)
Brants, T., Skut, W., Krenn, B.: Tagging grammatical functions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97). Providence, RI (1997)
Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. Res. Lang. Comput. 2(4), 597–620 (2004)
Bray, T., Paoli, J., Sperberg-McQueen, C.M. (eds.): Extensible Markup Language (XML) Version 1.0. W3C Recommendation. http://www.w3.org/TR/1998/REC-xml-19980210 (1998)
Carletta, J., Evert, S., Heid, U., Kilgour, J.: The NITE XML Toolkit: data model and query. Lang. Res. Eval. J. (LREJ) 39(4), 313–334 (2005)
Charniak, E.: A Maximum-entropy-inspired Parser. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 132–139 (2000)
Chen, P.P.S: The entity-relationship model–toward a unified view of data. ACM. Trans. Database. Syst. 1(1), 9–36 (1976)
Chiarcos C (accepted) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012)
Chiarcos, C.: Grounding an ontology of linguistic annotations in the data category registry. In: Workshop on Language Resource and Language Technology Standards (LR & LTS), held in Conjunction with LREC 2010. Valetta, Malta (2010)
Chiarcos, C.: An ontology of linguistic annotations. LDV Forum 23(1), 1–16 (2008)
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., Stede, M.: A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2), 217–246 (2008)
Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens. Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)
Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: ANLC ’88: Proceedings of the Second Conference on Applied Natural Language Processing
Collins, M.: Head-driven statistical models for natural language parsing. Comput. Linguist. 29(4), 589–637 (2003)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: ACL. http://www.aclweb.org/anthology/P02-1022 (2002). doi:10.3115/1073083.1073112
DeRose, Steven J.: Grammatical Category Disambiguation by Statistical Optimization. Comput. Linguist. 14(1), 31–39 (1988)
Diewald, N., Sthrenberg, M., Garbar, A., Goecke, D.: Serengeti - Webbasierte annotation semantischer relationen. J. Lang. Technol. Comput. Linguist. 23(2), 74–93 (2008)
Eckart, K., Riester, A., Schweitzer, K.: A discourse information radio news database for linguistic analysis. In: Nordhoff, S., Hellmann, S. Chiarcos C. (eds.) Linked data in Linguistics. Springer (2012)
Farrar, S., Langendoen, D.T.: An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Springer, Dordrecht (2010)
Ferrucci, D., Lally, A.: UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3/4), 327–348 (2004)
Goodwin, C., Heritage, J.: Conversation analysis. Ann. Rev. Anthropol. a, 283–307 (1990)
Grishman, R. (ed.): Tipster Text Architecture Design. http://www-nlpir.nist.gov/related_projects/tipster/ (1998)
Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics, pp. 466–471 (1996)
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference (LREC), pp. 463–70 (1998)
Ide, N., Pustejovsky, J.: What does interoperability mean, anyway? toward an operational definition of interoperability for language technology. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources (ICGL 2010) (2010)
Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop (LAW), pp. 1–8. Prague (2007)
Ide, N., Suderman, K.: The linguistic annotation framework: a standard for annotation interchange and merging. Lang. Res. Eval. 48(3), 395–418 (2014)
Ide, N., Romary, L., de la Clergerie, E.: International standard for a linguistic annotation framework. In: Proceedings of HLT-NAACL’03 Workshop on the Software Engineering and Architecture of Language Technology, pp. 25–30. Edmonton, Canada (2003)
Ide, N., Bonhomme, P., Romary, L.: XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second International Language Resources and Evaluation Conference (LREC), pp. 825–830 (2000)
Ide, N., Suderman, K., Simms, B.: ANC2Go: A Web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC) (2010)
Ide, N., Baker, C., Fellbaum, C., Passonneau, R.: The Manually Annotated Sub-Corpus: A Community Resource For and By the People. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 68–73. Uppsala, Sweden (2010)
ISO8879:1986: Information processing – Text and Office Systems – Standard Generalized Markup Language (SGML). International Organization for Standardization (1986)
ISO 24612:2012: Language resource management – Linguistic Annotation Framework (LAF). International Organization for Standardization (2012)
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.: ISOcat: remodelling metadata for language resources. Int. J. Metadata Semant. Ontol. 4(4), 261–276 (2009)
Mann, W., Thompson, S.: Rhetorical structure theory: towards a functional theory of text organization. TEXT 8, 243–281 (1988)
Marcus, M.P., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC) (2014)
Müller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods, pp. 197–214. Frankfurt: Peter Lang (2006)
Schmidt, T.: Visualising linguistic annotation as interlinear text. Sonderforschungsbereich 538 (2003)
Schmidt, T., Duncan, S., Ehmer, O., Hoyt, J., Kipp, M., Loehr, D., Magnusson, M., Rose, T., Sloetjes, H.: An exchange format for multimodal annotations. In: Multimodal Corpora, pp. 207–221. Springer (2009)
Schmidt, T., Elenius, K., Trilsbeek, P.: Multimedia corpora (Media encoding and annotation). Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C-3 Interoperability and Standards (2010)
Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J.: Tobi: a standard for labeling english prosody. In: Proceedings of the 1992 International Conference on Spoken Language Processing, ICSLP, pp. 12–16 (1992)
Windhouwer, M., Wright, S.: Linking to linguistic data categories in ISOcat. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked data in Linguistics, pp. 99–107. Springer, Heidelberg (2012)
Wittenburg, P., Lenkiewicz, P., Auer, E., Lenkiewicz, A., Gebre, B.G., Drude, S.: Av processing in ehumanities–a paradigm shift. In: Digital Humanities 2012 Conference, vol. 2 (2012)
Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C.: ANNIS: a search tool for multi-layer annotated corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK (2009)
Zipser, F., Romary, L.: A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, pp. 7–18 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Ide, N., Chiarcos, C., Stede, M., Cassidy, S. (2017). Designing Annotation Schemes: From Model to Representation. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_3
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_3
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)