Skip to main content

OntoNotes: Large Scale Multi-Layer, Multi-Lingual,  Distributed  Annotation

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

The OntoNotes project has annotated a large corpus comprising various genres in three languages with syntax, predicate argument structure, word senses, named entities and within document coreference. An important goal of the project was to ensure that each layer of annotation had reasonably high inter-annotator agreement (\(\sim \)90%). The multiple layers of annotation were developed asynchronously across multiple annotation sites. In this case study, we focus on the mechanics of the annotation process rather than on the annotations themselves. We first describe the data representation challenges, and present the developed representation. We then discuss the requirements for managing the data logistics, and, finally, describe some particular challenges pertaining to specific annotation layers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ontonotes.org.

  2. 2.

    Arabic was a pilot effort limited to the newswire genre.

  3. 3.

    Given changing project priorities and various other constraints, not all text in the OntoNotes corpus was annotated with all the layers of annotation.

  4. 4.

    Some entities/events constitute sub-parts of the relatively flat NP structure in the Treebank, and must be defined over word spans rather than corresponding to tree nodes.

  5. 5.

    A list of token exceptions that were not split at hyphens are listed in the following guidelines document. http://ontonotes.org/documents/guidelines/treebank/.

  6. 6.

    Including fresh Treebanking based on the latest Treebank guidelines, and Treebanking revisions.

  7. 7.

    http://callisto.mitre.org/.

  8. 8.

    Atlas Interchange Format.

  9. 9.

    ACE pilot format.

  10. 10.

    or, group of changes, as they might make several in one edit cycle.

References

  1. Babko-Malaya, O., Bies, A., Taylor, A., Yi, S., Palmer, M., Marcus, M., Kulick, S., Shen, L.: Issues in synchronizing the English Treebank and PropBank. In: Workshop on Frontiers in Linguistically Annotated Corpora (2006)

    Google Scholar 

  2. Choi, J.D., Bonial, C., Palmer, M.: Propbank frameset annotation guidelines using a dedicated editor, cornerstone. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)

    Google Scholar 

  3. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  4. Palmer, M., Babko-Malaya, O., Dang, H.T.: Different sense granularities for different applications. In: Porzel, R. (ed.) HLT-NAACL 2004 Workshop: 2nd Workshop on Scalable Natural Language Understanding, pp. 49–56. Association for Computational Linguistics, Boston (2004)

    Google Scholar 

  5. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)

    Google Scholar 

  6. Philpot, A., Hovy, E., Patrick, P.: The omega ontology. In: Proceedings of the ONTOLEX Workshop at IJCNLP. Jeju Island, South Korea (2005)

    Google Scholar 

  7. Pradhan, S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: a unified relational semantic representation. Int. J. Semant. Comput. 1(4), 405–419 (2007)

    Article  Google Scholar 

  8. Pradhan, S., Ramshaw, L., Weischedel, R., MacBride, J., Micciulla, L.: Unrestricted coreference: identifying entities and events in OntoNotes. In: Proceedings of ICSC (2007)

    Google Scholar 

  9. Pradhan, S., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R., Xue, N.: CoNLL-2011 shared task: modeling unrestricted coreference in OntoNotes. In: Proceedings of CoNLL (2011)

    Google Scholar 

  10. Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., Zhang, Y.: CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In: Joint Conference on EMNLP and CoNLL - Shared Task, pp. 1–40. Association for Computational Linguistics, Jeju Island (2012). http://www.aclweb.org/anthology/W12-4501

  11. Weischedel, R., Brunstein, A.: BBN pronoun coreference and entity type corpus LDC catalog no.: LDC2005T33. BBN Technologies (2005)

    Google Scholar 

  12. Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., Ramshaw, L., Xue, N.: OntoNotes: a large training corpus for enhanced processing. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer, Heidelberg (2011)

    Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the support of the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sameer Pradhan .

Editor information

Editors and Affiliations

Defining, Tracking and Reporting Annotation Errors

Defining, Tracking and Reporting Annotation Errors

This section gives the details of the error code scheme used to track errors in the proposition and sense layers, making clear the many distinctions that turned out to be important to track and giving an example of how these complexities can be dealt with. Table 6 shows the structure of the component sub-fields, and Table 7 lists the specific codes with their meaning and severity levels.

Table 6 Nomenclature of the error codes
Table 7 Descriptions of error codes used to identify problems in proposition annotation—including ones for the TB/PB merge cases

For the sense and proposition data, we periodically tracked the status of the coverage errors by creating a html report. Figure 4 shows a screenshot of the report for the proposition layer.

Following is the description of the information presented in the table.

  1. 1.

    Corpus—This is the sub-corpus that the row represents. It shows intermediate names used for various subcorpora.

  2. 2.

    Total Taggable—These are the total number of taggable instances for that particular corpus. In case of the web data, this was exactly the number of pointers in the repository as the web data annotation started after this procedure was put in place. In case of older data, for example, WSJ, we may not have added blank pointers to represent the holes in annotation yet, so this number would tend to be somewhat greater (if not equal) to the number of tagged instances in the respective .prop files in the repository.

  3. 3.

    Ready to Release—These are the annotated instances that have passed the consistency checks in the automatic database build process and can be considered ready to be released if we were to prepare an OntoNotes release at the time the report was built—so only “gold” or “adjudicated” instances will be counted in this category. These also included cases, that, according to the knowledge built in the API, we think were correct to begin with or had been successfully automatically mapped on to newer versions of trees. This category was used to track coverage.

  4. 4.

    Completed Annotation—These are the total instances that have been adjudicated.

  5. 5.

    Awaiting Correction—These represented the total number of instances that had been flagged by the build process as having at least one error or warning in them.

  6. 6.

    Awaiting Correction – Details by Error code—From a reporting point of view, it is not very important to get to the details of how many instances have been flagged by what types of errors, but form an operational perspective this information can be very useful. So, this was a “hidden” column that can be shown or re-hidden by using the buttons at the bottom of the table. When shown, this column shows all the error codes that have been encountered in loading the data, and under each error code will be listed the instances for each corpora that got identified.The list of error codes is a subset of the potential error-codes for the data; i.e., if a particular error-type is not encountered in all of the data for a particular language, then there won’t be the error code column here, but that implicitly means that there were no instances identified for that error code, and not that the error code does not exist. Owing to the “one instance can have multiple-errors” rule, the number in all the error codes columns do not sum to the total number of instances awaiting correction. Therefore, to mitigate that, we tried to also generate a “normalized” version of the instances that belong to an error code column. The normalization is done using a “severity” measure that we have tentatively assigned to each error. The measure was roughly proportional to “minutes required to correct an error of this type—on average”. So, if an instance was tagged with multiple errors, then the error that required the most number of minutes would be assigned to that particular instance, and we would increment the number in that error code column by one, and no other error code column will have any increment for that instance. This ensured that when we are done with the table, that the total of all the instances under each error code column sums to the total numbers that are awaiting correction. One downside was that someone has to decide what the exact “severity” values are. Another interesting case that had to be considered is the error code 14051 which represents a successful copy to newer trees—is not really an error code, and that it falls in the category of—“we expect do a random check on these instances, and once it seems satisfactory change all these to error code 00000”. Therefore, in order to give a realistic estimate of the “release coverage” we count just this one category in addition to the ones that have a 00000 error code as been covered. In order to be able to match the error code with the description easily, we hyperlinked the error codes in the column heading. Clicking on this would highlight the appropriate error code along with the description in the list below.

Fig. 4
figure 4

Screen capture of the proposition report table seen after a successful release build. Only some of the error columns are shown owing to space limitations

  1. 7.

    Partially Annotated—Since it is not always the case that the all the data in the pipeline would have been adjudicated at the end of each month, we will also have other columns to represent the data at intermediate levels in the pipeline—single-annotated, double-annotated, pre-gold, etc. This column represents the total data in this category. We will process all the instances in these categories in through the build, but will only merge the ones that have been adjudicated completely.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Pradhan, S., Ramshaw, L. (2017). OntoNotes: Large Scale Multi-Layer, Multi-Lingual,  Distributed  Annotation. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_20

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics