Abstract
We introduce CoNLL-RDF, a direct rendering of the CoNLL format in RDF, accompanied by a formatter whose output mimicks CoNLL’s original TSV-style layout. CoNLL-RDF represents a middle ground that accounts for the needs of NLP specialists (easy to read, easy to parse, close to conventional representations), but that also facilitates LLOD integration by applying off-the-shelf Semantic Web technology to CoNLL corpora and annotations. The CoNLL-RDF infrastructure is published as open source. We also provide SPARQL update scripts for selected use cases as described in this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As summarized by Mark Johnson in his ACL-IJCLNP 2012 keynote on the future of computational linguistics, “[s]tandard data formats (...) I’m not sure these are important: if someone can use a parser, they can probably also write a Python wrapper” [12, slide 8].
- 2.
- 3.
- 4.
- 5.
- 6.
For the sake of processability, we use only a minimal fragment of NIF. We do neither adopt its full semantic model nor its URI formation constraints; yet, it is possible to transform CoNLL-RDF to NIF using SPARQL update and to provide NIF-compliant URIs if information about the original spacing (which is not preserved in CoNLL) is provided externally.
- 7.
- 8.
It should be noted that addressing elements in a ragged array requires great care, as it is not guaranteed that a given index exists for every sentence, e.g., in case of SRL annotations which differ in length per sentence. In this regard, hash maps are more permissive.
- 9.
CoNLL-RDF provides a clear representation of annotation layers: CoNLL has been described as a ‘hybrid standoff format’ [11] in the sense that every column represents a self-contained annotation layer that refers to a common segmentation (tokens).
- 10.
- 11.
- 12.
- 13.
- 14.
NLTK provides numerous corpus readers specialized for different CoNLL variants, cf. http://www.nltk.org/howto/corpus.html.
- 15.
References
Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 Turtle. (2014). https://www.w3.org/TR/turtle/
Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus. In: LREC (2002)
Chiarcos, C., Sukhreva, M.: OLiA - Ontologies of Linguistic Annotation. Semant. Web J. 518, 379–386 (2015)
Chiarcos, C., Fäth, C., Renner-Westermann, H., Abromeit, F., Dimitrova, V.: Lin\(|\)gu\(|\)is\(|\)tik: building the linguist’s pathway to bibliographies, libraries, language resources and linked open data. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France (May 2016)
Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources, pp. 7–25. Springer, Heidelberg (2013)
Chiarcos, C., Nordhoff, S., Hellmann, S.: Linked Data in Linguistics. Springer, Heidelberg (2012)
Cimiano, P., McCrae, J., Buitelaar, P.: Lexicon model for ontologies (2016). https://www.w3.org/2016/05/ontolex/
Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language (2012). https://www.w3.org/TR/r2rml
Declerck, T., Buitelaar, P., Wunner, T., McCrae, J., Montiel-Ponsoda, E., de Cea, A.: Lemon: an ontology-lexicon model for the multilingual semantic web. In: W3C Workshop: The Multilingual Web - Where Are We? Madrid, Spain, October 2010
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Proceedings of 12th International Semantic Web Conference, 21–25 October 2013, Sydney, Australia (2013). http://persistence.uni-leipzig.org/nlp2rdf/
Ide, N., Chiarcos, C., Stede, M., Cassidy, S.: Designing annotation schemes: from model to representation. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation: Text, Speech, and Language Technology. Springer, Dordrecht (2017, in press)
Johnson, M.: Computational linguistics. Where do we go from here? Invited talk at the 50th Annual Meeting of the Association of Computational Linguistics (ACL-IJCNLP 2012), Jeju, Korea (2012). http://web.science.mq.edu.au/ mjohnson/papers/Johnson12next50.pdf. Accessed 13 July 2016
Lezius, W., Biesinger, H., Gerstenberger, C.: TigerXML quick reference guide (2002)
Nivre, J., Agić, Ž., Ahrenberg, L., et. al.: Universal dependencies 1.4 (2016). http://hdl.handle.net/11234/1-1827
Sanderson, R., Ciccarese, P., Van de Sompel, H.: Open annotation data model (2013). http://www.openannotation.org/spec/core
Sanderson, R., Ciccarese, P., Young, B.: Web annotation data model (2017). https://www.w3.org/TR/annotation-model
Sérasset, G.: DBnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semant. Web J. 648 (2014). http://kaiko.getalp.org/about-dbnary/
Acknowledgments
The research of Christian Chiarcos was supported by the BMBF-funded Research Group ‘Linked Open Dictionaries (LiODi)’ (2015–2020). The research of Christian Fäth was conducted in the context of DFG-funded projects ‘Virtuelle Fachbibliothek’ (2015–2016) and ‘Fachinformationsdienst Linguistik’ (2017–2019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chiarcos, C., Fäth, C. (2017). CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia, J., Bond, F., McCrae, J., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds) Language, Data, and Knowledge. LDK 2017. Lecture Notes in Computer Science(), vol 10318. Springer, Cham. https://doi.org/10.1007/978-3-319-59888-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-59888-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59887-1
Online ISBN: 978-3-319-59888-8
eBook Packages: Computer ScienceComputer Science (R0)