Advertisement

Linguistic Categories

  • Philipp Cimiano
  • Christian Chiarcos
  • John P. McCrae
  • Jorge Gracia
Chapter

Abstract

The (re-)usability of NLP tools and language resources has long been recognized as a key challenge in the language resource and NLP communities. Reuse of resources, however, requires a minimum level of interoperability, and in this chapter, we focus on conceptual interoperability, i.e. harmonization between different annotation schemas by means of terminology repositories. Beyond that, we give special attention to language identifiers, as these can be provided in different ways in an RDF context, either by reference to a concepts in a terminology repository, or by means of language tags.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    E. Atwell, Development of tag sets for part-of-speech tagging, in Corpus Linguistics: An International Handbook, Volume 1, ed. by A. Lüdeling, M. Kyto (Walter de Gruyter, New York, 2008), pp. 501–526Google Scholar
  2. 2.
    W.N. Francis, H. Kucera, Brown Corpus Manual. Information to accompany a standard corpus of present-day edited American English, for use with digital computers. Providence (1979). http://icame.uib.no/brown/bcm.html. Original edition 1964
  3. 3.
    M.P. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the Penn treebank. Comput. Linguist. 19, 313 (1993)Google Scholar
  4. 4.
    G. Sampson, English for the Computer: The SUSANNE Corpus and Analytic Scheme (Oxford University Press, Oxford, 1995)Google Scholar
  5. 5.
    G. Leech, A. Wilson, EAGLES recommendations for the morphosyntactic annotation of corpora (1996), http://www.ilc.cnr.it/EAGLES/annotate/annotate.html. Version of March 1996
  6. 6.
    J. Hughes, D. Souter, E. Atwell, Automatic extraction of tagset mappings from parallel annotated corpora, in Proceedings of the ACL-SIGDAT Workshop From Text to Tags: Issues in Multilingual Language Analysis (Association for Computational Linguistics, Stroudsburg, 1995), pp. 10–17Google Scholar
  7. 7.
    J. Nivre, Ž. Agić, L. Ahrenberg, et. al., Universal dependencies 1.4 (2016), http://hdl.handle.net/11234/1-1827
  8. 8.
    A. Kutuzov, E. Velldal, L. Øvrelid, Redefining part-of-speech classes with distributional semantic models, in Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (Association for Computational Linguistics, Berlin, 2016), pp. 115–125CrossRefGoogle Scholar
  9. 9.
    J. Broschart, Why Tongan does it differently: categorial distinctions in a language without nouns and verbs. Linguist. Typol. 1(2), 123 (1997)Google Scholar
  10. 10.
    N. Evans, T. Osada, Mundari: the myth of a language without word classes. Linguist. Typol. 9(3), 351–390 (2005)Google Scholar
  11. 11.
    D. Barner, A. Bale, No nouns, no verbs: psycholinguistic arguments in favor of lexical underspecification, in The Mental Representation of Grammatical Relations, ed. by J. Bresnan (MIT Press, Cambridge, MA, 2002), pp. 655–726Google Scholar
  12. 12.
    M. Kemps-Snijders, M. Windhouwer, P. Wittenburg, S. Wright, ISOcat: remodelling metadata for language resources. Int. J. Metadata Semant. Ontol. 4(4), 261 (2009)CrossRefGoogle Scholar
  13. 13.
    G. Francopoulo, M. George, N. Calzolari, M. Monachini, N. Bel, M. Pet, C. Soria, et al., Lexical markup framework (LMF), in Proceedings of the International Conference on Language Resources and Evaluation (LREC), vol. 6 (2006)Google Scholar
  14. 14.
    G. Francopoulo, N. Bel, M. George, N. Calzolari, M. Monachini, M. Pet, C. Soria, Multilingual resources for NLP in the Lexical Markup Framework (LMF). Lang. Resour. Eval. 43, 57 (2009)CrossRefGoogle Scholar
  15. 15.
    N. Ide, L. Romary, E. de la Clergerie, International standard for a linguistic annotation framework, in Proceedings of the HLT-NAACL’03 Workshop on the Software Engineering and Architecture of Language Technology, Edmonton (2003), pp. 25–30Google Scholar
  16. 16.
    N. Ide, K. Suderman, The Linguistic Annotation Framework: a standard for annotation interchange and merging. Lang. Resour. Eval. 48(3), 395 (2014)CrossRefGoogle Scholar
  17. 17.
    S. Farrar, D.T. Langendoen, An OWL-DL implementation of GOLD: an ontology for the Semantic Web, in Linguistic Modeling of Information and Markup Languages: Contributions to Language Technology, ed. by A. Witt, D. Metzing (Springer, Dordrecht, 2010)Google Scholar
  18. 18.
    P. Cimiano, P. Buitelaar, J. McCrae, M. Sintek, LexInfo: a declarative model for the lexicon-ontology interface. Web Semant. Sci. Serv. Agents World Wide Web 9(1), 29 (2011)CrossRefGoogle Scholar
  19. 19.
    C. Chiarcos, M. Sukhareva, OLiA - Ontologies of Linguistic Annotation. Semant. Web J. 518, 379 (2015)CrossRefGoogle Scholar
  20. 20.
    R. Cyganiak, D. Wood, M. Lanthaler, RDF 1.1 concepts and abstract syntax. Technical Report, W3C Recommendation 25 February 2014 (2014)Google Scholar
  21. 21.
    A. Phillips, M. Davis, BCP 47 – tags for identifying languages. BCP 47 Standard (2006), http://www.rfc-editor.org/rfc/bcp/bcp47.txt
  22. 22.
    Library of Congress, MARC code list for languages. Introduction. Technical Report, Library of Congress, Washington, DC (2007). Version of October 2007Google Scholar
  23. 23.
    SIL International, ISO 639-3. Technical Report, SIL International (2015), http://www-01.sil.org/iso639-3/default.asp
  24. 24.
    SIL International, Relationship between ISO 639-3 and the other parts of ISO 639. Technical Report, SIL International (2015), http://www-01.sil.org/iso639-3/relationship.asp
  25. 25.
    F. Gillis-Webber, S. Tittel, The shortcomings of language tags for linked data when modeling lesser-known languages, in Proceedings of the 2nd Conference on Language, Data and Knowledge (LDK 2019) OpenAccess Series in Informatics (Schloss Dagstuhl, Leibniz-Zentrum fuer Informatik, 2019), p. 4:1–4:15Google Scholar
  26. 26.
    M. Kemps-Snijders, M. Windhouwer, P. Wittenburg, S.E. Wright, ISOcat: corralling data categories in the wild, in Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC) (2008), pp. 887–891Google Scholar
  27. 27.
    N. Ide, L. Romary, A registry of standard data categories for linguistic annotation, in Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC) (2004), pp. 135–138Google Scholar
  28. 28.
    M. Windhouwer, S.E. Wright, Linking to linguistic data categories in ISOcat, in Linked Data in Linguistics (Springer, Berlin, 2012), pp. 99–107CrossRefGoogle Scholar
  29. 29.
    B. Bickel, J. Nichols, The autotyp research program, invited talk at the Annual Meeting of the Linguistic Typology Resource Center Utrecht (2002)Google Scholar
  30. 30.
    B. Comrie, Areal typology of mainland southeast Asia: what we learn from the wals maps. Manusya J. Humanit. 10(3), 18 (2007)CrossRefGoogle Scholar
  31. 31.
    I. Niles, A. Pease, Towards a standard upper ontology, in Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS), Maine, ed. by C. Welty, B. Smith (2001)Google Scholar
  32. 32.
    S. Farrar, W.D. Lewis, The GOLD community of practice: an infrastructure for linguistic data on the web. Lang. Resour. Eval. 41(1), 45 (2007)CrossRefGoogle Scholar
  33. 33.
    M. Kemps-Snijders, RELISH: rendering endangered languages lexicons interoperable through standards harmonisation, in Proceedings of the 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages (Valetta, Malta, 2010)Google Scholar
  34. 34.
    I. Schuurman, M. Windhouwer, O. Ohren, D. Zeman, CLARIN concept registry: the new semantic registry, in CLARIN 2015 Selected Papers (2015), pp. 62–70Google Scholar
  35. 35.
    K. Warburton, S. Wright, A data category repository for language resources, in Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, ed. by A. Pareja-Lora, M. Blume, B. Lust, C. Chiarcos (MIT Press, Cambridge, MA, 2019)Google Scholar
  36. 36.
    H. Brugman, M. Lindeman, Publishing and exploiting vocabularies using the OpenSKOS, in Proceedings of the Describing Language Resources with Metadata Workshop at LREC 2012 (2012)Google Scholar
  37. 37.
    C. Chiarcos, An ontology of linguistic annotations. LDV Forum 23(1), 1 (2008)Google Scholar
  38. 38.
    C. Chiarcos, Grounding an ontology of linguistic annotations in the data category registry, in Proceedings of the LREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS). State of the Art, Emerging Needs, and Future Developments, Valetta (2010), pp. 37–40Google Scholar
  39. 39.
    T. Schmidt, C. Chiarcos, T. Lehmberg, G. Rehm, A. Witt, E. Hinrichs, Avoiding data graveyards: from heterogeneous data collected in multiple research projects to sustainable linguistic resources, in Proceedings of the E-MELD Workshop on Digital Language Documentation, East Lansing (2006)Google Scholar
  40. 40.
    G. Rehm, R. Eckart, C. Chiarcos, J. Dellert, Ontology-based XQuery’ing of XML-encoded language resources on multiple annotation layers, in Proceedings of International Conference on Language Resources and Evaluation (LREC), Marrakech (2008), 3, pp. 525–532Google Scholar
  41. 41.
    C. Chiarcos, S. Dipper, M. Götze, U. Leser, A. Lüdeling, J. Ritz, M. Stede, A flexible framework for integrating annotations from different tools and tag sets. Traitement automatique des langues 49(2), 217 (2008)Google Scholar
  42. 42.
    A. Pareja-Lora, M. Blume, B. Lust, C. Chiarcos (eds.), Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences (MIT Press, Cambridge, MA, 2019)Google Scholar
  43. 43.
    T. Trippel, C. Zinn, Describing research data with CMDI—challenges to establish contact with linked open data, in Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, ed. by A. Pareja-Lora, M. Blume, B. Lust, C. Chiarcos (MIT Press, Cambridge, MA, 2019)Google Scholar
  44. 44.
    D. Langendoen, Whither GOLD? in Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences, ed. by A. Pareja-Lora, M. Blume, B. Lust, C. Chiarcos (MIT Press, Cambridge, MA, 2019)Google Scholar
  45. 45.
    C. Chiarcos, S. Nordhoff, S. Hellmann, Interoperability of Corpora and annotations, in Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata, ed. by C. Chiarcos, S. Nordhoff, S. Hellmann (Springer, Heidelberg, 2012), pp. 161–179CrossRefGoogle Scholar
  46. 46.
    C. Chiarcos, J. Ritz, M. Stede, Querying and visualizing coreference annotation in multi-layer corpora, in Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011), Faro (2011), pp. 80–92Google Scholar
  47. 47.
    A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans, T. Bíró, How to integrate databases without starting a typology war: the typological database system, in The Use of Databases in Cross-Linguistic Studies, ed. by M. Everaert, S. Musgrave, A. Dimitriadis, Empirical Approaches to Language Typology [EALT] 41 (Walter de Gruyter, Berlin, 2009), pp. 155–208. https://doi.org/10.1515/9783110198744.155

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Semantic Computing GroupBielefeld UniversityBielefeldGermany
  2. 2.Angewandte ComputerlinguistikGoethe-UniversityFrankfurt am MainGermany
  3. 3.Insight Centre for Data AnalyticsNational University of IrelandGalwayIreland
  4. 4.Aragon Institute of Engineering Research (I3A)University of ZaragozaZaragozaSpain

Personalised recommendations