Advertisement

PDTSC 2.0 - Spoken Corpus with Rich Multi-layer Structural Annotation

  • Marie MikulováEmail author
  • Jiří Mírovský
  • Anja Nedoluzhko
  • Petr Pajas
  • Jan Štěpánek
  • Jan Hajič
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)

Abstract

We present a richly annotated spoken language resource, the Prague Dependency Treebank of Spoken Czech 2.0, the primary purpose of which is to serve for speech-related NLP tasks. The treebank features several novel annotation schemas close to the audio and transcript, and the morphological, syntactic and semantic annotation corresponds to the family of Prague Dependency Treebanks; it could thus be used also for linguistic studies, including comparative studies regarding text and speech. The most unique and novel feature is our approach to syntactic annotation, which differs from other similar corpora such as Treebank-3 [8] in that it does not attempt to impose syntactic structure over input, but it includes one more layer which edits the literal transcript to fluent Czech while keeping the original transcript explicitly aligned with the edited version. This allows the morphological, syntactic and semantic annotation to be deterministically and fully mapped back to the transcript and audio. It brings new possibilities for modeling morphology, syntax and semantics in spoken language – either at the original transcript with mapped annotation, or at the new layer after (automatic) editing. The corpus is publicly and freely available.

Keywords

Speech Spoken corpus Syntax Semantics Coreference Treebank Annotation 

Notes

Acknowledgments

The research reported in the paper was supported by the Czech Science Foundation under the projects GA16-05394S and GA17-12624S. This work has also been supported by the LINDAT/CLARIN project of Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

References

  1. 1.
    Fitzgerald, E., Jelinek, F.: Linguistic resources for reconstructing spontaneous speech text. In: Proceedings of the 6th LREC, Marrakech, Moroco (2008)Google Scholar
  2. 2.
    Gerdes, K., Kahane, S., Lacheret, A., Truong, A., Pietrandrea, P.: Intonosyntactic data structures: the Rhapsodie Treebank of spoken French. In: Proceedings of the 6th Linguistic Annotation Workshop, Jeju, Korea, pp. 85–94. ACL (2012)Google Scholar
  3. 3.
    Hajič, J., Cinková, S., Mikulová, M., Pajas, P., Ptáček, J., Toman, J., Urešová, Z.: PDTSL: an annotated resource for speech reconstruction. In: Proceedings of the 2008 IEEE Workshop on Spoken Language Technology, Goa, India, pp. 93–96 (2008)Google Scholar
  4. 4.
    Hajič, J., Hajičová, E., Mikulová, M., Mírovský, J.: Prague dependency treebank. In: Handbook on Linguistic Annotation, Volume II, pp. 555–594. Springer, Dordrecht (2017)Google Scholar
  5. 5.
    Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M., Urešová, Z.: Prague Dependency Treebank 2.0 (LDC2006T01) (2006)Google Scholar
  6. 6.
    Hajič, J., Panevová, J., Urešová, Z., Bémová, A., Kolářová, V., Pajas, P.: PDT-VALLEX: creating a large-coverage valency lexicon for treebank annotation. In: Proceedings of the 2nd Treebanks and Linguistic Theories Workshop, pp. 57–68. Vaxjo University Press, Vaxjo (2003)Google Scholar
  7. 7.
    Hinrichs, E.W., Bartels, J., Kawata, Y., Kordoni, V., Telljohann, H.: The verbmobil treebanks. In: KONVENS, pp. 107–112 (2000)Google Scholar
  8. 8.
    Marcus, M., Santorini, B., Marcinkiewicz, M.A., Taylor, A.: Penn Treebank-3. Linguistic Data Consortium, LDC99T42, University of Pennsylvania (1999)Google Scholar
  9. 9.
    Mikulová, M.: Rekonstrukce standardizovaného textu z mluvené řeči v Pražském závislostním korpusu mluvené češtiny. Manuál pro anotátory. Technical report ÚFAL TR-2008-38 (2008)Google Scholar
  10. 10.
    Mikulová, M.: Annotation on the tectogrammatical level. Additions to annotation manual (with respect to PDTSC and PCEDT). Technical report ÚFAL TR-2013-52 (2014)Google Scholar
  11. 11.
    Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., Kučová, L., Lopatková, M., Pajas, P., Panevová, J., Razímová, M., Sgall, P., Štěpánek, J., Urešová, Z., Veselá, K., Žabokrtský, Z.: Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual. Technical report 30, Prague, Czech Republic (2006)Google Scholar
  12. 12.
    Mikulová, M., Štěpánek, J.: Annotation quality checking and its implications for design of Treebank (in Building the Prague Czech-English Dependency Treebank). In: Proceedings of 8th Treebanks and Linguistic Theories Workshop, Milano, Italy, pp. 137–148 (2009)Google Scholar
  13. 13.
    Mikulová, M., Štěpánek, J.: Ways of evaluation of the annotators in building the Prague Czech-English Dependency Treebank. In: Proceedings of the 7th LREC, Valletta, Malta, pp. 1836–1839 (2010)Google Scholar
  14. 14.
    Mikulová, M., Štěpánek, J., Urešová, Z.: Liší se mluvené a psané texty ve valenci? Korpus “gramatika” axiologie 8, 36–46 (2013)Google Scholar
  15. 15.
    Nedoluzhko, A., Mírovský, J.: Annotators’ certainty and disagreements in coreference and bridging annotation in Prague Dependency Treebank. In: Proceedings of the 2nd International Conference on Dependency Linguistics, Prague, Czech Republic, pp. 236–243 (2013)Google Scholar
  16. 16.
    Pajas, P., Štěpánek, J.: Recent advances in a feature-rich framework for treebank annotation. In: Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK, vol. 2, pp. 673–680 (2008)Google Scholar
  17. 17.
    Panevová, J.: On verbal frames in functional generative description. Prague Bull. Math. Linguist. 22, 3–40 (1974)Google Scholar
  18. 18.
    Sagae, K., MacWhinney, B., Lavie, A.: Adding syntactic annotations to transcripts of parent-child dialogs. In: Proceedings of the 4th LREC, Lisbon, Portugal (2004)Google Scholar
  19. 19.
    Schuurman, I., Goedertier, W., Hoekstra, H., Oostdijk, N., Piepenbrock, R., Schouppe, M.: Linguistic annotation of the spoken Dutch corpus: if we had to do it all over again. In: Proceedings of the 4th LREC, Lisbon, Portugal (2004)Google Scholar
  20. 20.
    Sgall, P., Hajičová, E., Panevová, J.: The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague/Dordrecht (1986)Google Scholar
  21. 21.
    Urešová, Z.: Building the PDT-VALLEX valency lexicon. In: Proceedings of the 5th Corpus Linguistics Conference, pp. 1–18. University of Liverpool, Liverpool (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marie Mikulová
    • 1
    Email author
  • Jiří Mírovský
    • 1
  • Anja Nedoluzhko
    • 1
  • Petr Pajas
    • 1
  • Jan Štěpánek
    • 1
  • Jan Hajič
    • 1
  1. 1.Faculty of Mathematics and Physics, Institute of Formal and Applied LinguisticsCharles UniversityPragueCzech Republic

Personalised recommendations