Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory

  • Lynn Carlson
  • Daniel Marcu
  • Mary Ellen Okurowski
Part of the Text, Speech and Language Technology book series (TLTB, volume 22)


We describe our experience in developing a discourse-annotated corpus for community-wide use. Working in the framework of Rhetorical Structure Theory, we were able to create a large annotated resource with very high consistency, using a well-defined methodology and protocol. This resource is made publicly available through the Linguistic Data Consortium to enable researchers to develop empirically grounded, discourse-specific applications.

Key words

discourse corpus annotation rhetorical structure 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Douglas Biber, Susan Conrad and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  2. Bruce Britton and John Black. 1985. Understanding Expository Text. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  3. Jill Burstein, Daniel Marcu, Slava Andreyev, and Martin Chodorow. 2001. Towards automatic identification of discourse elements in essays. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France.Google Scholar
  4. Lynn Carlson and Daniel Marcu. 2001. Discourse Tagging Reference Manual. ISI Technical Report. ISI-TR-545. ( Scholar
  5. Jean Carletta, Amy Isard, Stephen Isard, Jacqueline Kowtko, Gwyneth Doherty-Sneddon, and Anne Anderson. 1997. The reliability of a dialogue structure coding scheme. Computational Linguistics 23(1): 13–32.Google Scholar
  6. Barbara Di Eugenio, Johanna Moore and Massimo Paolucci. 1997. Learning features that predict cue usage. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997), pages 80–87, Madrid, Spain, July 7–12, 1997.Google Scholar
  7. Giacomo Ferrari. 1998. Preliminary steps toward the creation of a discourse and text resource. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC 1998), Granada, Spain, 999–1001.Google Scholar
  8. Giovanni Flammia and Victor Zue. 1995. Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue. In Proceedings of the 4th European Conference on Speech Communication and Technology, Madrid, Spain, vol. 3, 1965–1968.Google Scholar
  9. Roger Garside, Steve Fligelstone and Simon Botley. 1997. Discourse Annotation: Anaphoric Relations in Corpora. In Corpus annotation: Linguistic information from computer text corpora, edited by R. Garside, G. Leech, and T. McEnery. London: Longman, 66–84.Google Scholar
  10. Roger Garside, Geoffrey Leech and Geoffrey Sampson, eds. 1987. The Computational Analysis of English: A Corpus-Based Approach. London: Longman.Google Scholar
  11. Talmy Givon. 1983. Topic continuity in discourse. In Topic Continuity in Discourse: a Quantitative Cross-Language Study. Amsterdam/Philadelphia: John Benjamins, 1–41.Google Scholar
  12. Joseph Evans Grimes. 1975. The Thread of Discourse. The Hague, Paris: Mouton.Google Scholar
  13. Barbara Grosz and Candice Sidner. 1986. Attentions, intentions, and the structure of discourse. Computational Linguistics, 12(3): 175–204.Google Scholar
  14. M. A. K. Halliday and Ruqaiya Hasan, 1976. Cohesion in English. London: Longman.Google Scholar
  15. Marti Hearst. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1): 33–64.Google Scholar
  16. Julia Hirschberg and Diane Litman. 1987. Now Let’s Talk About Now Identifying Cue Phrases Intonationally. Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics (ACL-87), pages 163–171.Google Scholar
  17. Julia Hirschberg and Diane Litman. 1993. Empirical studies on the disambiguation of cue phrases. Computational Linguistics 19(3): 501–530.Google Scholar
  18. Eduard Hovy. 1993. Automated discourse generation using discourse structure relations. Artificial Intelligence 63(1–2): 341–386.CrossRefGoogle Scholar
  19. Alistair Knott. 1995. A Data-Driven Methodology for Motivating a Set of Coherence Relations. PhD Thesis, University of Edinburgh.Google Scholar
  20. Klaus Krippendorff. 1980. Content Analysis: An Introduction to its Methodology. Beverly Hills, CA: Sage Publications.Google Scholar
  21. Geoffrey Leech, Anthony McEnery, and Martin Wynne. 1997. Further levels of annotation. In Corpus Annotation: Linguistic Information from Computer Text Corpora, edited by R. Garside, G. Leech, and T. McEnery. London: Longman, 85–101.Google Scholar
  22. Lori Levin, Ann Thyme-Gobbel, Klaus Ries, Alon Lavie, and Monika Woszczyna. 1998. A discourse coding scheme for conversation Spanish. In Proceedings of the Fifth International Conference on Speech and Language Processing. Sydney, Australia.Google Scholar
  23. Diane Litman. 1996. Cue phrase classification using machine learning. Journal of Artificial Intelligence Research, 5:53–94.Google Scholar
  24. Robert Longacre. 1983. The Grammar of Discourse. New York: Plenum Press.Google Scholar
  25. William Mann and Sandra Thompson. 1988. Rhetorical structure theory. Toward a functional theory of text organization. Text, 8(3): 243–281.Google Scholar
  26. William Mann and Sandra Thompson, eds. 1992. Discourse Description: Diverse Linguistic Analyses of a Fund-raising Text. Amsterdam/Philadelphia: John Benjamins.Google Scholar
  27. Daniel Marcu. 2000. The Theory and Practice of Discourse Parsing and Summarization. Cambridge, MA: The MIT Press.Google Scholar
  28. Daniel Marcu, Estibaliz Amorrortu, and Magdelena Romera. 1999. Experiments in constructing a corpus of discourse trees. In Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, MD, 48–57.Google Scholar
  29. Daniel Marcu, Lynn Carlson, and Maki Watanabe. 2000. The automatic translation of discourse structures. Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, 9–17.Google Scholar
  30. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics 19(2), 313–330.Google Scholar
  31. James R. Martin. 1992. English Text. System and Structure. John Benjamin Publishing Company, Philadelphia/Amsterdam.Google Scholar
  32. Bonnie Meyer. 1985. Prose Analysis: Purposes, Procedures, and Problems. In Understanding Expository Text, edited by B. Britton and J. Black. Hillsdale, NJ: Lawrence Erlbaum Associates, 11–64.Google Scholar
  33. Johanna Moore. 1995. Participating in Explanatory Dialogues: Interpreting and Responding to Questions in Context. Cambridge, MA: MIT Press.Google Scholar
  34. Johanna Moore and Cecile Paris. 1993. Planning text for advisory dialogues: capturing intentional and rhetorical information. Computational Linguistics 19(4): 651–694.Google Scholar
  35. Megan Moser and Johanna Moore. 1995. Investigating cue selection and placement in tutorial discourse. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, 130–135.Google Scholar
  36. Tadashi Nomoto and Yuji Matsumoto. 1999. Learning discourse relations with active data selection. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, 158–167.Google Scholar
  37. Rebecca Passonneau and Diane Litman. 1997. Discourse segmentation by human and automatic means. Computational Linguistics 23(1): 103–140.Google Scholar
  38. Marie-Paule Pery-Woodley and Josette Rebeyrolle. 1998. Domain and genre in sublanguage text: definitional microtexts in three corpora. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC-1998), Granada, Spain, 987–992.Google Scholar
  39. Livia Polanyi. 1988. A formal model of the structure of discourse. Journal of Pragmatics 12: 601–638.CrossRefGoogle Scholar
  40. Livia Polanyi. 1996. The linguistic structure of discourse. Center for the Study of Language and Information. CSLI-96-200.Google Scholar
  41. Josette Rebeyrolle. 2000. Utilisation de contextes défmitoires pour l’acquisition de connaissances à partir de textes. In Actes Journées Francophones d’Ingénierie de la Connaissance (IC’2000), Toulouse, IRIT, 105–114.Google Scholar
  42. Harvey Sacks, Emmanuel Schegloff, and Gail Jefferson. 1974. A simple systematics for the organization of turntaking in conversation. Language 50: 696–735.CrossRefGoogle Scholar
  43. Deborah Schiffrin. 1987. Discourse Markers. Cambridge, England: Cambridge University Press.CrossRefGoogle Scholar
  44. Sidney Siegal and N.J. Castellan. 1988. Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill.Google Scholar
  45. Beth Sundheim. 1995. Overview of results of the MUC-6 evaluation. In Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, 13–31.Google Scholar
  46. Benjamin K. T’sou, Tom B.Y. Lai, Samuel W.K. Chan, Weijun Gao, and Xuegang Zhan. 2000. Enhancement of Chinese discourse marker tagger with C.4.5. In Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, 38–45.Google Scholar
  47. Teun A. Van Dijk and Walter Kintsch. 1983. Strategies of Discourse Comprehension. New York: Academic Press.Google Scholar
  48. Ellen Voorhees and Donna Harman. 1999. The Eighth Text Retrieval Conference (TREC-8). NIST Special Publication 500–246.Google Scholar
  49. Charles Wayne. 2000. Multilingual topic detection and tracking: successful research enabled by corpora and evaluation. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece, 1487–1493.Google Scholar
  50. Janyce Wiebe, Rebecca Bruce, and Thomas O’Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. College Park, MD, 246–253.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2003

Authors and Affiliations

  • Lynn Carlson
    • 1
  • Daniel Marcu
    • 2
  • Mary Ellen Okurowski
    • 1
  1. 1.U.S. Department of DefenseUniversity of Southern CaliforniaUSA
  2. 2.Information Sciences InstituteUniversity of Southern CaliforniaUSA

Personalised recommendations