Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools

  • Scott A. CrossleyEmail author
  • Kristopher Kyle


This chapter provides a general overview of research methods used in the analysis of both spoken and written discourse. In addition, it provides a specific overview of how natural language processing (NLP) tools that measure lexical, syntactic, rhetorical, and cohesion features of text can be used to examine spoken and written discourse. The chapter provides an overview of how NLP tools have been used in previous studies of discourse, an introduction to freely available tools, an overview of the output produced by these tools, and statistical methods used to analyze and interpret the output produced from these tools.


Spoken discourse Written discourse Natural language processing Second language acquisition 


  1. Ai, H., & Lu, X. (2013). A corpus-based comparison of syntactic complexity in NNS and NS university students’ writing. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 249–264). Amsterdam: John Benjamins Publishing Company.Google Scholar
  2. Allen, L. K., Mills, C., Jacovina, M. E., Crossley, S., D’Mello, S., & McNamara, D. S. (2016). Investigating boredom and engagement during writing using multiple sources of information: The essay, the writer, and keystrokes. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 114–123). Edinburgh: ACM.Google Scholar
  3. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445–459. Scholar
  4. Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.Google Scholar
  5. Biber, D., Conrad, S. M., Reppen, R., Byrd, P., Helt, M., Clark, V., … Urzua, A. (2004). Representing language use in the University: Analysis of the TOEFL 2000 Spoken and Written Academic Language Corpus. TOEFL Monograph Series. Retrieved from
  6. Biber, D., Gray, B., & Staples, S. (2014). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, amu059. Scholar
  7. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Sebastopol: O’Reilly Media, Inc.Google Scholar
  8. BNC Consortium. (2007). The British National Corpus, version 3. BNC Consortium. Retrieved from
  9. Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. Scholar
  10. Burstein, J. (2003). The E-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113–121). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.Google Scholar
  11. Cambria, E., Havasi, C., & Hussain, A. (2012). SenticNet 2: A Semantic and Affective Resource for Opinion Mining and Sentiment Analysis. In G. M. Youngblood & P. M. McCarthy (Eds.), FLAIRS conference (pp. 202–207). Palo Alto: Association for the Advancement of Artificial.Google Scholar
  12. Cambria, E., Speer, R., Havasi, C., & Hussain, A. (2010). SenticNet: A Publicly Available Semantic Resource for Opinion Mining. In C. Havasi, D. Lenat, & B. Van Durme (Eds.), AAAI fall symposium: commonsense knowledge (Vol. 10).Google Scholar
  13. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.Google Scholar
  14. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.Google Scholar
  15. Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology, 33(4), 497–505.Google Scholar
  16. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.Google Scholar
  17. Crossley, S. A., Allen, D., & McNamara, D. S. (2012). Text simplification and comprehensible input: A case for an intuitive approach. Language Teaching Research, 16(1), 89–108.Google Scholar
  18. Crossley, S. A., Kyle, K., & McNamara, D. S. (2016a). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 1–19.Google Scholar
  19. Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of Text Cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237.Google Scholar
  20. Crossley, S. A., Kyle, K., & Salsbury, T. (2016). A usage-based investigation of L2 lexical acquisition: The role of input and output. Modern Language Journal, 100(3), 702–715.Google Scholar
  21. Crossley, S. A., Louwerse, M. M., McCarthy, P. M., & McNamara, D. S. (2007). A linguistic analysis of simplified and authentic texts. Modern Language Journal, 91(1), 15–30.Google Scholar
  22. Crossley, S. A., & McNamara, D. S. (2008). Assessing L2 reading texts at the intermediate level: An approximate replication of Crossley, Louwerse, McCarthy & McNamara (2007). Language Teaching, 41(3), 409–429.Google Scholar
  23. Crossley, S. A., & McNamara, D. S. (2010). Cohesion, coherence, and expert evaluations of writing proficiency. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 984–989). Austin, TX: Cognitive Science Society.Google Scholar
  24. Crossley, S. A., & McNamara, D. S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. In L. Carlson, C. Hoelscher, & T. F. Shipley (Eds.), Proceedings of the 29th Annual Conference of the Cognitive Science Society (pp. 1236–1241). Austin, TX: Cognitive Science Society.Google Scholar
  25. Crossley, S. A., & McNamara, D. S. (2012). Predicting second language writing proficiency: the roles of cohesion and linguistic sophistication. Journal of Research in Reading, 35(2), 115–135.Google Scholar
  26. Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 66–79. Scholar
  27. Crossley, S. A., Paquette, L., Dascalu, M., McNamara, D. S., & Baker, R. S. (2016). Combining click-stream data with NLP tools to better understand MOOC completion. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 6–14). Edinburgh: ACM.Google Scholar
  28. Crossley, S. A., Salsbury, T., & McNamara, D. (2009). Measuring L2 lexical growth using hypernymic relationships. Language Learning, 59(2), 307–334.Google Scholar
  29. Crossley, S. A., Salsbury, T., & McNamara, D. (2010). The development of polysemy and frequency use in English second language speakers. Language Learning, 60(3), 573–605. Scholar
  30. Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Retrieved from
  31. Fairclough, N. (2013). Critical discourse analysis: The critical study of language. New York, NY: Routledge.Google Scholar
  32. Friginal, E. (2013). Twenty-five years of Biber’s Multi-Dimensional Analysis: introduction to the special issue and an interview with Douglas Biber. Corpora, 8(2), 137–152.Google Scholar
  33. Friginal, E., & Weigle, S. (2014). Exploring multiple profiles of L2 writing using multi-dimensional analysis. Journal of Second Language Writing, 26, 80–95. Scholar
  34. Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.Google Scholar
  35. Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202. Scholar
  36. Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe L2 writing differences. Journal of Second Language Writing, 9(2), 123–145.Google Scholar
  37. Guo, L., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study. Assessing Writing, 18(3), 218–238.Google Scholar
  38. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software. ACM SIGKDD Explorations Newsletter, 11(1), 10. Scholar
  39. Higgins, D., Xi, X., Zechner, K., & Williamson, D. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech & Language, 25(2), 282–306. Scholar
  40. Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.Google Scholar
  41. Hutto, C. J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media.Google Scholar
  42. Jung, Y., Crossley, S. A., & McNamara, D. S. (2015). Linguistic features in MELAB writing performances.Google Scholar
  43. Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural language processing, speech recognition, and computational linguistics. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  44. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—ACL ’03 (Vol. 1, pp. 423–430).
  45. Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.Google Scholar
  46. Kyle, K. (2016). Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. Georgia State University. Retrieved from
  47. Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786. Scholar
  48. Kyle, K., Crossley, S. A., & Berger, C. (in press). The tool for the analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods.Google Scholar
  49. Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319–340. Scholar
  50. Langacker, R. W. (1987). Foundations of cognitive grammar: Theoretical prerequisites (Vol. 1). Stanford: Stanford University Press.Google Scholar
  51. Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In 5th International Conference on Language Resources and Evaluation (LREC 2006).Google Scholar
  52. Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. Scholar
  53. Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers’ language development. TESOL Quarterly, 45(1), 36–62. Retrieved from Scholar
  54. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In ACL (System Demonstrations) (pp. 55–60).Google Scholar
  55. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: the penn treebank. Computational Linguistics, 19(2), 313–330. Retrieved from
  56. McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.Google Scholar
  57. Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: Manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90(5), 862.Google Scholar
  58. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.Google Scholar
  59. Mohammad, S. M., & Turney, P. D. (2010). Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text (pp. 26–34). Association for Computational Linguistics.Google Scholar
  60. Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3), 436–465.Google Scholar
  61. Myers, M. (2003). What can computers and AES contribute to a K–12 writing program. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 3–20). Mahwah, N.J.: Lawrence Erlbaum Associates Publishers.Google Scholar
  62. Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29(5), 665–675.Google Scholar
  63. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24(4), 492–518.Google Scholar
  64. Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzales, A., & Booth, R. J. (2007). The development and psychometric properties of LIWC2007:
  65. Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahwah: Lawrence Erlbaum Associates, 71.Google Scholar
  66. Polanyi, L., & Zaenen, A. (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications (pp. 1–10). Netherlands: Springer.Google Scholar
  67. Römer, U. (2005). Shifting foci in language description and instruction: Towards a lexical grammar of progressives. Arbeiten Aus Anglistik Und Amerikanistik, 30(1), 145–160.Google Scholar
  68. Salsbury, T., Crossley, S. A., & McNamara, D. S. (2011). Psycholinguistic word information in second language oral discourse. Second Language Research, 27(3), 343–360.Google Scholar
  69. Schiffrin, D. (1994). Approaches to discourse. Oxford, UK: Blackwell.Google Scholar
  70. Secui, A., Sirbu, M.-D., Dascalu, M., Crossley, S., Ruseti, S., & Trausan-Matu, S. (2016). Expressing Sentiments in Game Reviews. In In International Conference on Artificial Intelligence: Methodology, Systems, and Applications (pp. 352–355). Varna, Bulgaria: Springer.Google Scholar
  71. Sexton, J. B., & Helmreich, R. L. (2000). Analyzing cockpit communications: the links between language, performance, error, and workload. Journal of Human Performance in Extreme Environments, 5(1), 6.Google Scholar
  72. Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.Google Scholar
  73. Sinclair, J. M. (1987). Looking up: An account of the COBUILD project in lexical computing and the development of the Collins COBUILD English language dictionary. London: Collins ELT.Google Scholar
  74. Tabachnick, B. G., & Fidell, L. S. (2014). Using Multivariate Statistics (4th ed.). Needham Heights, MA: Allyn & Bacon.Google Scholar
  75. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.Google Scholar
  76. Thorndike, E. L., & Lorge, I. (1944). The teacher’s wordbook of 30,000 words. New York: Columbia University, Teachers College. Bureau of Publications.Google Scholar
  77. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—NAACL ’03 (Vol. 1, pp. 173–180). Morristown, NJ, USA: Association for Computational Linguistics.
  78. Witten, I. H., & Frank, E. (2005). Data mining practical machine learning tools and techniques. Amsterdam; Boston, MA: Morgan Kaufman. Retrieved from
  79. Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in writing: Measures of fluency, accuracy & Complexity. Honolulu, HI: University of Hawaii Press.Google Scholar
  80. Yang, W., Lu, X., & Weigle, S. C. (2015). Different topics, different discourse: Relationships among writing topic, measures of syntactic complexity, and judgments of writing quality. Journal of Second Language Writing, 28, 53–67. Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Department of Applied Linguistics & ESLGeorgia State UniversityAtlantaUSA
  2. 2.Department of Second Language StudiesUniversity of Hawai’i at ManoaHonoluluUSA

Personalised recommendations