Applying Natural Language Processing and Hierarchical Machine Learning Approaches to Text Difficulty Classification


For decades, educators have relied on readability metrics that tend to oversimplify dimensions of text difficulty. This study examines the potential of applying advanced artificial intelligence methods to the educational problem of assessing text difficulty. The combination of hierarchical machine learning and natural language processing (NLP) is leveraged to predict the difficulty of practice texts used in a reading comprehension intelligent tutoring system, iSTART. Human raters estimated the text difficulty level of 262 texts across two text sets (Set A and Set B) in the iSTART library. NLP tools were used to identify linguistic features predictive of text difficulty and these indices were submitted to both flat and hierarchical machine learning algorithms. Results indicated that including NLP indices and machine learning increased accuracy by more than 10% as compared to classic readability metrics (e.g., Flesch-Kincaid Grade Level). Further, hierarchical outperformed non-hierarchical (flat) machine learning classification for Set B (72%) and the combined set A + B (65%), whereas the non-hierarchical approach performed slightly better than the hierarchical approach for Set A (79%). These findings demonstrate the importance of considering deeper features of language related to text difficulty as well as the potential utility of hierarchical machine learning approaches in the development of meaningful text difficulty classification.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

  2. 2.

    For more on NLP, see McNamara et al. (2018). For a more thorough discussion of Coh-Metrix, see Graesser et al. (2004) and McNamara et al. (2014).

  3. 3.

    We were unable to locate downloadable software or corpora associated with these studies. Thus, we could not compare our algorithms to those used in these studies. Notably, that was not the purpose of this study nor does this affect the validity of the previous studies.


  1. Allen, L. K., Jacovina, M. E., & McNamara, D. S. (2016). Cohesive features of deep text comprehension processes. In J. Trueswell, A. Papafragou, D. Grodner, & D. Mirman (Eds.), Proceedings of the 38th annual meeting of the cognitive science Society in Philadelphia, PA (pp. 2681–2686). Austin, TX: Cognitive Science Society.

    Google Scholar 

  2. Allen, L. K., Snow, E. L., & McNamara, D. S. (2015). Are you reading my mind? Modeling students' reading comprehension skills with natural language processing techniques. In J. Baron, G. Lynch, N. Maziarz, P. Blikstein, A. Merceron, & G. Siemens (Eds.), Proceedings of the 5th International Learning Analytics & Knowledge Conference (LAK'15) (pp. 246–254). Poughkeepsie, NY: ACM.

    Google Scholar 

  3. Aggarwal, C. C., & Zhai, C. (2012). A survey of text classification algorithms. In C. Aggarwal & C. Zhai (Eds.), Mining text data. Boston, MA: Springer.

    Google Scholar 

  4. Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (release 2). Distributed by Linguistic Data Consortium, University of Pennsylvania.

  5. Babbar, R., Partalas, I., Gaussier, E., & Amini, M. R. (2013). On flat versus hierarchical classification in large-scale taxonomies. In Advances in Neural Information Processing Systems. 1824–1832.

  6. Balyan, R., McCarthy, K. S., & McNamara, D. S. (2017). Combining machine learning and natural language processing to assess literary text comprehension. In X. Hu, T. Barnes, A. Hershkovitz, & L. Paquette (Eds.), Proceedings of the 10th International Conference on Educational Data Mining (EDM) (pp. 244–249). Wuhan: International Educational Data Mining Society.

    Google Scholar 

  7. Balyan, R., McCarthy, K. S., & McNamara, D. S. (2018). Comparing machine learning classification approaches for predicting expository text difficulty. In Proceedings of the 31st Annual Florida Artificial Intelligence Research Society International Conference (FLAIRS). AAAI Press.

  8. Begeny, J. C., & Greene, D. J. (2014). Can readability formulas be used to successfully gauge difficulty of reading materials? Psychology in the Schools, 51(2), 198–215.

    Google Scholar 

  9. Benjamin, R. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24(1), 63–88.

    Google Scholar 

  10. Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press.

    Google Scholar 

  11. Bormuth, J. R. (1966). Readability: A new approach. Reading research quarterly, pp. 79–132, 1.

  12. Bormuth, J. R. (1969). Development of Readability Analysis. (final report, project no. 7-0052, contract no. OEC-3-7-070052-0326). Retrieved from ERIC database. (ED029166).

  13. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MATH  Google Scholar 

  14. Brunato, D., De Mattei, L., Dell’Orletta, F., Iavarone, B., & Venturi, G. (2018). Is this sentence difficult? Do you agree?. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2690-2699).

  15. Caruana, R., & Niculescu-Mizil, A. (2006, June). An empirical comparison of supervised learning algorithms. In proceedings of the 23rd international conference on machine learning (pp. 161-168). ACM.

  16. Casasent, D., & Wang, Y.-C. F. (2005). A hierarchical classifier using new support vector machine for automatic target recognition. Neural Networks, 18(5–6), 541–548.

    Google Scholar 

  17. Cerri, R., Barros, R. C., & de Carvalho, A. C. (2015, July). Hierarchical classification of gene ontology-based protein functions with neural networks. In 2015 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.

  18. Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006). Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7(Jan), 31–54.

    MathSciNet  MATH  Google Scholar 

  19. Chall, J. S. (1988). The beginning years. In B. L. Zakaluk & S, J. Samuels (Eds.) readability: Its past, present, and future. Newark, DE: International Reading association.

  20. Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of current and future research. ITL - International Journal of Applied Linguistics, 165(2), 97–135.

    Google Scholar 

  21. Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology, 33(4), 497–505.

    Google Scholar 

  22. Crossley, S. A., Allen, D., & McNamara, D. S. (2012). Text simplification and comprehensible input: A case for an intuitive approach. Language Teaching Research, 16, 89–108.

    Google Scholar 

  23. Crossley, S. A., Allen, L. K., Snow, E. L., & McNamara, D. S. (2016a). Incorporating learning characteristics into automatic essay scoring models: What individual differences and linguistic features tell us about writing quality. Journal of Educational Data Mining, 8(2), 1–19.

    Google Scholar 

  24. Crossley, S. A., Kyle, K., & Dascalu, M. (2018). The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap. Behavioral Research Methods. 1-14.

  25. Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237.

    Article  Google Scholar 

  26. Crossley, S. A., & McNamara, D. S. (2009). Computationally assessing lexical differences in second language writing. Journal of Second Language Writing, 17, 119–135.

    Google Scholar 

  27. Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27(1), 11–28.

    Google Scholar 

  28. Dimitrovski, I., Kocev, D., Loskovska, S., & Džeroski, S. (2011). Hierarchical annotation of medical images. Pattern Recognition, 44(10–11), 2436–2449.

    Google Scholar 

  29. Dufty, D. F., Graesser, A. C., Louwerse, M., & McNamara, D. S. (2006). Assigning grade level to textbooks: Is it just readability? In Proceedings of the 28th Annual Conference of the Cognitive Science Society Austin, TX: Cognitive science society. In R. Sun and N. Miyake, Eds. 1251–1256.

  30. Dumais, S. T., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management (Bethesda, Maryland, USA, November 02–07, 1998). CIKM’98. ACM, New York, NY, 148–155.

  31. Duran, N., Bellissens, C., Taylor, R., & McNamara, D. S. (2007). Quantifying text difficulty with automated indices of cohesion and semantics. In D. S. McNamara & G. Trafton (Eds.), Proceedings of the 29th annual meeting of the cognitive science society (pp. 233–238). Austin, TX: Cognitive Science Society.

    Google Scholar 

  32. Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N. (2010, August). A comparison of features for automatic readability assessment. In Proceedings of the 23rd international conference on computational linguistics: Posters, 276–284. Association for Computational Linguistics.

  33. Flesch, R. F. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233.

    Google Scholar 

  34. François, T., & Miltsakaki, E. (2012). Do NLP and machine learning improve traditional readability formulas? In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, pages 49–57, Montreal, Canada, Association for Computational Linguistics.

  35. Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting algorithm. In icml (Vol. 96, pp. 148-156).

  36. Fry, E. (2002). Readability versus leveling. Reading Teacher, 56(3), 286–291.

    MathSciNet  Google Scholar 

  37. Fuchs, E., Niehaus, I., & Stoletzki, A. (2014). Das Schulbuch in der Forschung. Analysen und Empfehlungen für die Bildungspraxis. Göttingen: V&R unipress.

    Google Scholar 

  38. Gee, J. P. (2004). An introduction to discourse analysis: Theory and method. Routledge.

  39. George-Nektarios, T. (2013). Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens.

  40. Gilhooly, K. J., & Logie, R. H. (1980). Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words. Behavior Research Methods & Instrumentation, 12(4), 395–427.

    Google Scholar 

  41. Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40, 223–234.

    Google Scholar 

  42. Graesser, A. C., McNamara, D. S., Louwerse, M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, and Computers, 36, 193–202.

    Google Scholar 

  43. Gunning, R. (1969). The fog index after twenty years. Journal of Business Communication, 6(2), 3–13.

    Google Scholar 

  44. Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20–38.

    Google Scholar 

  45. Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Columbus, OH, USA, 71–79.

  46. Ho, T. K. (1995). Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (Montreal, QC, august 14–15, 1995). ICDAR’95, IEEE computer society Washington, DC, USA, 278–282.

  47. Jackson, G. T., & McNamara, D. S. (2013). Motivation and performance in a game-based intelligent tutoring system. Journal of Educational Psychology, 105, 1036–1049.

    Google Scholar 

  48. Jiang, Z., Gu, Q., Yin, Y., & Chen, D. (2018, August). Enriching word Embeddings with domain knowledge for readability assessment. In Proceedings of the 27th International Conference on Computational Linguistics, 366–378.

  49. Johnson, A. M., McCarthy, K. S., Kopp, K. J., Perret, C. A., & McNamara, D. S. (2017). Adaptive Reading and writing instruction in iSTART and W-pal. In proceedings of the 30th Florida artificial intelligence research society international conference (FLAIRS). AAAI Press.

  50. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of 10th European Conference on Machine Learning (April 21-23). ECML’98. Springer-Verlag London, UK, 137-142.

  51. Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., Roukos, S., & Welty, C. (2010). Learning to predict readability using diverse linguistic features. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 546–554.

  52. Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and Flesch Reading ease formula) for navy enlisted personnel. Research Branch Report 8–75, Millington, TN: Naval technical training, U. S. Naval Air Station, Memphis, TN.

  53. Klare, G. R. (1974). Assessing readability. Reading Research Quarterly, 10, 62–102.

    Google Scholar 

  54. Klare, G. R. (1984). Readability. In P. D. Pearson, R. Barr, M. L. Kamil, P. Mosenthal, & R. Dykstra (Eds.), Handbook of Reading research (pp. 681–744). New York: Longman.

    Google Scholar 

  55. Kotani, K., Yoshimi, T., & Isahara, H. (2011). A machine learning approach to measurement of text readability for EFL learners using various linguistic features. US-China Education Review B, 6, 767–777.

    Google Scholar 

  56. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150.

    Google Scholar 

  57. Krogh, A., & Vedelsby, J. (1994). Neural network ensembles, cross validation, and active learning. In Proceedings of 7thInternational Conference on Neural Information Processing Systems (Denver, Colorado). NIPS’94. MIT press Cambridge, MA, USA, 231–238.

  58. Kumar, S., Ghosh, J., & Crawford, M. M. (2002). Hierarchical fusion of multiple classifiers for Hyperspectral data analysis. Pattern Analysis and Applications, Spl. Issue on Fusion of Multiple Classifiers, 5(2), 210–220.

    MathSciNet  MATH  Google Scholar 

  59. Kumar, S., & Ghosh, J. (1999). GAMLS: A generalized framework for associative modular learning systems. In Proceedings of SPIE conference on applications and science of computational intelligence II, SPIE proceedings, Orlando, FL, 3722, 24–35.

  60. Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication (Doctoral Dissertation). Retrieved from

  61. Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786.

    Article  Google Scholar 

  62. Kyle, K., Crossley, S. A., & Berger, C. (2018). The tool for the analysis of lexical sophistication version 2.0. Behavior Research Methods, 50(3), 1030–1046.

    Google Scholar 

  63. Lennon, C., & Burdick, H. (2004). The lexile framework as an approach for reading measurement and success. (electronic publication on

  64. Lieberman, M. G., & Morris, J. D. (2014). The precise effect of multicollinearity on classification prediction. Multiple Linear Regression Viewpoints, 40(1), 5–10.

    Google Scholar 

  65. Malvern, D. D., Richards, B. J., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Houndmills: Palgrave Macmillan.

    Google Scholar 

  66. Martínez, A. M., & Kak, A. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 228–233.

    Google Scholar 

  67. Mayne, A., & Perry, R. (2009, March). Hierarchically classifying documents with multiple labels. In 2009 IEEE symposium on computational intelligence and data mining (pp. 133-139). IEEE.

  68. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, tech. Rep. WS-98-05, AAAI press.

  69. McCarthy, K. S., Watanabe, M. , Dai, J., & McNamara, D. S. (in press). Personalized learning in iSTART: Past modifications and future design. Journal of Research on Technology in Education.

  70. McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Dissertation abstracts international, 66, UMI no. 3199485.

  71. McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42, 381–392.

    Google Scholar 

  72. McNamara, D. S, Allen, L. K., McCarthy, S. & Balyan, R. (2018). NLP: Getting computers to understand discourse. In Deep Comprehension (pp. 224-236). Routledge.

  73. McNamara, D. S., Crossley, S. A., & Roscoe, R. D. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45, 499–515.

    Google Scholar 

  74. McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59.

    Google Scholar 

  75. McNamara, D. S., Graesser, A. C., McCarthy, P., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge: Cambridge University Press.

    Google Scholar 

  76. McNamara, D. S., Graesser, A. C., & Louwerse, M. M. (2012). Sources of text difficulty: Across genres and grades. In J. P. Sabatini, E. Albro, & T. O'Reilly (Eds.), Measuring up: Advances in how we assess reading ability (pp. 89–116). RandL Education: Lanham, MD.

    Google Scholar 

  77. McNamara, D. S., Kintsch, E., Songer, N. B., & Kintsch, W. (1996). Are good texts always better? Interactions of text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instructions, 14, 1–43.

    Google Scholar 

  78. McNamara, D. S., Levinstein, I. B., & Boonthum, C. (2004). iSTART: Interactive strategy trainer for active reading and thinking. Behavioral Research Methods, Instruments, and Computers, 36, 222–233.

    Google Scholar 

  79. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., & Mullers, K. R. (1999, August). Fisher discriminant analysis with kernels. In neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. No. 98th8468) (pp. 41-48). IEEE.

  80. Millis, K., Magliano, J. P., Wiemer-Hastings, K., Todaro, S., & McNamara, D. S. (2007). Assessing and improving comprehension with latent semantic analysis. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 207–225). Mahwah, NJ: Erlbaum.

    Google Scholar 

  81. National Governors Association Center for Best Practices. (2010). Common Core State Standards. National Governors Association Center for best practices. Washington, D. C: Council of Chief State School Officers.

    Google Scholar 

  82. Ozuru, Y., Dempsey, K., Sayroo, J., & McNamara, D. S. (2005). Effect of text cohesion on comprehension of biology texts. In B. G. Bara, L. Barsalou, & M. Bucciarelli (Eds.), Proceedings of the 27th annual conference of the cognitive science society (pp. 1696–1701). Mahwah, NJ: Erlbaum.

    Google Scholar 

  83. Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of Experimental Psychology. 76, 1-2 (Jan.), 1-25.

  84. Perfetti, C. A., Landi, N., & Oakhill, J. (2005). The acquisition of reading comprehension skill. In M. J. Snowling & C. Hulme (Eds.), The science of Reading: A handbook (pp. 227–247). Oxford: Blackwell.

    Google Scholar 

  85. Perret, C. A., Johnson, A. M., MCarthy, K. S., Guerrero, T. A., & McNamara, D.S. (2017). StairStepper: An adaptive remedial iSTART module. In Proceedings of the 18th International Conference on Artificial Intelligence in Education (AIED), Wuhan, China: Springer.

  86. Pilán, I., Vajjala, S., & Volodina, E. (2016). A readable read: Automatic assessment of language learning materials based on linguistic complexity. International Journal of Computational Linguistics and Applications, 7, 143–159.

    Google Scholar 

  87. Pilán, I., Volodina, E., & Johansson, R. (2014). Rule-based and machine learning approaches for second language sentence-level readability. In Proceedings of the ninth workshop on innovative use of NLP for building educational applications, Baltimore, Maryland USA, 174–184.

  88. Pitler, E., & Nenkova, A. (2008, October). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the conference on empirical methods in natural language processing, 186–195. Association for Computational Linguistics.

  89. Rojas, R. (1996). Neural networks - a systematic introduction. Springer-Verlag, Berlin.

  90. Salsbury, T., Crossley, S. A., & McNamara, D. S. (2011). Psycholinguistic word information in second language oral discourse. Second Language Research, 27, 343–360.

    Google Scholar 

  91. Schapire, R. E., & Singer, Y. (1999). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2–3), 135–168.

    MATH  Google Scholar 

  92. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.

    Google Scholar 

  93. Schwarm, S. E., & Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 523-530). Association for Computational Linguistics.

  94. Schwenker, F. (2000). Hierarchical support vector machines for multiclass pattern recognition. In Proceedings of 4th KES, Brighton, UK, 2, 561–565.

  95. Si, L., & Callan, J. (2001, October). A statistical model for scientific readability. In Proceedings of the tenth international conference on Information and knowledge management (pp. 574-576). ACM.

  96. Snow, E. L., Jacovina, M. E., Jackson, G. T., & McNamara, D. S. (2016). iSTART-2: A reading comprehension and strategy instruction tutor. In Adaptive educational technologies for literacy instruction, D.S. McNamara and S. A. Crossley, Eds., Taylor and Francis, Routledge: NY, 104-121.

  97. Stenner, A. J., Horabin, I., Smith, D. R., & Smith, M. (1988). The lexile framework. Durham, NC: MetaMetrics.

    Google Scholar 

  98. Sun, A. & Lim, E. P. (2001). Hierarchical text classification and evaluation. In proceedings of the IEEE international conference on data mining (ICDM 2001), San Jose, CA, USA, 29 November–2 December 2001; pp. 521–528.

  99. Sung, Y. T., Chen, J. L., Cha, J. H., Tseng, H. C., Chang, T. H., & Chang, K. E. (2015). Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning. Behavior Research Methods, 47(2), 340–354.

    Google Scholar 

  100. Tanaka-Ishii, K., Tezuka, S., & Terada, H. (2010). Sorting by readability. Computational Linguistics, 36(2), 203–227.

    Google Scholar 

  101. Toglia, M. P., & Battig, W. F. (1978). Handbook of semantic word norms. Lawrence Erlbaum.

  102. Triguero, I., & Vens, C. (2016). Labelling strategies for hierarchical multi-label classification techniques. Pattern Recognition, 56, 170–183.

    Google Scholar 

  103. Vajjala, S., & Meurers, D. (2012, June). On improving the accuracy of readability classification using insights from second language acquisition. In proceedings of the seventh workshop on building educational applications using NLP (pp. 163-173). Association for Computational Linguistics.

  104. van Dijk, T. A. (1985). Semantic discourse analysis. In T. van Dijk (Ed.), Handbook of discourse analysis (Vol. 2, pp. 103–136). London: Academic Press.

    Google Scholar 

  105. Vygotsky, L. (1978) Mind in society: The development of higher psychological processes. (M. Cole, V. John-Steiner, S. Scribner, & E. Souberman, Trans.). Cambridge, MA: Harvard University Press.

  106. Wang, Y.-C. F., & Casasent, D. (2009). A support vector hierarchical method for multi-class classification and rejection. In Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June, 14-19, 3281–3288.

    Google Scholar 

  107. Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999). Weka: Practical machine learning tools and techniques with Java implementations.

  108. Zhang, G. P. (2000). Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 30(4), 451–462.

    Google Scholar 

  109. Zimek, A., Buchwald, F., Frank, E., & Kramer, S. (2008). A study of hierarchical and flat classification of proteins. IEEE Transactions on Computational Biology and Bioinformatics, 7(3), 563–571.

    Google Scholar 

  110. Zipf, G. K. (1949). Human behavior and the principle of least effort. Reading, MA: Addison-Wesley.

    Google Scholar 

Download references


The authors would like to recognize the support of the Institute of Education Sciences, U.S. Department of Education, through Grants R305A180261, R305A190050 and R305A180144, and the Office of Naval Research, through Grant N000141712300, to Arizona State University. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, or the Office of Naval Research.

Author information



Corresponding author

Correspondence to Renu Balyan.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A

Table 8 The evaluation Metrics (Set A) using FKGL
Table 9 The evaluation Metrics (Set B) using FKGL
Table 10 The evaluation Metrics (Set A + Set B) using FKGL
Table 11 The evaluation Metrics for Set A using FKGL+
Table 12 The evaluation Metrics for Set B using FKGL+
Table 13 The evaluation Metrics for Set A + Set B using FKGL+

Appendix B

Table 14 Machine Learning algorithms used in the study (Weka 3.8.1)

Appendix C

Table 15 Maximum and Minimum un-normalized values for Linguistic indices (Set A)
Table 16 Maximum and Minimum un-normalized values for Linguistic indices (Set B)
Table 17 Maximum and Minimum un-normalized values for Linguistic indices (Set A + Set B)

Appendix D

Table 18 Text Difficulty Level-wise Performance metrics
Table 19 Confusion Matrix for Set A
Table 20 Confusion Matrix for Set B
Table 21 Confusion Matrix for the Combined Set (A + B)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Balyan, R., McCarthy, K.S. & McNamara, D.S. Applying Natural Language Processing and Hierarchical Machine Learning Approaches to Text Difficulty Classification. Int J Artif Intell Educ (2020).

Download citation


  • Text difficulty
  • Machine learning
  • Hierarchical classification
  • Natural language processing