Skip to main content

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11626))

Abstract

Classifying texts by their content complexity is important for applications like adaptive foreign language reading recommender systems and information retrieval. The goal of this paper is to propose a computational model of technical texts’ content complexity based on three criteria: knowledge depth, required knowledge, and content focus. To implement this model, 28 features of content and lexical complexity were extracted from 1702 texts of three types: general blogs, science journalistic texts and research papers. The machine learning experiments showed that content features alone can provide high classification accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/rtatman/blog-authorship-corpus.

  2. 2.

    CORE (COnnecting REpositories) is an aggregation of papers from open access journals https://www.jisc.ac.uk/core.

  3. 3.

    Based on the shortest path that connects the senses and the maximum depth of the hierarchy in which the senses occur.

  4. 4.

    http://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm.

References

  1. Webb, N.: Alignment of science and mathematics standards and assessments in four states, Washington, D.C. CCSSO. Research Monograph No. 18: August 1999. https://www.researchgate.net/publication/239925507_Alignment_of_science_and_mathematics_standards_and_assessments_in_four_states

  2. Webb, N.: 28 March, Depth-of-Knowledge Levels for four content areas, unpublished paper (2002)

    Google Scholar 

  3. Wise, S.L., Kingsbury, G.G., Webb, N.L.: Evaluating content alignment in computerized adaptive testing. Educ. Measur. Issues Pract. 34(4), 41–48 (2015)

    Article  Google Scholar 

  4. Fahmi, I., Bouma, G.: Learning to Identify Definitions using Syntactic Features, Workshop of Learning Structured Information in Natural Language Applications, EACL, Italy (2006)

    Google Scholar 

  5. Fiser, D., Pollak S., Vintar S.: Learning to mine definitions from Slovene structured and unstructured knowledge-rich resources. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, pp. 2932–2936 (2010)

    Google Scholar 

  6. Pollak, S., Vavpetic, A., Kranjc, J., Lavrac N., Vinta, S.: NLP workflow for on-line definition extraction from English and Slovene Text Corpora. In: Proceedings of KONVENS, Vienna, 19 September (2012)

    Google Scholar 

  7. Rose, S., Dave, E., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M., Kogan, J. (eds.) Text Mining: Applications and Theory. Wiley, Hoboken (2010). ISBN 978-0-470-74982-1

    Google Scholar 

  8. Guiraud, P.: Problèmes et Méthodes de la Statistique Linguistique. D. Reidel, Dordrecht (1960)

    Google Scholar 

  9. Kurdi, M.Z.: Lexical and syntactic features selection for an adaptive reading recommendation system based on text complexity. In: Proceedings of the 2017 International Conference on Information System and Data Mining, ICISDM 2017, pp. 66–69 (2017)

    Google Scholar 

  10. Francis, W.N., Kucera, H.: Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin, Boston (1982)

    Google Scholar 

  11. Nickerson, C.A., Cartwright, D.S.: Behavior Research Methods. Instrum. Comput. 16, 355 (1984). https://doi.org/10.3758/BF03202462

    Article  Google Scholar 

  12. Kurdi, M.Z.: Natural Language Processing and Computational Linguistics 2: Semantics, Discourse, and Applications, ISTE. ISTE-Wiley, London (2017)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Zakaria Kurdi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kurdi, M.Z. (2019). Measuring Content Complexity of Technical Texts: Machine Learning Experiments. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds) Artificial Intelligence in Education. AIED 2019. Lecture Notes in Computer Science(), vol 11626. Springer, Cham. https://doi.org/10.1007/978-3-030-23207-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23207-8_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23206-1

  • Online ISBN: 978-3-030-23207-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics