Skip to main content

Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity

  • Conference paper
  • First Online:
Advances in Artificial Intelligence -- IBERAMIA 2014 (IBERAMIA 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8864))

Included in the following conference series:

Abstract

Lexical simplification aims at substituting complex words by simpler synonyms or semantically close words. A first step to perform such task is to decide which words are complex and need to be replaced. Though this is a very subjective task, and not trivial at all, there is agreement among linguists of what makes a word more difficult to read and understand. Cues like the length of the word or its frequency in the language are accepted as informative to determine the complexity of a word. In this work, we carry out a study of the effectiveness of those cues by using them in a classification task for separating words as simple or complex. Interestingly, our results show that word length is not important, while corpus frequency is enough to correctly classify a large proportion of the test cases (F-measure over 80 %).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Max, A.: Writing for language-impaired readers. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 567–570. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Siddharthan, A., Nenkova, A., McKeown, K.: Syntactic simplification for improving content selection in multi-document summarization. In: Proc. of the 20th International Conference on Computational Linguistics, p. 896. ACL (2004)

    Google Scholar 

  3. Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of english newspaper text to assist aphasic readers. In: Proc. of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, pp. 7–10 (1998)

    Google Scholar 

  4. Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and methods for text simplification. In: Proc. of the 16th Conference on Computational linguistics, pp. 1041–1044. ACL (1996)

    Google Scholar 

  5. Specia, L., Jauhar, S.K., Mihalcea, R.: Semeval-2012 task 1: English lexical simplification. In: Proc. of the First Joint Conference on Lexical and Computational Semantics, pp. 347–355 (2012)

    Google Scholar 

  6. Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233 (1948)

    Article  Google Scholar 

  7. Devlin, S., Unthank, G.: Helping aphasic people process online information. In: Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 225–226. ACM (2006)

    Google Scholar 

  8. Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. International Journal of Medical Informatics 82(8), 717–730 (2013)

    Article  Google Scholar 

  9. De Belder, J., Deschacht, K., Moens, M.F.: Lexical simplification. In: Proceedings of ITEC2010: 1st International Conference on Interdisciplinary Research on Technology, Education and Communication (2010)

    Google Scholar 

  10. Biran, O., Brody, S., Elhadad, N.: Putting it simply: a context-aware approach to lexical simplification. In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, pp. 496–501 (2011)

    Google Scholar 

  11. Gasperin, C., Maziero, E., Specia, L., Pardo, T., Aluisio, S.M.: Natural language processing for social inclusion: a text simplification architecture for different literacy levels. In: Proceedings of SEMISH-XXXVI Seminário Integrado de Software e Hardware, pp. 387–401 (2009)

    Google Scholar 

  12. Saggion, H., Martínez, E.G., Etayo, E., Anula, A., Bourg, L.: Text simplification in simplext. making text more accessible. Procesamiento del lenguaje natural 47, 341–342 (2011)

    Google Scholar 

  13. Aluísio, S.M., Specia, L., Pardo, T.A., Maziero, E.G., Fortes, R.P.: Towards brazilian portuguese automatic text simplification systems. In: Proceedings of the 8th ACM symposium on Document engineering, pp. 240–248. ACM (2008)

    Google Scholar 

  14. De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 426–437. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Lin, D., Pantel, P.: DIRT - Discovery of Inference Rules from Text. In: Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD-01). San Francisco, USA pp. 323–328 (2001)

    Google Scholar 

  16. Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57. ACL (2001)

    Google Scholar 

  17. Shinyama, Y., Sekine, S., Sudo, K.: Automatic paraphrase acquisition from news articles. In: Proceedings of the second International Conference on Human Language Technology Research, pp. 313–318. Morgan Kaufmann Publishers Inc. (2002)

    Google Scholar 

  18. Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 16–23 (2003)

    Google Scholar 

  19. Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 102–109 (2003)

    Google Scholar 

  20. Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, pp. 57–64. ACL (2003)

    Google Scholar 

  21. Lal, P., Ruger, S.: Extract-based summarization with simplification. In: Proceedings of the ACL Workshop on Text Summarisation: DUC, Philadelphia, USA (2002)

    Google Scholar 

  22. Amoia, M., Romanelli, M.: Sb: mmsystem-using decompositional semantics for lexical simplification. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 482–486 (2012)

    Google Scholar 

  23. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)

    Google Scholar 

  24. Sharoff, S.: Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4), 435–462 (2006)

    Article  Google Scholar 

  25. MacWhinney, B.: The CHILDES Project: The database. vol. 2. Psychology Press (2000)

    Google Scholar 

  26. de Paiva, V., Rademaker, A., de Melo, G.: Openwordnet-pt: An open brazilian wordnet for reasoning. In: Proceedings of the 24th International Conference on Computational Linguistics (2012)

    Google Scholar 

  27. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  28. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA, Istanbul (2012)

    Google Scholar 

  29. Scott, M., Tribble, C.: Textual patterns: key words and corpus analysis in language education. John Benjamins publishing company, Amsterdam (2006)

    Google Scholar 

  30. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  31. Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a WaCky corpus for Brazilian Portuguese. In: Proceedings of PROPOR 2014, São Carlos, Brazil (2014)

    Google Scholar 

  32. Finatto, M.J.B., Scarton, C.E., Rocha, A., Aluísio, S.: Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)

    Google Scholar 

  33. Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.: Building a brazilian portuguese parallel corpus of original and simplified texts. In: Proceedings of CICLing (2009)

    Google Scholar 

  34. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Rodrigo Wilkens or Alessandro Dalla Vecchia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Wilkens, R., Vecchia, A.D., Boito, M.Z., Padró, M., Villavicencio, A. (2014). Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity. In: Bazzan, A., Pichara, K. (eds) Advances in Artificial Intelligence -- IBERAMIA 2014. IBERAMIA 2014. Lecture Notes in Computer Science(), vol 8864. Springer, Cham. https://doi.org/10.1007/978-3-319-12027-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12027-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12026-3

  • Online ISBN: 978-3-319-12027-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics