Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity

Wilkens, Rodrigo; Vecchia, Alessandro Dalla; Boito, Marcely Zanon; Padró, Muntsa; Villavicencio, Aline

doi:10.1007/978-3-319-12027-0_11

Rodrigo Wilkens⁶,
Alessandro Dalla Vecchia⁶,
Marcely Zanon Boito⁶,
Muntsa Padró⁶ &
…
Aline Villavicencio⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8864))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1689 Accesses
1 Citations
6 Altmetric

Abstract

Lexical simplification aims at substituting complex words by simpler synonyms or semantically close words. A first step to perform such task is to decide which words are complex and need to be replaced. Though this is a very subjective task, and not trivial at all, there is agreement among linguists of what makes a word more difficult to read and understand. Cues like the length of the word or its frequency in the language are accepted as informative to determine the complexity of a word. In this work, we carry out a study of the effectiveness of those cues by using them in a classification task for separating words as simple or complex. Interestingly, our results show that word length is not important, while corpus frequency is enough to correctly classify a large proportion of the test cases (F-measure over 80 %).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Max, A.: Writing for language-impaired readers. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 567–570. Springer, Heidelberg (2006)
Chapter Google Scholar
Siddharthan, A., Nenkova, A., McKeown, K.: Syntactic simplification for improving content selection in multi-document summarization. In: Proc. of the 20th International Conference on Computational Linguistics, p. 896. ACL (2004)
Google Scholar
Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of english newspaper text to assist aphasic readers. In: Proc. of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, pp. 7–10 (1998)
Google Scholar
Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and methods for text simplification. In: Proc. of the 16th Conference on Computational linguistics, pp. 1041–1044. ACL (1996)
Google Scholar
Specia, L., Jauhar, S.K., Mihalcea, R.: Semeval-2012 task 1: English lexical simplification. In: Proc. of the First Joint Conference on Lexical and Computational Semantics, pp. 347–355 (2012)
Google Scholar
Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233 (1948)
Article Google Scholar
Devlin, S., Unthank, G.: Helping aphasic people process online information. In: Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 225–226. ACM (2006)
Google Scholar
Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. International Journal of Medical Informatics 82(8), 717–730 (2013)
Article Google Scholar
De Belder, J., Deschacht, K., Moens, M.F.: Lexical simplification. In: Proceedings of ITEC2010: 1st International Conference on Interdisciplinary Research on Technology, Education and Communication (2010)
Google Scholar
Biran, O., Brody, S., Elhadad, N.: Putting it simply: a context-aware approach to lexical simplification. In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, pp. 496–501 (2011)
Google Scholar
Gasperin, C., Maziero, E., Specia, L., Pardo, T., Aluisio, S.M.: Natural language processing for social inclusion: a text simplification architecture for different literacy levels. In: Proceedings of SEMISH-XXXVI Seminário Integrado de Software e Hardware, pp. 387–401 (2009)
Google Scholar
Saggion, H., Martínez, E.G., Etayo, E., Anula, A., Bourg, L.: Text simplification in simplext. making text more accessible. Procesamiento del lenguaje natural 47, 341–342 (2011)
Google Scholar
Aluísio, S.M., Specia, L., Pardo, T.A., Maziero, E.G., Fortes, R.P.: Towards brazilian portuguese automatic text simplification systems. In: Proceedings of the 8th ACM symposium on Document engineering, pp. 240–248. ACM (2008)
Google Scholar
De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 426–437. Springer, Heidelberg (2012)
Chapter Google Scholar
Lin, D., Pantel, P.: DIRT - Discovery of Inference Rules from Text. In: Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD-01). San Francisco, USA pp. 323–328 (2001)
Google Scholar
Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57. ACL (2001)
Google Scholar
Shinyama, Y., Sekine, S., Sudo, K.: Automatic paraphrase acquisition from news articles. In: Proceedings of the second International Conference on Human Language Technology Research, pp. 313–318. Morgan Kaufmann Publishers Inc. (2002)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 16–23 (2003)
Google Scholar
Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 102–109 (2003)
Google Scholar
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, pp. 57–64. ACL (2003)
Google Scholar
Lal, P., Ruger, S.: Extract-based summarization with simplification. In: Proceedings of the ACL Workshop on Text Summarisation: DUC, Philadelphia, USA (2002)
Google Scholar
Amoia, M., Romanelli, M.: Sb: mmsystem-using decompositional semantics for lexical simplification. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 482–486 (2012)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)
Google Scholar
Sharoff, S.: Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4), 435–462 (2006)
Article Google Scholar
MacWhinney, B.: The CHILDES Project: The database. vol. 2. Psychology Press (2000)
Google Scholar
de Paiva, V., Rademaker, A., de Melo, G.: Openwordnet-pt: An open brazilian wordnet for reasoning. In: Proceedings of the 24th International Conference on Computational Linguistics (2012)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA, Istanbul (2012)
Google Scholar
Scott, M., Tribble, C.: Textual patterns: key words and corpus analysis in language education. John Benjamins publishing company, Amsterdam (2006)
Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a WaCky corpus for Brazilian Portuguese. In: Proceedings of PROPOR 2014, São Carlos, Brazil (2014)
Google Scholar
Finatto, M.J.B., Scarton, C.E., Rocha, A., Aluísio, S.: Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)
Google Scholar
Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.: Building a brazilian portuguese parallel corpus of original and simplified texts. In: Proceedings of CICLing (2009)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Rodrigo Wilkens, Alessandro Dalla Vecchia, Marcely Zanon Boito, Muntsa Padró & Aline Villavicencio

Authors

Rodrigo Wilkens
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Dalla Vecchia
View author publications
You can also search for this author in PubMed Google Scholar
Marcely Zanon Boito
View author publications
You can also search for this author in PubMed Google Scholar
Muntsa Padró
View author publications
You can also search for this author in PubMed Google Scholar
Aline Villavicencio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Rodrigo Wilkens or Alessandro Dalla Vecchia .

Editor information

Editors and Affiliations

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Ana L.C. Bazzan
Pontifica Universidad Católica (PUC), Santiago de Chile, Chile
Karim Pichara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wilkens, R., Vecchia, A.D., Boito, M.Z., Padró, M., Villavicencio, A. (2014). Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity. In: Bazzan, A., Pichara, K. (eds) Advances in Artificial Intelligence -- IBERAMIA 2014. IBERAMIA 2014. Lecture Notes in Computer Science(), vol 8864. Springer, Cham. https://doi.org/10.1007/978-3-319-12027-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-12027-0_11
Published: 12 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12026-3
Online ISBN: 978-3-319-12027-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics