Abstract
Lexical simplification aims at substituting complex words by simpler synonyms or semantically close words. A first step to perform such task is to decide which words are complex and need to be replaced. Though this is a very subjective task, and not trivial at all, there is agreement among linguists of what makes a word more difficult to read and understand. Cues like the length of the word or its frequency in the language are accepted as informative to determine the complexity of a word. In this work, we carry out a study of the effectiveness of those cues by using them in a classification task for separating words as simple or complex. Interestingly, our results show that word length is not important, while corpus frequency is enough to correctly classify a large proportion of the test cases (F-measure over 80 %).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Max, A.: Writing for language-impaired readers. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 567–570. Springer, Heidelberg (2006)
Siddharthan, A., Nenkova, A., McKeown, K.: Syntactic simplification for improving content selection in multi-document summarization. In: Proc. of the 20th International Conference on Computational Linguistics, p. 896. ACL (2004)
Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of english newspaper text to assist aphasic readers. In: Proc. of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, pp. 7–10 (1998)
Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and methods for text simplification. In: Proc. of the 16th Conference on Computational linguistics, pp. 1041–1044. ACL (1996)
Specia, L., Jauhar, S.K., Mihalcea, R.: Semeval-2012 task 1: English lexical simplification. In: Proc. of the First Joint Conference on Lexical and Computational Semantics, pp. 347–355 (2012)
Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233 (1948)
Devlin, S., Unthank, G.: Helping aphasic people process online information. In: Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 225–226. ACM (2006)
Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. International Journal of Medical Informatics 82(8), 717–730 (2013)
De Belder, J., Deschacht, K., Moens, M.F.: Lexical simplification. In: Proceedings of ITEC2010: 1st International Conference on Interdisciplinary Research on Technology, Education and Communication (2010)
Biran, O., Brody, S., Elhadad, N.: Putting it simply: a context-aware approach to lexical simplification. In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, pp. 496–501 (2011)
Gasperin, C., Maziero, E., Specia, L., Pardo, T., Aluisio, S.M.: Natural language processing for social inclusion: a text simplification architecture for different literacy levels. In: Proceedings of SEMISH-XXXVI Seminário Integrado de Software e Hardware, pp. 387–401 (2009)
Saggion, H., Martínez, E.G., Etayo, E., Anula, A., Bourg, L.: Text simplification in simplext. making text more accessible. Procesamiento del lenguaje natural 47, 341–342 (2011)
Aluísio, S.M., Specia, L., Pardo, T.A., Maziero, E.G., Fortes, R.P.: Towards brazilian portuguese automatic text simplification systems. In: Proceedings of the 8th ACM symposium on Document engineering, pp. 240–248. ACM (2008)
De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 426–437. Springer, Heidelberg (2012)
Lin, D., Pantel, P.: DIRT - Discovery of Inference Rules from Text. In: Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD-01). San Francisco, USA pp. 323–328 (2001)
Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57. ACL (2001)
Shinyama, Y., Sekine, S., Sudo, K.: Automatic paraphrase acquisition from news articles. In: Proceedings of the second International Conference on Human Language Technology Research, pp. 313–318. Morgan Kaufmann Publishers Inc. (2002)
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 16–23 (2003)
Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 102–109 (2003)
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, pp. 57–64. ACL (2003)
Lal, P., Ruger, S.: Extract-based summarization with simplification. In: Proceedings of the ACL Workshop on Text Summarisation: DUC, Philadelphia, USA (2002)
Amoia, M., Romanelli, M.: Sb: mmsystem-using decompositional semantics for lexical simplification. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 482–486 (2012)
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)
Sharoff, S.: Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4), 435–462 (2006)
MacWhinney, B.: The CHILDES Project: The database. vol. 2. Psychology Press (2000)
de Paiva, V., Rademaker, A., de Melo, G.: Openwordnet-pt: An open brazilian wordnet for reasoning. In: Proceedings of the 24th International Conference on Computational Linguistics (2012)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA, Istanbul (2012)
Scott, M., Tribble, C.: Textual patterns: key words and corpus analysis in language education. John Benjamins publishing company, Amsterdam (2006)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a WaCky corpus for Brazilian Portuguese. In: Proceedings of PROPOR 2014, São Carlos, Brazil (2014)
Finatto, M.J.B., Scarton, C.E., Rocha, A., Aluísio, S.: Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)
Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.: Building a brazilian portuguese parallel corpus of original and simplified texts. In: Proceedings of CICLing (2009)
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Wilkens, R., Vecchia, A.D., Boito, M.Z., Padró, M., Villavicencio, A. (2014). Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity. In: Bazzan, A., Pichara, K. (eds) Advances in Artificial Intelligence -- IBERAMIA 2014. IBERAMIA 2014. Lecture Notes in Computer Science(), vol 8864. Springer, Cham. https://doi.org/10.1007/978-3-319-12027-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-12027-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12026-3
Online ISBN: 978-3-319-12027-0
eBook Packages: Computer ScienceComputer Science (R0)