On the Assessment of Text Corpora

Pinto, David; Rosso, Paolo; Jiménez-Salazar, Héctor

doi:10.1007/978-3-642-12550-8_23

David Pinto²⁰,
Paolo Rosso²¹ &
Héctor Jiménez-Salazar²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

861 Accesses
2 Citations

Abstract

Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.

This research work was partially supported by the CICYT TIN2006-15265-C06 project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. Journal of the American Society for Information Science and Technology 56(6), 584–596 (2005)
Article Google Scholar
Wibowo, W., Williams, H.: On using hierarchies for document classification. In: Proc. of the Australian Document Computing Symposium, pp. 31–37 (1999)
Google Scholar
Herdan, G.: Type-Token Mathematics: A Textbook of Mathematical Linguistics. Mouton & Co., The Hague (1960)
MATH Google Scholar
Tweedie, F.J., Baayen, R.H.: How variable may a constant be?: Measures of lexical richness in perspective. Computers and the Humanities 32(5), 323–352 (1998)
Article Google Scholar
Hoover, D.L.: Another perspective on vocabulary richness. Computers and the Humanities 37(2), 151–178 (2004)
Article MathSciNet Google Scholar
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proc. of the 2000 International Conference on Artificial Intelligence (IC-AI 2000), vol. 1, pp. 111–117 (2000)
Google Scholar
Montejo-Ráez, A.: Automatic text categorization of documents in the High Energy Physics domain. Phd thesis, Granada University, Spain (2006)
Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2004)
Google Scholar
Can, F., Patton, J.M.: Change of writing style with time. Computers and the Humanities 38(1), 61–82 (2004)
Article Google Scholar
Hoover, D.L.: Corpus stylistics, stylometry, and the styles of henry james. Style 41(2), 174–203 (2007)
Google Scholar
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867 (2007)
Google Scholar
Màrquez, L., Padró, L.: A flexible pos tagger using an automatically acquired language model. In: Proc. of the 35th annual meeting on Association for Computational Linguistics, pp. 238–245 (1997)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Research and Development in Information Retrieval, pp. 275–281 (1998)
Google Scholar
Bahl, L.R., Jelinek, E., Mercer, R.L.: A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2), 179–190 (1983)
Article Google Scholar
Brown, P.F., Pietra, V.J.D., de Souza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Zipf, G.K.: Human behaviour and the principle of least effort. Addison-Wesley, Reading (1949)
Google Scholar
Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning - EWLSATEL 2007 (2007)
Google Scholar
Pinto, D., Benedí, J.M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)
Chapter Google Scholar
Agirre, E., Soroa, A.: Semeval-2007 task 2: Evaluating word sense induction and discrimination systems. In: Proc. of the 4th International Workshop on Semantic Evaluations - SemEval 2007, pp. 7–12. Association for Computational Linguistics (2007)
Google Scholar
Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–89 (1938)
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, B. Autonomous University of Puebla, Mexico
David Pinto
Natural Language Engineering Lab. - ELiRF, Universidad Politécnica de Valencia, Spain
Paolo Rosso
Department of Information Technologies, Autonomous Metropolitan University, Mexico
Héctor Jiménez-Salazar

Authors

David Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Jiménez-Salazar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Computertechnologie, Technische Universität Wien, A-1040, Wien, Austria
Helmut Horacek
CNAM- Laboratoire Cédric, 292 Rue St. Martin, 75141, Paris Cedex 03, France
Elisabeth Métais
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vincente del Raspeig, Apdo 99, 03080, Alicante, Spain
Rafael Muñoz
Dept. of Computational Linguistics, Saarland University, Germany
Magdalena Wolska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinto, D., Rosso, P., Jiménez-Salazar, H. (2010). On the Assessment of Text Corpora. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-12550-8_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12549-2
Online ISBN: 978-3-642-12550-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics