Skip to main content

On the Assessment of Text Corpora

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Abstract

Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.

This research work was partially supported by the CICYT TIN2006-15265-C06 project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. Journal of the American Society for Information Science and Technology 56(6), 584–596 (2005)

    Article  Google Scholar 

  2. Wibowo, W., Williams, H.: On using hierarchies for document classification. In: Proc. of the Australian Document Computing Symposium, pp. 31–37 (1999)

    Google Scholar 

  3. Herdan, G.: Type-Token Mathematics: A Textbook of Mathematical Linguistics. Mouton & Co., The Hague (1960)

    MATH  Google Scholar 

  4. Tweedie, F.J., Baayen, R.H.: How variable may a constant be?: Measures of lexical richness in perspective. Computers and the Humanities 32(5), 323–352 (1998)

    Article  Google Scholar 

  5. Hoover, D.L.: Another perspective on vocabulary richness. Computers and the Humanities 37(2), 151–178 (2004)

    Article  MathSciNet  Google Scholar 

  6. Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proc. of the 2000 International Conference on Artificial Intelligence (IC-AI 2000), vol. 1, pp. 111–117 (2000)

    Google Scholar 

  7. Montejo-Ráez, A.: Automatic text categorization of documents in the High Energy Physics domain. Phd thesis, Granada University, Spain (2006)

    Google Scholar 

  8. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2004)

    Google Scholar 

  9. Can, F., Patton, J.M.: Change of writing style with time. Computers and the Humanities 38(1), 61–82 (2004)

    Article  Google Scholar 

  10. Hoover, D.L.: Corpus stylistics, stylometry, and the styles of henry james. Style 41(2), 174–203 (2007)

    Google Scholar 

  11. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867 (2007)

    Google Scholar 

  12. Màrquez, L., Padró, L.: A flexible pos tagger using an automatically acquired language model. In: Proc. of the 35th annual meeting on Association for Computational Linguistics, pp. 238–245 (1997)

    Google Scholar 

  13. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Research and Development in Information Retrieval, pp. 275–281 (1998)

    Google Scholar 

  14. Bahl, L.R., Jelinek, E., Mercer, R.L.: A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2), 179–190 (1983)

    Article  Google Scholar 

  15. Brown, P.F., Pietra, V.J.D., de Souza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  16. Zipf, G.K.: Human behaviour and the principle of least effort. Addison-Wesley, Reading (1949)

    Google Scholar 

  17. Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning - EWLSATEL 2007 (2007)

    Google Scholar 

  18. Pinto, D., Benedí, J.M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  19. Agirre, E., Soroa, A.: Semeval-2007 task 2: Evaluating word sense induction and discrimination systems. In: Proc. of the 4th International Workshop on Semantic Evaluations - SemEval 2007, pp. 7–12. Association for Computational Linguistics (2007)

    Google Scholar 

  20. Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–89 (1938)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pinto, D., Rosso, P., Jiménez-Salazar, H. (2010). On the Assessment of Text Corpora. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12550-8_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12549-2

  • Online ISBN: 978-3-642-12550-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics