Skip to main content

Evaluating the Performance of Text Mining Systems on Real-world Press Archives

  • Conference paper
From Data and Information Analysis to Knowledge Engineering

Abstract

We investigate the performance of text mining systems for annotating press articles in two real-world press archives. Seven commercial systems are tested which recover the categories of a document as well named entities and catchphrases. Using cross-validation we evaluate the precision-recall characteristic. Depending on the depth of the category tree 39–79% breakeven is achieved. For one corpus 45% of the documents can be classified automatically, based on the system’s confidence estimates. In a usability experiment the formal evaluation results are confirmed. It turns out that with respect to some features human annotators exhibit a lower performance than the text mining systems. This establishes a convincing argument to use text mining systems to support indexing of large document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 159.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. DIETTERICH, T.G. (1997): Approximate statistical tests for comparing supervised classification learning algorithms. Technical report, Dept. of Computer Science, Oregon State University.

    Google Scholar 

  2. NADEAU, C., and BENGIO, Y. (2001): Inference for the generalization error. Technical report, Health Canada and Cirano Montreal.

    Google Scholar 

  3. RAJMAN, M., VESELY, M., and ANDREWS, P. (2003): Document processing and visualization techniques. Technical report, Nemis Network of Excellence in Text Mining and its Applications in Statistics.

    Google Scholar 

  4. SEBASTIANI, F. (2002): Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Berlin · Heidelberg

About this paper

Cite this paper

Paaß, G., de Vries, H. (2006). Evaluating the Performance of Text Mining Systems on Real-world Press Archives. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_50

Download citation

Publish with us

Policies and ethics