Evaluating the Performance of Text Mining Systems on Real-world Press Archives

Paaß, Gerhard; de Vries, Hugo

doi:10.1007/3-540-31314-1_50

Gerhard Paaß²² &
Hugo de Vries²³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

We investigate the performance of text mining systems for annotating press articles in two real-world press archives. Seven commercial systems are tested which recover the categories of a document as well named entities and catchphrases. Using cross-validation we evaluate the precision-recall characteristic. Depending on the depth of the category tree 39–79% breakeven is achieved. For one corpus 45% of the documents can be classified automatically, based on the system’s confidence estimates. In a usability experiment the formal evaluation results are confirmed. It turns out that with respect to some features human annotators exhibit a lower performance than the text mining systems. This establishes a convincing argument to use text mining systems to support indexing of large document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

DIETTERICH, T.G. (1997): Approximate statistical tests for comparing supervised classification learning algorithms. Technical report, Dept. of Computer Science, Oregon State University.
Google Scholar
NADEAU, C., and BENGIO, Y. (2001): Inference for the generalization error. Technical report, Health Canada and Cirano Montreal.
Google Scholar
RAJMAN, M., VESELY, M., and ANDREWS, P. (2003): Document processing and visualization techniques. Technical report, Nemis Network of Excellence in Text Mining and its Applications in Statistics.
Google Scholar
SEBASTIANI, F. (2002): Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute for Autonomous Intelligent Systems, St. Augustin, Germany
Gerhard Paaß
Macquarie University, Sydney, Australia
Hugo de Vries

Authors

Gerhard Paaß
View author publications
You can also search for this author in PubMed Google Scholar
Hugo de Vries
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Technische und Betriebliche Informationssysteme, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Myra Spiliopoulou
Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Rudolf Kruse , Christian Borgelt & Andreas Nürnberger , &
Institut für Entscheidungstheorie und Unternehmensforschung, Universität Karlsruhe (TH), 76128, Karlsruhe
Wolfgang Gaul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paaß, G., de Vries, H. (2006). Evaluating the Performance of Text Mining Systems on Real-world Press Archives. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_50

Download citation

DOI: https://doi.org/10.1007/3-540-31314-1_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics