Abstract
We investigate the performance of text mining systems for annotating press articles in two real-world press archives. Seven commercial systems are tested which recover the categories of a document as well named entities and catchphrases. Using cross-validation we evaluate the precision-recall characteristic. Depending on the depth of the category tree 39–79% breakeven is achieved. For one corpus 45% of the documents can be classified automatically, based on the system’s confidence estimates. In a usability experiment the formal evaluation results are confirmed. It turns out that with respect to some features human annotators exhibit a lower performance than the text mining systems. This establishes a convincing argument to use text mining systems to support indexing of large document collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
DIETTERICH, T.G. (1997): Approximate statistical tests for comparing supervised classification learning algorithms. Technical report, Dept. of Computer Science, Oregon State University.
NADEAU, C., and BENGIO, Y. (2001): Inference for the generalization error. Technical report, Health Canada and Cirano Montreal.
RAJMAN, M., VESELY, M., and ANDREWS, P. (2003): Document processing and visualization techniques. Technical report, Nemis Network of Excellence in Text Mining and its Applications in Statistics.
SEBASTIANI, F. (2002): Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Berlin · Heidelberg
About this paper
Cite this paper
Paaß, G., de Vries, H. (2006). Evaluating the Performance of Text Mining Systems on Real-world Press Archives. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_50
Download citation
DOI: https://doi.org/10.1007/3-540-31314-1_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)