Abstract
The New York Times Annotated Corpus contains over 1.5 million of manually tagged articles. It could become a useful source for evaluation of algorithms for documents clustering. Since documents have been labeled over twenty years, it is argued that the classification may contains errors due to a possible dissent between experts and the necessity to add tags over time. This paper presents an approach to improving the classification quality by using assigned tags as a starting point.
It is assumed that tags can be described by a set of features. These features are selected based on the value of mutual information between the tag and stems from documents with it. An algorithm for reassigning tags in case the document does not contain features of its labels is presented. Experiments were performed on about ninety thousand articles published by the New York Times in 2005. Results of applying the algorithm to the collection are discussed.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Sandhaus, E.: The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia (2008)
Torkkola, K.: Discriminative features for text document classification. Pattern Analysis and Applications 6(4), 301–308 (2003)
Reuters-21578 Test Collection, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Text Retrieval Conference, http://trec.nist.gov/
Manning, C., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, NY (2008)
Weiss, M.S., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Science+Business Media, Inc., NY (2005)
Neto, J.L., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Document clustering and text summarization. In: International Conference Practical Applications of Knowledge Discovery and Data Mining, pp. 41–55 (2000)
Bakus, J., Kamel, M.S., Carey, T.: Extraction of text phrases using hierarchical grammar. In: Cohen, R., Spencer, B. (eds.) Canadian AI 2002. LNCS (LNAI), vol. 2338, pp. 319–324. Springer, Heidelberg (2002)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wue, J.: Understanding of internal clustering validation measures. In: IEEE International Conference on Data Mining, pp. 911–916 (2010)
Porter stemmer, http://tartarus.org/martin/PorterStemmer/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mozzherina, E. (2013). An Approach to Improving the Classification of the New York Times Annotated Corpus. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and the Semantic Web. KESW 2013. Communications in Computer and Information Science, vol 394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41360-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-41360-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41359-9
Online ISBN: 978-3-642-41360-5
eBook Packages: Computer ScienceComputer Science (R0)