Abstract
The article concerns the problem of automatic classification of textual content. We present selected methods for generation of documents representation and we evaluate them in classification tasks. The experiments have been performed on Wikipedia articles classified automatically to their categories made by Wikipedia editors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aas, K., Eikvil, L.: Text Categorisation: A Survey. Raport NR 941 (1999)
Bennett, C., Li, M., Ma, B.: Chain Letters and Evolutionary Histories. Scientific American 288(6), 76–81 (2003)
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization
Duch, W., Blachnik, M., Wieczorek, T.: Probabilistic Distance Measures for Prototype-Based Rules (in polish). In: Proc. of the 12 International Conference on Neural Information Processing, ICONIP, Citeseer, pp. 445–450 (2005)
Eyheramendy, S., Lewis, D., Madigan, D.: On the Naive Bayes Model for Text Categorization (2003)
Grossi, R., Vitter, J.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pp. 397–406. ACM (2000)
Korenius, T., Laurikkala, J., Juhola, M.: On Principal Component Analysis, Cosine and Euclidean Measures in Information Retrieval (in polish). Information Sciences 177(22), 4893–4905 (2007)
Kosmulski, M.: Representation of Text Documents in The Vector Space Model (in polish), 14–25, 34–41 (2005)
Łazewski, Ł., Pikuła, M., Siemion, A., Szklarzewski, M., Pindelski, S.: The Classification of Text Documents (in polish), 17–26, 62–66
Leahy, P.: n-Gram-Based Text Attribution
Li, Y., Jain, A.: Classification of Text Documents. The Computer Journal 41(8), 537 (1998)
Miller, G.A., Beckitch, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory. Princeton University Press (1993)
Newman, M.: Power laws, Pareto Distributions and Zipf’s Law. Arxiv Preprint cond-mat/0412004 (2004)
Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 Extension to Multiple Weighted Fields. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42–49. ACM (2004)
Steffen, J.: N-gram Language Modeling for Robust Multi-Lingual Document Classification. In: The 4th International Conference on Language Resources and Evaluation (LREC 2004). German Research Center for Artificial Intelligence (2004)
Szymański, J., Mizgier, A., Szopiński, M., Lubomski, P.: Disambiguation Words Meaning Using WordNet Dictionary (in polish). Scientific Publishers PG TI 2008 18, 89–195 (2008)
Wong, S.K.M., Ziarko, W., Wong, P.N.: Generalized Vector Spaces Model in Information Retrieval. In: SIGIR 1985, pp. 18–25. ACM Press, New York (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Westa, M., Szymański, J., Krawczyk, H. (2012). Text Classifiers for Automatic Articles Categorization. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-29350-4_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29349-8
Online ISBN: 978-3-642-29350-4
eBook Packages: Computer ScienceComputer Science (R0)