Abstract
Text classification is a process where documents are categorized usually by topic, place, readability easiness, etc. For text classification by topic, a well-known method is Singular Value Decomposition. For text classification by readability, “Flesch Reading Ease index” calculates the readability easiness level of a document (e.g. easy, medium, advanced). In this paper, we propose Singular Value Decomposition combined either with Cosine Similarity or with Aggregated Similarity Matrices to categorize documents by readability easiness and by topic. We experimentally compare both methods with Flesch Reading Ease index, and the vector-based cosine similarity method on a synthetic and a real data set (Reuters-21578). Both methods clearly outperform all other comparison partners.
This work has been partially funded by the Greek GSRT (project number 10TUR/4-3-3) and the Turkish TUBITAK (project number 109E282) national agencies as part of Greek-Turkey 2011-2012 bilateral scientific cooperation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 283–284 (1975)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)
Dale, E., Chall, J.: A Formula for Predicting Readability. Educational Research Bulletin 27, 11–20, 28 (1948)
Furnas, G.W., Deerwester, S., et al.: Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure. In Proceedings of SIGIR Conference, pp.465-480, Grenoble, France (1988)
Guan, H., Zhou, J., Guo, M.: A Class-Feature-Centroid Classifier for Text Categorization. In: Proceedings of WWW Conference, Madrid, Spain, pp. 201–210 (2009)
Hans-Henning, G., Spiliopoulou, M., Nanopoulos, A.: Eigenvector-Based Clustering Using Aggregated Similarity Matrices. In: Proceedings of ACM SAC Conference, Sierre, Switzerland, pp. 1083–1087 (2010)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Kincaid, J.P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of New Readability Formulas (Automated Readability Index, Fog Count, and Flesch Reading Ease formula) for Navy Enlisted Personnel. Chief of Naval Technical Training: Naval Air Station Memphis, Research Branch Report 8-75. Memphis, USA (1975)
McLaughlin, G.H.: SMOG Grading a New Readability Formula. Journal of Reading 12(8), 639–646 (1969)
Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of Dimensionality Reduction in Recommenders Systems: a Case Study. In: Proceedings of ACM WebKDD Workshop, Boston, MA, pp. 285–295 (2000)
Smith, E.A., Senter, R.J.: Automated Readability Index. Wright Patterson AFB, Ohio. Aerospace Medical Division (1967)
Spache, G.: A New Readability Formula for Primary-Grade Reading Materials. The Elementary School Journal 53(7), 410–413 (1953)
Symeonidis, P.: Content-based Dimensionality Reduction for Recommender Systems. In: Proceedings of GfKl Conference, Freiburg, Germany, pp. 619–626 (2007)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Symeonidis, P., Kehayov, I., Manolopoulos, Y. (2012). Text Classification by Aggregation of SVD Eigenvectors. In: Morzy, T., Härder, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2012. Lecture Notes in Computer Science, vol 7503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33074-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-33074-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33073-5
Online ISBN: 978-3-642-33074-2
eBook Packages: Computer ScienceComputer Science (R0)