Abstract
In our earlier work we introduced the concept of the multiaspect text categorization (MTC) task which has its roots in relevant practical problems of managing collections of documents at many, if not all, commercial companies and, above all, public institutions. Specifically, it is a well defined general problem which boils down to the classification of textual documents at two levels: first, to a general category, and—second—to a specific sequence of documents within such a category. While the former task may be dealt with the use of some standard text categorization techniques, the latter one is more challenging due to, first of all, a limited number of training documents. On the other hand, it is assumed that there is some natural logic, for instance, resulting from rules and regulations, behind the succession of documents within the sequences which can be exploited to make a decision as to the assignment of a new document to a proper sequence. We have studied the MCT problem in a number of papers and proposed some solutions to it. Here we propose a new solution which is based on the use of the support vector machines (SVMs) which are known as a very effective technique to solve various classification tasks. We consider the application of SVMs in a specific context, determined by the characteristics of the MTC problem, and by a specific data set used for the experimentation. The use of the SVMs has implied a new, more sophisticated representation of the documents and their sequences which has made it possible to obtain promising results in computational experiments. Moreover, the proposed approach is flexible and may be considerably modified and extended to cover many possible problem versions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allan, J. (ed.): Topic Detection and Tracking: Event-Based Information. Kluwer Academic Publishers, Boston (2002)
Beygelzimer, A., Kakadet, S., Langford, J., Arya, S, Mount, D., Li, S.: FNN: fast nearest neighbor search algorithms and applications (2013). http://CRAN.R-project.org/package=FNN. R package version 1.1
Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, Morocco
Bu, F., Li, H., Zhu, X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8–14 July 2012, Jeju Island, Korea - Volume 1: Long Papers, pp. 449–458. The Association for Computer Linguistics (2012)
Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)
Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci. 39, 205–210 (1986)
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure. R. J. Stat. Softw. 25(5), 1–54 (2008)
Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Series D: System Theory, Knowledge Engineering and Problem Solving. Kluwer Academic Publishers, Boston (1994)
Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana (to appear)
Grabisch, M.: Fuzzy integral as a flexible and interpretable tool of aggregation. In: Bouchon-Meunier, B. (ed.) Aggregation and Fusion of Imperfect Information. Studies in Fuzziness and Soft Computing, pp. 51–72. Physica-Verlag, Heidelberg (1998)
Kacprzyk, J., Zadrożny, S.: Power of linguistic data summaries and their protoforms. In: Kahraman, C. (ed.) Computational Intelligence Systems in Industrial Engineering. Atlantis Computational Intelligence Systems, vol. 6, pp. 71–90. Atlantis Press, Amsterdam (2012)
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab - an S4 package for kernel methods. R. J. Stat. Softw. 11(9), 1–20 (2004). http://www.jstatsoft.org/v11/i09/
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, 28 June–1 July 2001, pp. 282–289. Morgan Kaufmann (2001)
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria (2014). http://www.R-project.org
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 147 (2002)
Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett. 20(11–13), 1191–1199 (1999)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Zadeh, L.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl. 9, 149–184 (1983)
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing (ICCIP’13). Cracow University of Technology (2013)
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4–AC, 7–16 (2013)
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-Like Intelligence, CIHLI 2015, Cape Town, South Africa, 8–10 December 2015. IEEE (to appear)
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, 17–19 August 2015
Acknowledgments
This work is supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Zadrożny, S., Kacprzyk, J., Gajewski, M. (2016). A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines. In: Trė, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J., Penczek, W., Zadrożny, S. (eds) Challenging Problems and Solutions in Intelligent Systems. Studies in Computational Intelligence, vol 634. Springer, Cham. https://doi.org/10.1007/978-3-319-30165-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-30165-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30164-8
Online ISBN: 978-3-319-30165-5
eBook Packages: EngineeringEngineering (R0)