A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

Zadrożny, Sławomir; Kacprzyk, Janusz; Gajewski, Marek

doi:10.1007/978-3-319-30165-5_13

Sławomir Zadrożny⁸,
Janusz Kacprzyk⁸ &
Marek Gajewski⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 634))

534 Accesses
4 Citations

Abstract

In our earlier work we introduced the concept of the multiaspect text categorization (MTC) task which has its roots in relevant practical problems of managing collections of documents at many, if not all, commercial companies and, above all, public institutions. Specifically, it is a well defined general problem which boils down to the classification of textual documents at two levels: first, to a general category, and—second—to a specific sequence of documents within such a category. While the former task may be dealt with the use of some standard text categorization techniques, the latter one is more challenging due to, first of all, a limited number of training documents. On the other hand, it is assumed that there is some natural logic, for instance, resulting from rules and regulations, behind the succession of documents within the sequences which can be exploited to make a decision as to the assignment of a new document to a proper sequence. We have studied the MCT problem in a number of papers and proposed some solutions to it. Here we propose a new solution which is based on the use of the support vector machines (SVMs) which are known as a very effective technique to solve various classification tasks. We consider the application of SVMs in a specific context, determined by the characteristics of the MTC problem, and by a specific data set used for the experimentation. The use of the SVMs has implied a new, more sophisticated representation of the documents and their sequences which has made it possible to obtain promising results in computational experiments. Moreover, the proposed approach is flexible and may be considerably modified and extended to cover many possible problem versions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allan, J. (ed.): Topic Detection and Tracking: Event-Based Information. Kluwer Academic Publishers, Boston (2002)
Google Scholar
Beygelzimer, A., Kakadet, S., Langford, J., Arya, S, Mount, D., Li, S.: FNN: fast nearest neighbor search algorithms and applications (2013). http://CRAN.R-project.org/package=FNN. R package version 1.1
Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, Morocco
Google Scholar
Bu, F., Li, H., Zhu, X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8–14 July 2012, Jeju Island, Korea - Volume 1: Long Papers, pp. 449–458. The Association for Computer Linguistics (2012)
Google Scholar
Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)
Article Google Scholar
Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci. 39, 205–210 (1986)
Article MathSciNet MATH Google Scholar
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure. R. J. Stat. Softw. 25(5), 1–54 (2008)
Google Scholar
Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Series D: System Theory, Knowledge Engineering and Problem Solving. Kluwer Academic Publishers, Boston (1994)
Book MATH Google Scholar
Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana (to appear)
Google Scholar
Grabisch, M.: Fuzzy integral as a flexible and interpretable tool of aggregation. In: Bouchon-Meunier, B. (ed.) Aggregation and Fusion of Imperfect Information. Studies in Fuzziness and Soft Computing, pp. 51–72. Physica-Verlag, Heidelberg (1998)
Chapter Google Scholar
Kacprzyk, J., Zadrożny, S.: Power of linguistic data summaries and their protoforms. In: Kahraman, C. (ed.) Computational Intelligence Systems in Industrial Engineering. Atlantis Computational Intelligence Systems, vol. 6, pp. 71–90. Atlantis Press, Amsterdam (2012)
Chapter Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab - an S4 package for kernel methods. R. J. Stat. Softw. 11(9), 1–20 (2004). http://www.jstatsoft.org/v11/i09/
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, 28 June–1 July 2001, pp. 282–289. Morgan Kaufmann (2001)
Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)
Google Scholar
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria (2014). http://www.R-project.org
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 147 (2002)
Article Google Scholar
Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett. 20(11–13), 1191–1199 (1999)
Article Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Zadeh, L.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl. 9, 149–184 (1983)
Article MathSciNet MATH Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing (ICCIP’13). Cracow University of Technology (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4–AC, 7–16 (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-Like Intelligence, CIHLI 2015, Cape Town, South Africa, 8–10 December 2015. IEEE (to appear)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, 17–19 August 2015
Google Scholar

Download references

Acknowledgments

This work is supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

Author information

Authors and Affiliations

Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warszawa, Poland
Sławomir Zadrożny, Janusz Kacprzyk & Marek Gajewski

Authors

Sławomir Zadrożny
View author publications
You can also search for this author in PubMed Google Scholar
Janusz Kacprzyk
View author publications
You can also search for this author in PubMed Google Scholar
Marek Gajewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sławomir Zadrożny .

Editor information

Editors and Affiliations

Dept. of Telecommunication & Inform proc, Ghent University, Gent, Belgium
Guy de Trė
Faculty of Maths and Information Science, Warsaw University of Technology, Warszawa, Poland
Przemysław Grzegorzewski
Polish Academy of Sciences, Systems Research Institute, Warszawa, Poland
Janusz Kacprzyk
Polish Academy of Sciences, Systems Research Institute, Warszawa, Poland
Jan W. Owsiński
Polish Academy of Sciences, Institute of Computer Science, Warszawa, Poland
Wojciech Penczek
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Sławomir Zadrożny

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zadrożny, S., Kacprzyk, J., Gajewski, M. (2016). A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines. In: Trė, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J., Penczek, W., Zadrożny, S. (eds) Challenging Problems and Solutions in Intelligent Systems. Studies in Computational Intelligence, vol 634. Springer, Cham. https://doi.org/10.1007/978-3-319-30165-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-30165-5_13
Published: 26 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30164-8
Online ISBN: 978-3-319-30165-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics