Skip to main content

A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

  • Chapter
  • First Online:
Challenging Problems and Solutions in Intelligent Systems

Part of the book series: Studies in Computational Intelligence ((SCI,volume 634))

Abstract

In our earlier work we introduced the concept of the multiaspect text categorization (MTC) task which has its roots in relevant practical problems of managing collections of documents at many, if not all, commercial companies and, above all, public institutions. Specifically, it is a well defined general problem which boils down to the classification of textual documents at two levels: first, to a general category, and—second—to a specific sequence of documents within such a category. While the former task may be dealt with the use of some standard text categorization techniques, the latter one is more challenging due to, first of all, a limited number of training documents. On the other hand, it is assumed that there is some natural logic, for instance, resulting from rules and regulations, behind the succession of documents within the sequences which can be exploited to make a decision as to the assignment of a new document to a proper sequence. We have studied the MCT problem in a number of papers and proposed some solutions to it. Here we propose a new solution which is based on the use of the support vector machines (SVMs) which are known as a very effective technique to solve various classification tasks. We consider the application of SVMs in a specific context, determined by the characteristics of the MTC problem, and by a specific data set used for the experimentation. The use of the SVMs has implied a new, more sophisticated representation of the documents and their sequences which has made it possible to obtain promising results in computational experiments. Moreover, the proposed approach is flexible and may be considerably modified and extended to cover many possible problem versions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allan, J. (ed.): Topic Detection and Tracking: Event-Based Information. Kluwer Academic Publishers, Boston (2002)

    Google Scholar 

  2. Beygelzimer, A., Kakadet, S., Langford, J., Arya, S, Mount, D., Li, S.: FNN: fast nearest neighbor search algorithms and applications (2013). http://CRAN.R-project.org/package=FNN. R package version 1.1

  3. Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, Morocco

    Google Scholar 

  4. Bu, F., Li, H., Zhu, X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8–14 July 2012, Jeju Island, Korea - Volume 1: Long Papers, pp. 449–458. The Association for Computer Linguistics (2012)

    Google Scholar 

  5. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)

    Article  Google Scholar 

  6. Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci. 39, 205–210 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  7. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure. R. J. Stat. Softw. 25(5), 1–54 (2008)

    Google Scholar 

  8. Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Series D: System Theory, Knowledge Engineering and Problem Solving. Kluwer Academic Publishers, Boston (1994)

    Book  MATH  Google Scholar 

  9. Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana (to appear)

    Google Scholar 

  10. Grabisch, M.: Fuzzy integral as a flexible and interpretable tool of aggregation. In: Bouchon-Meunier, B. (ed.) Aggregation and Fusion of Imperfect Information. Studies in Fuzziness and Soft Computing, pp. 51–72. Physica-Verlag, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Kacprzyk, J., Zadrożny, S.: Power of linguistic data summaries and their protoforms. In: Kahraman, C. (ed.) Computational Intelligence Systems in Industrial Engineering. Atlantis Computational Intelligence Systems, vol. 6, pp. 71–90. Atlantis Press, Amsterdam (2012)

    Chapter  Google Scholar 

  12. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab - an S4 package for kernel methods. R. J. Stat. Softw. 11(9), 1–20 (2004). http://www.jstatsoft.org/v11/i09/

  13. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, 28 June–1 July 2001, pp. 282–289. Morgan Kaufmann (2001)

    Google Scholar 

  14. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)

    Google Scholar 

  15. R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria (2014). http://www.R-project.org

  16. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 147 (2002)

    Article  Google Scholar 

  18. Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett. 20(11–13), 1191–1199 (1999)

    Article  Google Scholar 

  19. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  20. Zadeh, L.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl. 9, 149–184 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  21. Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing (ICCIP’13). Cracow University of Technology (2013)

    Google Scholar 

  22. Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4–AC, 7–16 (2013)

    Google Scholar 

  23. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-Like Intelligence, CIHLI 2015, Cape Town, South Africa, 8–10 December 2015. IEEE (to appear)

    Google Scholar 

  24. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, 17–19 August 2015

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sławomir Zadrożny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Zadrożny, S., Kacprzyk, J., Gajewski, M. (2016). A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines. In: Trė, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J., Penczek, W., Zadrożny, S. (eds) Challenging Problems and Solutions in Intelligent Systems. Studies in Computational Intelligence, vol 634. Springer, Cham. https://doi.org/10.1007/978-3-319-30165-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30165-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30164-8

  • Online ISBN: 978-3-319-30165-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics