Advertisement

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

  • Sławomir ZadrożnyEmail author
  • Janusz Kacprzyk
  • Marek Gajewski
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 610)

Abstract

In our previous work we introduced a novel concept of the multiaspect text categorization (MTC) task meant as a special, extended form of the text categorization (TC) problem which is widely studied in information retrieval. The essence of the MTC problem is the classification of documents on two levels: first, on a more or less standard level of thematic categories and then on the level of document sequences which is much less studied in the literature. The latter stage of classification, which is by far more challenging, is the main focus of this paper. A promising way of attacking it requires some kind of modeling of connections between documents forming sequences. To solve this problem we propose a novel approach that combines a well-known techniques to model sequences, i.e., the Hidden Markov Models (HMM) and the Latent Dirichlet Allocation (LDA) technique for the advanced document representation, hence obtaining a hybrid approach. We present details of our proposed approach as well as results of some computational experiments.

Keywords

Multiaspect text categorization Sequences of documents HMM LDA 

Notes

Acknowledgments

This work is supported by the National Science Centre under contracts no. UMO-2011/01/B/ST6/06908 and UMO-2012/05/B/ST6/03068.

References

  1. 1.
    Allan, J. (ed.): Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers, Norwell (2002)zbMATHGoogle Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley, New York (1999)Google Scholar
  3. 3.
    Bird, S., et al.: The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, Morocco (2008)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Bayou, L., Espes, D., Cuppens-Boulahia, N., Cuppens, F.: Security issue of WirelessHART based SCADA systems. In: Lambrinoudakis, C., et al. (eds.) CRiSIS 2015. LNCS, vol. 9572, pp. 225–241. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-31811-0_14 CrossRefGoogle Scholar
  6. 6.
    Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)CrossRefGoogle Scholar
  8. 8.
    Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014)Google Scholar
  9. 9.
    Grün, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011). http://www.jstatsoft.org/v40/i13/ CrossRefGoogle Scholar
  10. 10.
    Quattoni, A., Wang, S.B., Morency, L., Collins, M., Darrell, T.: Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1848–1852 (2007). http://dx.org/10.1109/TPAMI.2007.1124 CrossRefGoogle Scholar
  11. 11.
    R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org
  12. 12.
    Rabiner, L.: A tutorial on HMM and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  13. 13.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  14. 14.
    Visser, I., Speekenbrink, M.: depmixS4: An R package for Hidden Markov Models. J. Stat. Softw. 36(7), 1–21 (2010)CrossRefGoogle Scholar
  15. 15.
    Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 688–693. ACM, New York (2002)Google Scholar
  16. 16.
    Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. 4–AC, 7–16 (2013)Google Scholar
  17. 17.
    Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: 2015 IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015, pp. 1484–1490. IEEE (2015)Google Scholar
  18. 18.
    Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS 2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, 17–19 August 2015 (2015)Google Scholar
  19. 19.
    Zadrożny, S., Kacprzyk, J., Gajewski, M.: On the detection of new cases in multiaspect text categorization: a comparison of approaches. In: Proceedings of the Congress on Information Technology, Computational and Experimental Physics, pp. 213–218. AGH University of Science and Technology (2015)Google Scholar
  20. 20.
    Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Sławomir Zadrożny
    • 1
    Email author
  • Janusz Kacprzyk
    • 1
  • Marek Gajewski
    • 1
  1. 1.Systems Research Institute, Polish Academy of SciencesWarszawaPoland

Personalised recommendations