A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

Zadrożny, Sławomir; Kacprzyk, Janusz; Gajewski, Marek

doi:10.1007/978-3-319-40596-4_19

Sławomir Zadrożny¹⁶,
Janusz Kacprzyk¹⁶ &
Marek Gajewski¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 610))

Included in the following conference series:

International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems

Abstract

In our previous work we introduced a novel concept of the multiaspect text categorization (MTC) task meant as a special, extended form of the text categorization (TC) problem which is widely studied in information retrieval. The essence of the MTC problem is the classification of documents on two levels: first, on a more or less standard level of thematic categories and then on the level of document sequences which is much less studied in the literature. The latter stage of classification, which is by far more challenging, is the main focus of this paper. A promising way of attacking it requires some kind of modeling of connections between documents forming sequences. To solve this problem we propose a novel approach that combines a well-known techniques to model sequences, i.e., the Hidden Markov Models (HMM) and the Latent Dirichlet Allocation (LDA) technique for the advanced document representation, hence obtaining a hybrid approach. We present details of our proposed approach as well as results of some computational experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
To shorten the notation we will denote the topic in the same way as the distribution on the words defining it.
2.
To simplify notation we denote this vector as d, i.e., in the same way as the document \(d\in D\).
3.
All text processing considered in this paper is carried out separately for each category \(c\in C\), which will not be explicitly mentioned again, and, moreover, we will refer to the collection of documents having in mind its subset comprising documents belonging to one category.

References

Allan, J. (ed.): Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers, Norwell (2002)
MATH Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley, New York (1999)
Google Scholar
Bird, S., et al.: The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, Morocco (2008)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bayou, L., Espes, D., Cuppens-Boulahia, N., Cuppens, F.: Security issue of WirelessHART based SCADA systems. In: Lambrinoudakis, C., et al. (eds.) CRiSIS 2015. LNCS, vol. 9572, pp. 225–241. Springer, Heidelberg (2016). doi:10.1007/978-3-319-31811-0_14
Chapter Google Scholar
Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)
Chapter Google Scholar
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Article Google Scholar
Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014)
Google Scholar
Grün, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011). http://www.jstatsoft.org/v40/i13/
Article Google Scholar
Quattoni, A., Wang, S.B., Morency, L., Collins, M., Darrell, T.: Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1848–1852 (2007). http://dx.org/10.1109/TPAMI.2007.1124
Article Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org
Rabiner, L.: A tutorial on HMM and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Visser, I., Speekenbrink, M.: depmixS4: An R package for Hidden Markov Models. J. Stat. Softw. 36(7), 1–21 (2010)
Article Google Scholar
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 688–693. ACM, New York (2002)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. 4–AC, 7–16 (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: 2015 IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015, pp. 1484–1490. IEEE (2015)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS 2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, 17–19 August 2015 (2015)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: On the detection of new cases in multiaspect text categorization: a comparison of approaches. In: Proceedings of the Congress on Information Technology, Computational and Experimental Physics, pp. 213–218. AGH University of Science and Technology (2015)
Google Scholar
Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1/2), 31–60 (2001)
Article MATH Google Scholar

Download references

Acknowledgments

This work is supported by the National Science Centre under contracts no. UMO-2011/01/B/ST6/06908 and UMO-2012/05/B/ST6/03068.

Author information

Authors and Affiliations

Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warszawa, Poland
Sławomir Zadrożny, Janusz Kacprzyk & Marek Gajewski

Authors

Sławomir Zadrożny
View author publications
You can also search for this author in PubMed Google Scholar
Janusz Kacprzyk
View author publications
You can also search for this author in PubMed Google Scholar
Marek Gajewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sławomir Zadrożny .

Editor information

Editors and Affiliations

INESC-ID,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Joao Paulo Carvalho
LIP 6, Université Pierre et Marie Curie, Paris, France
Marie-Jeanne Lesot
School of Industrial Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
Uzay Kaymak
IDMEC,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Susana Vieira
LIP6, Université Pierre et Marie Curie, CNRS, Paris, France
Bernadette Bouchon-Meunier
Machine Intelligence Institute, Iona College, New Rochelle, New York, USA
Ronald R. Yager

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zadrożny, S., Kacprzyk, J., Gajewski, M. (2016). A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 610. Springer, Cham. https://doi.org/10.1007/978-3-319-40596-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-40596-4_19
Published: 11 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40595-7
Online ISBN: 978-3-319-40596-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics