Clustering Textual Data by Latent Dirichlet Allocation: Applications and Extensions to Hierarchical Data
Latent Dirichlet Allocation is a generative probabilistic model that can be used to describe and analyse textual data. We extend the basic LDA model to search and classify a large set of administrative documents taking into account the structure of the textual data that show a clear hierarchy. This can be considered as a general approach to the analysis of short texts semantically linked to larger texts. Some preliminary empirical evidence that support the proposed model is presented.
KeywordsLatent Dirichlet Allocation Latent Semantic Analysis Probabilistic Latent Semantic Analysis Latent Dirichlet Allocation Model Probabilistic Generative Model
- Buntine, W., & Jakulin, A. (2006). Discrete principal component analysis. In C. Saunders, M. Grobelnik, S. Gunn, & J. Shawe-Taylor (Eds.), Subspace, latent structure and feature selection techniques. Amsterdam: Springer.Google Scholar
- Girolami, M., & Kaban, A. (2003). On an equivalence between PLSI and LDA. In Proceedings of SIGIR.Google Scholar
- Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. In Proceedings of the 24th annual conference of the Cognitive Science Society.Google Scholar
- Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the twenty-second annual international SIGIR conference.Google Scholar
- Teh, Y. W., Newman, D., & Welling, M. (2007). A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 19, 1353–1360.Google Scholar