Clustering Textual Data by Latent Dirichlet Allocation: Applications and Extensions to Hierarchical Data

  • Matteo DimaiEmail author
  • Nicola Torelli
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


Latent Dirichlet Allocation is a generative probabilistic model that can be used to describe and analyse textual data. We extend the basic LDA model to search and classify a large set of administrative documents taking into account the structure of the textual data that show a clear hierarchy. This can be considered as a general approach to the analysis of short texts semantically linked to larger texts. Some preliminary empirical evidence that support the proposed model is presented.


Latent Dirichlet Allocation Latent Semantic Analysis Probabilistic Latent Semantic Analysis Latent Dirichlet Allocation Model Probabilistic Generative Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 71–80.CrossRefGoogle Scholar
  2. Buntine, W., & Jakulin, A. (2006). Discrete principal component analysis. In C. Saunders, M. Grobelnik, S. Gunn, & J. Shawe-Taylor (Eds.), Subspace, latent structure and feature selection techniques. Amsterdam: Springer.Google Scholar
  3. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science and Technology, 41(6), 391–407.CrossRefGoogle Scholar
  4. Dickey, J. (1983). Multiple hypergeometric functions: Probabilistic interpretations and statistical uses. Journal of the American Statistical Association, 78, 628–637.zbMATHCrossRefMathSciNetGoogle Scholar
  5. Girolami, M., & Kaban, A. (2003). On an equivalence between PLSI and LDA. In Proceedings of SIGIR.Google Scholar
  6. Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. In Proceedings of the 24th annual conference of the Cognitive Science Society.Google Scholar
  7. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the twenty-second annual international SIGIR conference.Google Scholar
  8. Teh, Y. W., Newman, D., & Welling, M. (2007). A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 19, 1353–1360.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Dept. of Economics and StatisticsUniversity of TriesteTriesteItaly

Personalised recommendations