Abstract
Latent Dirichlet Allocation (LDA) represents perhaps the most famous topic model, employed in many different contexts in Computer Science. The wide success of LDA is due to the effectiveness of this model in dealing with large datasets, the competitive performances obtained on several tasks (e.g. classification, clustering), and the interpretability of the solution provided. Learning the LDA from training data usually requires to employ iterative optimization techniques such as the Expectation-Maximization, for which the choice of a good initialization is of crucial importance to reach an optimal solution. However, even if some clever solutions have been proposed, in practical applications this issue is typically disregarded, and the usual solution is to resort to random initialization.
In this paper we address the problem of initializing the LDA model with two novel strategies: the key idea is to perform a repeated learning by employ a topic splitting/pruning strategy, such that each learning phase is initialized with an informative situation derived from the previous phase.
The performances of the proposed splitting and pruning strategies have been assessed from a twofold perspective: i) the log-likelihood of the learned model (both on the training set and on a held-out set); ii) the coherence of the learned topics. The evaluation has been carried out on five different datasets, taken from and heterogeneous contexts in the literature, showing promising results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
More details on the dataset, called PsychoFlickr, can be found in [10].
- 2.
We employed the public Matlab LDA implementation available at http://lear.inrialpes.fr/~verbeek/software.php.
References
Asuncion, H., Asuncion, A., Taylor, R.: Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, ICSE 2010, vol. 1, pp. 95–104 (2010)
Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E., Lander, E., Wong, W., Johnson, B., Golub, T., Sugarbaker, D., Meyerson, M.: Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. 98(24), 13790–13795 (2001)
Bicego, M., Lovato, P., Perina, A., Fasoli, M., Delledonne, M., Pezzotti, M., Polverari, A., Murino, V.: Investigating topic models’ capabilities in expression microarray data classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(6), 1831–1836 (2012)
Bicego, M., Murino, V., Figueiredo, M.: A sequential pruning strategy for the selection of the number of states in hidden Markov models. Pattern Recogn. Lett. 24(9), 1395–1407 (2003)
Blei, D.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Boley, D.: Principal direction divisive partitioning. Data Mining Knowl. Disc. 2(4), 325–344 (1998)
Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006)
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J., Blei, D.: Reading tea leaves: how humans interpret topic models. Adv. Neural Inf. Process. Syst. 22, 288–296 (2009)
Cristani, M., Vinciarelli, A., Segalin, C., Perina, A.: Unveiling the multimedia unconscious: Implicit cognitive processes and multimedia content analysis. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 213–222 (2013)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)
Dhillon, I., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003)
Elidan, G., Friedman, N.: The information bottleneck EM algorithm. In: Proceedings of the Uncertainty in Artificial Intelligence, pp. 200–208 (2002)
Farahat, A., Chen, F.: Improving probabilistic latent semantic analysis with principal component analysis. In: EACL (2006)
Fayyad, U., Reina, C., Bradley, P.: Initialization of iterative refinement clustering algorithms. In: Knowledge Discovery and Data Mining, pp. 194–198 (1998)
Figueiredo, M.A.T., Leitão, J.M.N., Jain, A.K.: On fitting mixture models. In: Hancock, E.R., Pelillo, M. (eds.) EMMCVPR 1999. LNCS, vol. 1654, pp. 54–69. Springer, Heidelberg (1999)
Frey, B., Jojic, N.: A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1–25 (2005)
Girolami, M., Kabán, A.: On an equivalence between plsi and lda. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Formaion Retrieval, pp. 433–434 (2003)
Hazen, T.: Direct and latent modeling techniques for computing spoken document similarity. In: 2010 IEEE Spoken Language Technology Workshop (SLT), pp. 366–371 (2010)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Lienou, M., Maitre, H., Datcu, M.: Semantic annotation of satellite images using Latent Dirichlet Allocation. IEEE Geosci. Remote Sens. Lett. 7(1), 28–32 (2010)
Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
Perina, A., Kim, D., Turski, A., Jojic, N.: Skim-reading thousands of documents in one minute: data indexing and visualization for multifarious search. In: Workshop on Interactive Data Exploration and Analytics, IDEA 2014 at KDD (2014)
Perina, A., Lovato, P., Jojic, N.: Bags of words models of epitope sets: HIV viral load regression with counting grids. In: Proceedings of International Pacific Symposium on Biocomputing (PSB), pp. 288–299 (2014)
Quinn, K., Monroe, B., Colaresi, M., Crespin, M., Radev, D.: How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54(1), 209–228 (2010)
Roberts, S., Husmeier, D., Rezek, I., Penny, W.: Bayesian approaches to gaussian mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1133–1142 (1998)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)
Segalin, C., Perina, A., Cristani, M.: Personal aesthetics for soft biometrics: a generative multi-resolution approach. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 180–187 (2014)
Shivashankar, S., Srivathsan, S., Ravindran, B., Tendulkar, A.: Multi-view methods for protein structure comparison using Latent Dirichlet Allocation. Bioinformatics 27(13), i61–i68 (2011)
Smaragdis, P., Shashanka, M., Raj, B.: Topic models for audio mixture analysis. In: NIPS Workshop on Applications for Topic Models: Text and Beyond, pp. 1–4 (2009)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining. vol. 400, pp. 525–526 (2000)
Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961 (2012)
Ueda, N., Nakano, R.: Deterministic annealing EM algorithm. Neural Netw. 11(2), 271–282 (1998)
Wang, C., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1903–1910 (2009)
Wu, C.: On the convergence properties of the EM algorithm. Ann. Stat. 1(1), 95–103 (1983)
Yang, S., Long, B., Smola, A., Sadagopan, N., Zheng, Z., Zha, H.: Like like alike: joint friendship and interest propagation in social networks. In: Proceedings of the 20th International Conference on World Wide Web (WWW), WWW 2011, pp. 537–546 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Lovato, P., Bicego, M., Murino, V., Perina, A. (2015). Robust Initialization for Learning Latent Dirichlet Allocation. In: Feragen, A., Pelillo, M., Loog, M. (eds) Similarity-Based Pattern Recognition. SIMBAD 2015. Lecture Notes in Computer Science(), vol 9370. Springer, Cham. https://doi.org/10.1007/978-3-319-24261-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-24261-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24260-6
Online ISBN: 978-3-319-24261-3
eBook Packages: Computer ScienceComputer Science (R0)