Advertisement

Empirical Evaluation of Inference Technique for Topic Models

  • Pooja KherwaEmail author
  • Poonam Bansal
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 713)

Abstract

Topic modelling is a technique to infer themes and topic from a large collection of documents. Latent Dirichlet allocation is the most widely technique used in topic modelling literature. It is a model generative in nature with multinomial distribution to produce document, and then, again LDA is used as reverse process by estimating parameters to deduce topic and themes from unstructured documents. In topic modelling, many approximate posterior inference algorithms exist, and the most dominating inference techniques in LDA (latent Dirichlet allocation) are variational expectation maximization (VEM) and Gibbs sampling. In this paper, we are evaluating the performance of VEM and Gibbs sampling techniques on an Associated Press data set and Accepted Papers data set by fitting the topic model using latent Dirichlet allocation. In this experiment, we consider perplexity and entropy as significant metrics for the performance evaluation of topic models. In this, we found that for large data set like Associated Press data set with 2000 documents, variational inference is good inference technique and for small data set like Accepted Papers Gibbs sampling is the best choice for inference. Another advantage of Gibbs sampling is that it runs Markov chain and avoids getting trapped in local minima. Variational inference provides fast and deterministic solutions.

Keywords

Topic model Latent Dirichlet allocation Inference Gibbs sampling Multinomial distribution 

References

  1. 1.
    Blei, D.M., Lafferty, J.D..: Topic models. Int. J. CRC Press Text Mining Classif. Cluster. Appl. 71–89 (2009)Google Scholar
  2. 2.
    Mitchel, T.M.: Machine Learning. McGraw-Hill, Inc. New York, USA (1997)Google Scholar
  3. 3.
    Hofmann.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR conference on Research and Development in Information Retrieval. New York, NY, USA, ACM, pp. 50–57 (1999)Google Scholar
  4. 4.
    Kucukelbir, A., Tran, D., Rangnath, R., Gelman, A., Blei, D.M.: Automatic Differential Variational Inference (2016). arXiv:1603.00788
  5. 5.
    Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  7. 7.
    Blei, D. and Lafferty, J.: Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (2006) 113–120Google Scholar
  8. 8.
    Blei, D., Lafferty, J.A.: Correlated topic model of science. Ann. Appl. Statist. 1, 17–35 (2007)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Rogers, S., Girolami, M., Campbell, C., Breitling, R.: The latent process decomposition of CDNA microarray data set. ACM Trans. Computat. Biol. Bioinformat. 2(2), 143–156 (2005)CrossRefGoogle Scholar
  10. 10.
    Shivshankar, S., Srivathsan, S., Ravindran, B., Tendulkar, A.V.: Multi-view methods for protein structure comparison using latent Dirichlet allocation. Bioinformatics 27(13), 161–168 (2011)Google Scholar
  11. 11.
    Zhao, W., Zou, W., Chen, J.J.: Topic modelling for cluster analysis of large biological and medical datasets. BMC Bioinfor. 15(S11) (2014)CrossRefGoogle Scholar
  12. 12.
    Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of pollution structure using multilocus genotype data. Genetics 155, 945–959 (2000)Google Scholar
  13. 13.
    Griffiths, T.L., Styvers, M.: Finding scientific topics. In: Proceedings of the National Academy of Sciences, vol. 101. PNAS, pp. 5228–5235 (2004)Google Scholar
  14. 14.
    Gelfand, A., Smith, A.: Sampling based approaches to calculating marginal densities. J. Am. Statist. Assoc. 85, 398–409 (1990)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Hastings, W.: Monte carlo sampling methods using Markov chains and their applications. Biometrika 5797, 97–109 (1970)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, M., and Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)Google Scholar
  18. 18.
    Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)CrossRefGoogle Scholar
  19. 19.
    Miao, Y., Yu, L., Blunsom, P.: Neural Variational Inference for Text Processing, pp. 1727–1736 (2016)Google Scholar
  20. 20.
    Hinton, G.E., Salakhutdinov, R.R.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing System, pp. 1607–1614 (2009)Google Scholar
  21. 21.
    Larochelle, H., Lauly, S.: A neural autoregressive topic model. Advanc. Neural Inf. Process. Syst. 2708–2716 (2012)Google Scholar
  22. 22.
    Mnit, A., Gregor, K.: Neural Variational Inference and Learning Belief Network, pp. 1791–1799 (2014)Google Scholar
  23. 23.
    Teh, Y.W., Newman, D., Welling, A.M.: Collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Neural Inf. Process. Syst. 1–8 (2006)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Maharaja Surajmal Institute of TechnologyNew DelhiIndia

Personalised recommendations