Abstract
The massive amount of digital text information and delivering them in streaming manner pose challenges for traditional inference algorithms. Recently, advances in stochastic inference algorithms have made it feasible to learn topic models from very large-scale collections of documents. In this paper, we however point out that many existing approaches are prone to overfitting for extremely large/infinite datasets. The possibility of overfitting is particularly high in streaming environments. This finding suggests to use regularization for stochastic inference. We then propose a novel stochastic algorithm for learning latent Dirichlet allocation that uses regularization when updating global parameters and utilizes sparse Gibb sampling to do local inference. We study the performance of our algorithm on two massive data sets and demonstrate that it surpasses the existing algorithms in various aspects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
SSI was taken from http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar.
- 2.
The data were retrieved from http://archive.ics.uci.edu/ml/datasets/.
References
Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34, (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(3), 993–1022 (2003)
Bottou, L.: Online Learning in Neural Networks. Online Learning and Stochastic Approximations. Cambridge University Press, Cambridge (1998)
Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., Jordan, M.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)
Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In: Proceedings of Recent Advances in Natural Language Processing, pp. 198–206 (2013)
Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 446–454. ACM (2013)
Gerrish, S., Blei, D.: How they vote: Issue-adjusted models of legislative behavior. In: Advances in Neural Information Processing Systems, vol. 25, pp. 2762–2770 (2012)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. U.S.A. 101(Suppl. 1), 5228 (2004)
Grimmer, J.: A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 368–378. ACL (2011)
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
Li, X., OuYang, J., You, L.: Topic modeling for large-scale text data. Front. IT & EE 16(6), 457–465 (2015)
Liu, B., Liu, L., Tsykin, A., Goodall, G.J., Green, J.E., Zhu, M., Kim, C.H., Li, J.: Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics 26(24), 3105 (2010)
Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent dirichlet allocation. In: Proceedings of the 29th Annual International Conference on Machine Learning (2012)
Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in Neural Information Processing Systems (2013)
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)
Schwartz, H.A., Eichstaedt, J.C, Dziurzynski, L., Kern, M.L., Seligman, M.E.P., Ungar, L.H., Blanco, E., Kosinski, M., Stillwell, D.: Toward personality insights from language exploration in social media. In: AAAI Spring Symposium Series (2013)
Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 19, p. 1353 (2007)
Than, K., Ho, T.B.: Fully sparse topic models. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7523, pp. 490–505. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33460-3_37
Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Largescale high-precision topic modeling on Twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)
Sontag, D., Roy, D.M.: Complexity of inference in latent dirichlet allocation. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Acknowledgments
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2014.28 and by the Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD), and US Army International Technology Center, Pacific (ITC-PAC) under Award Number FA2386-15-1-4011. Khoat Than is also funded by Vietnam Institute for Advanced Study in Mathematics (VIASM).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Doan, T., Than, K. (2017). Sparse Stochastic Inference with Regularization. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-57454-7_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57453-0
Online ISBN: 978-3-319-57454-7
eBook Packages: Computer ScienceComputer Science (R0)