Sparse Stochastic Inference with Regularization

Doan, Tung; Than, Khoat

doi:10.1007/978-3-319-57454-7_35

Tung Doan¹⁹ &
Khoat Than²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10234))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3741 Accesses

Abstract

The massive amount of digital text information and delivering them in streaming manner pose challenges for traditional inference algorithms. Recently, advances in stochastic inference algorithms have made it feasible to learn topic models from very large-scale collections of documents. In this paper, we however point out that many existing approaches are prone to overfitting for extremely large/infinite datasets. The possibility of overfitting is particularly high in streaming environments. This finding suggests to use regularization for stochastic inference. We then propose a novel stochastic algorithm for learning latent Dirichlet allocation that uses regularization when updating global parameters and utilizes sparse Gibb sampling to do local inference. We study the performance of our algorithm on two massive data sets and demonstrate that it surpasses the existing algorithms in various aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
SSI was taken from http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar.
2.
The data were retrieved from http://archive.ics.uci.edu/ml/datasets/.

References

Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34, (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(3), 993–1022 (2003)
MATH Google Scholar
Bottou, L.: Online Learning in Neural Networks. Online Learning and Stochastic Approximations. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., Jordan, M.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)
Google Scholar
Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In: Proceedings of Recent Advances in Natural Language Processing, pp. 198–206 (2013)
Google Scholar
Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 446–454. ACM (2013)
Google Scholar
Gerrish, S., Blei, D.: How they vote: Issue-adjusted models of legislative behavior. In: Advances in Neural Information Processing Systems, vol. 25, pp. 2762–2770 (2012)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. U.S.A. 101(Suppl. 1), 5228 (2004)
Article Google Scholar
Grimmer, J.: A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)
Article Google Scholar
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 368–378. ACL (2011)
Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
MathSciNet MATH Google Scholar
Li, X., OuYang, J., You, L.: Topic modeling for large-scale text data. Front. IT & EE 16(6), 457–465 (2015)
Google Scholar
Liu, B., Liu, L., Tsykin, A., Goodall, G.J., Green, J.E., Zhu, M., Kim, C.H., Li, J.: Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics 26(24), 3105 (2010)
Article Google Scholar
Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent dirichlet allocation. In: Proceedings of the 29th Annual International Conference on Machine Learning (2012)
Google Scholar
Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in Neural Information Processing Systems (2013)
Google Scholar
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)
Google Scholar
Schwartz, H.A., Eichstaedt, J.C, Dziurzynski, L., Kern, M.L., Seligman, M.E.P., Ungar, L.H., Blanco, E., Kosinski, M., Stillwell, D.: Toward personality insights from language exploration in social media. In: AAAI Spring Symposium Series (2013)
Google Scholar
Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 19, p. 1353 (2007)
Google Scholar
Than, K., Ho, T.B.: Fully sparse topic models. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7523, pp. 490–505. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33460-3_37
Chapter Google Scholar
Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Largescale high-precision topic modeling on Twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)
Google Scholar
Sontag, D., Roy, D.M.: Complexity of inference in latent dirichlet allocation. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Google Scholar

Download references

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2014.28 and by the Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD), and US Army International Technology Center, Pacific (ITC-PAC) under Award Number FA2386-15-1-4011. Khoat Than is also funded by Vietnam Institute for Advanced Study in Mathematics (VIASM).

Author information

Authors and Affiliations

SOKENDAI The Graduate University for Advanced Studies, Hayama, Kanagawa, Japan
Tung Doan
Hanoi University of Science and Technology, No.1 Dai Co Viet Road, Hanoi, Vietnam
Khoat Than

Authors

Tung Doan
View author publications
You can also search for this author in PubMed Google Scholar
Khoat Than
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khoat Than .

Editor information

Editors and Affiliations

Kangwon National University, Chuncheon, Korea (Republic of)
Jinho Kim
Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
University of Technology Sydney, Sydney, New South Wales, Australia
Longbing Cao
KAIST, Daejeon, Korea (Republic of)
Jae-Gil Lee
University of New South Wales, Sydney, New South Wales, Australia
Xuemin Lin
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Doan, T., Than, K. (2017). Sparse Stochastic Inference with Regularization. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-57454-7_35
Published: 23 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57453-0
Online ISBN: 978-3-319-57454-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics