Topic modeling for large-scale text data

Li, Xi-ming; Ouyang, Ji-hong; Lu, You

doi:10.1631/FITEE.1400352

Xi-ming Li^1,2,
Ji-hong Ouyang^1,2 &
You Lu^1,2

136 Accesses
7 Citations
Explore all metrics

Abstract

This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named ‘stochastic variational inference’ and ‘SGRLD’, our algorithm achieves a faster convergence rate and better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Stochastic Bounds for Inference in Topic Models

Tuning the Learning Rate for Stochastic Variational Inference

Article 07 March 2016

References

Amari, S., 1998. Natural gradient works efficiently in learning. Neur. Comput., 10(2):251–276. [doi:10.1162/089976698300017746]
Article MathSciNet Google Scholar
Andrieu, C., de Freitas, N., Doucet, A., et al., 2003. An introduction to MCMC for machine learning. Mach. Learn., 50(1–2):5–43. [doi:10.1023/A:1020281327116]
Article MATH Google Scholar
Blatt, D., Hero, A.O., Gauchman, H., 2007. A convergent incremental gradient method with a constant step size. SIAM J. Optim., 18(1):29–51. [doi:10.1137/040615961]
Article MATH MathSciNet Google Scholar
Blei, D.M., 2012. Probabilistic topic models. Commun. ACM, 55(4):77–84. [doi:10.1145/2133806.2133826]
Article MathSciNet Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022.
MATH Google Scholar
Canini, K.R., Shi, L., Griffiths, T.L., 2009. Online inference of topics with latent Dirichlet allocation. J. Mach. Learn. Res., 5(2):65–72.
Google Scholar
Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(suppl 1):5228–5235. [doi:10.1073/pnas.0307752101]
Article Google Scholar
Hoffman, M., Bach, F.R., Blei, D.M., 2010. Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, p.856–864.
Google Scholar
Hoffman, M., Blei, D.M., Wang, C., et al., 2013. Stochastic variational inference. J. Mach. Learn. Res., 14(1): 1303–1347.
MATH MathSciNet Google Scholar
Liu, Z., Zhang, Y., Chang, E.Y., et al., 2011. PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2(3), Article 26.
Google Scholar
Newman, D., Asuncion, A., Smyth, P., et al., 2009. Distributed algorithms for topic models. J. Mach. Learn. Res., 10:1801–1828.
MATH MathSciNet Google Scholar
Ouyang, J., Lu, Y., Li, X., 2014. Momentum online LDA for large-scale datasets. Proc. 21st European Conf. on Artificial Intelligence, p.1075–1076.
Google Scholar
Patterson, S., Teh, Y.W., 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Advances in Neural Information Processing Systems, p.3102–3110.
Google Scholar
Ranganath, R., Wang, C., Blei, D.M., et al., 2013. An adaptive learning rate for stochastic variational inferencen. J. Mach. Learn. Res., 28(2):298–306.
Google Scholar
Schaul, T., Zhang, S., LeCun, Y., 2013. No more pesky learning rates. arXiv preprint, arXiv:1206:1106v2.
Google Scholar
Song, X., Lin, C.Y., Tseng, B.L., et al., 2005. Modeling and predicting personal information dissemination behavior. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.479–488. [doi:10.1145/1081870.1081925]
Google Scholar
Tadić, V.B., 2009. Convergence rate of stochastic gradient search in the case of multiple and non-isolated minima. arXiv preprint, arXiv:0904.4229v2.
Google Scholar
Teh, Y.W., Newman, D., Welling, M., 2007. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Advances in Neural Information Processing Systems, p.1353–1360.
Google Scholar
Wang, C., Chen, X., Smola, A.J., et al., 2013. Variance reduction for stochastic gradient optimization. Advances in Neural Information Processing Systems, p.181–189.
Google Scholar
Wang, Y., Bai, H., Stanton, M., et al., 2009. PLDA: parallel latent Dirichlet allocation for large-scale applications. Proc. 5th Int. Conf. on Algorithmic Aspects in Information and Management, p.301–314. [doi:10.1007/978-3-642-02158-9_26]
Chapter Google Scholar
Yan, F., Xu, N., Qi, Y., 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. Advances in Neural Information Processing Systems, p.2134–2142.
Google Scholar
Ye, Y., Gong, S., Liu, C., et al., 2013. Online belief propagation algorithm for probabilistic latent semantic analysis. Front. Comput. Sci., 7(5):526–535. [doi:10.1007/s11704-013-2360-7]
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Xi-ming Li, Ji-hong Ouyang & You Lu
MOE Key Laboratory of Symbolic Computation and Knowledge Engineering, Jilin University, Changchun, 130012, China
Xi-ming Li, Ji-hong Ouyang & You Lu

Authors

Xi-ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Ji-hong Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
You Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji-hong Ouyang.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61170092, 61133011, and 61103091)

ORCID: Xi-ming LI, http://orcid.org/0000-0001-8190-5087

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Xm., Ouyang, Jh. & Lu, Y. Topic modeling for large-scale text data. Frontiers Inf Technol Electronic Eng 16, 457–465 (2015). https://doi.org/10.1631/FITEE.1400352

Download citation

Received: 15 October 2014
Accepted: 12 March 2015
Published: 11 June 2015
Issue Date: June 2015
DOI: https://doi.org/10.1631/FITEE.1400352

Key words

CLC number

TP391.1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic modeling for large-scale text data

Abstract

Access this article

Similar content being viewed by others

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Stochastic Bounds for Inference in Topic Models

Tuning the Learning Rate for Stochastic Variational Inference

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Topic modeling for large-scale text data

Abstract

Access this article

Similar content being viewed by others

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Stochastic Bounds for Inference in Topic Models

Tuning the Learning Rate for Stochastic Variational Inference

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation