Skip to main content

Sparse Stochastic Inference with Regularization

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10234))

Included in the following conference series:

  • 3741 Accesses

Abstract

The massive amount of digital text information and delivering them in streaming manner pose challenges for traditional inference algorithms. Recently, advances in stochastic inference algorithms have made it feasible to learn topic models from very large-scale collections of documents. In this paper, we however point out that many existing approaches are prone to overfitting for extremely large/infinite datasets. The possibility of overfitting is particularly high in streaming environments. This finding suggests to use regularization for stochastic inference. We then propose a novel stochastic algorithm for learning latent Dirichlet allocation that uses regularization when updating global parameters and utilizes sparse Gibb sampling to do local inference. We study the performance of our algorithm on two massive data sets and demonstrate that it surpasses the existing algorithms in various aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SSI was taken from http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar.

  2. 2.

    The data were retrieved from http://archive.ics.uci.edu/ml/datasets/.

References

  1. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34, (2009)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(3), 993–1022 (2003)

    MATH  Google Scholar 

  3. Bottou, L.: Online Learning in Neural Networks. Online Learning and Stochastic Approximations. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  4. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., Jordan, M.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)

    Google Scholar 

  5. Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In: Proceedings of Recent Advances in Natural Language Processing, pp. 198–206 (2013)

    Google Scholar 

  6. Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 446–454. ACM (2013)

    Google Scholar 

  7. Gerrish, S., Blei, D.: How they vote: Issue-adjusted models of legislative behavior. In: Advances in Neural Information Processing Systems, vol. 25, pp. 2762–2770 (2012)

    Google Scholar 

  8. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. U.S.A. 101(Suppl. 1), 5228 (2004)

    Article  Google Scholar 

  9. Grimmer, J.: A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)

    Article  Google Scholar 

  10. Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 368–378. ACL (2011)

    Google Scholar 

  11. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)

    MathSciNet  MATH  Google Scholar 

  12. Li, X., OuYang, J., You, L.: Topic modeling for large-scale text data. Front. IT & EE 16(6), 457–465 (2015)

    Google Scholar 

  13. Liu, B., Liu, L., Tsykin, A., Goodall, G.J., Green, J.E., Zhu, M., Kim, C.H., Li, J.: Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics 26(24), 3105 (2010)

    Article  Google Scholar 

  14. Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent dirichlet allocation. In: Proceedings of the 29th Annual International Conference on Machine Learning (2012)

    Google Scholar 

  15. Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  16. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)

    Google Scholar 

  17. Schwartz, H.A., Eichstaedt, J.C, Dziurzynski, L., Kern, M.L., Seligman, M.E.P., Ungar, L.H., Blanco, E., Kosinski, M., Stillwell, D.: Toward personality insights from language exploration in social media. In: AAAI Spring Symposium Series (2013)

    Google Scholar 

  18. Teh, Y.W., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 19, p. 1353 (2007)

    Google Scholar 

  19. Than, K., Ho, T.B.: Fully sparse topic models. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7523, pp. 490–505. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33460-3_37

    Chapter  Google Scholar 

  20. Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Largescale high-precision topic modeling on Twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)

    Google Scholar 

  21. Sontag, D., Roy, D.M.: Complexity of inference in latent dirichlet allocation. In: Advances in Neural Information Processing Systems (NIPS) (2011)

    Google Scholar 

Download references

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2014.28 and by the Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research & Development (AOARD), and US Army International Technology Center, Pacific (ITC-PAC) under Award Number FA2386-15-1-4011. Khoat Than is also funded by Vietnam Institute for Advanced Study in Mathematics (VIASM).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khoat Than .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Doan, T., Than, K. (2017). Sparse Stochastic Inference with Regularization. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57454-7_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57453-0

  • Online ISBN: 978-3-319-57454-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics