Skip to main content

Online Subset Topic Modeling for Interactive Documents Exploration

  • Conference paper
  • First Online:
  • 3378 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10827))

Abstract

Data exploration over text databases is an important problem. In an exploration scenario, users would find something useful without previously knowing what exactly they are looking for, until the time they identify them. Therefore, labor-intensive efforts are often required, since users have to review the overview (or detail) results of ad-hoc queries and adjust the queries (e.g., zoom or filter) continuously. Probabilistic topic models are often adopted as a solution to provide the overview for a given text collection, since it could discover the underlying thematic structures of unstructured text data. However, training a topic model for a selected document collection is time consuming. Moreover, frequent model retraining would be introduced by continuous query-adjusting, which leads to large amount of time wasting and therefore is unsuitable for online exploration. To remedy this problem, this paper presents STMS, an algorithm for constructing topic structures in document subsets efficiently. STMS accelerates the process of subset modeling by leveraging global precomputation and applying an efficient sampling-based inference algorithm. The experiments on real world datasets show that STMS achieves orders of magnitude speed-ups than standard topic model, while remaining comparable in terms of modeling quality.

This paper is supported by NSFC (No. 61732004) and the Shanghai Innovation Action Project (Grant No. 16DZ1100200).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://snap.stanford.edu/data/email-Enron.html.

  2. 2.

    https://en.wikipedia.org/wiki/Wikipedia:Database_download.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  2. Cao, N., Sun, J., Lin, Y.R., Gotz, D., Liu, S., Qu, H.: Facetatlas: multifaceted visualization for rich text corpora. IEEE Trans. Vis. Comput. Graph. 16(6), 1172–1181 (2010)

    Article  Google Scholar 

  3. Gardner, M.J., Lutes, J., Lund, J., Hansen, J., Walker, D., Ringger, E., Seppi, K.: The topic browser: an interactive tool for browsing topic models. In: NIPS Workshop on Challenges of Data Visualization, vol. 2 (2010)

    Google Scholar 

  4. Görg, C., Liu, Z., Kihm, J., Choo, J., Park, H., Stasko, J.: Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Trans. Vis. Comput. Graph. 19(10), 1646–1663 (2013)

    Article  Google Scholar 

  5. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228–5235 (2004)

    Article  Google Scholar 

  6. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)

    Google Scholar 

  7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  8. Li, A.Q., Ahmed, A., Ravi, S., Smola, A.J.: Reducing the sampling complexity of topic models. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 891–900. ACM (2014)

    Google Scholar 

  9. Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed algorithms for topic models. J. Mach. Learn. Res. 10(Aug), 1801–1828 (2009)

    MathSciNet  MATH  Google Scholar 

  10. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577. ACM (2008)

    Google Scholar 

  11. Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

    Google Scholar 

  12. Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: LightLDA: big topic models on modest computer clusters. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1351–1361. ACM (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yinan Jing , Zhenying He or Xiaoyang Sean Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, L. et al. (2018). Online Subset Topic Modeling for Interactive Documents Exploration. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds) Database Systems for Advanced Applications. DASFAA 2018. Lecture Notes in Computer Science(), vol 10827. Springer, Cham. https://doi.org/10.1007/978-3-319-91452-7_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91452-7_59

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91451-0

  • Online ISBN: 978-3-319-91452-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics