Probabilistic Explicit Topic Modeling Using Wikipedia

Hansen, Joshua A.; Ringger, Eric K.; Seppi, Kevin D.

doi:10.1007/978-3-642-40722-2_7

Joshua A. Hansen²²,
Eric K. Ringger²² &
Kevin D. Seppi²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))

1432 Accesses
9 Citations

Abstract

Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M.: Introduction to probabilistic topic models. Comm. of ACM (2011)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei Reading, D.: tea leaves: How humans interpret topic models. In: NIPS (2009)
Google Scholar
Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval, pp. 1513–1518 (2009)
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by Latent Semantic Analysis. J. Amer. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAI 6, 12 (2007)
Google Scholar
Gardner, M.J., Lutes, J., Lund, J., Hansen, J., Walker, D., Ringger, E., Seppi, K.: The Topic Browser: An interactive tool for browsing topic models. NIPS Workshop on Challenges of Data Visualization (2010)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1), 177–196 (2001)
Article MATH Google Scholar
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. HLT 1, 1536–1545 (2011)
Google Scholar
Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: COLING, pp. 605–613 (2010)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR, pp. 37–50 (1992)
Google Scholar
Magatti, D., Calegari, S., Ciucci, D., Stella, F.: Automatic labeling of topics. In: ISDA, pp. 1227–1232 (2009)
Google Scholar
Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In: KDD, pp. 490–499 (2007)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.: Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, vol. 1, pp. 248–256 (2009)
Google Scholar
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In: EMNLP, pp. 254–263 (2008)
Google Scholar
Spitkovsky, V.I., Chang, A.X.: A cross-lingual dictionary for english Wikipedia concepts. In: LREC, May 23-25 (2012)
Google Scholar
Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.: Mr. LDA: A flexible large scale topic modeling package using variational inference in map/reduce. In: WWW (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brigham Young University, Provo, Utah, USA, 84602
Joshua A. Hansen, Eric K. Ringger & Kevin D. Seppi

Authors

Joshua A. Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Eric K. Ringger
View author publications
You can also search for this author in PubMed Google Scholar
Kevin D. Seppi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Technical University Darmstadt, 64289 Darmstadt, Germany, and German Institute for International Education Research,, 60486, Frankfurt, Germany
Iryna Gurevych
Technical University Darmstadt, 64289, Darmstadt, Germany
Chris Biemann
Technical University Darmstadt, 64289 Darmsadt, and German Institute for International Educational Research, 60486, Frankfurt, Germany
Torsten Zesch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hansen, J.A., Ringger, E.K., Seppi, K.D. (2013). Probabilistic Explicit Topic Modeling Using Wikipedia. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-40722-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics