Abstract
Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blei, D.M.: Introduction to probabilistic topic models. Comm. of ACM (2011)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. of Machine Learning Research 3, 993–1022 (2003)
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei Reading, D.: tea leaves: How humans interpret topic models. In: NIPS (2009)
Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval, pp. 1513–1518 (2009)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by Latent Semantic Analysis. J. Amer. Soc. Inf. Sci. 41(6), 391–407 (1990)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAIÂ 6, 12 (2007)
Gardner, M.J., Lutes, J., Lund, J., Hansen, J., Walker, D., Ringger, E., Seppi, K.: The Topic Browser: An interactive tool for browsing topic models. NIPS Workshop on Challenges of Data Visualization (2010)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(suppl. 1), 5228–5235 (2004)
Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1), 177–196 (2001)
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. HLT 1, 1536–1545 (2011)
Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: COLING, pp. 605–613 (2010)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR, pp. 37–50 (1992)
Magatti, D., Calegari, S., Ciucci, D., Stella, F.: Automatic labeling of topics. In: ISDA, pp. 1227–1232 (2009)
Mei, Q., Shen, X., Zhai, C.X.: Automatic labeling of multinomial topic models. In: KDD, pp. 490–499 (2007)
Ramage, D., Hall, D., Nallapati, R., Manning, C.: Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP, vol. 1, pp. 248–256 (2009)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In: EMNLP, pp. 254–263 (2008)
Spitkovsky, V.I., Chang, A.X.: A cross-lingual dictionary for english Wikipedia concepts. In: LREC, May 23-25 (2012)
Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.: Mr. LDA: A flexible large scale topic modeling package using variational inference in map/reduce. In: WWW (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hansen, J.A., Ringger, E.K., Seppi, K.D. (2013). Probabilistic Explicit Topic Modeling Using Wikipedia. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-40722-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)