Skip to main content

LDA-Based Topic Modeling in Labeling Blog Posts with Wikipedia Entries

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7234))

Abstract

Given a search query, most existing search engines simply return a ranked list of search results. However, it is often the case that those search result documents consist of a mixture of documents that are closely related to various contents. In order to address the issue of quickly overviewing the distribution of contents, this paper proposes a framework of labeling blog posts with Wikipedia entries through LDA (latent Dirichlet allocation) based topic modeling. More specifically, this paper applies an LDA-based document model to the task of labelling blog posts with Wikipedia entries. One of the most important advantages of this LDA-based document model is that the collected Wikipedia entries and their LDA parameters heavily depend on the distribution of keywords across all the search result of blog posts. This tendency actually contributes to quickly overviewing the search result of blog posts through the LDA-based topic distribution. In the evaluation of the paper, we also show that the LDA-based document retrieval scheme outperforms our previous approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers (2009)

    Google Scholar 

  2. Yokomoto, D., Makita, K., Utsuro, T., Kawada, Y., Fukuhara, T.: Utilizing Wikipedia in categorizing topic related blogs into facets. In: Proc. 12th PACLING, #20 (2011)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Wei, X., Croft, W.B.: LDA-Based document models for ad-hoc retrieval. In: Proc. 29th SIGIR, pp. 178–185 (2006)

    Google Scholar 

  5. Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC-2009 blog track. In: Proc. TREC 2009 (2009)

    Google Scholar 

  6. Fujimura, K., Toda, H., Inoue, T., Hiroshima, N., Kataoka, R., Sugizaki, M.: BLOGRANGER - a multi-faceted blog search engine. In: Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2006)

    Google Scholar 

  7. Li, C., Yan, N., Roy, S.B., Lisham, L., Das, G.: Facetedpedia: Dynamic generation of query-dependent faceted interfaces for Wikipedia. In: Proc. 19th WWW, pp. 651–660 (2010)

    Google Scholar 

  8. Harashima, J., Kurohashi, S.: Summarizing search results using PLSI. In: Proc. 2nd Workshop on NLPIX, pp. 12–20 (2010)

    Google Scholar 

  9. Toda, H., Kataoka, R., Oku, M.: Search result clustering using informatively named entities. International Journal of Human-Computer Interaction, 3–23 (2007)

    Google Scholar 

  10. de Winter, W., de Rijke, M.: Identifying facets in query-biased sets of blog posts. In: Proc. ICWSM, pp. 251–254 (2007)

    Google Scholar 

  11. Shibata, T., Bamba, Y., Shinzato, K., Kurohashi, S.: Web information organization using keyword distillation based clustering. In: Proc. WI-IAT, pp. 325–330 (2009)

    Google Scholar 

  12. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proc. 31st SIGIR, pp. 179–186 (2008)

    Google Scholar 

  13. Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using Wikipedia. In: Proc. 32nd SIGIR, pp. 139–146 (2009)

    Google Scholar 

  14. Hoffman, T.: Probabilistic latent semantic indexing. In: Proc. 22nd SIGIR, pp. 50–57 (1999)

    Google Scholar 

  15. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proc. 27th SIGIR, pp. 186–193 (2004)

    Google Scholar 

  16. Phan, X.H., Nguyen, C.T.: GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA) (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yokomoto, D. et al. (2012). LDA-Based Topic Modeling in Labeling Blog Posts with Wikipedia Entries. In: Wang, H., et al. Web Technologies and Applications. APWeb 2012. Lecture Notes in Computer Science, vol 7234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29426-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29426-6_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29425-9

  • Online ISBN: 978-3-642-29426-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics