Skip to main content

Extracting Multilingual Topics from Unaligned Comparable Corpora

  • Conference paper
Advances in Information Retrieval (ECIR 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

Abstract

Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for Cross-Lingual IR applications. We also found that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Xu, J., Weischedel, R., Nguyen, C.: Evaluating a probabilistic model for cross-lingual information retrieval. In: SIGIR 2001, pp. 105–110. ACM, New York (2001)

    Chapter  Google Scholar 

  2. Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)

    Google Scholar 

  3. Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Annals of Applied Statistics, 17–35 (August 2007)

    Google Scholar 

  4. Blei, D.M., Lafferty, J.: Topic models. Text Mining: Theory and Applications. Taylor and Francis, Abington (2009)

    Google Scholar 

  5. Steyvers, M., Griffiths, T.: Probabilistic topic models. Latent Semantic Analysis: A Road to Meaning (2005)

    Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Maching Learning Research 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  7. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT Summit (2005)

    Google Scholar 

  8. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of National Academy of Sciences USA 101(suppl. 1), 5228–5235 (2004)

    Article  Google Scholar 

  9. Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty topic patterns from coordinated text streams. In: KDD 2007: Proceedings of the 13th ACM SIGKDD, pp. 784–793. ACM, New York (2007)

    Google Scholar 

  10. Dumais, S.T., Landauer, T.K., Littman, M.L.: Automatic cross-linguistic information retrieval using latent semantic indexing. In: Working Notes of the Workshop on Cross-Linguistic Information Retrieval, SIGIR, Zurich, Switzerland, pp. 16–23. ACM, New York (1996)

    Google Scholar 

  11. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: SIGIR 2003, pp. 127–134. ACM, New York (2003)

    Chapter  Google Scholar 

  12. Ni, X., Sun, J.T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia. In: 18th International World Wide Web Conference, April 2009, pp. 1155–1155 (2009)

    Google Scholar 

  13. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL 2002 workshop on Unsupervised lexical acquisition, Morristown, NJ, USA, pp. 9–16. Association for Computational Linguistics (2002)

    Google Scholar 

  14. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    Article  Google Scholar 

  15. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR 2001, pp. 334–342. ACM Press, New York (2001)

    Chapter  Google Scholar 

  16. Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Uncertainty in Artificial Intelligence (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jagarlamudi, J., Daumé, H. (2010). Extracting Multilingual Topics from Unaligned Comparable Corpora. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12275-0_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12274-3

  • Online ISBN: 978-3-642-12275-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics