Knowledge Transfer across Multilingual Corpora via Latent Topics

De Smet, Wim; Tang, Jie; Moens, Marie-Francine

doi:10.1007/978-3-642-20841-6_45

Wim De Smet²²,
Jie Tang²³ &
Marie-Francine Moens²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1708 Accesses
15 Citations

Abstract

This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1-2), 105–121 (2010)
Article MathSciNet Google Scholar
Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. In: Proceedings of the 18th Neural Information Processing Systems (NIPS 2006), pp. 41–48 (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
MATH Google Scholar
Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 120–128 (2006)
Google Scholar
Bonilla, E., Chai, K.M., Williams, C.: Multi-task gaussian process prediction. In: Proceedings of the 20th Neural Information Processing Systems (NIPS 2008), pp. 153–160 (2008)
Google Scholar
Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using parafac2. In: KDD 2007, pp. 143–152 (2007)
Google Scholar
Dai, W., Yang, Q., Xue, G.-R., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 193–200 (2007)
Google Scholar
De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web using interlingual topic modelling. In: CIKM-SWSM, pp. 57–64 (2009)
Google Scholar
Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML 2005 Workshop, KCCA (2005)
Google Scholar
Gao, J., Fan, W., Jian, J., Han, J.: Knowledge transfer via multiple model local structure mapping. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 283–291 (2008)
Google Scholar
Gliozzo, A., Strapparava, C.: Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: ACL 2006, pp. 553–560 (2006)
Google Scholar
Grefenstette, G.: Cross-Language Information Retrieval. Kluwer Academic Publishers, Norwell (1998)
Book Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of Uncertainty in Artificial Intelligence, UAI, Stockholm (1999)
Google Scholar
Jebara, T.: Multi-task feature and kernel selection for svms. In: Proceedings of the 21th International Conference on Machine Learning (ICML 2004) (July 2004)
Google Scholar
Lee, S.-I., Chatalbashev, V., Vickrey, D., Koller, D.: Learning a meta-level prior for feature relevance from multiple related tasks. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 489–496 (July 2007)
Google Scholar
Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO, pp. 116–125 (2004)
Google Scholar
Mihalcea, R., Banea, C., Wiebe, J.: Learning multilingual subjective language via cross-lingual projections. In: ACL 2007 (2007)
Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009, pp. 880–889 (2009)
Google Scholar
Muramatsu, T., Mori, T.: Integration of plsa into probabilistic clir model. In: Proceedings of NTCIR 2004 (2004)
Google Scholar
Ni, X., Sun, J.-T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia. In: WWW 2009, pp. 1155–1155 (April 2009)
Google Scholar
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: SIGIR 1999: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–81. ACM, New York (1999)
Google Scholar
Olsson, J.S., Oard, D.W., Hajič, J.: Cross-language text classification. In: SIGIR 2005, pp. 645–646 (2005)
Google Scholar
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: Transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 759–766 (June 2007)
Google Scholar
Savoy, J.: Combining multiple strategies for effective monolingual and cross-language retrieval. Inf. Retr. 7(1-2), 121–148 (2004)
Article Google Scholar
Wan, X.: Co-training for cross-lingual sentiment classification. In: ACL-IJCNLP 2009, pp. 235–243 (2009)
Google Scholar
Xue, G.-R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged plsa for cross-domain text classification. In: SIGIR 2008, New York, NY, USA, pp. 627–634 (2008)
Google Scholar
Yang, Y., Carbonell, J.G., Brown, R.D., Frederking, R.E.: Translingual information retrieval: Learning from bilingual corpora. Artif. Intell. 103(1-2), 323–345 (1998)
Article MATH Google Scholar
Zhao, B., Xing, E.P.: Bitam: Bilingual topic admixture models for word alignment. In: ACL 2006 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

K.U. Leuven, Leuven, Belgium
Wim De Smet & Marie-Francine Moens
Tsinghua University, Beijing, China
Jie Tang

Authors

Wim De Smet
View author publications
You can also search for this author in PubMed Google Scholar
Jie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, NSW 2007, Sydney, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, MN 55455, Minneapolis, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Smet, W., Tang, J., Moens, MF. (2011). Knowledge Transfer across Multilingual Corpora via Latent Topics. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_45

Download citation

DOI: https://doi.org/10.1007/978-3-642-20841-6_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics