Skip to main content

Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9041))

Abstract

Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other “auxiliary” languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on WikiTSu, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: COLING 2002 (2002)

    Google Scholar 

  2. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004 (2004)

    Google Scholar 

  3. Andrade, D., Tsuchida, M., Onishi, T., Ishikawa, K.: Translation acquisition using synonym sets. In: NAACL-HLT (2013)

    Google Scholar 

  4. Tamura, A., Watanabe, T., Sumita, E.: Bilingual lexicon extraction from comparable corpora using label propagation. In: EMNLP-CoNLL (2012)

    Google Scholar 

  5. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: ACL-HLT (2008)

    Google Scholar 

  6. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: ULA 2002 (2002)

    Google Scholar 

  7. Ismail, A., Manandhar, S.: Bilingual lexicon extraction from comparable corpora using in-domain terms. In: COLING (2010)

    Google Scholar 

  8. Vulić, I., De Smet, W., Moens, M.: Identifying word translations from comparable corpora using latent topic models. In: ACL-HLT (2011)

    Google Scholar 

  9. Udupa, R., Khapra, M.: Improving the multilingual user experience of wikipedia using cross-language name search. In: HLT 2010 (2010)

    Google Scholar 

  10. Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M.S., Gergle, D.: Omnipedia: bridging the wikipedia language gap. In: CHI (2012)

    Google Scholar 

  11. Rapp, R.: Identifying word translations in non-parallel texts. In: ACL 1995 (1995)

    Google Scholar 

  12. Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: UAI (2009)

    Google Scholar 

  13. Lee, L., Aw, A., Zhang, M., Li, H.: Em-based hybrid model for bilingual terminology extraction from comparable corpora. In: COLING 2010 (2010)

    Google Scholar 

  14. Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: HLT 2011 (2011)

    Google Scholar 

  15. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: TKE (2012)

    Google Scholar 

  16. Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., Schütze, H.: A linguistically grounded graph model for bilingual lexicon extraction. In: COLING (2010)

    Google Scholar 

  17. Qian, L., Wang, H., Zhou, G., Zhu, Q.: Bilingual lexicon construction from comparable corpora via dependency mapping. In: COLING (2012)

    Google Scholar 

  18. Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING (2012)

    Google Scholar 

  19. Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: HLT-NAACL (2009)

    Google Scholar 

  20. Shao, L., Ng, H.T.: Mining new word translations from comparable corpora. In: COLING 2004 (2004)

    Google Scholar 

  21. Déjean, H., Gaussier, É., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING (2002)

    Google Scholar 

  22. Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: ACL 1999 (1999)

    Google Scholar 

  23. Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING (2010)

    Google Scholar 

  24. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Brains, not brawn: The use of smart comparable corpora in bilingual terminology mining. ACM Trans. Speech Lang. Process. (2008)

    Google Scholar 

  25. Rubino, R., Linarès, G.: A multi-view approach for term translation spotting. In: CLITP (2011)

    Google Scholar 

  26. Fišer, D., Ljubešic, N.: Bilingual lexicon extraction from comparable corpora for closely related languages. In: RANLP (2011)

    Google Scholar 

  27. Mausam, S.S., Etzioni, O., Weld, D.S., Skinner, M., Bilmes, J.: Compiling a massive, multilingual dictionary via probabilistic inference. In: ACL 2009 (2009)

    Google Scholar 

  28. Kaji, H., Tamamura, S., Erdenebat, D.: Automatic construction of a japanese-chinese dictionary via english. In: LREC (2008)

    Google Scholar 

  29. Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: Mint: a method for effective and scalable mining of named entity transliterations from large comparable corpora. In: EACL 2009 (2009)

    Google Scholar 

  30. Li, L., Wang, P., Huang, D., Zhao, L.: Mining english-chinese named entity pairs from comparable corpora. In: TALIP (2011)

    Google Scholar 

  31. Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: BUCC 2009 (2009)

    Google Scholar 

  32. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from wikipedia. ACM TMCCA (2009)

    Google Scholar 

  33. Fung, P.: Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In: VLC (1995)

    Google Scholar 

  34. Vulić, I., Moens, M.-F.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: NAACL-HLT (2013)

    Google Scholar 

  35. Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: COLING (2010)

    Google Scholar 

  36. Su, F., Babych, B.: Development and application of a cross-language document comparability metric. In: LREC (2012)

    Google Scholar 

  37. Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: ACL 2010 (2010)

    Google Scholar 

  38. Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: KDD 2002 (2002)

    Google Scholar 

  39. Borin, L.: You’ll take the high road and i’ll take the low road: using a third language to improve bilingual word alignment. In: COLING (2000)

    Google Scholar 

  40. Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: NAACL 2001 (2001)

    Google Scholar 

  41. Tsunakawa, T., Okazaki, N.: ichi Tsujii, J.: Building bilingual lexicons using lexical translation probabilities via pivot languages. In: LREC (2008)

    Google Scholar 

  42. Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Machine Translation (2007)

    Google Scholar 

  43. Cohn, T., Lapata, M.: Machine translation by triangulation: Making effective use of multi-parallel corpora. In: ACL (2007)

    Google Scholar 

  44. Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: HLT-NAACL (2007)

    Google Scholar 

  45. Kumar, S., Och, F.J., Macherey, W.: Improving word alignment with bridge languages. In: EMNLP-CoNLL (2007)

    Google Scholar 

  46. Khapra, M.M., Kumaran, A., Bhattacharyya, P.: Everybody loves a rich cousin: an empirical study of transliteration through bridge languages. In: HLT 2010 (2010)

    Google Scholar 

  47. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: ACL 2005 (2005)

    Google Scholar 

  48. Kim, W., Khudanpur, S.: Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Transactions on Asian Language Information Processing (TALIP) 3, 94–112 (2004)

    Article  Google Scholar 

  49. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist (1993)

    Google Scholar 

  50. Picard, R.R., Cook, R.D.: Cross-validation of regression models. JASA (1984)

    Google Scholar 

  51. Voorhees, E.M.: et al.: The trec-8 question answering track report. In: TREC (1999)

    Google Scholar 

  52. Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)

    Google Scholar 

  53. McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu

  54. Heinrich, G.: Parameter estimation for text analysis. Technical report (2009)

    Google Scholar 

  55. Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.. Psychological Bulletin (1968)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Goutham Tholpadi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Tholpadi, G., Bhattacharyya, C., Shevade, S. (2015). Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18111-0_38

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18110-3

  • Online ISBN: 978-3-319-18111-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics