Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages

Tholpadi, Goutham; Bhattacharyya, Chiranjib; Shevade, Shirish

doi:10.1007/978-3-319-18111-0_38

Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages

Goutham Tholpadi¹⁴,
Chiranjib Bhattacharyya¹⁴ &
Shirish Shevade¹⁴

Conference paper

2906 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9041))

Abstract

Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other “auxiliary” languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on WikiTSu, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: COLING 2002 (2002)
Google Scholar
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004 (2004)
Google Scholar
Andrade, D., Tsuchida, M., Onishi, T., Ishikawa, K.: Translation acquisition using synonym sets. In: NAACL-HLT (2013)
Google Scholar
Tamura, A., Watanabe, T., Sumita, E.: Bilingual lexicon extraction from comparable corpora using label propagation. In: EMNLP-CoNLL (2012)
Google Scholar
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: ACL-HLT (2008)
Google Scholar
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: ULA 2002 (2002)
Google Scholar
Ismail, A., Manandhar, S.: Bilingual lexicon extraction from comparable corpora using in-domain terms. In: COLING (2010)
Google Scholar
Vulić, I., De Smet, W., Moens, M.: Identifying word translations from comparable corpora using latent topic models. In: ACL-HLT (2011)
Google Scholar
Udupa, R., Khapra, M.: Improving the multilingual user experience of wikipedia using cross-language name search. In: HLT 2010 (2010)
Google Scholar
Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M.S., Gergle, D.: Omnipedia: bridging the wikipedia language gap. In: CHI (2012)
Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. In: ACL 1995 (1995)
Google Scholar
Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: UAI (2009)
Google Scholar
Lee, L., Aw, A., Zhang, M., Li, H.: Em-based hybrid model for bilingual terminology extraction from comparable corpora. In: COLING 2010 (2010)
Google Scholar
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: HLT 2011 (2011)
Google Scholar
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: TKE (2012)
Google Scholar
Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., Schütze, H.: A linguistically grounded graph model for bilingual lexicon extraction. In: COLING (2010)
Google Scholar
Qian, L., Wang, H., Zhou, G., Zhu, Q.: Bilingual lexicon construction from comparable corpora via dependency mapping. In: COLING (2012)
Google Scholar
Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING (2012)
Google Scholar
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: HLT-NAACL (2009)
Google Scholar
Shao, L., Ng, H.T.: Mining new word translations from comparable corpora. In: COLING 2004 (2004)
Google Scholar
Déjean, H., Gaussier, É., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING (2002)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: ACL 1999 (1999)
Google Scholar
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING (2010)
Google Scholar
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Brains, not brawn: The use of smart comparable corpora in bilingual terminology mining. ACM Trans. Speech Lang. Process. (2008)
Google Scholar
Rubino, R., Linarès, G.: A multi-view approach for term translation spotting. In: CLITP (2011)
Google Scholar
Fišer, D., Ljubešic, N.: Bilingual lexicon extraction from comparable corpora for closely related languages. In: RANLP (2011)
Google Scholar
Mausam, S.S., Etzioni, O., Weld, D.S., Skinner, M., Bilmes, J.: Compiling a massive, multilingual dictionary via probabilistic inference. In: ACL 2009 (2009)
Google Scholar
Kaji, H., Tamamura, S., Erdenebat, D.: Automatic construction of a japanese-chinese dictionary via english. In: LREC (2008)
Google Scholar
Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: Mint: a method for effective and scalable mining of named entity transliterations from large comparable corpora. In: EACL 2009 (2009)
Google Scholar
Li, L., Wang, P., Huang, D., Zhao, L.: Mining english-chinese named entity pairs from comparable corpora. In: TALIP (2011)
Google Scholar
Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: BUCC 2009 (2009)
Google Scholar
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from wikipedia. ACM TMCCA (2009)
Google Scholar
Fung, P.: Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In: VLC (1995)
Google Scholar
Vulić, I., Moens, M.-F.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: NAACL-HLT (2013)
Google Scholar
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: COLING (2010)
Google Scholar
Su, F., Babych, B.: Development and application of a cross-language document comparability metric. In: LREC (2012)
Google Scholar
Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: ACL 2010 (2010)
Google Scholar
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: KDD 2002 (2002)
Google Scholar
Borin, L.: You’ll take the high road and i’ll take the low road: using a third language to improve bilingual word alignment. In: COLING (2000)
Google Scholar
Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: NAACL 2001 (2001)
Google Scholar
Tsunakawa, T., Okazaki, N.: ichi Tsujii, J.: Building bilingual lexicons using lexical translation probabilities via pivot languages. In: LREC (2008)
Google Scholar
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Machine Translation (2007)
Google Scholar
Cohn, T., Lapata, M.: Machine translation by triangulation: Making effective use of multi-parallel corpora. In: ACL (2007)
Google Scholar
Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: HLT-NAACL (2007)
Google Scholar
Kumar, S., Och, F.J., Macherey, W.: Improving word alignment with bridge languages. In: EMNLP-CoNLL (2007)
Google Scholar
Khapra, M.M., Kumaran, A., Bhattacharyya, P.: Everybody loves a rich cousin: an empirical study of transliteration through bridge languages. In: HLT 2010 (2010)
Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: ACL 2005 (2005)
Google Scholar
Kim, W., Khudanpur, S.: Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Transactions on Asian Language Information Processing (TALIP) 3, 94–112 (2004)
Article Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist (1993)
Google Scholar
Picard, R.R., Cook, R.D.: Cross-validation of regression models. JASA (1984)
Google Scholar
Voorhees, E.M.: et al.: The trec-8 question answering track report. In: TREC (1999)
Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)
Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Heinrich, G.: Parameter estimation for text analysis. Technical report (2009)
Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.. Psychological Bulletin (1968)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Automation, Indian Institute of Science, Bangalore, 560012, India
Goutham Tholpadi, Chiranjib Bhattacharyya & Shirish Shevade

Authors

Goutham Tholpadi
View author publications
You can also search for this author in PubMed Google Scholar
Chiranjib Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar
Shirish Shevade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Goutham Tholpadi .

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tholpadi, G., Bhattacharyya, C., Shevade, S. (2015). Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-18111-0_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18110-3
Online ISBN: 978-3-319-18111-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics