Abstract
Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other “auxiliary” languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on WikiTSu, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: COLING 2002 (2002)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004 (2004)
Andrade, D., Tsuchida, M., Onishi, T., Ishikawa, K.: Translation acquisition using synonym sets. In: NAACL-HLT (2013)
Tamura, A., Watanabe, T., Sumita, E.: Bilingual lexicon extraction from comparable corpora using label propagation. In: EMNLP-CoNLL (2012)
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: ACL-HLT (2008)
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: ULA 2002 (2002)
Ismail, A., Manandhar, S.: Bilingual lexicon extraction from comparable corpora using in-domain terms. In: COLING (2010)
Vulić, I., De Smet, W., Moens, M.: Identifying word translations from comparable corpora using latent topic models. In: ACL-HLT (2011)
Udupa, R., Khapra, M.: Improving the multilingual user experience of wikipedia using cross-language name search. In: HLT 2010 (2010)
Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M.S., Gergle, D.: Omnipedia: bridging the wikipedia language gap. In: CHI (2012)
Rapp, R.: Identifying word translations in non-parallel texts. In: ACL 1995 (1995)
Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: UAI (2009)
Lee, L., Aw, A., Zhang, M., Li, H.: Em-based hybrid model for bilingual terminology extraction from comparable corpora. In: COLING 2010 (2010)
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: HLT 2011 (2011)
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: TKE (2012)
Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., Schütze, H.: A linguistically grounded graph model for bilingual lexicon extraction. In: COLING (2010)
Qian, L., Wang, H., Zhou, G., Zhu, Q.: Bilingual lexicon construction from comparable corpora via dependency mapping. In: COLING (2012)
Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING (2012)
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: HLT-NAACL (2009)
Shao, L., Ng, H.T.: Mining new word translations from comparable corpora. In: COLING 2004 (2004)
Déjean, H., Gaussier, É., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING (2002)
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: ACL 1999 (1999)
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING (2010)
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Brains, not brawn: The use of smart comparable corpora in bilingual terminology mining. ACM Trans. Speech Lang. Process. (2008)
Rubino, R., Linarès, G.: A multi-view approach for term translation spotting. In: CLITP (2011)
Fišer, D., Ljubešic, N.: Bilingual lexicon extraction from comparable corpora for closely related languages. In: RANLP (2011)
Mausam, S.S., Etzioni, O., Weld, D.S., Skinner, M., Bilmes, J.: Compiling a massive, multilingual dictionary via probabilistic inference. In: ACL 2009 (2009)
Kaji, H., Tamamura, S., Erdenebat, D.: Automatic construction of a japanese-chinese dictionary via english. In: LREC (2008)
Udupa, R., Saravanan, K., Kumaran, A., Jagarlamudi, J.: Mint: a method for effective and scalable mining of named entity transliterations from large comparable corpora. In: EACL 2009 (2009)
Li, L., Wang, P., Huang, D., Zhao, L.: Mining english-chinese named entity pairs from comparable corpora. In: TALIP (2011)
Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: BUCC 2009 (2009)
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from wikipedia. ACM TMCCA (2009)
Fung, P.: Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In: VLC (1995)
Vulić, I., Moens, M.-F.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: NAACL-HLT (2013)
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: COLING (2010)
Su, F., Babych, B.: Development and application of a cross-language document comparability metric. In: LREC (2012)
Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: ACL 2010 (2010)
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: KDD 2002 (2002)
Borin, L.: You’ll take the high road and i’ll take the low road: using a third language to improve bilingual word alignment. In: COLING (2000)
Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: NAACL 2001 (2001)
Tsunakawa, T., Okazaki, N.: ichi Tsujii, J.: Building bilingual lexicons using lexical translation probabilities via pivot languages. In: LREC (2008)
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Machine Translation (2007)
Cohn, T., Lapata, M.: Machine translation by triangulation: Making effective use of multi-parallel corpora. In: ACL (2007)
Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: HLT-NAACL (2007)
Kumar, S., Och, F.J., Macherey, W.: Improving word alignment with bridge languages. In: EMNLP-CoNLL (2007)
Khapra, M.M., Kumaran, A., Bhattacharyya, P.: Everybody loves a rich cousin: an empirical study of transliteration through bridge languages. In: HLT 2010 (2010)
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: ACL 2005 (2005)
Kim, W., Khudanpur, S.: Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Transactions on Asian Language Information Processing (TALIP) 3, 94–112 (2004)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist (1993)
Picard, R.R., Cook, R.D.: Cross-validation of regression models. JASA (1984)
Voorhees, E.M.: et al.: The trec-8 question answering track report. In: TREC (1999)
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Heinrich, G.: Parameter estimation for text analysis. Technical report (2009)
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.. Psychological Bulletin (1968)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Tholpadi, G., Bhattacharyya, C., Shevade, S. (2015). Translation Induction on Indian Language Corpora Using Translingual Themes from Other Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-18111-0_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18110-3
Online ISBN: 978-3-319-18111-0
eBook Packages: Computer ScienceComputer Science (R0)