Multilingual Projections

Bhattacharyya, Pushpak

doi:10.1007/978-3-319-08043-7_11

Multilingual Projections

Pushpak Bhattacharyya⁵

Chapter
First Online: 12 November 2014

1487 Accesses

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 48))

Abstract

Languages of the world, though different, share structures and vocabulary. Today’s NLP depends crucially on annotation which, however, is costly, needing expertise, money and time. Most languages in the world fall far behind English, when it comes to annotated resources. Since annotation is costly, there has been worldwide effort at leveraging multilinguality in development and use of annotated corpora. The key idea is to project and utilize annotation from one language to another. This means parameters learnt from the annotated corpus of one language is made use of in the NLP of another language. We illustrate multilingual projection through the case study of word sense disambiguation (WSD) whose goal is to obtain the correct meaning of a word in the context. The correct meaning is usually denoted by an appropriate sense id from a sense repository, usually the wordnet. In this paper we show how two languages can help each other in their WSD, even when neither language has any sense marked corpus. The two specific languages chosen are Hindi and Marathi. The sense repository is the IndoWordnet which is a linked structure of wordnets of 19 major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. These wordnets have been created by following the expansion approach from Hindi wordnet. The WSD algorithm is reminiscent of expectation maximization. The sense distribution of either language is estimated through the mediation of the sense distribution of the other language in an iterative fashion. The WSD accuracy arrived at is better than any state of the art accuracy of all words general purpose unsupervised WSD.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://en.wikipedia.org/wiki/Ethnologue.
2.
http://www.cfilt.iitb.ac.in/wordnet/webhwn.
3.
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
4.
The PP “with a telescope” can get attached to either “saw” (‘I have the telescope’) or “the boy” (‘the boy has the telescope’).
5.
http://www.cfilt.iitb.ac.in/wsd/annotated_corpus.
6.
http://wordnet.princeton.edu/.
7.
http://globalwordnet.org/.
8.
http://globalwordnet.org/global-wordnet-grid/.
9.
http://www.cfilt.iitb.ac.in/indowordnet/.
10.
http://wordnetweb.princeton.edu/perl/webwn.
11.
http://www.cfilt.iitb.ac.in/wsd/annotated_corpus.
12.
http://babelnet.org/.

References

Agirre, E., & Edmonds, P. (2006). Word sense disambiguation. New York: Springer.
Google Scholar
Bengio, Y. (2009). Learning deep architectures for AI. Foundations & Trends in Machine Learning, 2(1), 1–127.
Article MATH MathSciNet Google Scholar
Bhattacharyya, P. (2010). IndoWordNet. Lexical Resources Engineering Conference 2010 (LREC 2010), Malta.
Google Scholar
Bhingardive, S., Shaikh, S., & Bhattacharyya, P. (2013). Neighbor help: Bilingual unsupervised WSD using context. Sofia, Bulgaria: ACL.
Google Scholar
Bhattacharyya, P. (2012). Natural language processing: A perspective from computation in presence of ambiguity, resource constraint and multilinguality. CSI Journal of Computing, 1(2).
Google Scholar
Brown P. F., Pietra V. J. D., Pietra S. A. D., & Mercer R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2): 263–311.
Google Scholar
Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. ANLP.
Google Scholar
Cruze, D. A. (1986). Lexical semantics. Cambridge: Cambridge University Press.
Google Scholar
DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1), 31–39.
Google Scholar
Escudero, G., Màrquez, L., & Rigau, G. (2000). Naive bayes and exemplar-based approaches to word sense disambiguation revisited: European Conference on AI (pp. 421–425).
Google Scholar
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
MATH Google Scholar
Khapra, M., Shah, S., Kedia, P., & Bhattacharyya, P. (2009). Projecting parameters for multilingual word sense disambiguation. EMNLP.
Google Scholar
Khapra, M., Shah, S., Kedia, P., & Bhattacharyya, P. (2010). Domain-specific word sense disambiguation combining corpus based and wordnet based parameters. 5th International Conference on Global Wordnet, Mumbai, India.
Google Scholar
Khapra, M., Joshi, S., & Bhattacharyya, P. (2011a). It takes two to tango: A bilingual unsupervised approach for estimating sense distributions using expectation maximization. IJCNLP, Chiang Mai, Thailand.
Google Scholar
Khapra, M., Joshi, S., Chatterjee, A., & Bhattacharyya, P. (2011b). Together we can: Bilingual bootstrapping for WSD. Oregon, USA: ACL.
Google Scholar
Klein, D., Toutanova, K., Ilhan, H. T., Kamvar, S. D., & Manning, C. D. (2002). Combining heterogeneous classifers for word-sense disambiguation: Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions WSD’02 (Vol. 8, pp 74–80), Stroudsburg, PA: Association for Computational Linguistics.
Google Scholar
Lee, K. Y., Ng, H. T., & Chia T. K. (2004). Supervised word sense disambiguation with support vector machines and multiple knowledge sources: Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (pp. 137–140).
Google Scholar
Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Google Scholar
Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Mohanty, R., Bhattacharyya, P., Pande, P., Kalele, S., Khapra, M., & Sharma, A. (2008). Synset based multilingual dictionary: Insights, applications and challenges. Global Wordnet Conference, Szeged, Hungary.
Google Scholar
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 1–69.
Google Scholar
Ng H. T., & Lee H. B. (1996). Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach: Proceedings of the 34th annual meeting on Association for Computational Linguistics, Morristown, NJ, USA (pp. 40–47).
Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McCleland (Eds.), Vol. 1, Chapter 8, Cambridge, MA: MIT Press.
Google Scholar
Sha, F., & Perreira, F. (2003). Shallow parsing with conditional random fields. HLT, NAACL.
Google Scholar
Vossen, P. (Ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht, Netherlands: Kluwer.
MATH Google Scholar
Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French: Proceedings of the 32nd Annual Meeting of the association for Computational Linguistics (ACL), (pp. 88–95).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Bombay, Bombay, India
Pushpak Bhattacharyya

Authors

Pushpak Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pushpak Bhattacharyya .

Editor information

Editors and Affiliations

CNRS-LIF, UMR 7279, Aix-Marseille University, City, France
Núria Gala
CNRS-LIF, UMR 7279, Aix-Marseille University and University of Mainz, Marseille, France
Reinhard Rapp
CNRS-LIF, UMR 7279, Aix-Marseille University, Marseille, France
Gemma Bel-Enguix

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bhattacharyya, P. (2015). Multilingual Projections. In: Gala, N., Rapp, R., Bel-Enguix, G. (eds) Language Production, Cognition, and the Lexicon. Text, Speech and Language Technology, vol 48. Springer, Cham. https://doi.org/10.1007/978-3-319-08043-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-08043-7_11
Published: 12 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08042-0
Online ISBN: 978-3-319-08043-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics