Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation

Resnik, Philip

doi:10.1007/978-3-540-24630-5_35

Philip Resnik⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

990 Accesses
1 Citations

Abstract

The last decade has taught computational linguists that high performance on broad-coverage natural language processing tasks is best obtained using supervised learning techniques, which require annotation of large quantities of training data. But annotated text is hard to obtain. Some have emphasized making the most out of limited amounts of annotation. Others have argued that we should focus on simpler learning algorithms and find ways to exploit much larger quantities of text, though those efforts have tended to focus on linguistically shallow problems. In this paper, I describe my efforts to exploit larger quantities of data while still focusing on linguistically deeper problems such as parsing and word sense disambiguation. The trick, I argue, is to take advantage of the shared meaning hidden between the lines of sentences in parallel translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Woods, W.A., Kaplan, R.: The lunar sciences natural language information system. Technical Report 2265, Bolt, Beranek, and Newman, Cambridge, MA (1971)
Google Scholar
Klavans, J., Resnik, P. (eds.): The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge (1996), http://mitpress.mit.edu/bookhome.tcl?isbn=0262611228
Google Scholar
Collins, M.: Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the ACL, Madrid (1997)
Google Scholar
Collins, M.: Head-driven statistical models for natural language parsing. Computational Linguistics (to appear)
Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Google Scholar
Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Computational Linguistics 28, 245–288 (2002)
Article Google Scholar
Fillmore, C.J., Johnson, C.R., Petruck, M.R.: Background to FrameNet. International Journal of Lexicography 16, 235–250 (2003)
Article Google Scholar
Sapir, E.: Language: An Introduction to the Study of Speech. Harcourt, Brace, New York (1921), http://www.bartleby.com/186/
Google Scholar
Stigler, S.M.: The History of Statistics. Harvard University Press (1986)
Google Scholar
Oard, D.W., Doermann, D., Dorr, B., He, D., Resnik, P., Weinberg, A., Byrne, W., Khudanpur, S., Yarowsky, D., Leuski, A., Koehn, P., Knight, K.: Desperately seeking Cebuano. In: Proceedings of the HLT-NAACL Conference, pp. 76–78 (2003) (Late Breaking Results)
Google Scholar
Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’. Computers and the Humanities 33, 129–153 (1999)
Article Google Scholar
Tong, S.: Active Learning: Theory and Applications. PhD thesis, Stanford University (2001), http://www.robotics.stanford.edu/~stong/papers/tong_thesis.ps.gz
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, San Francisco (1998)
Google Scholar
Sarkar, A.: Applying cotraining methods to statistical parsing. In: Proceedings of the Second North American Conference on Computational Linguistics, NAACL 2001 (2001)
Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp. 189–196. Association for Computational Linguistics (1995)
Google Scholar
Banko, M., Brill, E.: Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. In: Human Language Technology Conference, HLT (2001)
Google Scholar
Brill, E., Lin, J., Banko, M., Dumais, S., Ng, A.: Data-intensive question answering. In: Proceedings of TREC 2001 (2001)
Google Scholar
Church, K.W., Mercer, R.: Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19, 1–24 (1993)
Google Scholar
Chklovski, T., Mihalcea, R.: Building a sense tagged corpus with open mind word expert. In: Proceedings of the ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions (2002)
Google Scholar
Mihalcea, R.: Bootstrapping large sense tagged corpora. In: Proceedings of the 3rd International Conference on Languages Resources and Evaluations, LREC 2002 (2002)
Google Scholar
Baker, J.: Trainable grammars for speech recognition. In: Proceedings of the Spring Conference of the Acoustical Society of America, Boston, MA, pp. 547–550 (1979)
Google Scholar
Pereira, F., Schabes, Y.: Inside-outside reestimation from partially bracketed corpora. In: Proceedings of the February 1992 DARPA Speech and Natural Language Workshop, pp. 122–127 (1992)
Google Scholar
Kingsbury, P., Palmer, M., Marcus, M.: Adding semantic annotation to the Penn TreeBank. In: Proceedings of the Human Language Technology Conference, HLT 2002 (2002)
Google Scholar
Alshawi, H., Srinivas, B., Douglas, S.: Learning dependency translation models as collections of finite state head transducers. Computational Linguistics 26 (2000)
Google Scholar
Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics (23), 377–403
Google Scholar
Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)
Google Scholar
Eisner, J.: Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics (companion volume) (2003)
Google Scholar
Dorr, B.J., Pearl, L., Hwa, R., Habash, N.: DUSTer: A method for unraveling cross-language divergences for statistical word-level alignment. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 31–43. Springer, Heidelberg (2002)
Chapter Google Scholar
Drabek, E., Yarowsky, D.: Personal communication
Google Scholar
Yamada, K., Knight, K.: A syntax-based statistical translation model. In: Proceedings of the 39th Meeting of the Association for Computational Linguistics (ACL 2001), pp. 523–530 (2001)
Google Scholar
Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)
Google Scholar
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: A statistical approach to sense disambiguation in machine translation. In: Proc. of the Speech and Natural Language Workshop, Pacific Grove, CA, pp. 146–151 (1991)
Google Scholar
Dagan, I.: Lexical disambiguation: sources of information and their statistical realization. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)
Google Scholar
Dagan, I., Itai, A.: Word sense disambiguation using a second language monolingual corpus (1994)
Google Scholar
Dyvik, H.: Translations as semantic mirrors. In: Proceedings of Workshop W13: Multilinguality in the lexicon II, Brighton, UK, The 13th biennial European Conference on Artificial Intelligence, ECAI 1998, pp. 24–44 (1998)
Google Scholar
Ide, N.: Cross-lingual sense determination: Can it work? Computers and the Humanities: Special issue on SENSEVAL 34, 223–234 (2000)
Google Scholar
Ide, N., Erjavec, T., Tufis, D.: Automatic sense tagging using parallel corpora. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 83–89 (2001)
Google Scholar
Ide, N., Erjavec, T., Tufis, D.: Sense discrimination with parallel corpora. In: Proceedings of ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60 (2002)
Google Scholar
Resnik, P., Yarowsky, D.: Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering 5, 113–133 (1999)
Article Google Scholar
Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research (JAIR) 11, 95–130 (1999)
MATH Google Scholar
Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: Proceedings of NAACL 2001, pp. 200–207 (2001)
Google Scholar
Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of ACL 2000, pp. 207–216 (2000)
Google Scholar
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, HLT 2001 (2001)
Google Scholar
Diab, M.: Word Sense Disambiguation within a Multilingual Framework. PhD thesis, University of Maryland (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland, 20742, USA
Philip Resnik

Authors

Philip Resnik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Resnik, P. (2004). Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_35

Download citation

DOI: https://doi.org/10.1007/978-3-540-24630-5_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics