Abstract
The last decade has taught computational linguists that high performance on broad-coverage natural language processing tasks is best obtained using supervised learning techniques, which require annotation of large quantities of training data. But annotated text is hard to obtain. Some have emphasized making the most out of limited amounts of annotation. Others have argued that we should focus on simpler learning algorithms and find ways to exploit much larger quantities of text, though those efforts have tended to focus on linguistically shallow problems. In this paper, I describe my efforts to exploit larger quantities of data while still focusing on linguistically deeper problems such as parsing and word sense disambiguation. The trick, I argue, is to take advantage of the shared meaning hidden between the lines of sentences in parallel translation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Woods, W.A., Kaplan, R.: The lunar sciences natural language information system. Technical Report 2265, Bolt, Beranek, and Newman, Cambridge, MA (1971)
Klavans, J., Resnik, P. (eds.): The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge (1996), http://mitpress.mit.edu/bookhome.tcl?isbn=0262611228
Collins, M.: Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the ACL, Madrid (1997)
Collins, M.: Head-driven statistical models for natural language parsing. Computational Linguistics (to appear)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330 (1993)
Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Computational Linguistics 28, 245–288 (2002)
Fillmore, C.J., Johnson, C.R., Petruck, M.R.: Background to FrameNet. International Journal of Lexicography 16, 235–250 (2003)
Sapir, E.: Language: An Introduction to the Study of Speech. Harcourt, Brace, New York (1921), http://www.bartleby.com/186/
Stigler, S.M.: The History of Statistics. Harvard University Press (1986)
Oard, D.W., Doermann, D., Dorr, B., He, D., Resnik, P., Weinberg, A., Byrne, W., Khudanpur, S., Yarowsky, D., Leuski, A., Koehn, P., Knight, K.: Desperately seeking Cebuano. In: Proceedings of the HLT-NAACL Conference, pp. 76–78 (2003) (Late Breaking Results)
Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’. Computers and the Humanities 33, 129–153 (1999)
Tong, S.: Active Learning: Theory and Applications. PhD thesis, Stanford University (2001), http://www.robotics.stanford.edu/~stong/papers/tong_thesis.ps.gz
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, San Francisco (1998)
Sarkar, A.: Applying cotraining methods to statistical parsing. In: Proceedings of the Second North American Conference on Computational Linguistics, NAACL 2001 (2001)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp. 189–196. Association for Computational Linguistics (1995)
Banko, M., Brill, E.: Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. In: Human Language Technology Conference, HLT (2001)
Brill, E., Lin, J., Banko, M., Dumais, S., Ng, A.: Data-intensive question answering. In: Proceedings of TREC 2001 (2001)
Church, K.W., Mercer, R.: Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19, 1–24 (1993)
Chklovski, T., Mihalcea, R.: Building a sense tagged corpus with open mind word expert. In: Proceedings of the ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions (2002)
Mihalcea, R.: Bootstrapping large sense tagged corpora. In: Proceedings of the 3rd International Conference on Languages Resources and Evaluations, LREC 2002 (2002)
Baker, J.: Trainable grammars for speech recognition. In: Proceedings of the Spring Conference of the Acoustical Society of America, Boston, MA, pp. 547–550 (1979)
Pereira, F., Schabes, Y.: Inside-outside reestimation from partially bracketed corpora. In: Proceedings of the February 1992 DARPA Speech and Natural Language Workshop, pp. 122–127 (1992)
Kingsbury, P., Palmer, M., Marcus, M.: Adding semantic annotation to the Penn TreeBank. In: Proceedings of the Human Language Technology Conference, HLT 2002 (2002)
Alshawi, H., Srinivas, B., Douglas, S.: Learning dependency translation models as collections of finite state head transducers. Computational Linguistics 26 (2000)
Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics (23), 377–403
Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)
Eisner, J.: Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics (companion volume) (2003)
Dorr, B.J., Pearl, L., Hwa, R., Habash, N.: DUSTer: A method for unraveling cross-language divergences for statistical word-level alignment. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 31–43. Springer, Heidelberg (2002)
Drabek, E., Yarowsky, D.: Personal communication
Yamada, K., Knight, K.: A syntax-based statistical translation model. In: Proceedings of the 39th Meeting of the Association for Computational Linguistics (ACL 2001), pp. 523–530 (2001)
Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: A statistical approach to sense disambiguation in machine translation. In: Proc. of the Speech and Natural Language Workshop, Pacific Grove, CA, pp. 146–151 (1991)
Dagan, I.: Lexical disambiguation: sources of information and their statistical realization. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)
Dagan, I., Itai, A.: Word sense disambiguation using a second language monolingual corpus (1994)
Dyvik, H.: Translations as semantic mirrors. In: Proceedings of Workshop W13: Multilinguality in the lexicon II, Brighton, UK, The 13th biennial European Conference on Artificial Intelligence, ECAI 1998, pp. 24–44 (1998)
Ide, N.: Cross-lingual sense determination: Can it work? Computers and the Humanities: Special issue on SENSEVAL 34, 223–234 (2000)
Ide, N., Erjavec, T., Tufis, D.: Automatic sense tagging using parallel corpora. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 83–89 (2001)
Ide, N., Erjavec, T., Tufis, D.: Sense discrimination with parallel corpora. In: Proceedings of ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60 (2002)
Resnik, P., Yarowsky, D.: Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering 5, 113–133 (1999)
Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research (JAIR) 11, 95–130 (1999)
Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: Proceedings of NAACL 2001, pp. 200–207 (2001)
Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of ACL 2000, pp. 207–216 (2000)
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, HLT 2001 (2001)
Diab, M.: Word Sense Disambiguation within a Multilingual Framework. PhD thesis, University of Maryland (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Resnik, P. (2004). Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive