Skip to main content

Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Abstract

The last decade has taught computational linguists that high performance on broad-coverage natural language processing tasks is best obtained using supervised learning techniques, which require annotation of large quantities of training data. But annotated text is hard to obtain. Some have emphasized making the most out of limited amounts of annotation. Others have argued that we should focus on simpler learning algorithms and find ways to exploit much larger quantities of text, though those efforts have tended to focus on linguistically shallow problems. In this paper, I describe my efforts to exploit larger quantities of data while still focusing on linguistically deeper problems such as parsing and word sense disambiguation. The trick, I argue, is to take advantage of the shared meaning hidden between the lines of sentences in parallel translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Woods, W.A., Kaplan, R.: The lunar sciences natural language information system. Technical Report 2265, Bolt, Beranek, and Newman, Cambridge, MA (1971)

    Google Scholar 

  2. Klavans, J., Resnik, P. (eds.): The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge (1996), http://mitpress.mit.edu/bookhome.tcl?isbn=0262611228

    Google Scholar 

  3. Collins, M.: Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the ACL, Madrid (1997)

    Google Scholar 

  4. Collins, M.: Head-driven statistical models for natural language parsing. Computational Linguistics (to appear)

    Google Scholar 

  5. Marcus, M.P., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330 (1993)

    Google Scholar 

  6. Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Computational Linguistics 28, 245–288 (2002)

    Article  Google Scholar 

  7. Fillmore, C.J., Johnson, C.R., Petruck, M.R.: Background to FrameNet. International Journal of Lexicography 16, 235–250 (2003)

    Article  Google Scholar 

  8. Sapir, E.: Language: An Introduction to the Study of Speech. Harcourt, Brace, New York (1921), http://www.bartleby.com/186/

    Google Scholar 

  9. Stigler, S.M.: The History of Statistics. Harvard University Press (1986)

    Google Scholar 

  10. Oard, D.W., Doermann, D., Dorr, B., He, D., Resnik, P., Weinberg, A., Byrne, W., Khudanpur, S., Yarowsky, D., Leuski, A., Koehn, P., Knight, K.: Desperately seeking Cebuano. In: Proceedings of the HLT-NAACL Conference, pp. 76–78 (2003) (Late Breaking Results)

    Google Scholar 

  11. Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’. Computers and the Humanities 33, 129–153 (1999)

    Article  Google Scholar 

  12. Tong, S.: Active Learning: Theory and Applications. PhD thesis, Stanford University (2001), http://www.robotics.stanford.edu/~stong/papers/tong_thesis.ps.gz

  13. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, San Francisco (1998)

    Google Scholar 

  14. Sarkar, A.: Applying cotraining methods to statistical parsing. In: Proceedings of the Second North American Conference on Computational Linguistics, NAACL 2001 (2001)

    Google Scholar 

  15. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp. 189–196. Association for Computational Linguistics (1995)

    Google Scholar 

  16. Banko, M., Brill, E.: Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. In: Human Language Technology Conference, HLT (2001)

    Google Scholar 

  17. Brill, E., Lin, J., Banko, M., Dumais, S., Ng, A.: Data-intensive question answering. In: Proceedings of TREC 2001 (2001)

    Google Scholar 

  18. Church, K.W., Mercer, R.: Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19, 1–24 (1993)

    Google Scholar 

  19. Chklovski, T., Mihalcea, R.: Building a sense tagged corpus with open mind word expert. In: Proceedings of the ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions (2002)

    Google Scholar 

  20. Mihalcea, R.: Bootstrapping large sense tagged corpora. In: Proceedings of the 3rd International Conference on Languages Resources and Evaluations, LREC 2002 (2002)

    Google Scholar 

  21. Baker, J.: Trainable grammars for speech recognition. In: Proceedings of the Spring Conference of the Acoustical Society of America, Boston, MA, pp. 547–550 (1979)

    Google Scholar 

  22. Pereira, F., Schabes, Y.: Inside-outside reestimation from partially bracketed corpora. In: Proceedings of the February 1992 DARPA Speech and Natural Language Workshop, pp. 122–127 (1992)

    Google Scholar 

  23. Kingsbury, P., Palmer, M., Marcus, M.: Adding semantic annotation to the Penn TreeBank. In: Proceedings of the Human Language Technology Conference, HLT 2002 (2002)

    Google Scholar 

  24. Alshawi, H., Srinivas, B., Douglas, S.: Learning dependency translation models as collections of finite state head transducers. Computational Linguistics 26 (2000)

    Google Scholar 

  25. Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics (23), 377–403

    Google Scholar 

  26. Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)

    Google Scholar 

  27. Eisner, J.: Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics (companion volume) (2003)

    Google Scholar 

  28. Dorr, B.J., Pearl, L., Hwa, R., Habash, N.: DUSTer: A method for unraveling cross-language divergences for statistical word-level alignment. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 31–43. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  29. Drabek, E., Yarowsky, D.: Personal communication

    Google Scholar 

  30. Yamada, K., Knight, K.: A syntax-based statistical translation model. In: Proceedings of the 39th Meeting of the Association for Computational Linguistics (ACL 2001), pp. 523–530 (2001)

    Google Scholar 

  31. Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)

    Google Scholar 

  32. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: A statistical approach to sense disambiguation in machine translation. In: Proc. of the Speech and Natural Language Workshop, Pacific Grove, CA, pp. 146–151 (1991)

    Google Scholar 

  33. Dagan, I.: Lexical disambiguation: sources of information and their statistical realization. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)

    Google Scholar 

  34. Dagan, I., Itai, A.: Word sense disambiguation using a second language monolingual corpus (1994)

    Google Scholar 

  35. Dyvik, H.: Translations as semantic mirrors. In: Proceedings of Workshop W13: Multilinguality in the lexicon II, Brighton, UK, The 13th biennial European Conference on Artificial Intelligence, ECAI 1998, pp. 24–44 (1998)

    Google Scholar 

  36. Ide, N.: Cross-lingual sense determination: Can it work? Computers and the Humanities: Special issue on SENSEVAL 34, 223–234 (2000)

    Google Scholar 

  37. Ide, N., Erjavec, T., Tufis, D.: Automatic sense tagging using parallel corpora. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 83–89 (2001)

    Google Scholar 

  38. Ide, N., Erjavec, T., Tufis, D.: Sense discrimination with parallel corpora. In: Proceedings of ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60 (2002)

    Google Scholar 

  39. Resnik, P., Yarowsky, D.: Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering 5, 113–133 (1999)

    Article  Google Scholar 

  40. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research (JAIR) 11, 95–130 (1999)

    MATH  Google Scholar 

  41. Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: Proceedings of NAACL 2001, pp. 200–207 (2001)

    Google Scholar 

  42. Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of ACL 2000, pp. 207–216 (2000)

    Google Scholar 

  43. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, HLT 2001 (2001)

    Google Scholar 

  44. Diab, M.: Word Sense Disambiguation within a Multilingual Framework. PhD thesis, University of Maryland (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Resnik, P. (2004). Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24630-5_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21006-1

  • Online ISBN: 978-3-540-24630-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics