Skip to main content

Part of the book series: Linguistica Computazionale ((LICO,volume 9))

  • 656 Accesses

Abstract

Electronic access to large collections of texts and their translations provides a new resource for language analysis and translation studies. Empirical and statistical methods offer the means to organize the data and develop alternative models in view of a better understanding of our use of language. From a practical point of view they provide a basis for progress in the performance of NLP systems. A prerequisite for this work is the availability of machine-readable texts in an appropriate format. This paper will present current initiatives to acquire and prepare the necessary textual resource for corpus-based work and review current methods under development to exploit the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bar-Hillel, Y., “The state of machine translation in 1951”, American Documentation, 2: 229–237, 1951.

    Article  Google Scholar 

  2. Biber, D., “Using register-diversified corpora for general language studies”, Computational Linguistics, 19 (2): 219–242, 1993.

    Google Scholar 

  3. Black, E., E Jelinek, J. Lafferty, M. Magerman, R. Mercer, and S. Roukos, “Towards history-based grammars: Using richer models for probabilistic parsing”, In Proceedings of the ACL, pages 31–37, Columbus, Ohio, 1993.

    Google Scholar 

  4. Bod, R., “Using an annotated corpus as a stochastic grammar”, In Proceedings of the Conference of the European Chapter of ACL, pages 37–44, Utrecht, Holland, 1993.

    Google Scholar 

  5. Boguraev, B. and J. Pustejovsky, (eds.) Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text. Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  6. Brent, M., “From grammar to lexicon: Unsupervised learning of lexical syntax”, Computational Linguistics, 19 (2): 243–262, 1993.

    Google Scholar 

  7. Brill, E., “Automatic grammar induction and parsing free text: A transformation-based approach”, In Proceedings of the ACL, pages 259–265, Columbus, Ohio, 1993.

    Google Scholar 

  8. Briscoe, T. and J. Carroll, “Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars”, Computational Linguistics, 19(1):25–60, 1993.

    Google Scholar 

  9. Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, “A statistical approach to language translation”, In Proceedings of COLING-88, pages 71–76, Budapest, 1988.

    Google Scholar 

  10. Brown, P., J. Lai, and R. Mercer. “Aligning sentences in parallel corpora”, In Proceedings of the ACL, pages 169–176, Berkeley, California, 1991.

    Google Scholar 

  11. Catizone, R., G. Russell, and S. Warwick-Armstrong, “Deriving translation data from bilingual texts”, in Zernik, (ed.), Proceedings of the Lexical Acquisition Workshop, Detroit, Michigan, 1989.

    Google Scholar 

  12. Chen, S., “Aligning sentences in bilingual corpora using lexical information”, in Proceedings of the ACL, pages 9–16, Columbus, Ohio, 1993.

    Google Scholar 

  13. Church, K. and P. Hanks. “Word association norms, mutual information, and lexicography”, Computational Linguistics, 16 (1): 22–29, 1990.

    Google Scholar 

  14. Church, K. and R. Mercer. “Introduction to the special issue on computational linguistics using large corpora”, Computational Linguistics, 19 (1): 1–24, 1993.

    Google Scholar 

  15. Church, K., “A stochastic parts program and noun phrase parser for unrestricted text”, in Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143, Austin, Texas, 1988.

    Chapter  Google Scholar 

  16. Church, K., “Concordances for parallel text”, in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research, pages 40–62, Oxford, England, 1991.

    Google Scholar 

  17. Church, K., “Char align: A program for aligning parallel texts at the character level”, in Proceedings of the ACL, pages 1–8, Columbus, Ohio, 1993.

    Google Scholar 

  18. Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-of-speech tagger”, in Proceedings of the Conference on Applied Natural Language Processing Processing, Trento, Italy, 1992.

    Google Scholar 

  19. Dagan, I., W. Gale, and K. Church. “Robust bilingual word alignment for machine aided translation”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus Ohio, 1993.

    Google Scholar 

  20. des Tombe, L. and S. Armstrong, “Using function words to measure translation quality”, In Proceedings of the Ninth Annual Conference of the UW Centre for the New OED and Text Research, pages 1–18, Oxford, England, 1993.

    Google Scholar 

  21. Dunning, T., “Accurate methods for the statistics of surprise and coincidence”, Computational Linguistics, (19)1:61–74, 1993.

    Google Scholar 

  22. Francis, W. and H. Kuera, Frequency Analysis of English Usage. Houghton Mifflin, Boston, Massachusetts, 1982.

    Google Scholar 

  23. Futrelle, R. and S. Gauch, “Experiments in syntactic and semantic classification and disambiguation using bootstrapping”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 117–127, Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  24. Graff, D., “The UN multilingual text corpus”, in LDC Newsletter, Vol. 1, No. 3. Linguistic Data Consortium, 1993.

    Google Scholar 

  25. Grefenstette, G., “Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  26. Hindle D. and M. Rooth. “Structural ambiguity and lexical relations”, Computational Linguistics, 19 (1): 103–120, 1993.

    Google Scholar 

  27. Kay, M. and M. R’oscheisen, “Text-translation alignment”, Computational Linguistics, 19 (1): 121–142, 1993.

    Google Scholar 

  28. Kupiec, J., “An algorithm for finding noun phrase correspondences in bilingual corpora”, in Proceedings of ACL, pages 17–22, Columbus, Ohio, 1993.

    Google Scholar 

  29. Kupiec, J. and J. Maxwell, “Training stochastic grammars from unlabelled text corpora”, in Workshop Notes from the AAA’ Workshop on Statistically-Based Natural Language Processing Techniques, pages 14–19, San Jose, California, 1992.

    Google Scholar 

  30. Liberman, M. and Y. Schabes, “Tutorial on statistical methods in natural language processing”, held in conjunction with the Conference of the European Chapter of ACL, 1993.

    Google Scholar 

  31. Liberman, M., “Text on tap: The ACLJDCF’, in Proceedings of the 1989 DARPA Speech and Natural Language Workshop, Cape Cod, Massachussetts, 1989.

    Google Scholar 

  32. Liberman, M., “Introduction to the Linguistic Data Consortium”, distributed at COLING-92, Nantes, 1992.

    Google Scholar 

  33. Manning, C., “Automatic acquisition of a large subcategorization dictionary from corpora”, in Proceedings of the ACL, pages 235–242, Columbus, Ohio, 1993.

    Google Scholar 

  34. Marcus, M., B. Santorini, and M. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank”, Computational Linguistics,19(2):313331, 1993.

    Google Scholar 

  35. Matsumoto, Y., H. Ishimoto, and T. Utsuro, “Structural matching of parallel texts”, in Proceedings of ACL, pages 23–30, Columbus, Ohio, 1993.

    Google Scholar 

  36. McDonald, D., “Internal and external evidence in the identification and semantic categorization of proper names”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 32–43, Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  37. Miller, G. et al., Five papers on WordNet, Technical report, Cognitive Science Laboratory, Princeton University, 1990.

    Google Scholar 

  38. Monachini, M. and A. Östling, “Morphosyntactic corpus annotation - a comparison of different schemes”, Technical report, Istituto di Linguistica Computazionale, CNR, Pisa, 1992. Report for NERC project.

    Google Scholar 

  39. Nagao, M., “A framework of a mechanical translation between Japanese and English by analogy principle”, in A. Elithorn and R. Banerji, editors, Artificial and Human Intelligence, pages 173–180. North-Holland, 1984.

    Google Scholar 

  40. Pereira, E and Y. Schabes, “Inside-outside reestimation from partially bracketed corpora”, in Proceedings of ACL, pages 128–135, Newark, Delaware, 1992.

    Google Scholar 

  41. Pustejovsky, J., S. Bergler, and P. Anick, “Lexical semantic techniques for corpus analysis”, Computational Linguistics, 19 (2): 331–358, 1993.

    Google Scholar 

  42. Resnik, P., “WordNet and distributional analysis: a class-based approach to lexical discovery”, in Workshop Notes from the AAA! Workshop on Statistically-Based Natural Language Processing Techniques, pages 54–64, San Jose, California, July, 1992.

    Google Scholar 

  43. Sato, S. and M. Nagao, “Towards memory-based machine translation”, in Proceedings of COLING-90, pages 247–252, Helsinki, 1990.

    Google Scholar 

  44. Shemtov, H., “Text alignment in a tool for translating revised documents”, in Proceedings of the European Chapter of theACL, pages 449–453, Utrecht, Holland, 1993.

    Google Scholar 

  45. Simard, M., G. Foster, and P. Isabelle, “Using cognates to align sentences in bilingual corpora”, in Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation, pages 67–82, Montreal, 1992.

    Google Scholar 

  46. Sinclair, J. (ed.), Looking Up: An Account of the COBUILD Project in Lexical Computing, Collins, London, 1987.

    Google Scholar 

  47. Smadja, F., “How to compile a bilingual collocational lexicon automatically”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Workshop, pages 65–71, San Jose, California, July, 1992.

    Google Scholar 

  48. Smadja, F, “Retrieving collocations from text: Xtract”, Computational Linguistics, 19 (1): 143–178, 1993.

    Google Scholar 

  49. Sumita, E. and H. Iida, “Example-based natural language processing techniques - a case study of machine translation”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Wdrkshop, pages 90–97, San Jose, California, July, 1992.

    Google Scholar 

  50. Thompson, H., “European Corpus Initiative”, ELS’NEWS, 1 (1), 1992.

    Google Scholar 

  51. Thompson, H., “Multilingual corpora for cooperation (MLCC)”, Proposal submitted under the LRE program for International Scientific Cognitive science-operation, 1993.

    Google Scholar 

  52. Ushioda, A., D. Evans, T Gibson, and A. Waibel, “The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 95–106, Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  53. Walker, D., “The ecology of language”, in Proceedings of the International Workshop on Electronic Dictionaries, pages 1–22, Tokyo, Japan, 1991.

    Google Scholar 

  54. Waterman, S., “Structural methods for lexical/semantic patterns”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.

    Google Scholar 

  55. Weaver, W., Translation. (memorandum), 1949.

    Google Scholar 

  56. Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci, “Coping with ambiguity and unknown words through probabilistic models”, Computational Linguistics, 19 (2): 359–382, 1993.

    Google Scholar 

  57. Wettler, M. and R. Rapp, “Computation of word associations based on cooccurrences of words in large corpora”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 84–93, Columbus Ohio, 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Antonio Zampolli Nicoletta Calzolari Martha Palmer

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Armstrong-Warwick, S. (1994). Acquisition and Exploitation of Textual Resources for NLP. In: Zampolli, A., Calzolari, N., Palmer, M. (eds) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-0-585-35958-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-0-585-35958-8_23

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-0-7923-2998-5

  • Online ISBN: 978-0-585-35958-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics