Acquisition and Exploitation of Textual Resources for NLP

Armstrong-Warwick, Susan

doi:10.1007/978-0-585-35958-8_23

Susan Armstrong-Warwick¹

Part of the book series: Linguistica Computazionale ((LICO,volume 9))

656 Accesses

Abstract

Electronic access to large collections of texts and their translations provides a new resource for language analysis and translation studies. Empirical and statistical methods offer the means to organize the data and develop alternative models in view of a better understanding of our use of language. From a practical point of view they provide a basis for progress in the performance of NLP systems. A prerequisite for this work is the availability of machine-readable texts in an appropriate format. This paper will present current initiatives to acquire and prepare the necessary textual resource for corpus-based work and review current methods under development to exploit the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bar-Hillel, Y., “The state of machine translation in 1951”, American Documentation, 2: 229–237, 1951.
Article Google Scholar
Biber, D., “Using register-diversified corpora for general language studies”, Computational Linguistics, 19 (2): 219–242, 1993.
Google Scholar
Black, E., E Jelinek, J. Lafferty, M. Magerman, R. Mercer, and S. Roukos, “Towards history-based grammars: Using richer models for probabilistic parsing”, In Proceedings of the ACL, pages 31–37, Columbus, Ohio, 1993.
Google Scholar
Bod, R., “Using an annotated corpus as a stochastic grammar”, In Proceedings of the Conference of the European Chapter of ACL, pages 37–44, Utrecht, Holland, 1993.
Google Scholar
Boguraev, B. and J. Pustejovsky, (eds.) Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text. Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Brent, M., “From grammar to lexicon: Unsupervised learning of lexical syntax”, Computational Linguistics, 19 (2): 243–262, 1993.
Google Scholar
Brill, E., “Automatic grammar induction and parsing free text: A transformation-based approach”, In Proceedings of the ACL, pages 259–265, Columbus, Ohio, 1993.
Google Scholar
Briscoe, T. and J. Carroll, “Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars”, Computational Linguistics, 19(1):25–60, 1993.
Google Scholar
Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, “A statistical approach to language translation”, In Proceedings of COLING-88, pages 71–76, Budapest, 1988.
Google Scholar
Brown, P., J. Lai, and R. Mercer. “Aligning sentences in parallel corpora”, In Proceedings of the ACL, pages 169–176, Berkeley, California, 1991.
Google Scholar
Catizone, R., G. Russell, and S. Warwick-Armstrong, “Deriving translation data from bilingual texts”, in Zernik, (ed.), Proceedings of the Lexical Acquisition Workshop, Detroit, Michigan, 1989.
Google Scholar
Chen, S., “Aligning sentences in bilingual corpora using lexical information”, in Proceedings of the ACL, pages 9–16, Columbus, Ohio, 1993.
Google Scholar
Church, K. and P. Hanks. “Word association norms, mutual information, and lexicography”, Computational Linguistics, 16 (1): 22–29, 1990.
Google Scholar
Church, K. and R. Mercer. “Introduction to the special issue on computational linguistics using large corpora”, Computational Linguistics, 19 (1): 1–24, 1993.
Google Scholar
Church, K., “A stochastic parts program and noun phrase parser for unrestricted text”, in Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143, Austin, Texas, 1988.
Chapter Google Scholar
Church, K., “Concordances for parallel text”, in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research, pages 40–62, Oxford, England, 1991.
Google Scholar
Church, K., “Char align: A program for aligning parallel texts at the character level”, in Proceedings of the ACL, pages 1–8, Columbus, Ohio, 1993.
Google Scholar
Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-of-speech tagger”, in Proceedings of the Conference on Applied Natural Language Processing Processing, Trento, Italy, 1992.
Google Scholar
Dagan, I., W. Gale, and K. Church. “Robust bilingual word alignment for machine aided translation”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus Ohio, 1993.
Google Scholar
des Tombe, L. and S. Armstrong, “Using function words to measure translation quality”, In Proceedings of the Ninth Annual Conference of the UW Centre for the New OED and Text Research, pages 1–18, Oxford, England, 1993.
Google Scholar
Dunning, T., “Accurate methods for the statistics of surprise and coincidence”, Computational Linguistics, (19)1:61–74, 1993.
Google Scholar
Francis, W. and H. Kuera, Frequency Analysis of English Usage. Houghton Mifflin, Boston, Massachusetts, 1982.
Google Scholar
Futrelle, R. and S. Gauch, “Experiments in syntactic and semantic classification and disambiguation using bootstrapping”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 117–127, Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Graff, D., “The UN multilingual text corpus”, in LDC Newsletter, Vol. 1, No. 3. Linguistic Data Consortium, 1993.
Google Scholar
Grefenstette, G., “Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Hindle D. and M. Rooth. “Structural ambiguity and lexical relations”, Computational Linguistics, 19 (1): 103–120, 1993.
Google Scholar
Kay, M. and M. R’oscheisen, “Text-translation alignment”, Computational Linguistics, 19 (1): 121–142, 1993.
Google Scholar
Kupiec, J., “An algorithm for finding noun phrase correspondences in bilingual corpora”, in Proceedings of ACL, pages 17–22, Columbus, Ohio, 1993.
Google Scholar
Kupiec, J. and J. Maxwell, “Training stochastic grammars from unlabelled text corpora”, in Workshop Notes from the AAA’ Workshop on Statistically-Based Natural Language Processing Techniques, pages 14–19, San Jose, California, 1992.
Google Scholar
Liberman, M. and Y. Schabes, “Tutorial on statistical methods in natural language processing”, held in conjunction with the Conference of the European Chapter of ACL, 1993.
Google Scholar
Liberman, M., “Text on tap: The ACLJDCF’, in Proceedings of the 1989 DARPA Speech and Natural Language Workshop, Cape Cod, Massachussetts, 1989.
Google Scholar
Liberman, M., “Introduction to the Linguistic Data Consortium”, distributed at COLING-92, Nantes, 1992.
Google Scholar
Manning, C., “Automatic acquisition of a large subcategorization dictionary from corpora”, in Proceedings of the ACL, pages 235–242, Columbus, Ohio, 1993.
Google Scholar
Marcus, M., B. Santorini, and M. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank”, Computational Linguistics,19(2):313331, 1993.
Google Scholar
Matsumoto, Y., H. Ishimoto, and T. Utsuro, “Structural matching of parallel texts”, in Proceedings of ACL, pages 23–30, Columbus, Ohio, 1993.
Google Scholar
McDonald, D., “Internal and external evidence in the identification and semantic categorization of proper names”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 32–43, Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Miller, G. et al., Five papers on WordNet, Technical report, Cognitive Science Laboratory, Princeton University, 1990.
Google Scholar
Monachini, M. and A. Östling, “Morphosyntactic corpus annotation - a comparison of different schemes”, Technical report, Istituto di Linguistica Computazionale, CNR, Pisa, 1992. Report for NERC project.
Google Scholar
Nagao, M., “A framework of a mechanical translation between Japanese and English by analogy principle”, in A. Elithorn and R. Banerji, editors, Artificial and Human Intelligence, pages 173–180. North-Holland, 1984.
Google Scholar
Pereira, E and Y. Schabes, “Inside-outside reestimation from partially bracketed corpora”, in Proceedings of ACL, pages 128–135, Newark, Delaware, 1992.
Google Scholar
Pustejovsky, J., S. Bergler, and P. Anick, “Lexical semantic techniques for corpus analysis”, Computational Linguistics, 19 (2): 331–358, 1993.
Google Scholar
Resnik, P., “WordNet and distributional analysis: a class-based approach to lexical discovery”, in Workshop Notes from the AAA! Workshop on Statistically-Based Natural Language Processing Techniques, pages 54–64, San Jose, California, July, 1992.
Google Scholar
Sato, S. and M. Nagao, “Towards memory-based machine translation”, in Proceedings of COLING-90, pages 247–252, Helsinki, 1990.
Google Scholar
Shemtov, H., “Text alignment in a tool for translating revised documents”, in Proceedings of the European Chapter of theACL, pages 449–453, Utrecht, Holland, 1993.
Google Scholar
Simard, M., G. Foster, and P. Isabelle, “Using cognates to align sentences in bilingual corpora”, in Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation, pages 67–82, Montreal, 1992.
Google Scholar
Sinclair, J. (ed.), Looking Up: An Account of the COBUILD Project in Lexical Computing, Collins, London, 1987.
Google Scholar
Smadja, F., “How to compile a bilingual collocational lexicon automatically”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Workshop, pages 65–71, San Jose, California, July, 1992.
Google Scholar
Smadja, F, “Retrieving collocations from text: Xtract”, Computational Linguistics, 19 (1): 143–178, 1993.
Google Scholar
Sumita, E. and H. Iida, “Example-based natural language processing techniques - a case study of machine translation”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Wdrkshop, pages 90–97, San Jose, California, July, 1992.
Google Scholar
Thompson, H., “European Corpus Initiative”, ELS’NEWS, 1 (1), 1992.
Google Scholar
Thompson, H., “Multilingual corpora for cooperation (MLCC)”, Proposal submitted under the LRE program for International Scientific Cognitive science-operation, 1993.
Google Scholar
Ushioda, A., D. Evans, T Gibson, and A. Waibel, “The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 95–106, Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Walker, D., “The ecology of language”, in Proceedings of the International Workshop on Electronic Dictionaries, pages 1–22, Tokyo, Japan, 1991.
Google Scholar
Waterman, S., “Structural methods for lexical/semantic patterns”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.
Google Scholar
Weaver, W., Translation. (memorandum), 1949.
Google Scholar
Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci, “Coping with ambiguity and unknown words through probabilistic models”, Computational Linguistics, 19 (2): 359–382, 1993.
Google Scholar
Wettler, M. and R. Rapp, “Computation of word associations based on cooccurrences of words in large corpora”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 84–93, Columbus Ohio, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

ISSCO, University of Geneva, Schweiz
Susan Armstrong-Warwick

Authors

Susan Armstrong-Warwick
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Antonio Zampolli Nicoletta Calzolari Martha Palmer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Armstrong-Warwick, S. (1994). Acquisition and Exploitation of Textual Resources for NLP. In: Zampolli, A., Calzolari, N., Palmer, M. (eds) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-0-585-35958-8_23

Download citation

DOI: https://doi.org/10.1007/978-0-585-35958-8_23
Publisher Name: Springer, Dordrecht
Print ISBN: 978-0-7923-2998-5
Online ISBN: 978-0-585-35958-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics