Abstract
Electronic access to large collections of texts and their translations provides a new resource for language analysis and translation studies. Empirical and statistical methods offer the means to organize the data and develop alternative models in view of a better understanding of our use of language. From a practical point of view they provide a basis for progress in the performance of NLP systems. A prerequisite for this work is the availability of machine-readable texts in an appropriate format. This paper will present current initiatives to acquire and prepare the necessary textual resource for corpus-based work and review current methods under development to exploit the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bar-Hillel, Y., “The state of machine translation in 1951”, American Documentation, 2: 229–237, 1951.
Biber, D., “Using register-diversified corpora for general language studies”, Computational Linguistics, 19 (2): 219–242, 1993.
Black, E., E Jelinek, J. Lafferty, M. Magerman, R. Mercer, and S. Roukos, “Towards history-based grammars: Using richer models for probabilistic parsing”, In Proceedings of the ACL, pages 31–37, Columbus, Ohio, 1993.
Bod, R., “Using an annotated corpus as a stochastic grammar”, In Proceedings of the Conference of the European Chapter of ACL, pages 37–44, Utrecht, Holland, 1993.
Boguraev, B. and J. Pustejovsky, (eds.) Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text. Association for Computational Linguistics, Columbus, Ohio, 1993.
Brent, M., “From grammar to lexicon: Unsupervised learning of lexical syntax”, Computational Linguistics, 19 (2): 243–262, 1993.
Brill, E., “Automatic grammar induction and parsing free text: A transformation-based approach”, In Proceedings of the ACL, pages 259–265, Columbus, Ohio, 1993.
Briscoe, T. and J. Carroll, “Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars”, Computational Linguistics, 19(1):25–60, 1993.
Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, “A statistical approach to language translation”, In Proceedings of COLING-88, pages 71–76, Budapest, 1988.
Brown, P., J. Lai, and R. Mercer. “Aligning sentences in parallel corpora”, In Proceedings of the ACL, pages 169–176, Berkeley, California, 1991.
Catizone, R., G. Russell, and S. Warwick-Armstrong, “Deriving translation data from bilingual texts”, in Zernik, (ed.), Proceedings of the Lexical Acquisition Workshop, Detroit, Michigan, 1989.
Chen, S., “Aligning sentences in bilingual corpora using lexical information”, in Proceedings of the ACL, pages 9–16, Columbus, Ohio, 1993.
Church, K. and P. Hanks. “Word association norms, mutual information, and lexicography”, Computational Linguistics, 16 (1): 22–29, 1990.
Church, K. and R. Mercer. “Introduction to the special issue on computational linguistics using large corpora”, Computational Linguistics, 19 (1): 1–24, 1993.
Church, K., “A stochastic parts program and noun phrase parser for unrestricted text”, in Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143, Austin, Texas, 1988.
Church, K., “Concordances for parallel text”, in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research, pages 40–62, Oxford, England, 1991.
Church, K., “Char align: A program for aligning parallel texts at the character level”, in Proceedings of the ACL, pages 1–8, Columbus, Ohio, 1993.
Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-of-speech tagger”, in Proceedings of the Conference on Applied Natural Language Processing Processing, Trento, Italy, 1992.
Dagan, I., W. Gale, and K. Church. “Robust bilingual word alignment for machine aided translation”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus Ohio, 1993.
des Tombe, L. and S. Armstrong, “Using function words to measure translation quality”, In Proceedings of the Ninth Annual Conference of the UW Centre for the New OED and Text Research, pages 1–18, Oxford, England, 1993.
Dunning, T., “Accurate methods for the statistics of surprise and coincidence”, Computational Linguistics, (19)1:61–74, 1993.
Francis, W. and H. Kuera, Frequency Analysis of English Usage. Houghton Mifflin, Boston, Massachusetts, 1982.
Futrelle, R. and S. Gauch, “Experiments in syntactic and semantic classification and disambiguation using bootstrapping”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 117–127, Association for Computational Linguistics, Columbus, Ohio, 1993.
Graff, D., “The UN multilingual text corpus”, in LDC Newsletter, Vol. 1, No. 3. Linguistic Data Consortium, 1993.
Grefenstette, G., “Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.
Hindle D. and M. Rooth. “Structural ambiguity and lexical relations”, Computational Linguistics, 19 (1): 103–120, 1993.
Kay, M. and M. R’oscheisen, “Text-translation alignment”, Computational Linguistics, 19 (1): 121–142, 1993.
Kupiec, J., “An algorithm for finding noun phrase correspondences in bilingual corpora”, in Proceedings of ACL, pages 17–22, Columbus, Ohio, 1993.
Kupiec, J. and J. Maxwell, “Training stochastic grammars from unlabelled text corpora”, in Workshop Notes from the AAA’ Workshop on Statistically-Based Natural Language Processing Techniques, pages 14–19, San Jose, California, 1992.
Liberman, M. and Y. Schabes, “Tutorial on statistical methods in natural language processing”, held in conjunction with the Conference of the European Chapter of ACL, 1993.
Liberman, M., “Text on tap: The ACLJDCF’, in Proceedings of the 1989 DARPA Speech and Natural Language Workshop, Cape Cod, Massachussetts, 1989.
Liberman, M., “Introduction to the Linguistic Data Consortium”, distributed at COLING-92, Nantes, 1992.
Manning, C., “Automatic acquisition of a large subcategorization dictionary from corpora”, in Proceedings of the ACL, pages 235–242, Columbus, Ohio, 1993.
Marcus, M., B. Santorini, and M. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank”, Computational Linguistics,19(2):313331, 1993.
Matsumoto, Y., H. Ishimoto, and T. Utsuro, “Structural matching of parallel texts”, in Proceedings of ACL, pages 23–30, Columbus, Ohio, 1993.
McDonald, D., “Internal and external evidence in the identification and semantic categorization of proper names”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 32–43, Association for Computational Linguistics, Columbus, Ohio, 1993.
Miller, G. et al., Five papers on WordNet, Technical report, Cognitive Science Laboratory, Princeton University, 1990.
Monachini, M. and A. Östling, “Morphosyntactic corpus annotation - a comparison of different schemes”, Technical report, Istituto di Linguistica Computazionale, CNR, Pisa, 1992. Report for NERC project.
Nagao, M., “A framework of a mechanical translation between Japanese and English by analogy principle”, in A. Elithorn and R. Banerji, editors, Artificial and Human Intelligence, pages 173–180. North-Holland, 1984.
Pereira, E and Y. Schabes, “Inside-outside reestimation from partially bracketed corpora”, in Proceedings of ACL, pages 128–135, Newark, Delaware, 1992.
Pustejovsky, J., S. Bergler, and P. Anick, “Lexical semantic techniques for corpus analysis”, Computational Linguistics, 19 (2): 331–358, 1993.
Resnik, P., “WordNet and distributional analysis: a class-based approach to lexical discovery”, in Workshop Notes from the AAA! Workshop on Statistically-Based Natural Language Processing Techniques, pages 54–64, San Jose, California, July, 1992.
Sato, S. and M. Nagao, “Towards memory-based machine translation”, in Proceedings of COLING-90, pages 247–252, Helsinki, 1990.
Shemtov, H., “Text alignment in a tool for translating revised documents”, in Proceedings of the European Chapter of theACL, pages 449–453, Utrecht, Holland, 1993.
Simard, M., G. Foster, and P. Isabelle, “Using cognates to align sentences in bilingual corpora”, in Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation, pages 67–82, Montreal, 1992.
Sinclair, J. (ed.), Looking Up: An Account of the COBUILD Project in Lexical Computing, Collins, London, 1987.
Smadja, F., “How to compile a bilingual collocational lexicon automatically”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Workshop, pages 65–71, San Jose, California, July, 1992.
Smadja, F, “Retrieving collocations from text: Xtract”, Computational Linguistics, 19 (1): 143–178, 1993.
Sumita, E. and H. Iida, “Example-based natural language processing techniques - a case study of machine translation”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Wdrkshop, pages 90–97, San Jose, California, July, 1992.
Thompson, H., “European Corpus Initiative”, ELS’NEWS, 1 (1), 1992.
Thompson, H., “Multilingual corpora for cooperation (MLCC)”, Proposal submitted under the LRE program for International Scientific Cognitive science-operation, 1993.
Ushioda, A., D. Evans, T Gibson, and A. Waibel, “The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 95–106, Association for Computational Linguistics, Columbus, Ohio, 1993.
Walker, D., “The ecology of language”, in Proceedings of the International Workshop on Electronic Dictionaries, pages 1–22, Tokyo, Japan, 1991.
Waterman, S., “Structural methods for lexical/semantic patterns”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.
Weaver, W., Translation. (memorandum), 1949.
Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci, “Coping with ambiguity and unknown words through probabilistic models”, Computational Linguistics, 19 (2): 359–382, 1993.
Wettler, M. and R. Rapp, “Computation of word associations based on cooccurrences of words in large corpora”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 84–93, Columbus Ohio, 1993.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Armstrong-Warwick, S. (1994). Acquisition and Exploitation of Textual Resources for NLP. In: Zampolli, A., Calzolari, N., Palmer, M. (eds) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-0-585-35958-8_23
Download citation
DOI: https://doi.org/10.1007/978-0-585-35958-8_23
Publisher Name: Springer, Dordrecht
Print ISBN: 978-0-7923-2998-5
Online ISBN: 978-0-585-35958-8
eBook Packages: Springer Book Archive