Advertisement

Acquisition and Exploitation of Textual Resources for NLP

  • Susan Armstrong-Warwick
Chapter
Part of the Linguistica Computazionale book series (LICO, volume 9)

Abstract

Electronic access to large collections of texts and their translations provides a new resource for language analysis and translation studies. Empirical and statistical methods offer the means to organize the data and develop alternative models in view of a better understanding of our use of language. From a practical point of view they provide a basis for progress in the performance of NLP systems. A prerequisite for this work is the availability of machine-readable texts in an appropriate format. This paper will present current initiatives to acquire and prepare the necessary textual resource for corpus-based work and review current methods under development to exploit the data.

Keywords

Noun Phrase Machine Translation Statistical Machine Translation Computational Linguistics Parallel Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Bar-Hillel, Y., “The state of machine translation in 1951”, American Documentation, 2: 229–237, 1951.CrossRefGoogle Scholar
  2. [2]
    Biber, D., “Using register-diversified corpora for general language studies”, Computational Linguistics, 19 (2): 219–242, 1993.Google Scholar
  3. [3]
    Black, E., E Jelinek, J. Lafferty, M. Magerman, R. Mercer, and S. Roukos, “Towards history-based grammars: Using richer models for probabilistic parsing”, In Proceedings of the ACL, pages 31–37, Columbus, Ohio, 1993.Google Scholar
  4. [4]
    Bod, R., “Using an annotated corpus as a stochastic grammar”, In Proceedings of the Conference of the European Chapter of ACL, pages 37–44, Utrecht, Holland, 1993.Google Scholar
  5. [5]
    Boguraev, B. and J. Pustejovsky, (eds.) Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text. Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  6. [6]
    Brent, M., “From grammar to lexicon: Unsupervised learning of lexical syntax”, Computational Linguistics, 19 (2): 243–262, 1993.Google Scholar
  7. [7]
    Brill, E., “Automatic grammar induction and parsing free text: A transformation-based approach”, In Proceedings of the ACL, pages 259–265, Columbus, Ohio, 1993.Google Scholar
  8. [8]
    Briscoe, T. and J. Carroll, “Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars”, Computational Linguistics, 19(1):25–60, 1993.Google Scholar
  9. [9]
    Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, “A statistical approach to language translation”, In Proceedings of COLING-88, pages 71–76, Budapest, 1988.Google Scholar
  10. [10]
    Brown, P., J. Lai, and R. Mercer. “Aligning sentences in parallel corpora”, In Proceedings of the ACL, pages 169–176, Berkeley, California, 1991.Google Scholar
  11. [11]
    Catizone, R., G. Russell, and S. Warwick-Armstrong, “Deriving translation data from bilingual texts”, in Zernik, (ed.), Proceedings of the Lexical Acquisition Workshop, Detroit, Michigan, 1989.Google Scholar
  12. [12]
    Chen, S., “Aligning sentences in bilingual corpora using lexical information”, in Proceedings of the ACL, pages 9–16, Columbus, Ohio, 1993.Google Scholar
  13. [13]
    Church, K. and P. Hanks. “Word association norms, mutual information, and lexicography”, Computational Linguistics, 16 (1): 22–29, 1990.Google Scholar
  14. [14]
    Church, K. and R. Mercer. “Introduction to the special issue on computational linguistics using large corpora”, Computational Linguistics, 19 (1): 1–24, 1993.Google Scholar
  15. [15]
    Church, K., “A stochastic parts program and noun phrase parser for unrestricted text”, in Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143, Austin, Texas, 1988.CrossRefGoogle Scholar
  16. [16]
    Church, K., “Concordances for parallel text”, in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research, pages 40–62, Oxford, England, 1991.Google Scholar
  17. [17]
    Church, K., “Char align: A program for aligning parallel texts at the character level”, in Proceedings of the ACL, pages 1–8, Columbus, Ohio, 1993.Google Scholar
  18. [18]
    Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-of-speech tagger”, in Proceedings of the Conference on Applied Natural Language Processing Processing, Trento, Italy, 1992.Google Scholar
  19. [19]
    Dagan, I., W. Gale, and K. Church. “Robust bilingual word alignment for machine aided translation”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus Ohio, 1993.Google Scholar
  20. [20]
    des Tombe, L. and S. Armstrong, “Using function words to measure translation quality”, In Proceedings of the Ninth Annual Conference of the UW Centre for the New OED and Text Research, pages 1–18, Oxford, England, 1993.Google Scholar
  21. [21]
    Dunning, T., “Accurate methods for the statistics of surprise and coincidence”, Computational Linguistics, (19)1:61–74, 1993.Google Scholar
  22. [22]
    Francis, W. and H. Kuera, Frequency Analysis of English Usage. Houghton Mifflin, Boston, Massachusetts, 1982.Google Scholar
  23. [23]
    Futrelle, R. and S. Gauch, “Experiments in syntactic and semantic classification and disambiguation using bootstrapping”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 117–127, Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  24. [24]
    Graff, D., “The UN multilingual text corpus”, in LDC Newsletter, Vol. 1, No. 3. Linguistic Data Consortium, 1993.Google Scholar
  25. [25]
    Grefenstette, G., “Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  26. [26]
    Hindle D. and M. Rooth. “Structural ambiguity and lexical relations”, Computational Linguistics, 19 (1): 103–120, 1993.Google Scholar
  27. [27]
    Kay, M. and M. R’oscheisen, “Text-translation alignment”, Computational Linguistics, 19 (1): 121–142, 1993.Google Scholar
  28. [28]
    Kupiec, J., “An algorithm for finding noun phrase correspondences in bilingual corpora”, in Proceedings of ACL, pages 17–22, Columbus, Ohio, 1993.Google Scholar
  29. [29]
    Kupiec, J. and J. Maxwell, “Training stochastic grammars from unlabelled text corpora”, in Workshop Notes from the AAA’ Workshop on Statistically-Based Natural Language Processing Techniques, pages 14–19, San Jose, California, 1992.Google Scholar
  30. [30]
    Liberman, M. and Y. Schabes, “Tutorial on statistical methods in natural language processing”, held in conjunction with the Conference of the European Chapter of ACL, 1993.Google Scholar
  31. [31]
    Liberman, M., “Text on tap: The ACLJDCF’, in Proceedings of the 1989 DARPA Speech and Natural Language Workshop, Cape Cod, Massachussetts, 1989.Google Scholar
  32. [32]
    Liberman, M., “Introduction to the Linguistic Data Consortium”, distributed at COLING-92, Nantes, 1992.Google Scholar
  33. [33]
    Manning, C., “Automatic acquisition of a large subcategorization dictionary from corpora”, in Proceedings of the ACL, pages 235–242, Columbus, Ohio, 1993.Google Scholar
  34. [34]
    Marcus, M., B. Santorini, and M. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank”, Computational Linguistics,19(2):313331, 1993.Google Scholar
  35. [35]
    Matsumoto, Y., H. Ishimoto, and T. Utsuro, “Structural matching of parallel texts”, in Proceedings of ACL, pages 23–30, Columbus, Ohio, 1993.Google Scholar
  36. [36]
    McDonald, D., “Internal and external evidence in the identification and semantic categorization of proper names”, In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 32–43, Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  37. [37]
    Miller, G. et al., Five papers on WordNet, Technical report, Cognitive Science Laboratory, Princeton University, 1990.Google Scholar
  38. [38]
    Monachini, M. and A. Östling, “Morphosyntactic corpus annotation - a comparison of different schemes”, Technical report, Istituto di Linguistica Computazionale, CNR, Pisa, 1992. Report for NERC project.Google Scholar
  39. [39]
    Nagao, M., “A framework of a mechanical translation between Japanese and English by analogy principle”, in A. Elithorn and R. Banerji, editors, Artificial and Human Intelligence, pages 173–180. North-Holland, 1984.Google Scholar
  40. [40]
    Pereira, E and Y. Schabes, “Inside-outside reestimation from partially bracketed corpora”, in Proceedings of ACL, pages 128–135, Newark, Delaware, 1992.Google Scholar
  41. [41]
    Pustejovsky, J., S. Bergler, and P. Anick, “Lexical semantic techniques for corpus analysis”, Computational Linguistics, 19 (2): 331–358, 1993.Google Scholar
  42. [42]
    Resnik, P., “WordNet and distributional analysis: a class-based approach to lexical discovery”, in Workshop Notes from the AAA! Workshop on Statistically-Based Natural Language Processing Techniques, pages 54–64, San Jose, California, July, 1992.Google Scholar
  43. [43]
    Sato, S. and M. Nagao, “Towards memory-based machine translation”, in Proceedings of COLING-90, pages 247–252, Helsinki, 1990.Google Scholar
  44. [44]
    Shemtov, H., “Text alignment in a tool for translating revised documents”, in Proceedings of the European Chapter of theACL, pages 449–453, Utrecht, Holland, 1993.Google Scholar
  45. [45]
    Simard, M., G. Foster, and P. Isabelle, “Using cognates to align sentences in bilingual corpora”, in Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation, pages 67–82, Montreal, 1992.Google Scholar
  46. [46]
    Sinclair, J. (ed.), Looking Up: An Account of the COBUILD Project in Lexical Computing, Collins, London, 1987.Google Scholar
  47. [47]
    Smadja, F., “How to compile a bilingual collocational lexicon automatically”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Workshop, pages 65–71, San Jose, California, July, 1992.Google Scholar
  48. [48]
    Smadja, F, “Retrieving collocations from text: Xtract”, Computational Linguistics, 19 (1): 143–178, 1993.Google Scholar
  49. [49]
    Sumita, E. and H. Iida, “Example-based natural language processing techniques - a case study of machine translation”, in Workshop notes from the AAAI Statistically-Based NLP Techniques Wdrkshop, pages 90–97, San Jose, California, July, 1992.Google Scholar
  50. [50]
    Thompson, H., “European Corpus Initiative”, ELS’NEWS, 1 (1), 1992.Google Scholar
  51. [51]
    Thompson, H., “Multilingual corpora for cooperation (MLCC)”, Proposal submitted under the LRE program for International Scientific Cognitive science-operation, 1993.Google Scholar
  52. [52]
    Ushioda, A., D. Evans, T Gibson, and A. Waibel, “The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 95–106, Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  53. [53]
    Walker, D., “The ecology of language”, in Proceedings of the International Workshop on Electronic Dictionaries, pages 1–22, Tokyo, Japan, 1991.Google Scholar
  54. [54]
    Waterman, S., “Structural methods for lexical/semantic patterns”, in Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 128–142, Association for Computational Linguistics, Columbus, Ohio, 1993.Google Scholar
  55. [55]
    Weaver, W., Translation. (memorandum), 1949.Google Scholar
  56. [56]
    Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci, “Coping with ambiguity and unknown words through probabilistic models”, Computational Linguistics, 19 (2): 359–382, 1993.Google Scholar
  57. [57]
    Wettler, M. and R. Rapp, “Computation of word associations based on cooccurrences of words in large corpora”, in Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 84–93, Columbus Ohio, 1993.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1994

Authors and Affiliations

  • Susan Armstrong-Warwick
    • 1
  1. 1.ISSCOUniversity of GenevaSchweiz

Personalised recommendations