Advertisement

Research on Language and Computation

, Volume 3, Issue 2–3, pp 247–279 | Cite as

Treebank-Based Acquisition of Multilingual Unification Grammar Resources

  • Aoife Cahill
  • Michael Burke
  • Martin Forst
  • Ruth O’donovan
  • Christian Rohrer
  • Josef van Genabith
  • Andy Way
Article

Abstract

Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming and expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al. (2002, 2004) and O’Donovan et al. (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure and f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction and parsing modules from O’Donovan et al. (2004) and Cahill et al. (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al., 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al. (2004), 74.6% against the 2000 held-out TIGER dependency structures and 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al., 2004b) from the Penn Chinese Treebank (Xue et al., 2002) and for Spanish from the CAST3LB Treebank (Civit, 2003).

Keywords

Semantic Form Computational Linguistics Annotation Algorithm Subcategorisation Frame Grammar Development 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abeillé, A. eds. 2003Treebanks, Building and Using Parsed CorporaKluwerDordrechtGoogle Scholar
  2. Bender, E., Flickinger, D., Oepen, S. 2002The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision GrammarsCarroll, J.Oostdijk, N.Sutcliffe, R. eds. Proceedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational LinguisticsTaipeiTaiwan814Google Scholar
  3. Bouma, G., Noord, G., Malouf, R. 2000Alpino: Wide-coverage Computational Analysis of DutchDaelemans, W.Sima’an, K.Veenstra, J.Zavrel, J. eds. Computational Linguistics in The Netherlands 2000RodopiAmsterdam4559Google Scholar
  4. Brants, T., Dipper, S., Hansen, S., Lezius, W., Smith, G. 2002The TIGER TreebankHinrichs, E.Simov, K. eds. Proceedings of the first Workshop on Treebanks and Linguistic Theories (TLT’02)SozopolBulgaria2441Google Scholar
  5. Bresnan, J. 2001Lexical-Functional SyntaxBlackwellOxfordGoogle Scholar
  6. Burke M., Cahill A., O’Donovan R., van Genabith J., Way A. (2004a) The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank. In: Proceedings of the Ninth International Conference on LFG. Christchurch, New Zealand, pp. 101–121.Google Scholar
  7. Burke M., Lam O. Chan R., Cahill A., O’Donovan R., Bodomo A., van Genabith J., Way A. (2004b) Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar. In Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation. Tokyo, Japan, pp. 161–172.Google Scholar
  8. Butt M., Dyvik H., King T. H., Masuichi H., Rohrer C. (2002) The Parallel Grammar Project. In Proceedings of COLING 2002, Workshop on Grammar Engineering and Evaluation. Taipei, Taiwan, pp. 1–7.Google Scholar
  9. Butt, M., King T.H., , Niño, M.E., Segond, F. 1999A Grammar Writer’s CookbookCSLI PublicationsStanford, CAGoogle Scholar
  10. Cahill A., Burke M., O’Donovan R., van Genabith J., Way A. (2004) Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 320–327.Google Scholar
  11. Cahill, A., McCarthy, M., Genabith, J., Way, A. 2002Parsing with PCFGs and Automatic F-Structure AnnotationButt, M.King, T.H. eds. Proceedings of the Seventh International Conference on LFGCSLI PublicationsStanford CA7695Google Scholar
  12. Cahill, A., McCarthy, M., Genabith, J., Way, A. 2003Quasi-Logical Forms for the Penn TreebankBunt, H.Sluis, I.Morante, R. eds. Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05TilburgThe Netherlands5571Google Scholar
  13. Charniak E. (2000) A Maximum Entropy Inspired Parser. In Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2000). Seattle, WA, pp. 132–139.Google Scholar
  14. Civit M. (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. Ph.D. thesis, Universitat de Barcelona, Spain.Google Scholar
  15. Collins M. (1999) Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA.Google Scholar
  16. Crouch R., Kaplan R., King T. H., Riezler S. (2002) A comparison of evaluation metrics for a broad coverage parser. In Proceedings of the LREC Workshop: Beyond PARSEVAL – Towards Improved Evaluation Measures for Parsing Systems. Las Palmas, Canary Islands, Spain, pp. 67–74.Google Scholar
  17. Dalrymple, M. 2001Lexical-Functional GrammarAcademic PressSan Diego, CA. LondonGoogle Scholar
  18. Dipper S. (2003) Implementing and Documenting Large-scale Grammars – German LFG. Ph.D. thesis, IMS, University of Stuttgart. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (AIMS), Volume 9, Number 1.Google Scholar
  19. Flickinger, D. 2000On Building a More Efficient Grammar by Exploiting TypesNatural Language Engineering61528CrossRefGoogle Scholar
  20. Forst, M. 2003aTreebank Conversion – Creating an f-structure bank from the TIGER CorpusButt, M.King, T. H. eds. Proceedings of the Eighth International Conference on LFGCSLI PublicationsStanford, CA205216Google Scholar
  21. Forst M. (2003b) Treebank Conversion – establishing a test suite for a broad-coverage LFG from the TIGER treebank. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC’03). Budapest, Hungary, pp. 25–32.Google Scholar
  22. Frank, A., Sadler, L., Genabith, J., Way, A. 2003From Treebank Resources to LFG F-StructuresAbeillé, A. eds. Treebanks: Building and Using Syntactically Annotated CorporaKluwer Academic PublishersDordrecht/Boston/London, The Netherlands367389Google Scholar
  23. Gamon, M., Lozano, C., Pinkham, J., Reutter, T. 1997Practical Experience with Grammar Sharing in Multilingual NLPBurstein, J.Leacock, C. eds. Proceedings of the Workshop From Research to Commercial Applications: Making NLP Work in Practice, 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL-EACL’97)SpainMadrid4956Google Scholar
  24. Hemphill C.T., Godfrey J.J., Doddington G.R. (1990) The ATIS spoken language systems pilot corpus. In Proceedings of a workshop on Speech and natural language. Morgan Kaufmann Publishers Inc., Hidden Valley, PA. pp. 96–101.Google Scholar
  25. Hockenmaier J. (2003) Parsing with Generative models of Predicate-Argument Structure. In Proceedings of the 41st Annual Conference of the Association for Computational Linguistics. Sapporo, Japan, pp. 359–366.Google Scholar
  26. Hockenmaier J., Steedman M. (2002) Generative Models for Statistical Parsing with Combinatory Categorial Grammar. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, PA. pp. 335–342.Google Scholar
  27. Johnson, M. 1999PCFG models of linguistic tree representationsComputational Linguistics24613632Google Scholar
  28. Kaplan, R., Bresnan, J. 1982Lexical Functional Grammar, a Formal System for Grammatical RepresentationBresnan, J. eds. The Mental Representation of Grammatical RelationsMIT PressCambridge, MA173281Google Scholar
  29. Kaplan R., Riezler S., King T.H., Maxwell J.T., Vasserman A., Crouch R. (2004), Speed and Accuracy in Shallow and Deep Stochastic Parsing. In Proceedings of the Human Language Technology Conference and the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04). Boston, MA., pp. 97–104.Google Scholar
  30. Kaplan R., Zaenen A. (1989) Long-distance Dependencies, Constituent Structure and Functional Uncertainty. In: Baltin M., Kroch A., (eds), Alternative Conceptions of Phrase Structure. Chicago University Press, Chicago, pp. 17–42, Reprinted in M. Dalrymple et al. (editors), Formal Issues in Lexical-Functional Grammar. CSLI Publications, 1995.Google Scholar
  31. Kaplan R.M., Netter K., Wedekind J., Zaenen A. (1989) Translation by structural correspondences. In Proceedings of the 4th Meeting of the European Chapter of the Association for Computational Linguistics. UMIST Manchester, UK, pp. 272–281.Google Scholar
  32. Kim R., Dalrymple M., Kaplan R., King T.~H., Masuichi H., Ohkuma T. (2003) Multilingual Grammar Development via Grammar Porting. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, ESSLLI 2003. Vienna, Austria, pp. 49–56.Google Scholar
  33. King T. H., Crouch R., Riezler S., Dalrymple M., Kaplan R. (2003) The PARC700 dependency bank. In Proceedings of the EACL03: 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). Budapest, Hungary, pp. 1–8.Google Scholar
  34. Macleod C., Meyers A., Grishman R. (1994) The COMLEX Syntax Project: The First Year. In Proceedings of the ARPA Workshop on Human Language Technology. Princeton, NJ., pp. 669–703.Google Scholar
  35. Magerman D. (1994) Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Department of Computer Science, Stanford University, CA.Google Scholar
  36. Marcus M., Kim G., Marcinkiewicz M. A., MacIntyre R., Bies A., Ferguson M., Katz K., Schasberger B. (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Workshop on Human Language Technology. Princton, NJ., pp. 110–115.Google Scholar
  37. Masuichi, H., Okuma, T. 2003Japanese Parser on the basis of the Lexical-Functional Grammar Formalism and its EvaluationJournal of Natural Language Processing1079109Google Scholar
  38. Miyao Y., Ninomiya T., Tsujii J. (2003) Probabilistic modeling of argument structures including non-local dependencies. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria, pp. 285–291.Google Scholar
  39. Miyao Y., Ninomiya T., Tsujii J. (2004) Corpus-oriented Grammar Development for Acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. In Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04). Hainan Island, China, pp. 390–397.Google Scholar
  40. Müller S., Kasper W. (2000) HPSG Analysis of German. In Verbmobil: Foundations of Speech-to-Speech Translation. Springer-Verlag, Artificial Intelligence, Berlin, Heidelberg, New York, pp. 238–253.Google Scholar
  41. O’Donovan R., Burke M., Cahill A., van Genabith J., Way A. (2004) Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 368–375.Google Scholar
  42. Pollard, C., Sag, I. 1994Head-driven Phrase Structure GrammarCSLI PublicationsStanford, CAGoogle Scholar
  43. Riezler S., King T., Kaplan R., Crouch R., Maxwell J. T., Johnson M. (2002) Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. pp. 271–278.Google Scholar
  44. Schmid H. (2004) Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2004). Geneva, Switzerland, pp. 162–168.Google Scholar
  45. Siegel M., Bender E. (2002) Efficient deep processing of Japanese. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan. pp. 31–38.Google Scholar
  46. van Genabith J., Crouch R. (1996) Direct and Underspecified Interpretations of LFG f-Structures. In 16th International Conference on Computational Linguistics (COLING 96). Copenhagen, Denmark, pp. 262–267.Google Scholar
  47. Genabith, J., Crouch, R. 1997How to Glue a Donkey to an f-Structure or Porting a Dynamic Meaning Representation Language into LFG’s Linear Logic Based Glue Language SemanticsBunt, H.Muskens, R. eds. Computing Meaning volume 1, Studies in Linguistics and Philosophy, volume 73Kluwer Academic PressDordrecht, Boston and London129148Google Scholar
  48. van Genabith J., Way A., Sadler L. (1999) Data-driven Compilation of LFG Semantic Forms. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway, pp. 69–76.Google Scholar
  49. Xue N., Chiou F.-D., Palmer M. (2002) Building a Large-Scale Annotated Chinese Corpus. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan.Google Scholar

Copyright information

© Springer 2005

Authors and Affiliations

  • Aoife Cahill
    • 1
  • Michael Burke
    • 1
    • 2
  • Martin Forst
    • 2
  • Ruth O’donovan
    • 1
  • Christian Rohrer
    • 3
  • Josef van Genabith
    • 1
    • 2
  • Andy Way
    • 1
    • 2
  1. 1.National Centre for Language Technology, School of ComputingDublin City UniversityDublin 9Ireland
  2. 2.Centre for Advanced StudiesIBMDublinIreland
  3. 3.Institut für Maschinelle SprachverarbeitungUniversität StuttgartStuttgartGermany

Personalised recommendations