Abstract
Although state-of-the-art parsers for natural language are lexicalized, it was recently shown that an accurate unlexicalized parser for the Penn tree-bank can be simply read off a manually refined tree-bank. While lexicalized parsers often suffer from sparse data, manual mark-up is costly and largely based on individual linguistic intuition. Thus, across domains, languages, and tree-bank annotations, a fundamental question arises: Is it possible to automatically induce an accurate parser from a tree-bank without resorting to full lexicalization? In this paper, we show how to induce a probabilistic parser with latent head information from simple linguistic principles. Our parser has a performance of 85.1% (LP/LR F1), which is as good as that of early lexicalized ones. This is remarkable since the induction of probabilistic grammars is in general a hard task.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Charniak, E.: Tree-bank grammars. Technical Report CS-96-02, Brown University (1996)
Charniak, E.: Parsing with context-free grammars and word statistics. Technical Report CS-95-28, Department of Computer Science, Brown University (1995)
Magerman, D.M.: Statistical decision-tree models for parsing. In: Proc. of ACL 1995 (1995)
Collins, M.: A new statistical parser based on bigram lexical dependencies. In: Proc. of the ACL 1996 (1996)
Johnson, M.: PCFG models of linguistic tree representations. Comp. Linguistics 24 (1998)
Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, U of Pennsylvania (1999)
Dubey, A., Keller, F.: Probabilistic parsing for German using sister-head dependencies. In: Proc. of ACL 2003 (2003)
Fissaha, S., Olejnik, D., Kornberger, R., Müller, K., Prescher, D.: Experiments in German treebank parsing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 50–57. Springer, Heidelberg (2003)
Bikel, D.: Intricacies of Collins’ parsing model. Computational Linguistics (to appear)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proc. of ACL 2003 (2003)
Bresnan, J., Kaplan, R.M.: Lexical functional grammar: A formal system for grammatical representation. In: The Mental Representation of Grammatical Relations. MIT Press, Cambridge (1982)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. 39 (1977)
Carroll, G., Rooth, M.: Valence induction with a head-lexicalized PCFG. In: Proc. of EMNLP-3 (1998)
Lari, K., Young, S.J.: The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4 (1990)
Schmid, H.: LoPar. Design and Implementation. Technical report, IMS, U Stuttgart (1999)
Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19 (1993)
Schmid, H.: Efficient parsing of highly ambiguous context-free grammars with bit vectors. In: Proc. of COLING 2004 (2004)
Black, E., et al.: A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proc. of DARPA 1991 (1991)
Chiang, D., Bikel, D.: Recovering latent information in treebanks. In: Proc. of COLING 2002 (2002)
Ghahramani, Z., Jordan, M.: Factorial Hidden Markov Models. Technical report. MIT (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Prescher, D. (2005). Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_30
Download citation
DOI: https://doi.org/10.1007/11564096_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)