Abstract
This paper aims at presenting some preliminary results for data driven lemmatisation for Italian. Based on a joint lemmatisation and part-of-speech tagging models, our system relies on a architecture that has already been proved successful for French. ‘Besides’ intrinsic evaluation for this task, we want to measure its usefulness and adequacy by using our system as input for the task of parsing. This approach achieves state-of-the-art parsing accuracy on unlabeled text without any gold information supplied (83.70% of F1 score in a 10-fold cross-validation setting), without requiring any prior knowledge of the language. This shows that our methodology is perfectly suitable for wide coverage parsing of Italian.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of SPMRL 2010, Los Angeles, CA (2010)
Candito, M., Seddah, D.: Parsing word clusters. In: Proceedings of SPMRL 2010, pp. 76–84. Association for Computational Linguistics (2010)
Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I., Tounsi, L.: Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 1–12. Association for Computational Linguistics (2010)
Chrupała, G., Dinu, G., van Genabith, J.: Learning morphology with morfette. In: Proceedings of LREC 2008. ELDA/ELRA, Marrakech (2008)
Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)
Attia, M., Foster, J., Hogan, D., Roux, J.L., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)
Zanchetta, E., Baroni, M.: Morph-it!: a free corpus-based morphological resource for the italian language (2005)
Chrupała, G.: Towards a machine-learning architecture for lexical functional grammar parsing. PhD thesis, Dublin City University (2008)
Bosco, C., Lombardo, V.: Dependency and relational structure in treebank annotation. In: Proceedings of Workshop on Recent Advances in Dependency Grammar at COLING 2004 (2004)
Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Kluwer, Dordrecht (2003)
Charniak, E., Johnson, M.: Coarse-to-fine n-best-parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the ACL, Barcelona, Spain, pp. 173–180 (June 2005)
Matsuzaki, T., Miyao, Y., Tsujii, J.: Probabilistic cfg with latent annotations. In: Proc. of ACL 2005, Ann Arbor, USA, pp. 75–82 (2005)
Seddah, D., Candito, M., Crabbé, B.: Cross parser evaluation and tagset variation: A French Treebank study. In: Proceedings of the 11th Internation Conference on Parsing Technologies (IWPT 2009), pp. 150–161. Association for Computational Linguistics, Paris (2009)
Petrov, S., Klein, D.: Parsing German with latent variable grammars. In: Proceedings of the Workshop on Parsing German at ACL 2008, pp. 33–39. Association for Computational Linguistics, Columbus (2008)
Lavelli, A., Corazza, A.: The berkeley parser at the evalita 2009 constituency parsing task (2009)
Attia, M., Foster, J., Hogan, D., Le Roux, J., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of SPMRL 2010, pp. 67–75. Association for Computational Linguistics (2010)
Lesmo, L., Lombardo, V., Bosco, C.: Treebank development: the tut approach. In: Proceedings of ICON 2002 (2002)
Cowan, B., Collins, M.: Morphology and reranking for the statistical parsing of spanish. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 795–802. Association for Computational Linguistics (2005)
Schluter, N., van Genabith, J.: Preparing, restructuring, and augmenting a French Treebank: Lexicalised parsers or coherent treebanks? In: Proc. of PACLING 2007, Melbourne, Australia (2007)
Goldberg, Y., Elhadad, M.: Joint hebrew segmentation and parsing using a pcfgla lattice parser. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 704–709. Association for Computational Linguistics, Portland (2011)
Petrov, S.: Products of random latent variable grammars. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27. Association for Computational Linguistics (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Seddah, D., Le Roux, J., Sagot, B. (2013). Data Driven Lemmatization and Parsing of Italian. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-35828-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35827-2
Online ISBN: 978-3-642-35828-9
eBook Packages: Computer ScienceComputer Science (R0)