Data Driven Lemmatization and Parsing of Italian

Seddah, Djamé; Le Roux, Joseph; Sagot, Benoît

doi:10.1007/978-3-642-35828-9_27

Djamé Seddah^23,24,
Joseph Le Roux²⁵ &
Benoît Sagot²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7689))

Included in the following conference series:

International Workshop on Evaluation of Natural Language and Speech Tool for Italian

667 Accesses

Abstract

This paper aims at presenting some preliminary results for data driven lemmatisation for Italian. Based on a joint lemmatisation and part-of-speech tagging models, our system relies on a architecture that has already been proved successful for French. ‘Besides’ intrinsic evaluation for this task, we want to measure its usefulness and adequacy by using our system as input for the task of parsing. This approach achieves state-of-the-art parsing accuracy on unlabeled text without any gold information supplied (83.70% of F₁ score in a 10-fold cross-validation setting), without requiring any prior knowledge of the language. This shows that our methodology is perfectly suitable for wide coverage parsing of Italian.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of SPMRL 2010, Los Angeles, CA (2010)
Google Scholar
Candito, M., Seddah, D.: Parsing word clusters. In: Proceedings of SPMRL 2010, pp. 76–84. Association for Computational Linguistics (2010)
Google Scholar
Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I., Tounsi, L.: Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 1–12. Association for Computational Linguistics (2010)
Google Scholar
Chrupała, G., Dinu, G., van Genabith, J.: Learning morphology with morfette. In: Proceedings of LREC 2008. ELDA/ELRA, Marrakech (2008)
Google Scholar
Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)
Google Scholar
Attia, M., Foster, J., Hogan, D., Roux, J.L., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)
Google Scholar
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)
Article MATH Google Scholar
Zanchetta, E., Baroni, M.: Morph-it!: a free corpus-based morphological resource for the italian language (2005)
Google Scholar
Chrupała, G.: Towards a machine-learning architecture for lexical functional grammar parsing. PhD thesis, Dublin City University (2008)
Google Scholar
Bosco, C., Lombardo, V.: Dependency and relational structure in treebank annotation. In: Proceedings of Workshop on Recent Advances in Dependency Grammar at COLING 2004 (2004)
Google Scholar
Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Kluwer, Dordrecht (2003)
Google Scholar
Charniak, E., Johnson, M.: Coarse-to-fine n-best-parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the ACL, Barcelona, Spain, pp. 173–180 (June 2005)
Google Scholar
Matsuzaki, T., Miyao, Y., Tsujii, J.: Probabilistic cfg with latent annotations. In: Proc. of ACL 2005, Ann Arbor, USA, pp. 75–82 (2005)
Google Scholar
Seddah, D., Candito, M., Crabbé, B.: Cross parser evaluation and tagset variation: A French Treebank study. In: Proceedings of the 11th Internation Conference on Parsing Technologies (IWPT 2009), pp. 150–161. Association for Computational Linguistics, Paris (2009)
Chapter Google Scholar
Petrov, S., Klein, D.: Parsing German with latent variable grammars. In: Proceedings of the Workshop on Parsing German at ACL 2008, pp. 33–39. Association for Computational Linguistics, Columbus (2008)
Chapter Google Scholar
Lavelli, A., Corazza, A.: The berkeley parser at the evalita 2009 constituency parsing task (2009)
Google Scholar
Attia, M., Foster, J., Hogan, D., Le Roux, J., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of SPMRL 2010, pp. 67–75. Association for Computational Linguistics (2010)
Google Scholar
Lesmo, L., Lombardo, V., Bosco, C.: Treebank development: the tut approach. In: Proceedings of ICON 2002 (2002)
Google Scholar
Cowan, B., Collins, M.: Morphology and reranking for the statistical parsing of spanish. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 795–802. Association for Computational Linguistics (2005)
Google Scholar
Schluter, N., van Genabith, J.: Preparing, restructuring, and augmenting a French Treebank: Lexicalised parsers or coherent treebanks? In: Proc. of PACLING 2007, Melbourne, Australia (2007)
Google Scholar
Goldberg, Y., Elhadad, M.: Joint hebrew segmentation and parsing using a pcfgla lattice parser. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 704–709. Association for Computational Linguistics, Portland (2011)
Google Scholar
Petrov, S.: Products of random latent variable grammars. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27. Association for Computational Linguistics (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Paris–Sorbonne, Paris, France
Djamé Seddah
Alpage, INRIA & Université Paris–Diderot (UMR-I 001), Paris, France
Djamé Seddah & Benoît Sagot
LIPN, Université Paris–Nord & CNRS (UMR 7030), Villetaneuse, France
Joseph Le Roux

Authors

Djamé Seddah
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Le Roux
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Sagot
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Via Sommarive 18, 38123, Povo, TN, Italy
Bernardo Magnini
University of Naples, Via Cinthia, 80126, Napoli, NA, Italy
Francesco Cutugno
Fondazione Ugo Bordoni, Viale del Policlinico, 161, Roma, Italy
Mauro Falcone
CELCT, Via alla Cascata, 38123, Povo, TN, Italy
Emanuele Pianta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seddah, D., Le Roux, J., Sagot, B. (2013). Data Driven Lemmatization and Parsing of Italian. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-35828-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35827-2
Online ISBN: 978-3-642-35828-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics