Skip to main content

Data Driven Lemmatization and Parsing of Italian

  • Conference paper
Evaluation of Natural Language and Speech Tools for Italian (EVALITA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7689))

  • 667 Accesses

Abstract

This paper aims at presenting some preliminary results for data driven lemmatisation for Italian. Based on a joint lemmatisation and part-of-speech tagging models, our system relies on a architecture that has already been proved successful for French. ‘Besides’ intrinsic evaluation for this task, we want to measure its usefulness and adequacy by using our system as input for the task of parsing. This approach achieves state-of-the-art parsing accuracy on unlabeled text without any gold information supplied (83.70% of F1 score in a 10-fold cross-validation setting), without requiring any prior knowledge of the language. This shows that our methodology is perfectly suitable for wide coverage parsing of Italian.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of SPMRL 2010, Los Angeles, CA (2010)

    Google Scholar 

  2. Candito, M., Seddah, D.: Parsing word clusters. In: Proceedings of SPMRL 2010, pp. 76–84. Association for Computational Linguistics (2010)

    Google Scholar 

  3. Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I., Tounsi, L.: Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 1–12. Association for Computational Linguistics (2010)

    Google Scholar 

  4. Chrupała, G., Dinu, G., van Genabith, J.: Learning morphology with morfette. In: Proceedings of LREC 2008. ELDA/ELRA, Marrakech (2008)

    Google Scholar 

  5. Seddah, D., Chrupała, G., Cetinoglu, O., van Genabith, J., Candito, M.: Lemmatization and statistical lexicalized parsing of morphologically-rich languages. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)

    Google Scholar 

  6. Attia, M., Foster, J., Hogan, D., Roux, J.L., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of the NAACL/HLT Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), Los Angeles, CA (2010)

    Google Scholar 

  7. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)

    Article  MATH  Google Scholar 

  8. Zanchetta, E., Baroni, M.: Morph-it!: a free corpus-based morphological resource for the italian language (2005)

    Google Scholar 

  9. Chrupała, G.: Towards a machine-learning architecture for lexical functional grammar parsing. PhD thesis, Dublin City University (2008)

    Google Scholar 

  10. Bosco, C., Lombardo, V.: Dependency and relational structure in treebank annotation. In: Proceedings of Workshop on Recent Advances in Dependency Grammar at COLING 2004 (2004)

    Google Scholar 

  11. Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Kluwer, Dordrecht (2003)

    Google Scholar 

  12. Charniak, E., Johnson, M.: Coarse-to-fine n-best-parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the ACL, Barcelona, Spain, pp. 173–180 (June 2005)

    Google Scholar 

  13. Matsuzaki, T., Miyao, Y., Tsujii, J.: Probabilistic cfg with latent annotations. In: Proc. of ACL 2005, Ann Arbor, USA, pp. 75–82 (2005)

    Google Scholar 

  14. Seddah, D., Candito, M., Crabbé, B.: Cross parser evaluation and tagset variation: A French Treebank study. In: Proceedings of the 11th Internation Conference on Parsing Technologies (IWPT 2009), pp. 150–161. Association for Computational Linguistics, Paris (2009)

    Chapter  Google Scholar 

  15. Petrov, S., Klein, D.: Parsing German with latent variable grammars. In: Proceedings of the Workshop on Parsing German at ACL 2008, pp. 33–39. Association for Computational Linguistics, Columbus (2008)

    Chapter  Google Scholar 

  16. Lavelli, A., Corazza, A.: The berkeley parser at the evalita 2009 constituency parsing task (2009)

    Google Scholar 

  17. Attia, M., Foster, J., Hogan, D., Le Roux, J., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for arabic, english and french. In: Proceedings of SPMRL 2010, pp. 67–75. Association for Computational Linguistics (2010)

    Google Scholar 

  18. Lesmo, L., Lombardo, V., Bosco, C.: Treebank development: the tut approach. In: Proceedings of ICON 2002 (2002)

    Google Scholar 

  19. Cowan, B., Collins, M.: Morphology and reranking for the statistical parsing of spanish. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 795–802. Association for Computational Linguistics (2005)

    Google Scholar 

  20. Schluter, N., van Genabith, J.: Preparing, restructuring, and augmenting a French Treebank: Lexicalised parsers or coherent treebanks? In: Proc. of PACLING 2007, Melbourne, Australia (2007)

    Google Scholar 

  21. Goldberg, Y., Elhadad, M.: Joint hebrew segmentation and parsing using a pcfgla lattice parser. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 704–709. Association for Computational Linguistics, Portland (2011)

    Google Scholar 

  22. Petrov, S.: Products of random latent variable grammars. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27. Association for Computational Linguistics (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Seddah, D., Le Roux, J., Sagot, B. (2013). Data Driven Lemmatization and Parsing of Italian. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35828-9_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35827-2

  • Online ISBN: 978-3-642-35828-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics