Skip to main content

Using Wiktionary to Build an Italian Part-of-Speech Tagger

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8455))

Abstract

While there has been a lot of progress in Natural Language Processing (NLP), many basic resources are still missing for many languages, including Italian, especially resources that are free for both research and commercial use. One of these basic resources is a Part-of-Speech tagger, a first processing step in many NLP applications. We describe a weakly-supervised, fast, free and reasonably accurate part-of-speech tagger for the Italian language, created by mining words and their part-of-speech tags from Wiktionary. We have integrated the tagger in Pattern, a freely available Python toolkit. We believe that our approach is general enough to be applied to other languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, vol. 12, pp. 44–49 (September 1994)

    Google Scholar 

  2. Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: A java-based nlp toolkit (2005)

    Google Scholar 

  3. Pianta, E., Zanoli, R.: TagPro: A system for Italian PoS tagging based on SVM. Intelligenza Artificiale 4(2), 8–9 (2007)

    Google Scholar 

  4. Tamburini, F.: PoS-tagging Italian texts with CORISTagger. In: Proc. of EVALITA 2009. AI*IA Workshop on Evaluation of NLP and Speech Tools for Italian (2009)

    Google Scholar 

  5. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  6. Attardi, G., Fuschetto, A., Tamberi, F., Simi, M., Vecchi, E.M.: Experiments in tagger combination: arbitrating, guessing, correcting, suggesting. In: Proc. of Workshop Evalita, p. 10 (2009)

    Google Scholar 

  7. Søgaard, A.: Ensemble-based POS tagging of Italian. In: The 11th Conference of the Italian Association for Artificial Intelligence, EVALITA, Reggio Emilia, Italy (2009)

    Google Scholar 

  8. Dell’Orletta, F.: Ensemble system for Part-of-Speech tagging. In: Proceedings of EVALITA, p. 9 (2009)

    Google Scholar 

  9. De Smedt, T., Daelemans, W.: Pattern for Python. The Journal of Machine Learning Research 98888, 2063–2067 (2012)

    Google Scholar 

  10. Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (February 1992)

    Google Scholar 

  11. Reese, S., Boleda, G., Cuadros, M., Padró, L., Rigau, G.: Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus (2010)

    Google Scholar 

  12. Schneider, G., Volk, M.: Adding manual constraints and lexical look-up to a Brill-tagger for German. In: Proceedings of the ESSLLI 1998 Workshop on Recent Advances in Corpus Annotation, Saarbrücken (1998)

    Google Scholar 

  13. Sagot, B.: The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In: 7th International Conference on Language Resources and Evaluation, LREC 2010 (2010)

    Google Scholar 

  14. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger generator. In: Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 (August 1996)

    Google Scholar 

  15. Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 1–8. Association for Computational Linguistics (July 2002)

    Google Scholar 

  16. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (May 2003)

    Google Scholar 

  17. Täckström, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics 1, 1–12 (2013)

    Google Scholar 

  18. Li, S., Graça, J.V., Taskar, B.: Wiki-ly supervised part-of-speech tagging. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1389–1398. Association for Computational Linguistics (July 2012)

    Google Scholar 

  19. Ding, W.: Weakly supervised part-of-speech tagging for chinese using label propagation (2012)

    Google Scholar 

  20. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)

    Google Scholar 

  21. Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset.arXiv preprint arXiv:1104 (2011)

    Google Scholar 

  22. Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychological Review 82(6), 407 (1975)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W. (2014). Using Wiktionary to Build an Italian Part-of-Speech Tagger. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07983-7_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07982-0

  • Online ISBN: 978-3-319-07983-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics