Using Wiktionary to Build an Italian Part-of-Speech Tagger

De Smedt, Tom; Marfia, Fabio; Matteucci, Matteo; Daelemans, Walter

doi:10.1007/978-3-319-07983-7_1

Tom De Smedt¹⁸,
Fabio Marfia¹⁹,
Matteo Matteucci¹⁹ &
…
Walter Daelemans¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8455))

Included in the following conference series:

International Conference on Applications of Natural Language to Data Bases/Information Systems

1565 Accesses
2 Citations

Abstract

While there has been a lot of progress in Natural Language Processing (NLP), many basic resources are still missing for many languages, including Italian, especially resources that are free for both research and commercial use. One of these basic resources is a Part-of-Speech tagger, a first processing step in many NLP applications. We describe a weakly-supervised, fast, free and reasonably accurate part-of-speech tagger for the Italian language, created by mining words and their part-of-speech tags from Wiktionary. We have integrated the tagger in Pattern, a freely available Python toolkit. We believe that our approach is general enough to be applied to other languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, vol. 12, pp. 44–49 (September 1994)
Google Scholar
Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: A java-based nlp toolkit (2005)
Google Scholar
Pianta, E., Zanoli, R.: TagPro: A system for Italian PoS tagging based on SVM. Intelligenza Artificiale 4(2), 8–9 (2007)
Google Scholar
Tamburini, F.: PoS-tagging Italian texts with CORISTagger. In: Proc. of EVALITA 2009. AI*IA Workshop on Evaluation of NLP and Speech Tools for Italian (2009)
Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Attardi, G., Fuschetto, A., Tamberi, F., Simi, M., Vecchi, E.M.: Experiments in tagger combination: arbitrating, guessing, correcting, suggesting. In: Proc. of Workshop Evalita, p. 10 (2009)
Google Scholar
Søgaard, A.: Ensemble-based POS tagging of Italian. In: The 11th Conference of the Italian Association for Artificial Intelligence, EVALITA, Reggio Emilia, Italy (2009)
Google Scholar
Dell’Orletta, F.: Ensemble system for Part-of-Speech tagging. In: Proceedings of EVALITA, p. 9 (2009)
Google Scholar
De Smedt, T., Daelemans, W.: Pattern for Python. The Journal of Machine Learning Research 98888, 2063–2067 (2012)
Google Scholar
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (February 1992)
Google Scholar
Reese, S., Boleda, G., Cuadros, M., Padró, L., Rigau, G.: Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus (2010)
Google Scholar
Schneider, G., Volk, M.: Adding manual constraints and lexical look-up to a Brill-tagger for German. In: Proceedings of the ESSLLI 1998 Workshop on Recent Advances in Corpus Annotation, Saarbrücken (1998)
Google Scholar
Sagot, B.: The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In: 7th International Conference on Language Resources and Evaluation, LREC 2010 (2010)
Google Scholar
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger generator. In: Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 (August 1996)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 1–8. Association for Computational Linguistics (July 2002)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (May 2003)
Google Scholar
Täckström, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics 1, 1–12 (2013)
Google Scholar
Li, S., Graça, J.V., Taskar, B.: Wiki-ly supervised part-of-speech tagging. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1389–1398. Association for Computational Linguistics (July 2012)
Google Scholar
Ding, W.: Weakly supervised part-of-speech tagging for chinese using label propagation (2012)
Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
Google Scholar
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset.arXiv preprint arXiv:1104 (2011)
Google Scholar
Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychological Review 82(6), 407 (1975)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CLiPS Computational Linguistics Research Group, University of Antwerp, Antwerp, Belgium
Tom De Smedt & Walter Daelemans
DEIB Department of Electronics, Information and Bioeng., Politecnico di Milano, Milan, Italy
Fabio Marfia & Matteo Matteucci

Authors

Tom De Smedt
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Marfia
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Matteucci
View author publications
You can also search for this author in PubMed Google Scholar
Walter Daelemans
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Computer Science,, 2 rue Conté, 75003, Paris, France
Elisabeth Métais
Cirad, TETIS, 500 rue J.F. Breton, 34093, Montpellier Cedex 5, France
Mathieu Roche
Irstea, TETIS, 500 rue J.F. Breton, 34093, Montpellier Cedex 5, France
Maguelonne Teisseire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W. (2014). Using Wiktionary to Build an Italian Part-of-Speech Tagger. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-07983-7_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07982-0
Online ISBN: 978-3-319-07983-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics