Skip to main content

Developing Competitive HMM PoS Taggers Using Small Training Corpora

  • Conference paper
  • First Online:
Advances in Natural Language Processing (EsTAL 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Included in the following conference series:

Abstract

This paper presents a study aiming to find out the best strategy to develop a fast and accurate HMM tagger when only a limited amount of training material is available. This is a crucial factor when dealing with languages for which small annotated material is not easily available.

First, we develop some experiments in English, using WSJ corpus as a test-bench to establish the differences caused by the use of large or a small train set. Then, we port the results to develop an accurate Spanish PoS tagger using a limited amount of training data.

Different configurations of a HMM tagger are studied. Namely, trigram and 4-gram models are tested, as well as different smoothing techniques. The performance of each configuration depending on the size of the training corpus is tested in order to determine the most appropriate setting to develop HMM PoS taggers for languages with reduced amount of corpus available.

This research has been partially supported by the European Comission (Meaning, IST-2001-34460) and by the Catalan Government Research Department (DURSI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brants, T.: Tnt - a statistical part- of-speech tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing, ANLP, ACL (2000)

    Google Scholar 

  2. Brill, E.: A Corpus–based Approach to Language Learning. PhD thesis, Department of Computer and Information Science, University of Pennsylvania (1993), http://www.cs.jhu.edu/~brill/acadpubs.html

  3. Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)

    Google Scholar 

  4. Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department, Universitat de Barcelona (2003)

    Google Scholar 

  5. Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP, ACL, pp. 136–143 (1988)

    Google Scholar 

  6. Cutting, D., Kupiec, J., Pedersen, J.O., Sibun, P.: A practical part–of–speech tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLP, ACL, pp. 133–140 (1992)

    Google Scholar 

  7. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: Mbt: A memory–based part–of–speech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27 (1996)

    Google Scholar 

  8. Karlsson, F.: Constraint grammar as a framework for parsing running text. In: Proceedings of 13th International Conference on Computational Linguistics, COLING, Helsinki, Finland, pp. 168–173 (1990)

    Google Scholar 

  9. Laplace, P.S.m.: Philosophical Essay on Probabilities. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  10. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  11. Merialdo, B.: Tagging english text with a probabilistic model. Computational Linguistics 20, 155–171 (1994)

    Google Scholar 

  12. Ratnaparkhi, A.: A maximum entropy part–of–speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)

    Google Scholar 

  13. Schmid, H.: Improvements in part–of–speech tagging with an application to german. In: Proceedings of the EACL SIGDAT Workshop, Dublin, Ireland (1995)

    Google Scholar 

  14. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260–269 (1967)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Padró, M., Padró, L. (2004). Developing Competitive HMM PoS Taggers Using Small Training Corpora. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30228-5_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23498-2

  • Online ISBN: 978-3-540-30228-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics