Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available

  • Sergio Ferrández
  • Jesús Peral
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


One of the important processing steps for many natural language systems (information extraction, question answering, etc.) is Part-of-speech (PoS) tagging. This issue has been tackled with a number of different approaches in order to resolve this step. In this paper we study the functioning of a Hidden Markov Models (HMM) Spanish PoS tagger using a minimum amount of training corpora. Our PoS tagger is based on HMM where the states are tag pairs that emit words. It is based on transitional and lexical probabilities. This technique has been suggested by Rabiner [11] –and our implementation is influenced by Brants [2]–. We have investigated the best configuration of HMM using a small amount of training data which has about 50,000 words and the maximum precision obtained for an unknown Spanish text was 95.36%.


Hide Markov Model Emission Probability Viterbi Algorithm Question Answering Training Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Atserias, J., Carmona, J., Castellón, I., Cervell, S., Civit, M., Márquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation, LREC 1998, pp. 1267–1272 (1998)Google Scholar
  2. 2.
    Brants, T.: Tnt- a statistical part-of-speech tagger. In: Proceedings of the 6rd Conference on Applied Natural Language Procesing, ANLP, pp. 224–231 (2000)Google Scholar
  3. 3.
    Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics 21, 543–565Google Scholar
  4. 4.
    Brill, E.: A corpus-based Approach to Language Learning (1993)Google Scholar
  5. 5.
    Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Languge Resources and Evaluation, LREC 2004, pp. 1364–1371 (2004)Google Scholar
  6. 6.
    Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department,Universitat de Barcelona (2003) Google Scholar
  7. 7.
    Daelemans, W., Zavrel, J., Berckand, P., Gillis, S.: A memory-based part-ofspeech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)Google Scholar
  8. 8.
    Figuerola, G., Zazo, F., Rodríguez, E., Alonso, J.: La Recuperación de Información en español y la normalización de términos. Revista Iberoamericana de Inteligencia Artificial VIII(22), 135–145 (2004)Google Scholar
  9. 9.
    Mérialdo, B.: Tagging English text with a probabilistic model. Computational Linguistics 20(2), 155–171 (1994)Google Scholar
  10. 10.
    Padró, M., Padró, L.: Developing Competitive HMM PoS Taggers Using Small Training Corpora. ESPAÑA for NATURAL LANGUAGE PROCESSING, EsTAL, 127–136 (2004)Google Scholar
  11. 11.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  12. 12.
    Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Procesing, EMNLP, pp. 16–19 (1996)Google Scholar
  13. 13.
    Schmid, H.: TreeTagger — a language independent part-of-speech tagger. Institut fur Maschinelle Sprachverarbeitung, Universitat Stuttgart (1995) Google Scholar
  14. 14.
    Viterbi, A.J.: Error bounds for convolutional codes and asymptotically optimal decoding algorithm. IEEE Transactions on Inf. Theory, 260–269 (1967)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Sergio Ferrández
    • 1
  • Jesús Peral
    • 1
  1. 1.Grupo de Investigación en Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas InformáticosUniversity of AlicanteSpain

Personalised recommendations