Skip to main content

Achieving an Almost Correct PoS-Tagged Corpus

  • Conference paper
  • First Online:
Text, Speech and Dialogue (TSD 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2448))

Included in the following conference series:

Abstract

After some theoretical discussion on the issue of representativity of a corpus, this paper presents a simple yet very efficient technique serving for (semi-) automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and application of “invalid bigrams”, i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - VERB). Further, the paper describes the generalization of the “invalid bigrams” into “extended invalid bigrams of length n”, for any natural n, which provides a powerful tool for error detection in a corpus. The approach is illustrated by English, German and Czech examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brants T.: TnT-A Statistical Part-of-Speech tagger, In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle (2000).

    Google Scholar 

  2. Hirakawa H., K. Ono and Y. Yoshimura: Automatic refinement of a PoS tagger using a reliable parser and plain text corpora, In: Proceedings of the 18th Coling conference, Saarbrücken (2000).

    Google Scholar 

  3. Květoň P. and K. Oliva (in prep.) Correcting the NEGRA Corpus: Methods, Results, Implications, ÖFAI Technical Report (in prep.).

    Google Scholar 

  4. Oliva K.: The possibilities of automatic detection/correction of errors in tagged corpora: a pilot study on a German corpus, In: 4th International conference Text, Speech and Dialogue, TSD 2001, Lecture Notes in Artificial Intelligence 2166, Springer, Berlin (2001).

    Google Scholar 

  5. Schiller A., S. Teufel, C. Stöckert and C. Thielen: Guidelines für das Tagging deutscher Textcorpora, University of Stuttgart / University of Tübingen (1999

    Google Scholar 

  6. Skut W., B. Krenn, T. Brants and H. Uszkoreit: An annotation scheme for free word order languages, In: Proceedings of the 3rd Applied Natural Language Processing Conference, Washington D.C. (1997).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Květoň, P., Oliva, K. (2002). Achieving an Almost Correct PoS-Tagged Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2002. Lecture Notes in Computer Science(), vol 2448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46154-X_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-46154-X_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44129-8

  • Online ISBN: 978-3-540-46154-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics