Abstract
After some theoretical discussion on the issue of representativity of a corpus, this paper presents a simple yet very efficient technique serving for (semi-) automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and application of “invalid bigrams”, i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - VERB). Further, the paper describes the generalization of the “invalid bigrams” into “extended invalid bigrams of length n”, for any natural n, which provides a powerful tool for error detection in a corpus. The approach is illustrated by English, German and Czech examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brants T.: TnT-A Statistical Part-of-Speech tagger, In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle (2000).
Hirakawa H., K. Ono and Y. Yoshimura: Automatic refinement of a PoS tagger using a reliable parser and plain text corpora, In: Proceedings of the 18th Coling conference, Saarbrücken (2000).
Květoň P. and K. Oliva (in prep.) Correcting the NEGRA Corpus: Methods, Results, Implications, ÖFAI Technical Report (in prep.).
Oliva K.: The possibilities of automatic detection/correction of errors in tagged corpora: a pilot study on a German corpus, In: 4th International conference Text, Speech and Dialogue, TSD 2001, Lecture Notes in Artificial Intelligence 2166, Springer, Berlin (2001).
Schiller A., S. Teufel, C. Stöckert and C. Thielen: Guidelines für das Tagging deutscher Textcorpora, University of Stuttgart / University of Tübingen (1999
Skut W., B. Krenn, T. Brants and H. Uszkoreit: An annotation scheme for free word order languages, In: Proceedings of the 3rd Applied Natural Language Processing Conference, Washington D.C. (1997).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Květoň, P., Oliva, K. (2002). Achieving an Almost Correct PoS-Tagged Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2002. Lecture Notes in Computer Science(), vol 2448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46154-X_3
Download citation
DOI: https://doi.org/10.1007/3-540-46154-X_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44129-8
Online ISBN: 978-3-540-46154-8
eBook Packages: Springer Book Archive