Achieving an Almost Correct PoS-Tagged Corpus

Květoň, Pavel; Oliva, Karel

doi:10.1007/3-540-46154-X_3

Pavel Květoň³ &
Karel Oliva⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2448))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

562 Accesses
5 Citations

Abstract

After some theoretical discussion on the issue of representativity of a corpus, this paper presents a simple yet very efficient technique serving for (semi-) automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and application of “invalid bigrams”, i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - VERB). Further, the paper describes the generalization of the “invalid bigrams” into “extended invalid bigrams of length n”, for any natural n, which provides a powerful tool for error detection in a corpus. The approach is illustrated by English, German and Czech examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brants T.: TnT-A Statistical Part-of-Speech tagger, In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle (2000).
Google Scholar
Hirakawa H., K. Ono and Y. Yoshimura: Automatic refinement of a PoS tagger using a reliable parser and plain text corpora, In: Proceedings of the 18th Coling conference, Saarbrücken (2000).
Google Scholar
Květoň P. and K. Oliva (in prep.) Correcting the NEGRA Corpus: Methods, Results, Implications, ÖFAI Technical Report (in prep.).
Google Scholar
Oliva K.: The possibilities of automatic detection/correction of errors in tagged corpora: a pilot study on a German corpus, In: 4th International conference Text, Speech and Dialogue, TSD 2001, Lecture Notes in Artificial Intelligence 2166, Springer, Berlin (2001).
Google Scholar
Schiller A., S. Teufel, C. Stöckert and C. Thielen: Guidelines für das Tagging deutscher Textcorpora, University of Stuttgart / University of Tübingen (1999
Google Scholar
Skut W., B. Krenn, T. Brants and H. Uszkoreit: An annotation scheme for free word order languages, In: Proceedings of the 3rd Applied Natural Language Processing Conference, Washington D.C. (1997).
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Malostranské nám. 25, CZ - 118 00, Praha 1 - Malá Strana, Czech Republic
Pavel Květoň
Austrian Research Institute for Artificial Intelligence (ÖFAI), Schottengasse 3, A-1010, Wien, Austria
Karel Oliva

Authors

Pavel Květoň
View author publications
You can also search for this author in PubMed Google Scholar
Karel Oliva
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics Department of Programming Systems and Communication, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Ivan Kopeček & Karel Pala &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Květoň, P., Oliva, K. (2002). Achieving an Almost Correct PoS-Tagged Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2002. Lecture Notes in Computer Science(), vol 2448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46154-X_3

Download citation

DOI: https://doi.org/10.1007/3-540-46154-X_3
Published: 23 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44129-8
Online ISBN: 978-3-540-46154-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics