Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens

Mihov, Stoyan; Mitankin, Petar; Gotscharek, Annette; Reffle, Ulrich; Schulz, Klaus U.; Ringlstetter, Christoph

doi:10.1007/978-3-540-76928-6_47

Stoyan Mihov¹,
Petar Mitankin¹,
Annette Gotscharek²,
Ulrich Reffle²,
Klaus U. Schulz² &
…
Christoph Ringlstetter³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4830))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2349 Accesses
2 Citations

Abstract

Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kukich, K.: Techniques for automatically correcting words in texts. ACM Computing Surveys, 377–439 (1992)
Google Scholar
Owolabi, O., McGregor, D.: Fast approximate string matching. Software Practice and Experience 18(4), 387–393 (1988)
Article Google Scholar
Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Software Practice and Experience 25(3), 331–345 (1995)
Article Google Scholar
Strohmaier, C., Ringlstetter, C., Schulz, K.U., Mihov, S.: A visual and interactive tool for optimizing lexical postcorrection of OCR results. In: DIAR 2003. Proceedings of the IEEE Workshop on Document Image Analysis and Recognition (2003)
Google Scholar
Schulz, K.U., Mihov, S.: Fast String Correction with Levenshtein-Automata. International Journal of Document Analysis and Recognition 5(1), 67–85 (2002)
Article MATH Google Scholar
Mihov, S., Schulz, K.U.: Fast approximate search in large dictionaries. Computational Linguistics 30(4), 451–477 (2004)
Article MathSciNet Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. (1966)
Google Scholar
Schulz, K.U., Mihov, S., Mitankin, P.: Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)
Google Scholar
Ringlstetter, C., Reffle, U., Gotscharek, A., Schulz, K.U.: Deriving symbol dependent edit weights for text correction - the use of error dictionaries. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)
Google Scholar
Arning, A.: Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. PhD thesis, University of Osnabrück (1995)
Google Scholar
Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: Towards cleaner web corpora. Computational Linguistics 32(3), 295–340 (2006)
Article Google Scholar
Wagner, R.A., Fisher, M.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

IPP, Bulgarian Academy of Sciences,
Stoyan Mihov & Petar Mitankin
CIS, University of Munich,
Annette Gotscharek, Ulrich Reffle & Klaus U. Schulz
AICML, University of Alberta,
Christoph Ringlstetter

Authors

Stoyan Mihov
View author publications
You can also search for this author in PubMed Google Scholar
Petar Mitankin
View author publications
You can also search for this author in PubMed Google Scholar
Annette Gotscharek
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Reffle
View author publications
You can also search for this author in PubMed Google Scholar
Klaus U. Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Ringlstetter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Mehmet A. Orgun John Thornton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mihov, S., Mitankin, P., Gotscharek, A., Reffle, U., Schulz, K.U., Ringlstetter, C. (2007). Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens. In: Orgun, M.A., Thornton, J. (eds) AI 2007: Advances in Artificial Intelligence. AI 2007. Lecture Notes in Computer Science(), vol 4830. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76928-6_47

Download citation

DOI: https://doi.org/10.1007/978-3-540-76928-6_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76926-2
Online ISBN: 978-3-540-76928-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics