Abstract
Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kukich, K.: Techniques for automatically correcting words in texts. ACM Computing Surveys, 377–439 (1992)
Owolabi, O., McGregor, D.: Fast approximate string matching. Software Practice and Experience 18(4), 387–393 (1988)
Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Software Practice and Experience 25(3), 331–345 (1995)
Strohmaier, C., Ringlstetter, C., Schulz, K.U., Mihov, S.: A visual and interactive tool for optimizing lexical postcorrection of OCR results. In: DIAR 2003. Proceedings of the IEEE Workshop on Document Image Analysis and Recognition (2003)
Schulz, K.U., Mihov, S.: Fast String Correction with Levenshtein-Automata. International Journal of Document Analysis and Recognition 5(1), 67–85 (2002)
Mihov, S., Schulz, K.U.: Fast approximate search in large dictionaries. Computational Linguistics 30(4), 451–477 (2004)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. (1966)
Schulz, K.U., Mihov, S., Mitankin, P.: Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)
Ringlstetter, C., Reffle, U., Gotscharek, A., Schulz, K.U.: Deriving symbol dependent edit weights for text correction - the use of error dictionaries. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)
Arning, A.: Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. PhD thesis, University of Osnabrück (1995)
Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: Towards cleaner web corpora. Computational Linguistics 32(3), 295–340 (2006)
Wagner, R.A., Fisher, M.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mihov, S., Mitankin, P., Gotscharek, A., Reffle, U., Schulz, K.U., Ringlstetter, C. (2007). Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens. In: Orgun, M.A., Thornton, J. (eds) AI 2007: Advances in Artificial Intelligence. AI 2007. Lecture Notes in Computer Science(), vol 4830. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76928-6_47
Download citation
DOI: https://doi.org/10.1007/978-3-540-76928-6_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76926-2
Online ISBN: 978-3-540-76928-6
eBook Packages: Computer ScienceComputer Science (R0)