Skip to main content

Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens

  • Conference paper
Book cover AI 2007: Advances in Artificial Intelligence (AI 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4830))

Included in the following conference series:

Abstract

Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kukich, K.: Techniques for automatically correcting words in texts. ACM Computing Surveys, 377–439 (1992)

    Google Scholar 

  2. Owolabi, O., McGregor, D.: Fast approximate string matching. Software Practice and Experience 18(4), 387–393 (1988)

    Article  Google Scholar 

  3. Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Software Practice and Experience 25(3), 331–345 (1995)

    Article  Google Scholar 

  4. Strohmaier, C., Ringlstetter, C., Schulz, K.U., Mihov, S.: A visual and interactive tool for optimizing lexical postcorrection of OCR results. In: DIAR 2003. Proceedings of the IEEE Workshop on Document Image Analysis and Recognition (2003)

    Google Scholar 

  5. Schulz, K.U., Mihov, S.: Fast String Correction with Levenshtein-Automata. International Journal of Document Analysis and Recognition 5(1), 67–85 (2002)

    Article  MATH  Google Scholar 

  6. Mihov, S., Schulz, K.U.: Fast approximate search in large dictionaries. Computational Linguistics 30(4), 451–477 (2004)

    Article  MathSciNet  Google Scholar 

  7. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. (1966)

    Google Scholar 

  8. Schulz, K.U., Mihov, S., Mitankin, P.: Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)

    Google Scholar 

  9. Ringlstetter, C., Reffle, U., Gotscharek, A., Schulz, K.U.: Deriving symbol dependent edit weights for text correction - the use of error dictionaries. In: ICDAR. Proceedings of the ninth International Conference on Document Analysis and Recognition (to appear, 2007)

    Google Scholar 

  10. Arning, A.: Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. PhD thesis, University of Osnabrück (1995)

    Google Scholar 

  11. Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: Towards cleaner web corpora. Computational Linguistics 32(3), 295–340 (2006)

    Article  Google Scholar 

  12. Wagner, R.A., Fisher, M.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Mehmet A. Orgun John Thornton

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mihov, S., Mitankin, P., Gotscharek, A., Reffle, U., Schulz, K.U., Ringlstetter, C. (2007). Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens. In: Orgun, M.A., Thornton, J. (eds) AI 2007: Advances in Artificial Intelligence. AI 2007. Lecture Notes in Computer Science(), vol 4830. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76928-6_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76928-6_47

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76926-2

  • Online ISBN: 978-3-540-76928-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics