Skip to main content

Improving OCR Accuracy for Classical Critical Editions

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5714))

Included in the following conference series:

Abstract

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abbyy FineReader Homepage, http://www.abbyy.com

  2. Anagnostis Homepage, http://www.ideatech-online.com

  3. Aspell Spell-checker Homepage, http://aspell.net

  4. Ben Jlaiel, M., Kanoun, S., Alimi, A.M., Mullot, R.: Three decision levels strategy for Arabic and Latin texts differentiation in printed and handwritten natures. In: 9th International Conference on Document Analysis and Recognition, pp. 1103–1107 (2007)

    Google Scholar 

  5. van Beusekom, J., Shafait, F., Breul, T.M.: Automated OCR Ground Truth Generation. In: 9th International Conference on Document Analysis and Recognition, pp. 111–117 (2007)

    Google Scholar 

  6. Cecotti, H., Belaïd, A.: Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents. In: 8th International Conference on Document Analysis and Recognition, pp. 1045–1049 (2005)

    Google Scholar 

  7. Crane, G.: Generating and parsing classical Greek. Literary and Linguistic Computing 6(4), 243–245 (1991)

    Article  MathSciNet  Google Scholar 

  8. Crane, G., Bamman, D., Cerrato, L., Jones, A., Mimno, D., Packel, A., Sculley, D., Weaver, G.: Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 353–366. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Csernel, M., Patte, F.: Critical Edition of Sanskrit Texts. In: 1st International Sanskrit Computational Linguistics Symposium, pp. 95–113 (2007)

    Google Scholar 

  10. Edwards, J., Teh, Y.W., Forsyth, D., Bock, R., Maire, M., Vesom, G.: Making Latin Manuscripts Searchable using gHMM’s. Advances in Neural Information Processing Systems 17, 385–392 (2004)

    Google Scholar 

  11. Feng, S., Manmatha, R.: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. In: JCDL 2006, pp. 109–118 (2006)

    Google Scholar 

  12. Internet Archive Homepage, http://www.archive.org

  13. Le Bourgeois, F., Emptoz, H.: DEBORA: Digital AccEss to Books of the RenAissance. International Journal on Document Analysis and Recognition 9, 192–221 (2007)

    Article  Google Scholar 

  14. Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recognition 40(12), 3552–3567 (2007)

    Article  MATH  Google Scholar 

  15. Leydier, Y., Le Bourgeois, F., Emptoz, H.: Textual Indexation of Ancient Documents. In: 2005 ACM symposium on Document engineering, pp. 111–117 (2005)

    Google Scholar 

  16. Lund, W.B., Ringger, E.K.: Improving Optical Character Recognition through Efficient Multiple System Alignment (to appear in JCDL 2009)

    Google Scholar 

  17. Moalla, I., Lebourgeois, F., Emptoz, H., Alimi, A.M.: Image Analysis for Paleography Inspection. In: Document Analysis Systems VII, pp. 25–37 (2006)

    Google Scholar 

  18. Monroy, C., Kochumman, R., Furuta, R., Urbina, E., Melgoza, E., Goenka, A.: Visualization of Variants in Textual Collations to Analyze the Evolution of Literary Works in The Cervantes Project. In: 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 638–653 (2007)

    Google Scholar 

  19. Medieval Unicode Font Initiative Homepage, http://www.mufi.info/fonts

  20. Namboodiri, A.M., Narayanan, P.J., Jawahar, C.V.: On Using Classical Poetry Structure for Indian Language Post-Processing. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 1238–1242. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  21. Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  22. OCRopus Homepage, code.google.com/p/ocropus

  23. Perseus Project Homepage, http://www.perseus.tufts.edu/hopper/opensource

  24. Reddy, S., Crane, G.: A Document Recognition System for Early Modern Latin. In: Chicago Colloquium on Digital Humanities and Computer Science: What Do You Do With A Million Books, Chicago, IL (2006)

    Google Scholar 

  25. Reynaert, M.: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 617–630. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  26. Reynaert, M.: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. In: 6th International Conference on Language Resources and Evaluation 2008, pp. 1867–1872 (2008)

    Google Scholar 

  27. Ringlstetter, C., Schulz, K., Mihov, S., Louka, K.: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. In: 8th International Conference on Document Analysis and Recognition, vol. 1, pp. 406–410 (2005)

    Google Scholar 

  28. Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  29. Spencer, M., Howe, C.: Collating texts using progressive multiple alignment. Computer and the Humanities 37(1), 97–109 (2003)

    Article  Google Scholar 

  30. Stewart, G., Crane, G., Babeu, A.: A New Generation of Textual Corpora. In: JCDL 2007, pp. 356–365 (2007)

    Google Scholar 

  31. Tesseract Homepage, http://code.google.com/p/tesseract-ocr

  32. Zhuang, L., Zhu, X.-Y.: An OCR post-processing approach based on multi-knowledge. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 346–352. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Boschetti, F., Romanello, M., Babeu, A., Bamman, D., Crane, G. (2009). Improving OCR Accuracy for Classical Critical Editions. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2009. Lecture Notes in Computer Science, vol 5714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04346-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04346-8_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04345-1

  • Online ISBN: 978-3-642-04346-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics