Improving OCR Accuracy for Classical Critical Editions

Boschetti, Federico; Romanello, Matteo; Babeu, Alison; Bamman, David; Crane, Gregory

doi:10.1007/978-3-642-04346-8_17

Federico Boschetti²⁰,
Matteo Romanello²⁰,
Alison Babeu²⁰,
David Bamman²⁰ &
…
Gregory Crane²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5714))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1741 Accesses
19 Citations

Abstract

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abbyy FineReader Homepage, http://www.abbyy.com
Anagnostis Homepage, http://www.ideatech-online.com
Aspell Spell-checker Homepage, http://aspell.net
Ben Jlaiel, M., Kanoun, S., Alimi, A.M., Mullot, R.: Three decision levels strategy for Arabic and Latin texts differentiation in printed and handwritten natures. In: 9th International Conference on Document Analysis and Recognition, pp. 1103–1107 (2007)
Google Scholar
van Beusekom, J., Shafait, F., Breul, T.M.: Automated OCR Ground Truth Generation. In: 9th International Conference on Document Analysis and Recognition, pp. 111–117 (2007)
Google Scholar
Cecotti, H., Belaïd, A.: Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents. In: 8th International Conference on Document Analysis and Recognition, pp. 1045–1049 (2005)
Google Scholar
Crane, G.: Generating and parsing classical Greek. Literary and Linguistic Computing 6(4), 243–245 (1991)
Article MathSciNet Google Scholar
Crane, G., Bamman, D., Cerrato, L., Jones, A., Mimno, D., Packel, A., Sculley, D., Weaver, G.: Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 353–366. Springer, Heidelberg (2006)
Chapter Google Scholar
Csernel, M., Patte, F.: Critical Edition of Sanskrit Texts. In: 1st International Sanskrit Computational Linguistics Symposium, pp. 95–113 (2007)
Google Scholar
Edwards, J., Teh, Y.W., Forsyth, D., Bock, R., Maire, M., Vesom, G.: Making Latin Manuscripts Searchable using gHMM’s. Advances in Neural Information Processing Systems 17, 385–392 (2004)
Google Scholar
Feng, S., Manmatha, R.: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. In: JCDL 2006, pp. 109–118 (2006)
Google Scholar
Internet Archive Homepage, http://www.archive.org
Le Bourgeois, F., Emptoz, H.: DEBORA: Digital AccEss to Books of the RenAissance. International Journal on Document Analysis and Recognition 9, 192–221 (2007)
Article Google Scholar
Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recognition 40(12), 3552–3567 (2007)
Article MATH Google Scholar
Leydier, Y., Le Bourgeois, F., Emptoz, H.: Textual Indexation of Ancient Documents. In: 2005 ACM symposium on Document engineering, pp. 111–117 (2005)
Google Scholar
Lund, W.B., Ringger, E.K.: Improving Optical Character Recognition through Efficient Multiple System Alignment (to appear in JCDL 2009)
Google Scholar
Moalla, I., Lebourgeois, F., Emptoz, H., Alimi, A.M.: Image Analysis for Paleography Inspection. In: Document Analysis Systems VII, pp. 25–37 (2006)
Google Scholar
Monroy, C., Kochumman, R., Furuta, R., Urbina, E., Melgoza, E., Goenka, A.: Visualization of Variants in Textual Collations to Analyze the Evolution of Literary Works in The Cervantes Project. In: 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 638–653 (2007)
Google Scholar
Medieval Unicode Font Initiative Homepage, http://www.mufi.info/fonts
Namboodiri, A.M., Narayanan, P.J., Jawahar, C.V.: On Using Classical Poetry Structure for Indian Language Post-Processing. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 1238–1242. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
OCRopus Homepage, code.google.com/p/ocropus
Perseus Project Homepage, http://www.perseus.tufts.edu/hopper/opensource
Reddy, S., Crane, G.: A Document Recognition System for Early Modern Latin. In: Chicago Colloquium on Digital Humanities and Computer Science: What Do You Do With A Million Books, Chicago, IL (2006)
Google Scholar
Reynaert, M.: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 617–630. Springer, Heidelberg (2008)
Chapter Google Scholar
Reynaert, M.: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. In: 6th International Conference on Language Resources and Evaluation 2008, pp. 1867–1872 (2008)
Google Scholar
Ringlstetter, C., Schulz, K., Mihov, S., Louka, K.: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. In: 8th International Conference on Document Analysis and Recognition, vol. 1, pp. 406–410 (2005)
Google Scholar
Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Spencer, M., Howe, C.: Collating texts using progressive multiple alignment. Computer and the Humanities 37(1), 97–109 (2003)
Article Google Scholar
Stewart, G., Crane, G., Babeu, A.: A New Generation of Textual Corpora. In: JCDL 2007, pp. 356–365 (2007)
Google Scholar
Tesseract Homepage, http://code.google.com/p/tesseract-ocr
Zhuang, L., Zhu, X.-Y.: An OCR post-processing approach based on multi-knowledge. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 346–352. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Perseus Digital Library, Tufts University, Eaton 124, Medford, MA, 02155, USA
Federico Boschetti, Matteo Romanello, Alison Babeu, David Bamman & Gregory Crane

Authors

Federico Boschetti
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Romanello
View author publications
You can also search for this author in PubMed Google Scholar
Alison Babeu
View author publications
You can also search for this author in PubMed Google Scholar
David Bamman
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Crane
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131, Padova, Italy
Maristella Agosti
Department of Computer Science and Engineering IST, Instituto Superior Técnico, Av. Rovisco Pais, 1049-001, Lisboa, Portugal
José Borbinha
Department of Archives and Library Sciences, Ionian University, 72 Ioannou Theotoki str., 49100, Corfu, Greece
Sarantos Kapidakis
Department of Archives and Library Sciences, Ionian University, 72 Ioannou Theotoiki str., 49100, Corfu, Greece
Christos Papatheodorou & Giannis Tsakonas &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boschetti, F., Romanello, M., Babeu, A., Bamman, D., Crane, G. (2009). Improving OCR Accuracy for Classical Critical Editions. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2009. Lecture Notes in Computer Science, vol 5714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04346-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-04346-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04345-1
Online ISBN: 978-3-642-04346-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics