Abstract
This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abbyy FineReader Homepage, http://www.abbyy.com
Anagnostis Homepage, http://www.ideatech-online.com
Aspell Spell-checker Homepage, http://aspell.net
Ben Jlaiel, M., Kanoun, S., Alimi, A.M., Mullot, R.: Three decision levels strategy for Arabic and Latin texts differentiation in printed and handwritten natures. In: 9th International Conference on Document Analysis and Recognition, pp. 1103–1107 (2007)
van Beusekom, J., Shafait, F., Breul, T.M.: Automated OCR Ground Truth Generation. In: 9th International Conference on Document Analysis and Recognition, pp. 111–117 (2007)
Cecotti, H., Belaïd, A.: Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents. In: 8th International Conference on Document Analysis and Recognition, pp. 1045–1049 (2005)
Crane, G.: Generating and parsing classical Greek. Literary and Linguistic Computing 6(4), 243–245 (1991)
Crane, G., Bamman, D., Cerrato, L., Jones, A., Mimno, D., Packel, A., Sculley, D., Weaver, G.: Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 353–366. Springer, Heidelberg (2006)
Csernel, M., Patte, F.: Critical Edition of Sanskrit Texts. In: 1st International Sanskrit Computational Linguistics Symposium, pp. 95–113 (2007)
Edwards, J., Teh, Y.W., Forsyth, D., Bock, R., Maire, M., Vesom, G.: Making Latin Manuscripts Searchable using gHMM’s. Advances in Neural Information Processing Systems 17, 385–392 (2004)
Feng, S., Manmatha, R.: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. In: JCDL 2006, pp. 109–118 (2006)
Internet Archive Homepage, http://www.archive.org
Le Bourgeois, F., Emptoz, H.: DEBORA: Digital AccEss to Books of the RenAissance. International Journal on Document Analysis and Recognition 9, 192–221 (2007)
Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recognition 40(12), 3552–3567 (2007)
Leydier, Y., Le Bourgeois, F., Emptoz, H.: Textual Indexation of Ancient Documents. In: 2005 ACM symposium on Document engineering, pp. 111–117 (2005)
Lund, W.B., Ringger, E.K.: Improving Optical Character Recognition through Efficient Multiple System Alignment (to appear in JCDL 2009)
Moalla, I., Lebourgeois, F., Emptoz, H., Alimi, A.M.: Image Analysis for Paleography Inspection. In: Document Analysis Systems VII, pp. 25–37 (2006)
Monroy, C., Kochumman, R., Furuta, R., Urbina, E., Melgoza, E., Goenka, A.: Visualization of Variants in Textual Collations to Analyze the Evolution of Literary Works in The Cervantes Project. In: 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 638–653 (2007)
Medieval Unicode Font Initiative Homepage, http://www.mufi.info/fonts
Namboodiri, A.M., Narayanan, P.J., Jawahar, C.V.: On Using Classical Poetry Structure for Indian Language Post-Processing. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 1238–1242. IEEE Computer Society, Los Alamitos (2007)
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
OCRopus Homepage, code.google.com/p/ocropus
Perseus Project Homepage, http://www.perseus.tufts.edu/hopper/opensource
Reddy, S., Crane, G.: A Document Recognition System for Early Modern Latin. In: Chicago Colloquium on Digital Humanities and Computer Science: What Do You Do With A Million Books, Chicago, IL (2006)
Reynaert, M.: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 617–630. Springer, Heidelberg (2008)
Reynaert, M.: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. In: 6th International Conference on Language Resources and Evaluation 2008, pp. 1867–1872 (2008)
Ringlstetter, C., Schulz, K., Mihov, S., Louka, K.: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. In: 8th International Conference on Document Analysis and Recognition, vol. 1, pp. 406–410 (2005)
Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE Computer Society, Los Alamitos (2007)
Spencer, M., Howe, C.: Collating texts using progressive multiple alignment. Computer and the Humanities 37(1), 97–109 (2003)
Stewart, G., Crane, G., Babeu, A.: A New Generation of Textual Corpora. In: JCDL 2007, pp. 356–365 (2007)
Tesseract Homepage, http://code.google.com/p/tesseract-ocr
Zhuang, L., Zhu, X.-Y.: An OCR post-processing approach based on multi-knowledge. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 346–352. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boschetti, F., Romanello, M., Babeu, A., Bamman, D., Crane, G. (2009). Improving OCR Accuracy for Classical Critical Editions. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2009. Lecture Notes in Computer Science, vol 5714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04346-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-04346-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04345-1
Online ISBN: 978-3-642-04346-8
eBook Packages: Computer ScienceComputer Science (R0)