Abstract
Despite the current practice of re-keying most documents placed in digital libraries, we continue to try to improve accuracy of automated recognition techniques for obtaining document image content. This task is made more difficult when the document in question has been rendered in letterpress, subjected to hundreds of years of the aging process and been microfilmed before scanning.
We endeavored to leave intact a previously described document reconstruction technique, and to enhance the document image to bring the perceived production values up to a more modern standards in order to process a novel of historic importance: Don Quixote by Miguel de Cervantes Saavedra. Pre-processing of the page images before application of the reconstruction techniques were performed to accommodate early 17th century typography and low-quality scanned micro-film images.
Though our technology easily outstripped the capabilities of commercial OCRs, it too was found lacking, at this stage of development, for automated processing of historical documents for digital libraries.
We had hoped to develop a useful transcription of the text and a lexicon of Spanish contemporary with the composition of this novel. However the actual accomplishment was limited to making improvements in the recognizability of the page images involved and providing a basis for further research.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Cannon, M., Hochberg, J., Kelly, P.: Quality assessment and restoration of typewritten document images. International Journal on Document Analysis and Recognition 2(2/3), 80–89 (1999)
Ho, T.K., Nagy, G.: OCR with no shape training. In: International Conference on Pattern Recognition, pp. 27–30 (2000)
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text Categorization of Low Quality Images. In: Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 301–315 (1995)
Nagy, G., Seth, S., Einspahr, K.: Decoding substitution ciphers by means of word matching with application to OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(5), 710–715 (1987)
Reynar, J.C., Spitz, A.L., Sibun, P.: Document Reconstruction: A Thousand Words from One Picture. In: Symposium on Document Analysis and Information Retrieval, pp. 367–385 (1995)
Lawrence Spitz, A.: Shape-based Word Recognition. International Journal of Document Analysis and Recognition 178 (1999)
Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1280–1296 (1999)
Lawrence Spitz, A.: Progress in document reconstruction. In: International Conference on Pattern Recognition, Quebec City, pp. 464–467 (2002)
Lawrence Spitz, A., Paul Marks, J.: Measuring the robustness of character shape coding. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 1–12. Springer, Heidelberg (1999)
Lawrence Spitz, A.: Generalized Line, Word and Character Finding. In: Impedovo, S. (ed.) Progress in Image Analysis and Processing III, pp. 377–383. World Scientific, Singapore (1994)
Lawrence Spitz, A.: Moby Dick meets GEOCR: Lexical considerations in word recognition. In: International Conference on Document Analysis and Recognition, Ulm, Germany (1997)
Lawrence Spitz, A.: Correcting for Variable Skew. In: Document Analysis Systems, pp. 179–187. Princeton, NJ (2002)
Taghva, K., Borsack, J., Condit, A., Erva, S.: The Effects of Noisy Data on Text Retrieval. Journal of the American Society for Information Science 45(1), 50–58 (1994)
Xu, Y., Nagy, G.: Prototype extraction and adaptive OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99-01-07, 1280–1296 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Spitz, A.L. (2004). Tilting at Windmills: Adventures in Attempting to Reconstruct Don Quixote . In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-28640-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive