Abstract
Recent book digitization initiatives have facilitated the access and search of millions of books. Although OCR remains essential for retrieving printed documents, OCR engines remain limited in the languages they handle and are generally expensive to build. This paper proposes a language independent approach that enables search through printed documents in a way that combines image-based matching with conventional IR techniques without using OCR. While image-based matching can be effective in finding similar words, complementing it with efficient retrieval techniques allows for sub-word matching, term weighting, and document ranking. The basic idea is that similar connected elements in printed documents are clustered and represented with ID’s, which are then used to generate equivalent textual representations. The resultant representations are indexed using an IR engine and searched using the equivalent ID’s of the connected elements in queries. Though, the main benefit of the proposed approach lies in languages for which no OCR exists, the technique was tested on English and Arabic to ascertain the relative effectiveness of the approach. The approach achieves more than 61% relative effectiveness compared to using OCR for both languages. While the reported numbers are lower than that of OCR-based approaches, the proposed method is fully automated, does not require any supervised training, and allows documents to be searchable within a few hours.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ahmed, M.: A Large-Scale Computational Processor of Arabic Morphology and Applications. MSc. Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt (2000)
Barret, W., Hutchison, L., Quass, D., Nielson, H., Kennard, D.: Digital Mountain: From Granite Archive to Global Access. In: Intl. Workshop on Doc. Image Analysis for Libraries, pp. 104–121 (2004)
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: SIGIR, pp. 338–344 (2003)
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR, pp. 261–268 (2002)
Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. (2008)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD (1996)
Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries, pp. 345–359 (1997)
Hassibi, K.: Machine Printed Arabic OCR. In: AIPR Workshop: Interdisciplinary Computer Vision, SPIE Proceedings, vol. 2103, pp. 126–134 (1994)
Hawking, D.: Document Retrieval in OCR-Scanned Text. In: 6th Parallel Comp. Workshop, P2-F (1996)
Kantor, P., Voorhees, E.: Report on the TREC-5 Confusion Track. TREC-5, p. 65 (1996)
Kanungo, T., Marton, G., Bulbul, O.: OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products. In: SPIE Conf. on Doc. Recognition and Retrieval (VI), vol. 3651, pp. 109–120 (1999)
Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. In: IJDAR (2007)
Kumar, A., Jawahar, C., Manmatha, R.: Efficient Search in Doc. Image Collections. In: ACCV (2007)
Lu, Z., Bazzi, I., Kornai, A., Makhoul, J., Natarajan, P., Schwartz, R.: A Robust, Language-Independent OCR System. In: AIPR Workshop: Advances in Computer Assisted Recognition, SPIE, vol. 3584 (1999)
Magdy, W., Darwish, K.: Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. In: EMNLP, pp. 408–414 (2006)
Magdy, W., Darwish, K., Rashwan, M.: Fusion of Multiple Corrupted Transmissions and its Effect on Information Retrieval. In: Seventh Conference on Language Engineering, ESOLEC, pp. 351–358 (2007)
Manmatha, R., Croft, W.B.: Word Spotting: Indexing Handwritten Archives (1997)
Marinai, S., Marino, S., Soda, G.: Font Adaptive Word Indexing of Modern Printed Documents. Transactions Pattern Analysis and Machine Intelligence (2006)
Metzler, D., Croft, W.B.: Combining the Language Model and Inference Network Approaches to Retrieval. Info. Processing and Management 40(5), 735–750 (2004)
Mittendorf, E., Schäuble, P.: IR can Cope with Many Errors. IR 3(3), 189–216 (2000)
Oard, D., Gey, F.: The TREC 2002 Arabic/English CLIR Track. In: TREC 2002 (2002)
Oard, D.W., Ertunc, F.: Translation-Based Indexing for Cross-Language Retrieval. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 324–333. Springer, Heidelberg (2002)
Pirkola, A.: Effects of Query Structure and Dict. Setups in Dict.-Based Cross-Lang. IR. SIGIR (1998)
Rath, T., Manmatha, R.: Word Image Matching Using Dynamic Time Warping. In: CVPR (2), vol. 521 (2003)
Rath, T., Manmatha, R., Lavrenko, V.: Search Engine for Historical Manuscript Images. In: SIGIR (2004)
Rath, T., Manmatha, R.: Word spotting for historical documents. In: IJDAR 2007 (2007)
Sanderson, M.: Word Sense Disambiguation and IR. PhD thesis, University of Glasgow (1997)
Sankar, P., Jawahar, C.: Prob. Reverse Annotation for Large Scale Image Retrieval. In: CVPR (2007)
Srihari, S.N., Ball, G.R., Srinivasan, H.: Versatile Search of Scanned Arabic Handwriting. In: Doermann, D., Jaeger, S. (eds.) SACH 2006. LNCS, vol. 4768, pp. 57–69. Springer, Heidelberg (2008)
Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on Ranking and Feedback using the Vector Space Model. Info. Processing and Management 32(3), 317–327 (1996)
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic Press, London (2006)
Thoma, G., Ford, G.: Automated Data Entry System: Performance Issues. In: SPIE Conference on Document Recognition and Retrieval IX, pp. 181–190 (2002)
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: SDIUT, pp. 151–158 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Magdy, W., Darwish, K., El-Saban, M. (2009). Efficient Language-Independent Retrieval of Printed Documents without OCR. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-03784-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03783-2
Online ISBN: 978-3-642-03784-9
eBook Packages: Computer ScienceComputer Science (R0)