Efficient Language-Independent Retrieval of Printed Documents without OCR

Magdy, Walid; Darwish, Kareem; El-Saban, Motaz

doi:10.1007/978-3-642-03784-9_33

Efficient Language-Independent Retrieval of Printed Documents without OCR

Walid Magdy¹⁹,
Kareem Darwish²⁰ &
Motaz El-Saban²⁰

Conference paper

1099 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5721))

Abstract

Recent book digitization initiatives have facilitated the access and search of millions of books. Although OCR remains essential for retrieving printed documents, OCR engines remain limited in the languages they handle and are generally expensive to build. This paper proposes a language independent approach that enables search through printed documents in a way that combines image-based matching with conventional IR techniques without using OCR. While image-based matching can be effective in finding similar words, complementing it with efficient retrieval techniques allows for sub-word matching, term weighting, and document ranking. The basic idea is that similar connected elements in printed documents are clustered and represented with ID’s, which are then used to generate equivalent textual representations. The resultant representations are indexed using an IR engine and searched using the equivalent ID’s of the connected elements in queries. Though, the main benefit of the proposed approach lies in languages for which no OCR exists, the technique was tested on English and Arabic to ascertain the relative effectiveness of the approach. The approach achieves more than 61% relative effectiveness compared to using OCR for both languages. While the reported numbers are lower than that of OCR-based approaches, the proposed method is fully automated, does not require any supervised training, and allows documents to be searchable within a few hours.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahmed, M.: A Large-Scale Computational Processor of Arabic Morphology and Applications. MSc. Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt (2000)
Google Scholar
Barret, W., Hutchison, L., Quass, D., Nielson, H., Kennard, D.: Digital Mountain: From Granite Archive to Global Access. In: Intl. Workshop on Doc. Image Analysis for Libraries, pp. 104–121 (2004)
Google Scholar
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: SIGIR, pp. 338–344 (2003)
Google Scholar
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR, pp. 261–268 (2002)
Google Scholar
Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. (2008)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD (1996)
Google Scholar
Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries, pp. 345–359 (1997)
Google Scholar
Hassibi, K.: Machine Printed Arabic OCR. In: AIPR Workshop: Interdisciplinary Computer Vision, SPIE Proceedings, vol. 2103, pp. 126–134 (1994)
Google Scholar
Hawking, D.: Document Retrieval in OCR-Scanned Text. In: 6th Parallel Comp. Workshop, P2-F (1996)
Google Scholar
Kantor, P., Voorhees, E.: Report on the TREC-5 Confusion Track. TREC-5, p. 65 (1996)
Google Scholar
Kanungo, T., Marton, G., Bulbul, O.: OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products. In: SPIE Conf. on Doc. Recognition and Retrieval (VI), vol. 3651, pp. 109–120 (1999)
Google Scholar
Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. In: IJDAR (2007)
Google Scholar
Kumar, A., Jawahar, C., Manmatha, R.: Efficient Search in Doc. Image Collections. In: ACCV (2007)
Google Scholar
Lu, Z., Bazzi, I., Kornai, A., Makhoul, J., Natarajan, P., Schwartz, R.: A Robust, Language-Independent OCR System. In: AIPR Workshop: Advances in Computer Assisted Recognition, SPIE, vol. 3584 (1999)
Google Scholar
Magdy, W., Darwish, K.: Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. In: EMNLP, pp. 408–414 (2006)
Google Scholar
Magdy, W., Darwish, K., Rashwan, M.: Fusion of Multiple Corrupted Transmissions and its Effect on Information Retrieval. In: Seventh Conference on Language Engineering, ESOLEC, pp. 351–358 (2007)
Google Scholar
Manmatha, R., Croft, W.B.: Word Spotting: Indexing Handwritten Archives (1997)
Google Scholar
Marinai, S., Marino, S., Soda, G.: Font Adaptive Word Indexing of Modern Printed Documents. Transactions Pattern Analysis and Machine Intelligence (2006)
Google Scholar
Metzler, D., Croft, W.B.: Combining the Language Model and Inference Network Approaches to Retrieval. Info. Processing and Management 40(5), 735–750 (2004)
Article Google Scholar
Mittendorf, E., Schäuble, P.: IR can Cope with Many Errors. IR 3(3), 189–216 (2000)
MATH Google Scholar
Oard, D., Gey, F.: The TREC 2002 Arabic/English CLIR Track. In: TREC 2002 (2002)
Google Scholar
Oard, D.W., Ertunc, F.: Translation-Based Indexing for Cross-Language Retrieval. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 324–333. Springer, Heidelberg (2002)
Chapter Google Scholar
Pirkola, A.: Effects of Query Structure and Dict. Setups in Dict.-Based Cross-Lang. IR. SIGIR (1998)
Google Scholar
Rath, T., Manmatha, R.: Word Image Matching Using Dynamic Time Warping. In: CVPR (2), vol. 521 (2003)
Google Scholar
Rath, T., Manmatha, R., Lavrenko, V.: Search Engine for Historical Manuscript Images. In: SIGIR (2004)
Google Scholar
Rath, T., Manmatha, R.: Word spotting for historical documents. In: IJDAR 2007 (2007)
Google Scholar
Sanderson, M.: Word Sense Disambiguation and IR. PhD thesis, University of Glasgow (1997)
Google Scholar
Sankar, P., Jawahar, C.: Prob. Reverse Annotation for Large Scale Image Retrieval. In: CVPR (2007)
Google Scholar
Srihari, S.N., Ball, G.R., Srinivasan, H.: Versatile Search of Scanned Arabic Handwriting. In: Doermann, D., Jaeger, S. (eds.) SACH 2006. LNCS, vol. 4768, pp. 57–69. Springer, Heidelberg (2008)
Chapter Google Scholar
Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on Ranking and Feedback using the Vector Space Model. Info. Processing and Management 32(3), 317–327 (1996)
Article Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic Press, London (2006)
MATH Google Scholar
Thoma, G., Ford, G.: Automated Data Entry System: Performance Issues. In: SPIE Conference on Document Recognition and Retrieval IX, pp. 181–190 (2002)
Google Scholar
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: SDIUT, pp. 151–158 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Dublin City University, Dublin 9, Ireland
Walid Magdy
Cairo Microsoft Innovation Center, Microsoft, Smart Village, B115, Abou Rawash, Egypt
Kareem Darwish & Motaz El-Saban

Authors

Walid Magdy
View author publications
You can also search for this author in PubMed Google Scholar
Kareem Darwish
View author publications
You can also search for this author in PubMed Google Scholar
Motaz El-Saban
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Swedish Institute of Computer Science, Kista, Sweden
Jussi Karlgren
Department of Computer Science and Engineering, Helsinki University of Technology, P.O. Box 5400, 02015 HUT, Espoo, Finland
Jorma Tarhio
Department of Computer Sciences, University of Tampere, Tampere, Finland
Heikki Hyyrö

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Magdy, W., Darwish, K., El-Saban, M. (2009). Efficient Language-Independent Retrieval of Printed Documents without OCR. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-03784-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03783-2
Online ISBN: 978-3-642-03784-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics