Abstract
We present a segmentation-free method to retrieve keywords from degraded historical documents. The proposed method works directly on the gray scale representation and does not require any pre-processing to enhance document images. The document images are subdivided into overlapping patches of varying sizes, where each patch is described by the bag-of-visual-words descriptor. The obtained patch descriptors are hashed into several hash tables using kernelized locality-sensitive hashing scheme for efficient retrieval. In such a scheme the search for a keyword is reduced to a small fraction of the patches from the appropriate entries in the hash tables. Since we need to capture the handwriting variations and the availability of historical documents is limited, we synthesize a small number of samples from the given query to improve the results of the retrieval process.
We have tested our approach on historical document images in Hebrew from the Cairo Genizah collection, and obtained impressive results.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Efficient Exemplar Word Spotting. In: British Machine Vision Conference, pp. 67.1–67.11 (2012)
Biller, O., Asi, A., Kedem, K., El-Sana, J., Dinstein, I.: WebGT: An Interactive Web-based System for Historical Document Ground Truth Generation. In: 12th International Conference on Document Analysis and Recognition, pp. 305–308 (2013)
Biller, O., Kedem, K., Dinstein, I., El-Sana, J.: Evolution Maps for Connected Components in Text Documents. In: International Conference on Frontiers in Handwriting Recognition, pp. 405–410 (2012)
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual Categorization with Bags of Keypoints. In: Workshop on Statistical Learning in Computer Vision. vol. 1, pp. 1–2 (2004)
Dovgalecs, V., Burnett, A., Tranouez, P., Nicolas, S., Heutte, L.: Spot It! Finding Words and Patterns in Historical Documents. In: 12th International Conference on Document Analysis and Recognition, pp. 1039–1043 (2013)
Fischer, A., Keller, A., Frinken, V., Bunke, H.: Lexicon-free handwritten word spotting using character HMMs. Pattern Recognition Letters 33(7), 934–942 (2012)
Gatos, B., Pratikakis, I.: Segmentation-free Word Spotting in Historical Printed Documents. In: 10th International Conference on Document Analysis and Recognition, pp. 271–275 (2009)
Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via Hashing. In: VLDB, vol. 99, pp. 518–529 (1999)
Kieu, V., Visani, M., Journet, N., Domenger, J., Mullot, R.: A character degradation model for grayscale ancient document images. In: 21st International Conference on Pattern Recognition, pp. 685–688 (2012)
Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., Popescu, G.: A Line-Oriented Approach to Word Spotting in Handwritten Documents. Pattern Analysis and Applications 3, 153–168 (2000)
Kulis, B., Grauman, K.: Kernelized Locality-Sensitive Hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(6), 1092–1104 (2012)
Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient Search in Document Image Collections. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 586–595. Springer, Heidelberg (2007)
Lavrenko, V., Rath, T., Manmatha, R.: Holistic Word Recognition for Handwritten Historical Documents. In: Workshop on Document Image Analysis for Libraries, pp. 278–287 (2004)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006)
Manmatha, R., Croft, W.: Word Spotting: Indexing Handwritten Archives. In: Intelligent Multimedia Information Retrieval Collection, pp. 43–64 (1997)
Rabaev, I., Biller, O., El-Sana, J., Kedem, K., Dinstein, I.: Case Study in Hebrew Character Searching. In: 11th InternationalConference on Document Analysis and Recognition, pp. 1080–1084 (2011)
Rath, T., Manmatha, R.: Word Image Matching Using Dynamic Time Warping. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 521–527 (2003)
Rusinol, M., Aldavert, D., Toledo, R., Lladós, J.: Browsing Heterogeneous Document Collections by a Segmentation-free Word Spotting Method. In: 11th International Conference on Document Analysis and Recognition, pp. 63–67 (2011)
Saabni, R., Bronstein, A.: Fast Keyword Searching Using ‘BoostMap’ Based Embedding. In: International Conference on Frontiers in Handwriting Recognition, pp. 734–739 (2012)
Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. In: IEEE International Conference on Computer Vision, pp. 1465–1472 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Rabaev, I., Dinstein, I., El-Sana, J., Kedem, K. (2014). Segmentation-Free Keyword Retrieval in Historical Document Images. In: Campilho, A., Kamel, M. (eds) Image Analysis and Recognition. ICIAR 2014. Lecture Notes in Computer Science(), vol 8814. Springer, Cham. https://doi.org/10.1007/978-3-319-11758-4_40
Download citation
DOI: https://doi.org/10.1007/978-3-319-11758-4_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11757-7
Online ISBN: 978-3-319-11758-4
eBook Packages: Computer ScienceComputer Science (R0)