Abstract
In this paper, we introduce image-based word clouds as a novel tool for a quick and aesthetic overviews of common words in collections of digitized text manuscripts. While OCR can be used to enable summaries and search functionality to printed modern text, historical and handwritten documents remains a challenge. By segmenting and counting word images, without applying manual transcription or OCR, we have developed a method that can produce word or tag clouds from document collections. Our new tool is not limited to any specific kind of text. We make further contributions in ways of stop-word removal, class based feature weighting and visualization. An evaluation of the proposed tool includes comparisons with ground truth word clouds on handwritten marriage licenses from the 17th century and the George Washington database of handwritten letters, from the 18th century. Our experiments show that image-based word clouds capture the same information, albeit approximately, as the regular word clouds based on text data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zagoris, K., Pratikakis, I., Antonacopoulos, A., Gatos, B., Papamarkos, N.: Handwritten and machine printed text separation in document images using the bag of visual words paradigm. In: ICFHR, pp. 103–108 (2012)
Kovalchuk, A., Wolf, L., Dershowitz, N.: A simple and fast word spotting method. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 3–8 (2014)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 971–987 (2002)
Vedaldi, A., Fulkerson, B.: Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the international conference on Multimedia, pp. 1469–1472. ACM (2010)
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Handwritten word spotting with corrected attributes. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1017–1024. IEEE (2013)
Johansson, B.: On classification: simultaneously reducing dimensionality and finding automatic representation using canonical correlation (2001)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press Inc, Orlando (1978)
Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9, 139–152 (2007)
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning, vol. 2. Springer, New York (2009)
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Cambridge (1949)
Fernández-Mota, D., Almazán, J., Cirera, N., Fornés, A., Lladós, J.: Bh2m: the barcelona historical, handwritten marriages database. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 256–261. IEEE (2014)
Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 278–287. IEEE (2004)
Acknowledgments
This project is a part of q2b, From quill to bytes, a framework program sponsored by the Swedish Research Council (Dnr 2012-5743) and Uppsala university. The work is done in part as a collaboration with the Swedish Museum of Natural History (Naturhistoriska riksmuseet). We would also like to thank Alicia Fornés and the Document Analysis group of the Computer Vision Center at Universitat Autnoma de Barcelona for access to the BH2M dataset.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wilkinson, T., Brun, A. (2015). Visualizing Document Image Collections Using Image-Based Word Clouds. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2015. Lecture Notes in Computer Science(), vol 9474. Springer, Cham. https://doi.org/10.1007/978-3-319-27857-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-27857-5_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27856-8
Online ISBN: 978-3-319-27857-5
eBook Packages: Computer ScienceComputer Science (R0)