Skip to main content

Visualizing Document Image Collections Using Image-Based Word Clouds

  • Conference paper
  • First Online:
Advances in Visual Computing (ISVC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9474))

Included in the following conference series:

Abstract

In this paper, we introduce image-based word clouds as a novel tool for a quick and aesthetic overviews of common words in collections of digitized text manuscripts. While OCR can be used to enable summaries and search functionality to printed modern text, historical and handwritten documents remains a challenge. By segmenting and counting word images, without applying manual transcription or OCR, we have developed a method that can produce word or tag clouds from document collections. Our new tool is not limited to any specific kind of text. We make further contributions in ways of stop-word removal, class based feature weighting and visualization. An evaluation of the proposed tool includes comparisons with ground truth word clouds on handwritten marriage licenses from the 17th century and the George Washington database of handwritten letters, from the 18th century. Our experiments show that image-based word clouds capture the same information, albeit approximately, as the regular word clouds based on text data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zagoris, K., Pratikakis, I., Antonacopoulos, A., Gatos, B., Papamarkos, N.: Handwritten and machine printed text separation in document images using the bag of visual words paradigm. In: ICFHR, pp. 103–108 (2012)

    Google Scholar 

  2. Kovalchuk, A., Wolf, L., Dershowitz, N.: A simple and fast word spotting method. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 3–8 (2014)

    Google Scholar 

  3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)

    Google Scholar 

  4. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 971–987 (2002)

    Article  Google Scholar 

  5. Vedaldi, A., Fulkerson, B.: Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the international conference on Multimedia, pp. 1469–1472. ACM (2010)

    Google Scholar 

  6. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Handwritten word spotting with corrected attributes. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1017–1024. IEEE (2013)

    Google Scholar 

  7. Johansson, B.: On classification: simultaneously reducing dimensionality and finding automatic representation using canonical correlation (2001)

    Google Scholar 

  8. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  9. Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press Inc, Orlando (1978)

    MATH  Google Scholar 

  10. Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9, 139–152 (2007)

    Article  Google Scholar 

  11. Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning, vol. 2. Springer, New York (2009)

    Book  MATH  Google Scholar 

  12. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Cambridge (1949)

    Google Scholar 

  13. Fernández-Mota, D., Almazán, J., Cirera, N., Fornés, A., Lladós, J.: Bh2m: the barcelona historical, handwritten marriages database. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 256–261. IEEE (2014)

    Google Scholar 

  14. Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 278–287. IEEE (2004)

    Google Scholar 

Download references

Acknowledgments

This project is a part of q2b, From quill to bytes, a framework program sponsored by the Swedish Research Council (Dnr 2012-5743) and Uppsala university. The work is done in part as a collaboration with the Swedish Museum of Natural History (Naturhistoriska riksmuseet). We would also like to thank Alicia Fornés and the Document Analysis group of the Computer Vision Center at Universitat Autnoma de Barcelona for access to the BH2M dataset.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomas Wilkinson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wilkinson, T., Brun, A. (2015). Visualizing Document Image Collections Using Image-Based Word Clouds. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2015. Lecture Notes in Computer Science(), vol 9474. Springer, Cham. https://doi.org/10.1007/978-3-319-27857-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27857-5_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27856-8

  • Online ISBN: 978-3-319-27857-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics