Word Clouds for Efficient Document Labeling

  • Christin Seifert
  • Eva Ulbrich
  • Michael Granitzer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6926)


In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labelers - a tedious and time-consuming work. We propose to use condensed representations of text documents instead of the full-text document to reduce the labeling time for single documents. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. The key phrases are presented in a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labeling with these condensed representations can be done faster and equally accurate by the human labelers. Our evaluation shows that the users labeled word clouds twice as fast but as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labeling process of text documents.


Text classification visualization user interface word clouds document labeling document annotation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wordle - Beautiful Word Clouds, (accessed: April 25, 2011)
  2. 2.
    Baldridge, J., Palmer, A.: How well does active learning actually work?: Time-based evaluation of cost-reduction strategies for language documentation. In: Proc. of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 296–305. Association for Computational Linguistics, Morristown (2009)CrossRefGoogle Scholar
  3. 3.
    Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 595–602. ACM, New York (2008), Google Scholar
  4. 4.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)zbMATHGoogle Scholar
  5. 5.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  6. 6.
    Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Proc. of the International Conference on World Wide Web (WWW), pp. 201–210. ACM, New York (2009)Google Scholar
  7. 7.
    Gupta, V., Lehal, G.: A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence 2(3) (2010),
  8. 8.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009),, doi:10.1145/1656274.1656278CrossRefGoogle Scholar
  9. 9.
    van Ham, F., Wattenberg, M., Viegas, F.B.: Mapping text with phrase nets. IEEE Transactions on Visualization and Computer Graphics 15, 1169–1176 (2009), CrossRefGoogle Scholar
  10. 10.
    McCallum, A.K.: Mallet: A machine learning for language toolkit (2002),
  11. 11.
    Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain (2004),
  12. 12.
    Paley, W.B.: TextArc: Showing word frequency and distribution in text. In: Proceedings of IEEE Symposium on Information Visualization, Poster Compendium. IEEE CS Press, Los Alamitos (2002)Google Scholar
  13. 13.
    Schein, A.I., Ungar, L.H.: Active learning for logistic regression: an evaluation. Mach. Learn. 68(3), 235–265 (2007)CrossRefGoogle Scholar
  14. 14.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002), CrossRefGoogle Scholar
  15. 15.
    Seifert, C., Kump, B., Kienreich, W., Granitzer, G., Granitzer, M.: On the beauty and usability of tag clouds. In: Proceedings of the 12th International Conference on Information Visualisation (IV), pp. 17–25. IEEE Computer Society, Los Alamitos (2008)Google Scholar
  16. 16.
    Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2010),
  17. 17.
    Strobelt, H., Oelke, D., Rohrdantz, C., Stoffel, A., Keim, D.A., Deussen, O.: Document cards: A top trumps visualization for documents. IEEE Transactions on Visualization and Computer Graphics 15, 1145–1152 (2009)CrossRefGoogle Scholar
  18. 18.
    Tomanek, K., Olsson, F.: A web survey on the use of active learning to support annotation of text data. In: Proc. of the NAACL Workshop on Active Learning for Natural Language Processing (HLT), pp. 45–48. Association for Computational Linguistics, Morristown (2009)Google Scholar
  19. 19.
    Wattenberg, M., Viégas, F.B.: The word tree, an interactive visual concordance. IEEE Transactions on Visualization and Computer Graphics 14, 1221–1228 (2008), CrossRefGoogle Scholar
  20. 20.
    Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin (2008),
  21. 21.
    Šilić, A., Bašić, B.: Visualization of text streams: A survey. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6277, pp. 31–43. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Christin Seifert
    • 1
  • Eva Ulbrich
    • 2
  • Michael Granitzer
    • 1
    • 2
  1. 1.University of TechnologyGrazAustria
  2. 2.Know-CenterGrazAustria

Personalised recommendations