Skip to main content

Human Perception of Enriched Topic Models

  • Conference paper
  • First Online:
  • 2888 Accesses

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 320))

Abstract

Topic modeling algorithms, such as LDA, find topics, hidden structures, in document corpora in an unsupervised manner. Traditionally, applications of topic modeling over textual data use the bag-of-words model, i.e. only consider words in the documents. In our previous work we developed a framework for mining enriched topic models. We proposed a bag-of-features approach, where a document consists not only of words but also of linked named entities and their related information, such as types or categories.

In this work we focused on the feature engineering and selection aspects of enriched topic modeling and evaluated the results based on two measures for assessing the understandability of estimated topics for humans: model precision and topic log odds. In our 10-model experimental setup with 7 pure resource-, 2 hybrid words/resource- and one word-based model, the traditional bag-of-words models were outperformed by 5 pure resource-based models in both measures. These results show that incorporating background knowledge into topic models makes them more understandable for humans.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.spiegel.de/international/.

  2. 2.

    http://mlg.ucd.ie/datasets/bbc.html.

  3. 3.

    https://catalog.ldc.upenn.edu/ldc2008t19.

  4. 4.

    http://news.bbc.co.uk/.

  5. 5.

    http://wiki.dbpedia.org/Downloads2015-10.

  6. 6.

    https://www.mturk.com/.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)

    Google Scholar 

  3. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. IJCAI 5, 1048–1053 (2005)

    Google Scholar 

  4. Garla, V.N., Brandt, C.: Ontology-guided feature engineering for clinical text classification. J. Biomed. Inf. 45(5), 992–998 (2012)

    Article  Google Scholar 

  5. Hoffman, M., Blei, D.M., Bach, F.: Online learning for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 23, pp. 856–864 (2010)

    Google Scholar 

  6. Hu, Z., Luo, G., Sachan, M., Xing, E., Nie, Z.: Grounding topic models with knowledge bases (2016)

    Google Scholar 

  7. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics (2011)

    Google Scholar 

  8. Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 680–686. ACM (2006)

    Google Scholar 

  9. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. Association for Computational Linguistics (2010)

    Google Scholar 

  10. Pinoli, P., Chicco, D., Masseroli, M.: Latent Dirichlet allocation based on Gibbs sampling for gene function prediction. In: 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–8. IEEE (2014)

    Google Scholar 

  11. Scott, S., Matwin, S.: Feature engineering for text classification. ICML 99, 379–388 (1999)

    Google Scholar 

  12. Todor, A., Lukasiewicz, W., Athan, T., Paschke, A.: Enriching topic models with DBpedia. In: Debruyne, C., et al. (eds.) OTM 2016. LNCS, vol. 10033, pp. 735–751. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48472-3_46

    Chapter  Google Scholar 

  13. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1105–1112. ACM (2009)

    Google Scholar 

  14. Zong, W., Feng, W., Chu, L.-K., Sculli, D.: A discriminative and semantic feature selection method for text categorization. Int. J. Prod. Econ. 165, 215–222 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandru Todor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lukasiewicz, W., Todor, A., Paschke, A. (2018). Human Perception of Enriched Topic Models. In: Abramowicz, W., Paschke, A. (eds) Business Information Systems. BIS 2018. Lecture Notes in Business Information Processing, vol 320. Springer, Cham. https://doi.org/10.1007/978-3-319-93931-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93931-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93930-8

  • Online ISBN: 978-3-319-93931-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics