Durchführung von Topic-Modell-Analysen

Papilloud, Christian; Hinneburg, Alexander

doi:10.1007/978-3-658-21980-2_3

Christian Papilloud⁶ &
Alexander Hinneburg⁶

Part of the book series: Studienskripten zur Soziologie ((SSZS))

2322 Accesses

Zusammenfassung

Die Durchführung einer Topic-Modell-Analyse setzt die Schätzung und Einstellung von Parametern voraus, die das Topic-Modell optimieren, d.h. die dazu führen, dass das Topic-Modell möglichst gut auf die Wortmengen der Dokumente passt. Dieses Schätzen und Optimieren der Parameter ist als das beschrieben, was in der Literatur das Lernen von Topic-Modellen genannt wird. Das Lernen von Topic-Modellen stützt sich auf Theorien der Statistik und der linearen Algebra. Die im Gebiet des maschinellen Lernens entwickelten Bayesschen Modelle wie Latent Dirichlet Allocation (LDA) und die in der linearen Algebra entwickelte Non-Negative-Matrix-Factorization (NMF) wurden zu ähnlichen Zwecken als Topic-Modelle vorgeschlagen. Beide Topic-Modelle werden heute benutzt werden, um Textquellen zu analysieren. Schließlich behandeln wir die Fragen der Evaluation und der Interpretation von Topic-Modellen. Wir beschreiben ebenfalls drei mögliche Umsetzungen des Analyseprozesses durch Skript-Programmierung mit R und python, und mittels der interaktiven Web-Applikation TopicExplorer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 19.99; Price excludes VAT (USA)

Softcover Book: USD 24.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Literaturverzeichnis

Akaike, H. 1974. „A new look at the statistical model identification“. IEEE Transactions on Automatic Control 19, Nr. 6 (Dezember): 716–723. ISSN: 0018-9286. https://doi.org/10.1109/tac.1974.1100705.
Aletras, Nikolaos, und Mark Stevenson. 2013. „Evaluating topic coherence using distributional semantics“. In Proceedings of the 10th International Conference on Computational Semantics (IWCS’13)–Long Papers, 13–22.
Google Scholar
Alexander, Eric, Joe Kohlmann, Robin Valenza, Michael Witmore und Michael Gleicher. 2014. „Serendip: Topic model-driven visual exploration of text corpora“. In Visual Analytics Science and Technology (VAST), 2014 IEEE Conference on, 173–182. Oktober. https://doi.org/10.1109/vast.2014.7042493.
Andorfer, Peter. 2017. „Turing Test für das Topic Modeling. Von Menschen und Maschinen erstellte inhaltliche Analysen der Korrespondenz von Leo von Thun-Hohenstein im Vergleich.“ Zeitschrift für digitale Geisteswissenschaften. https://doi.org/10.17175/2017_002. http://www.zfdg.de/2017_002.
Asuncion, Arthur, Max Welling, Padhraic Smyth und Yee Whye Teh. 2009. „On Smoothing and Inference for Topic Models“. In Proceedings of the Twenty-Fifth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-09), 27–34. Corvallis, Oregon: AUAI Press.
Google Scholar
Bird, S., E. Loper und E. Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc. http://www.nltk.org/book/.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. ISBN: 0387310738.
Google Scholar
Blei, David M. 2012. „Probabilistic Topic Models“. Commun. ACM (New York, NY, USA) 55, Nr. 4 (April): 77–84. ISSN: 0001-0782. https://doi.org/10.1145/2133806.2133826. http://doi.acm.org/10.1145/2133806.2133826.
Blei, David M., Andrew Y. Ng und Michael I. Jordan. 2003. „Latent Dirichlet Allocation“. J. Mach. Learn. Res. 3 (März): 993–1022. ISSN: 1532-4435. http://dl.acm.org/citation.cfm?id=944919.944937.
Chang, Jonathan, Sean Gerrish, ChongWang, Jordan L. Boyd-graber und David M. Blei. 2009. „Reading Tea Leaves: How Humans Interpret Topic Models“. In Advances in Neural Information Processing Systems 22, herausgegeben von Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams und A. Culotta, 288–296. Curran Associates, Inc. http://papers.nips.cc/paper/3700-reading-tealeaves-how-humans-interpret-topic-models.pdf.
Chuang, Jason, Christopher D. Manning und Jeffrey Heer. 2012. „Termite: Visualization Techniques for Assessing Textual Topic Models“. In Proceedings of the International Working Conference on Advanced Visual Interfaces, 74–77. AVI ’12. Capri Island, Italy: ACM. ISBN: 978-1-4503-1287-5. https://doi.org/10.1145/2254556.2254572. http://doi.acm.org/10.1145/2254556.2254572.
Cui, Weiwei, Shixia Liu, Zhuofeng Wu und Hao Wei. 2014. „How Hierarchical Topics Evolve in Large Text Corpora“. Visualization and Computer Graphics, IEEE Transactions on 20, Nr. 12 (Dezember): 2281–2290. ISSN: 1077-2626. https://doi.org/10.1109/tvcg.2014.2346433.
Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer und Richard Harshman. 1990. „Indexing by latent semantic analysis“. Journal of the American Society for Information Science 41 (6): 391–407. ISSN: 1097-4571. https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9. http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AIDASI1%3E3.0.CO;2-9.
Ding, Chris, Tao Li und Wei Peng. 2008. „On the equivalence between Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing“. Computational Statistics & Data Analysis 52 (8): 3913–3927. ISSN: 0167-9473. doi: https://doi.org/10.1016/j.csda.2008.01.011. http://www.sciencedirect.com/science/article/pii/S0167947308000145.
Eldén, L. 2007. Matrix Methods in Data Mining and Pattern Recognition. Society for Industrial / Applied Mathematics. https://doi.org/10.1137/1.9780898718867. eprint: http://epubs.siam.org/doi/pdf/10.1137/1.9780898718867.
Feinerer, I., K. Hornik und D. Meyer. 2008. „Text Mining Infrastructure“. R. Journal of Statistical Software 25 (5): 1–54. http://www.jstatsoft.org/v25/i05/.
Foulds, James, Levi Boyles, Christopher DuBois, Padhraic Smyth und Max Welling. 2013. „Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation“. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 446–454. KDD ’13. Chicago, Illinois, USA: ACM. ISBN: 978-1-4503-2174-7. doi:https://doi.org/10.1145/2487575.2487697. http://doi.acm.org/10.1145/2487575.2487697.
Griffiths, T., und M. Steyvers. 2004. „Finding scientific topics“. Proceedings of the National Academy of Sciences of the United States of America 101 (1): 5228–5235.
Google Scholar
Grün, B., und K. Hornik. 2011. „topicmodels: An R Package for Fitting Topic Models“. Journal of Statistical Software 40 (13). https://www.jstatsoft.org/index.php/jss/article/view/v040i13/v40i13.pdf.
Hastie, Trevor, Robert Tibshirani und Jerome Friedman. 2001. The Elements of Statistical Learning. Springer.
Google Scholar
Hearst, M. A. 2003. What is Text Mining? http://people.ischool.berkeley.edu/~hearst/text-mining.html
Hinneburg, Alexander, Rico Preiss und René Schröder. 2012. „TopicExplorer: Exploring Document Collections with Topic Models“. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II, 838–841. ECML PKDD’12. Bristol, UK: Springer-Verlag. ISBN: 978-3-642-33485-6. https://doi.org/10.1007/978-3-642-33486-3_59. http://dx.doi.org/10.1007/978-3-642-33486-3_59.
Hinneburg, Alexander, Frank Rosner, Stefan Pessler und Christian Oberländer. 2014. „Exploring Document Collections with Topic Frames“. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 2084–2086. CIKM ’14. Shanghai, China: ACM. ISBN: 978-1-4503-2598-1. https://doi.org/10.1145/2661829.2661857. http://doi.acm.org/10.1145/2661829.2661857.
Hofmann, Thomas. 1999. „Probabilistic Latent Semantic Indexing“. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. SIGIR ’99. Berkeley, California, USA: ACM. ISBN: 1-58113-096-1. https://doi.org/10.1145/312624.312649. http://doi.acm.org/10.1145/312624.312649.
Hofmann, Thomas. 1999. 2000. „Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization“. In Advances in Neural Information Processing Systems 12, herausgegeben von S. A. Solla, T. K. Leen und K. Müller, 914–920. MIT Press. http://papers.nips.cc/paper/1654-learning-the-similarity-of-documents-aninformation-geometric-approach-todocument-retrieval-and-categorization.pdf.
Lau, Jey Han, David Newman und Timothy Baldwin. 2014. „Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality“. In Proceedings of the European Chapter of the Association for Computational Linguistics.
Google Scholar
Lee, Daniel D., und H. Sebastian Seung. 1999. „Learning the parts of objects by nonnegative matrix factorization“. Nature 401:788–791.
Google Scholar
Lee, Daniel D., und H. Sebastian Seung. 2001. „Algorithms for Non-negative Matrix Factorization“. In Advances in Neural Information Processing Systems 13, herausgegeben von T. K. Leen, T. G. Dietterich und V. Tresp, 556–562. MIT Press. http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf.
Lim, K.W., W. Buntine, C. Chen und L. Du. 2016. „Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes“. ArXiv e-prints (September). arXiv: 1609.06783 [stat.ML].
Google Scholar
Lim, Kar Wai, Wray Buntine, Changyou Chen und Lan Du. 2016. „Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes“. International Journal of Approximate Reasoning 78 (Supplement C): 172–191. ISSN: 0888-613X. doi: https://doi.org/10.1016/j.ijar.2016.07.007. http://www.sciencedirect.com/science/article/pii/S0888613X16301128.
Michalke, M. 2014. koRpus: An R Package for Text Analysis. http://reaktanz.de/?c=hacking&s=koRpus.
Newman, David, Jey Han Lau, Karl Grieser und Timothy Baldwin. 2010. „Automatic evaluation of topic coherence“. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108. Association for Computational Linguistics.
Google Scholar
Paatero, Pentti, und Unto Tapper. 1994. „Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values“. Environmetrics 5 (2): 111–126. ISSN: 1099-095X. https://doi.org/10.1002/env.3170050203. http://dx.doi.org/10.1002/env.3170050203.
Pauca, V. P., F. Shahnaz, M.W. Berry und R. J. Plemmons. 2004. „Text mining using non-negative matrix factorizations“. In Proc. SIAM International Conference on Data Mining (SDM), 452–456.
Google Scholar
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel u. a. 2011. „Scikit-learn: Machine Learning in Python“. Journal of Machine Learning Research 12:2825–2830.
Google Scholar
Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth und Max Welling. 2008. „Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation“. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 569–577. KDD ’08. Las Vegas, Nevada, USA: ACM. ISBN: 978-1-60558-193-4. https://doi.org/10.1145/1401890.1401960. http://doi.acm.org/10.1145/1401890.1401960.
Rehurek, R. 2011. „Scalability of Semantic Analysis in Natural Language Processing“. Diss., Masaryk University. https://radimrehurek.com/phd_rehurek.pdf.
Riddell, Allen. Text Analysis with Topic Models for the Humanities and Social Sciences. https://de.dariah.eu/tatom/index.html.
Rinker, T. 2013. qdap: Quantitative Discourse Analysis Package. version 2.1.0. Buffalo: University at Buffalo. http://github.com/trinker/qdap.
Röder, Michael, Andreas Both und Alexander Hinneburg. 2015. „Exploring the Space of Topic Coherence Measures“. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. WSDM ’15. Shanghai, China: ACM. ISBN: 978-1-4503-3317-7. https://doi.org/10.1145/2684822.2685324. http://doi.acm.org/10.1145/2684822.2685324.
Schneider, Johannes. 2018. „Topic Modeling based on Keywords and Context“. In 2018 SIAM International Conference on Data Mining, SDM’18, May 3 - May 5, 2018. http://arxiv.org/abs/1710.02650.
Schwarz, Gideon. 1978. „Estimating the Dimension of a Model“. Ann. Statist. 6, Nr. 2 (März): 461–464. https://doi.org/10.1214/aos/1176344136. https://doi.org/10.1214/aos/1176344136.
Shahnaz, F.,M.W. Berry, V. P. Pauca und R. J. Plemmons. 2006. „Document clustering using non-negative matrix factorization“. Information Processing & Management 42:373–386.
Google Scholar
Stevens, Keith, Philip Kegelmeyer, David Andrzejewski und David Buttler. 2012. „Exploring Topic Coherence over Many Models and Many Topics“. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 952–961. EMNLP-CoNLL ’12. Jeju Island, Korea: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390948.2391052.
Team, R. Core. 2015. R: A language and environment for statistical computing. Wien: R Foundation for Statistical Computing. https://www.R-project.org/.
Teh, Yee W., David Newman und Max Welling. 2007. „A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation“. In Advances in Neural Information Processing Systems 19, herausgegeben von B. Schölkopf, J. C. Platt und T. Hoffman, 1353–1360. MIT Press. http://papers.nips.cc/paper/3113-a-collapsed-variational-bayesian-inference-algorithm-for-latentdirichlet-allocation.pdf.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov und David Mimno. 2009. „Evaluation Methods for Topic Models“. In Proceedings of the 26th Annual International Conference on Machine Learning, 1105–1112. ICML ’09. Montreal, Quebec, Canada: ACM. ISBN: 978-1-60558-516-1. https://doi.org/10.1145/1553374.1553515. http://doi.acm.org/10.1145/1553374.1553515.
Welling, Max, Chaitanya Chemudugunta und Nathan Sutter. 2008. „Deterministic Latent Variable Models and their Pitfalls“. In Proceedings of the 2008 SIAM International Conference on Data Mining, 196–207. https://doi.org/10.1137/1.9781611972788.18. eprint: http://epubs.siam.org/doi/pdf/10.1137/1.9781611972788.18. http://epubs.siam.org/doi/abs/10.1137/1.9781611972788.18.
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. London, New York, Heidelberg, Dodrecht: Springer.
Google Scholar
Witte, R., und J. Müller. 2006. Text Mining. Wissensgewinnung aus natürlichsprachigen Dokumenten. http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/3230.
Yao, Limin, David Mimno und Andrew McCallum. 2009. „Efficient Methods for Topic Model Inference on Streaming Document Collections“. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 937–946. KDD ’09. Paris, France: ACM. ISBN: 978-1-60558-495-9. https://doi.org/10.1145/1557019.1557121. http://doi.acm.org/10.1145/1557019.1557121.

Download references

Author information

Authors and Affiliations

Martin-Luther Universität Halle-Wittenberg, Halle, Deutschland
Christian Papilloud & Alexander Hinneburg

Authors

Christian Papilloud
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hinneburg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Papilloud .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Papilloud, C., Hinneburg, A. (2018). Durchführung von Topic-Modell-Analysen. In: Qualitative Textanalyse mit Topic-Modellen. Studienskripten zur Soziologie. Springer VS, Wiesbaden. https://doi.org/10.1007/978-3-658-21980-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-658-21980-2_3
Published: 22 June 2018
Publisher Name: Springer VS, Wiesbaden
Print ISBN: 978-3-658-21979-6
Online ISBN: 978-3-658-21980-2
eBook Packages: Social Science and Law (German Language)

Publish with us

Policies and ethics