How Many Topics? Stability Analysis for Topic Models

Greene, Derek; O’Callaghan, Derek; Cunningham, Pádraig

doi:10.1007/978-3-662-44848-9_32

Derek Greene²³,
Derek O’Callaghan²³ &
Pádraig Cunningham²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8724))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5725 Accesses
66 Citations
1 Altmetric

Abstract

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the“over-clustering” of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.

Download to read the full chapter text

Chapter PDF

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Arora, S., Ge, R., Moitra, A.: Learning topic models – Going beyond SVD. In: Proc. 53rd Symp. Foundations of Computer Science, pp. 1–10. IEEE (2012)
Google Scholar
Bae, E., Bailey, J.: Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Proc. 6th International Conference on Data Mining, pp. 53–62. IEEE (2006)
Google Scholar
Ben-David, S., Pál, D., Simon, H.U.: Stability of k-means clustering. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 20–34. Springer, Heidelberg (2007)
Chapter Google Scholar
Bertoni, A., Valentini, G.: Random projections for assessing gene expression cluster stability. In: Proc. IEEE International Joint Conference on Neural Networks (IJCNN 2005)., vol. 1, pp. 149–154 (2005)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Boutsidis, C., Gallopoulos, E.: SVD based initialization: A head start for non-negative matrix factorization. Pattern Recognition (2008)
Google Scholar
Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. Proc. National Academy of Sciences 101(12), 4164–4169 (2004)
Article Google Scholar
De Waal, A., Barnard, E.: Evaluating topic models with stability. In: 19th Annual Symposium of the Pattern Recognition Association of South Africa (2008)
Google Scholar
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM Journal on Discrete Mathematics 17(1), 134–160 (2003)
Article MATH MathSciNet Google Scholar
Greene, D., Cunningham, P.: Producing accurate interpretable clusters from high-dimensional data. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 486–494. Springer, Heidelberg (2005)
Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. 15th Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann (1999)
Google Scholar
Hutchins, L.N., Murphy, S.M., Singh, P., Graber, J.H.: Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics 24(23), 2684–2690 (2008)
Article Google Scholar
Jaccard, P.: The distribution of flora in the alpine zone. New Phytologist 11(2), 37–50 (1912)
Article Google Scholar
Kendall, M., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold, London (1990)
MATH Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quaterly 2, 83–97 (1955)
Article Google Scholar
Kumar, R., Vassilvitskii, S.: Generalized distances between rankings. In: Proc. 19th International Conference on World Wide Web, pp. 571–580. ACM (2010)
Google Scholar
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Computation 16(6), 1299–1323 (2004)
Article MATH Google Scholar
Law, M., Jain, A.K.: Cluster validity by bootstrapping partitions. Tech. Rep. MSU-CSE-03-5, University of Washington (February 2003)
Google Scholar
Lee, C., Cunningham, P.: Community detection: effective evaluation on large social networks. Journal of Complex Networks (2013)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Article Google Scholar
Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13(11), 2573–2593 (2001)
Article MATH Google Scholar
Lin, C.: Projected gradient methods for non-negative matrix factorization. Neural Computation 19(10), 2756–2779 (2007)
Article MATH MathSciNet Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic Evaluation of Topic Coherence. In: Proc. Conf. North American Chapter of the Association for Computational Linguistics (HLT 2010), pp. 100–108 (2010)
Google Scholar
Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium 6(12), e26752 (2008)
Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of latent semantic analysis, vol. 427(7), pp. 424–440 (2007)
Google Scholar
Wang, Q., Cao, Z., Xu, J., Li, H.: Group matrix factorization for scalable topic modeling. In: Proc. 35th SIGIR Conference on Research and Development in Information Retrieval, pp. 375–384. ACM (2012)
Google Scholar
Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28(4), 20 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Informatics, University College Dublin, Republic of Ireland
Derek Greene, Derek O’Callaghan & Pádraig Cunningham

Authors

Derek Greene
View author publications
You can also search for this author in PubMed Google Scholar
Derek O’Callaghan
View author publications
You can also search for this author in PubMed Google Scholar
Pádraig Cunningham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences, Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Greene, D., O’Callaghan, D., Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44848-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-662-44848-9_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44847-2
Online ISBN: 978-3-662-44848-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How Many Topics? Stability Analysis for Topic Models

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

How Many Topics? Stability Analysis for Topic Models

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation