Abstract
We propose here a novel method to estimate the number of topics in a document set using consensus clustering based on Non-negative Matrix Factorization (NMF). It is useful to automatically estimate the number of topics from a document set since various approaches to extract topics can determine their number through heuristics. Consensus clustering makes it possible to obtain a consensus of multiple results of clustering so that robust clustering is achieved and the number of clusters is regarded as the optimized number. In this paper, we have proposed a novel consensus soft clustering algorithm based on NMF and estimated an optimized number of topics by searching through a robust classification of documents for the topics obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Larsen, B., Aone, C.: Fast and Effective Text Mining using Linear-time Document Clustering. In: 5th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 16–22 (1999)
Pelleg, D., Moore, A.: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: 17th International Conference on Machine Learning, pp. 727–734 (2000)
Windham, M., Culter, A.: Information Ratios for Validating Mixture Analysis. Journal of the American Statistical Association 87, 1182–1192 (1992)
The, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet Process. Technical Report 653, Department of Statistics, University of California at Berkeley (2004)
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Journal of Machine Learning 52, 91–118 (2003)
Li, T., Ding, C.: Weighted Consensus Clustering. In: Jonker, W., Petković, M. (eds.) SDM 2008. LNCS, vol. 5159, pp. 798–809. Springer, Heidelberg (2008)
Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and Molecular Pattern Discovery using Matrix Factorization. PNAS 101(12), 4164–4169 (2004)
Rui, X., Wunsch II, D.C.: Clustering, pp. 267–268. J. Wiley & Sons Inc., NJ (2009)
Berry, M.W., Browne, M., Langville, A.N.: Algorithms and Applications for Approximate Nonnegative Matrix Factorization, V. In: Pauca, V.P., Plemmons, R.J. (eds.) Computational Statistics & Data Analysis, vol. 52(1), pp. 155–173 (2008)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York (1983)
Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. Advanced Neural Information Processing Systems 13, 556–562 (2001)
Punera, K., Ghosh, J.: Consensus-Based Ensembles of Soft Clustering. In: International Conference on Machine Learning: Models, Technologies & Applications (MLMTA 2007), pp. 3–9 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yokoi, T. (2010). Topic Number Estimation by Consensus Soft Clustering with NMF. In: Kim, Th., Lee, Yh., Kang, BH., Ślęzak, D. (eds) Future Generation Information Technology. FGIT 2010. Lecture Notes in Computer Science, vol 6485. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17569-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-17569-5_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17568-8
Online ISBN: 978-3-642-17569-5
eBook Packages: Computer ScienceComputer Science (R0)