Abstract
This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, non-negative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational approximation, Gibbs sampling, and Rao-Blackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and for the Reuters-21578 newswire collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Azzopardi, L., Girolami, M., van Risjbergen, K.: Investigating the relationship between language model perplexity and ir precision-recall measures. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 369–370 (2003)
Buntine, W., Jakulin, A.: Applying discrete PCA in data analysis. In: UAI-2004, Banff, Canada (2004)
Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Process. Lett. 17(1), 69–83 (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Buntine, W.L., Perttu, S., Tuulos, V.: Using discrete PCA on web pages. In: Workshop on Statistical Approaches to Web Mining, SAWM 2004 (2004), At ECML 2004
Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. John Wiley, Chichester (1994)
Buntine, W.L.: Variational extensions to EM and multinomial PCA. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS, vol. 2430, p. 23. Springer, Heidelberg (2002)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)
Canny, J.: GaP: a factor model for discrete data. In: SIGIR 2004, pp. 122–129 (2004)
Casella, G., Berger, R.L.: Statistical Inference. Wadsworth & Brooks/Cole, Belmont (1990)
Clarke, B.S., Barron, A.R.: Jeffrey’s prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference 41, 37–60 (1994)
Carlin, B.P., Chib, S.: Bayesian model choice via MCMC. Journal of the Royal Statistical Society B 57, 473–484 (1995)
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: NIPS*13 (2001)
Clinton, J.D., Jackman, S., Rivers, D.: The statistical analysis of roll call voting: A unified approach. American Political Science Review 98(2), 355–370 (2004)
Casella, G., Robert, C.P.: Rao-Blackewellization of sampling schemes. Biometrika 83(1), 81–94 (1996)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
de Leeuw, J.: Principal component analysis of binary data: Applications to roll-call-analysis. Technical Report 364, UCLA Department of Statistics (2003)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)
Ghahramani, Z., Beal, M.J.: Propagation algorithms for variational Bayesian learning. In: NIPS, pp. 507–513 (2000)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman & Hall, Boca Raton (1995)
Gaussier, E., Goutte, C.: Relation between PLSA and NMF and implications. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 601–602. ACM Press, New York (2005)
Griffiths, T.L., Steyvers, M.: A probabilistic approach to semantic representation. In: Proc. of the 24th Annual Conference of the Cognitive Science Society (2002)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS Colloquium (2004)
Hofmann, T., Buhmann, J.M.: Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(1), 1–14 (1997)
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)
Holland, P., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: Some first steps. Social Networks 5, 109–137 (1983)
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4-5), 411–430 (2000)
Hofmann, T.: Probabilistic latent semantic indexing. Research and Development in Information Retrieval, 50–57 (1999)
Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 229–240. Springer, Heidelberg (2003)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Lewis, D.D., Yand, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979)
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: UAI-2002, Edmonton (2002)
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, London (1989)
Poole, K.T.: Non-parametric unfolding of binary choice data. Political Analysis 8(3), 211–232 (2000)
Pritchard, J.K., Stephens, M., Donnelly, P.J.: Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993 (June 1993)
Ross, S.M.: Introduction to Probability Models, 4th edn. Academic Press, London (1989)
Roweis, S.: EM algorithms for PCA and SPCA. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. The MIT Press, Cambridge (1998)
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic block models for graphs with latent block structure. Journal of Classification 14, 75–100 (1997)
Tipping, M.E., Bishop, C.M.: Probabilistic principal components analysis. J. Roy. Statistical Society B 61(3), 611–622 (1999)
Titterington, D.M.: Some aspects of latent structure analysis (In this volume.). In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 69–83. Springer, Heidelberg (2006)
van der Heijden, P.G.M., Gilula, Z., van der Ark, L.A.: An extended study into the relationship between correspondence analysis and latent class analysis. Sociological Methodology 29, 147–186 (1999)
Woodbury, M.A., Manton, K.G.: A new procedure for analysis of medical classification. Methods Inf. Med. 21, 210–220 (1982)
Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations and text. In: The 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD 2005), pp. 28–35 (2005)
Yu, K., Yu, S., Tresp, V.: Dirichlet enhanced latent semantic analysis. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Buntine, W., Jakulin, A. (2006). Discrete Component Analysis. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds) Subspace, Latent Structure and Feature Selection. SLSFS 2005. Lecture Notes in Computer Science, vol 3940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752790_1
Download citation
DOI: https://doi.org/10.1007/11752790_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34137-6
Online ISBN: 978-3-540-34138-3
eBook Packages: Computer ScienceComputer Science (R0)