Skip to main content

Discrete Component Analysis

  • Conference paper
Subspace, Latent Structure and Feature Selection (SLSFS 2005)

Abstract

This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, non-negative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational approximation, Gibbs sampling, and Rao-Blackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and for the Reuters-21578 newswire collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Azzopardi, L., Girolami, M., van Risjbergen, K.: Investigating the relationship between language model perplexity and ir precision-recall measures. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 369–370 (2003)

    Google Scholar 

  2. Buntine, W., Jakulin, A.: Applying discrete PCA in data analysis. In: UAI-2004, Banff, Canada (2004)

    Google Scholar 

  3. Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Process. Lett. 17(1), 69–83 (2003)

    Article  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Buntine, W.L., Perttu, S., Tuulos, V.: Using discrete PCA on web pages. In: Workshop on Statistical Approaches to Web Mining, SAWM 2004 (2004), At ECML 2004

    Google Scholar 

  6. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. John Wiley, Chichester (1994)

    Book  MATH  Google Scholar 

  7. Buntine, W.L.: Variational extensions to EM and multinomial PCA. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS, vol. 2430, p. 23. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)

    Google Scholar 

  9. Canny, J.: GaP: a factor model for discrete data. In: SIGIR 2004, pp. 122–129 (2004)

    Google Scholar 

  10. Casella, G., Berger, R.L.: Statistical Inference. Wadsworth & Brooks/Cole, Belmont (1990)

    MATH  Google Scholar 

  11. Clarke, B.S., Barron, A.R.: Jeffrey’s prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference 41, 37–60 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  12. Carlin, B.P., Chib, S.: Bayesian model choice via MCMC. Journal of the Royal Statistical Society B 57, 473–484 (1995)

    MATH  Google Scholar 

  13. Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: NIPS*13 (2001)

    Google Scholar 

  14. Clinton, J.D., Jackman, S., Rivers, D.: The statistical analysis of roll call voting: A unified approach. American Political Science Review 98(2), 355–370 (2004)

    Article  Google Scholar 

  15. Casella, G., Robert, C.P.: Rao-Blackewellization of sampling schemes. Biometrika 83(1), 81–94 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  16. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  17. de Leeuw, J.: Principal component analysis of binary data: Applications to roll-call-analysis. Technical Report 364, UCLA Department of Statistics (2003)

    Google Scholar 

  18. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)

    Google Scholar 

  19. Ghahramani, Z., Beal, M.J.: Propagation algorithms for variational Bayesian learning. In: NIPS, pp. 507–513 (2000)

    Google Scholar 

  20. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman & Hall, Boca Raton (1995)

    MATH  Google Scholar 

  21. Gaussier, E., Goutte, C.: Relation between PLSA and NMF and implications. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 601–602. ACM Press, New York (2005)

    Chapter  Google Scholar 

  22. Griffiths, T.L., Steyvers, M.: A probabilistic approach to semantic representation. In: Proc. of the 24th Annual Conference of the Cognitive Science Society (2002)

    Google Scholar 

  23. Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS Colloquium (2004)

    Google Scholar 

  24. Hofmann, T., Buhmann, J.M.: Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(1), 1–14 (1997)

    Article  Google Scholar 

  25. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)

    Book  Google Scholar 

  26. Holland, P., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: Some first steps. Social Networks 5, 109–137 (1983)

    Article  MathSciNet  Google Scholar 

  27. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4-5), 411–430 (2000)

    Article  Google Scholar 

  28. Hofmann, T.: Probabilistic latent semantic indexing. Research and Development in Information Retrieval, 50–57 (1999)

    Google Scholar 

  29. Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 229–240. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  30. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  31. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)

    Google Scholar 

  32. Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)

    Article  Google Scholar 

  33. Lewis, D.D., Yand, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  34. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979)

    MATH  Google Scholar 

  35. Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: UAI-2002, Edmonton (2002)

    Google Scholar 

  36. McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, London (1989)

    Book  MATH  Google Scholar 

  37. Poole, K.T.: Non-parametric unfolding of binary choice data. Political Analysis 8(3), 211–232 (2000)

    Article  Google Scholar 

  38. Pritchard, J.K., Stephens, M., Donnelly, P.J.: Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000)

    Google Scholar 

  39. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993 (June 1993)

    Google Scholar 

  40. Ross, S.M.: Introduction to Probability Models, 4th edn. Academic Press, London (1989)

    MATH  Google Scholar 

  41. Roweis, S.: EM algorithms for PCA and SPCA. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. The MIT Press, Cambridge (1998)

    Google Scholar 

  42. Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic block models for graphs with latent block structure. Journal of Classification 14, 75–100 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  43. Tipping, M.E., Bishop, C.M.: Probabilistic principal components analysis. J. Roy. Statistical Society B 61(3), 611–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  44. Titterington, D.M.: Some aspects of latent structure analysis (In this volume.). In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 69–83. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  45. van der Heijden, P.G.M., Gilula, Z., van der Ark, L.A.: An extended study into the relationship between correspondence analysis and latent class analysis. Sociological Methodology 29, 147–186 (1999)

    Article  Google Scholar 

  46. Woodbury, M.A., Manton, K.G.: A new procedure for analysis of medical classification. Methods Inf. Med. 21, 210–220 (1982)

    Google Scholar 

  47. Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations and text. In: The 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD 2005), pp. 28–35 (2005)

    Google Scholar 

  48. Yu, K., Yu, S., Tresp, V.: Dirichlet enhanced latent semantic analysis. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Buntine, W., Jakulin, A. (2006). Discrete Component Analysis. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds) Subspace, Latent Structure and Feature Selection. SLSFS 2005. Lecture Notes in Computer Science, vol 3940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752790_1

Download citation

  • DOI: https://doi.org/10.1007/11752790_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34137-6

  • Online ISBN: 978-3-540-34138-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics