Skip to main content

Matrix Factorization and Topic Modeling

  • Chapter
  • First Online:
Book cover Machine Learning for Text

Abstract

Most document collections are defined by document-term matrices in which the rows (or columns) are highly correlated with one another. These correlations can be leveraged to create a low-dimensional representation of the data, and this process is referred to as dimensionality reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, we are assuming a specific type of factorization, referred to as non-negative matrix factorization, because of its interpretability. Other factorizations might not obey these properties.

  2. 2.

    The factorization is unique up to multiplication by − 1 of any particular column of P and Q.

  3. 3.

    This solution is unique up to multiplication of any column of U or V with − 1.

  4. 4.

    In other words, the columns of P, the columns of Q, and the diagonal of Σ each sum to 1.

  5. 5.

    The Dirichlet is selected because it is the posterior distribution of multinomial parameters, if the prior distribution of these parameters is a Dirichlet (although the parameters of the prior and posterior Dirichlet may be different). If we throw a loaded dice repeatedly with its faces showing various topics, the resulting observations are referred to as multinomial. In LDA, the selection of the latent components of the different tokens in a document is achieved by throwing such a dice repeatedly. Formally, the Dirichlet distribution is a conjugate prior to the multinomial distribution. The use of conjugate priors is widespread in Bayesian statistics because of this property.

  6. 6.

    For a positive integer n, the value of Γ(n) is (n − 1)! . For a positive real value x, the value of Γ(x) is defined by interpolating the values at integer points with a smooth curve, which works out to an interpolated value of Γ(x) = 0 y x−1 e ydy. More details of an exact definition and a specific functional form may be found at http://mathworld.wolfram.com/GammaFunction.html.

  7. 7.

    There does not seem to be a clear consensus on this issue. For the classification problem, slightly better results have been claimed in [519] for the linear kernel. On the other hand, the work in [88] shows that slightly better results are obtained with the Gaussian kernel method with proper tuning. Theoretically, the latter claim seems to be a better justified because linear kernels can be roughly simulated by the Gaussian by using a large bandwidth.

  8. 8.

    For simplicity, we are including stop words in the 2-grams.

Bibliography

  1. C. Aggarwal. On the effects of dimensionality reduction on high dimensional similarity search. ACM PODS Conference, pp. 256–266, 2001.

    Google Scholar 

  2. C. Aggarwal and S. Sathe. Outlier ensembles: An introduction. Springer, 2017.

    Google Scholar 

  3. C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.

    Google Scholar 

  4. A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. Uncertainty in Artificial Intelligence, pp. 27–34, 2009.

    Google Scholar 

  5. D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.

    Google Scholar 

  6. D. Blei. Probabilistic topic models. Communications of the ACM, 55(4), pp. 77–84, 2012.

    Article  Google Scholar 

  7. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3, pp. 993–1022, 2003.

    MATH  Google Scholar 

  8. D. Blei and J. Lafferty. Dynamic topic models. ICML Conference, pp. 113–120, 2006.

    Google Scholar 

  9. R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. NIPS Conference, pp. 171–178, 2005.

    Google Scholar 

  10. Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471–1490, 2010.

    MathSciNet  MATH  Google Scholar 

  11. C. Ding, T. Li, and M. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), pp. 45–55, 2010.

    Article  Google Scholar 

  12. C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.

    Article  MathSciNet  Google Scholar 

  13. C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. ACM KDD Conference, pp. 126–135, 2006.

    Google Scholar 

  14. S. Dumais. Latent semantic indexing (LSI) and TREC-2. Text Retrieval Conference (TREC), pp. 105–115, 1993.

    Google Scholar 

  15. S. Dumais. Latent semantic indexing (LSI): TREC-3 Report. Text Retrieval Conference (TREC), pp. 219–230, 1995.

    Google Scholar 

  16. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 41(6), pp. 391–407, 1990.

    Article  Google Scholar 

  17. C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3), pp. 211–218, 1936.

    Article  Google Scholar 

  18. T. Gärtner. A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter, 5(1), pp. 49–58, 2003.

    Article  Google Scholar 

  19. E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. ACM SIGIR Conference, pp. 601–602, 2005.

    Google Scholar 

  20. M. Girolami and A. Kabán. On an equivalence between PLSI and LDA. ACM SIGIR Conference, pp. 433–434, 2003.

    Google Scholar 

  21. G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786), pp. 504–507, 2006.

    Article  MathSciNet  Google Scholar 

  22. T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.

    Google Scholar 

  23. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 41(1–2), pp. 177–196, 2001.

    Article  Google Scholar 

  24. K. Hornik and B. Grün. topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), pp. 1–30, 2011.

    Google Scholar 

  25. A. Karatzoglou, A. Smola A, K. Hornik, and A. Zeileis. kernlab – An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9), 2004. http://epub.wu.ac.at/1048/1/document.pdf http://CRAN.R-project.org/package=kernlab

  26. A. Langville, C. Meyer, R. Albright, J. Cox, and D. Duling. Initializations for the nonnegative matrix factorization. ACM KDD Conference, pp. 23–26, 2006.

    Google Scholar 

  27. Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.

    Google Scholar 

  28. D. Lee and H. Seung. Algorithms for non-negative matrix factorization. Advances in Meural Information Processing Systems, pp. 556–562, 2001.

    Google Scholar 

  29. D. Lee and H. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), pp. 788–791, 2001.

    MATH  Google Scholar 

  30. C. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10), pp. 2756–2779, 2007.

    Article  MathSciNet  Google Scholar 

  31. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.

    MATH  Google Scholar 

  32. U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), pp. 395–416, 2007.

    Article  MathSciNet  Google Scholar 

  33. D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. European Conference on Information Retrieval, pp. 16-27, 2007.

    Google Scholar 

  34. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781

  35. J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2), pp. 945–959, 2000.

    Google Scholar 

  36. R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html

  37. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290, no. 5500, pp. 2323–2326, 2000.

    Article  Google Scholar 

  38. M. Sahami and T. D. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. WWW Conference, pp. 377–386, 2006.

    Google Scholar 

  39. B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), pp. 1299–1319, 1998.

    Article  Google Scholar 

  40. G. Strang. An introduction to linear algebra. Wellesley Cambridge Press, 2009.

    Google Scholar 

  41. J. Tenenbaum, V. De Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290 (5500), pp. 2319–2323, 2000.

    Article  Google Scholar 

  42. H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS Conference, pp. 1973–1981, 2009.

    Google Scholar 

  43. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. ACM SIGIR Conference, pp. 178–185, 2006.

    Google Scholar 

  44. C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS Conference, 2000.

    Google Scholar 

  45. Y. Yang and X. Liu. A re-examination of text categorization methods. ACM SIGIR Conference, pp. 42–49, 1999.

    Google Scholar 

  46. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  47. https://cran.r-project.org/web/packages/lsa/index.html

  48. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

  49. http://weka.sourceforge.net/doc.stable/weka/attributeSelection/LatentSemanticAnalysis.html

  50. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

  51. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

  52. https://cran.r-project.org/

  53. http://www.cs.princeton.edu/~blei/lda-c/

  54. http://scikit-learn.org/stable/modules/manifold.html

  55. https://code.google.com/archive/p/word2vec/

  56. https://www.tensorflow.org/tutorials/word2vec/

  57. http://www.netlib.org/svdpack

  58. http://scikit-learn.org/stable/modules/kernel_approximation.html

  59. http://mallet.cs.umass.edu/

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Matrix Factorization and Topic Modeling. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics