Lambert Matrix Factorization

  • Arto Klami
  • Jarkko Lagus
  • Joseph SakayaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


Many data generating processes result in skewed data, which should be modeled by distributions that can capture the skewness. In this work we adopt the flexible family of Lambert W distributions that combine arbitrary standard distribution with specific nonlinear transformation to incorporate skewness. We describe how Lambert W distributions can be used in probabilistic programs by providing stable gradient-based inference, and demonstrate their use in matrix factorization. In particular, we focus in modeling logarithmically transformed count data. We analyze the weighted squared loss used by state-of-the-art word embedding models to learn interpretable representations from word co-occurrences and show that a generative model capturing the essential properties of those models can be built using Lambert W distributions.


Skewed data Matrix factorization Lambert distribution 



The project was supported by Academy of Finland (grants 266969 and 313125) and Tekes (Scalable Probabilistic Analytics).


  1. 1.
    Ailem, M., Role, F., Nadif, M.: Sparse poisson latent block model for document clustering. IEEE Trans. Knowl. Data Eng. 29(7), 1563–1576 (2017)CrossRefGoogle Scholar
  2. 2.
    Archambeau, C., Delannay, N., Verleysen, M.: Robust probabilistic projections. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 33–40 (2006)Google Scholar
  3. 3.
    Arnold, B., Beaver, R.J.: The skew-Cauchy distribution. Stat. Probab. Lett. 49, 285–290 (2000)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J. Roy. Stat. Soc. Ser. B 65, 367–389 (2003)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bamler, R., Mandt, S.: Dynamic word embeddings. In: Proceedings of the 34th International Conference on Machine Learning (2017)Google Scholar
  6. 6.
    Betancourt, M.: A conceptual introduction to Hamiltonian Monte Carlo. Technical report. arXiv:1701.02434 (2017)
  7. 7.
    Bruni, E., Boleda, G., Baroni, M., Tran, N.K.: Distributional semantics in technicolor. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL 2012, pp. 136–145. Association for Computational Linguistics (2012)Google Scholar
  8. 8.
    Corless, R., Gonnet, G., Hare, D., Jeffrey, D., Knuth, D.: On the Lambert W function. Adv. Comput. Mathe. 5(1), 329–359 (1993)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Dean, J., et al.: Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223–1231 (2012)Google Scholar
  10. 10.
    Finkelstein, L., et al.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)Google Scholar
  11. 11.
    Goerg, G.M.: Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation. Ann. Appl. Stat. 5(3), 2197–2230 (2011)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Gopalan, P., Hofman, J.M., Blei, D.M.: Scalable recommendation with hierarchical poisson factorization. In: Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pp. 326–335 (2015)Google Scholar
  13. 13.
    Hashimoto, T.B., Alvarez-Melis, D., Jaakkola, T.S.: Word, graph and manifold embedding from markov processes. CoRR abs/1509.05808 (2015)Google Scholar
  14. 14.
    Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with genuine similarity estimation. Comput. Linguist. 41, 665–695 (2015)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing data. J. Mach. Learn. Res. 11, 1957–2000 (2010)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Jameel, S., Schockaert, S.: D-GloVe: a feasible least squares model for estimating word embedding densities. In: Proceedings of the 26th International Conference on Computational Linguistics, pp. 1849–1860 (2016)Google Scholar
  17. 17.
    Klami, A., Virtanen, S., Kaski, S.: Bayesian canonical correlation analysis. J. Mach. Learn. Res. 14, 965–1003 (2013)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Klami, A., Virtanen, S., Leppäaho, E., Kaski, S.: Group factor analysis. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 2136–2147 (2015)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Li, S., Zhu, J., Miao, C.: A generative word embedding model and its low rank positive semidefinite solution. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1599–1609 (2015)Google Scholar
  20. 20.
    Luong, T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the 17th Conference on Computational Natural Language Learning, CoNLL 2013, pp. 104–113 (2013)Google Scholar
  21. 21.
    Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2008)Google Scholar
  22. 22.
    Paisley, J., Blei, D., Jordan, M.: Bayesian nonnegative matrix factorization with stochastic variational inference. In: Handbook of Mixed Membership Models and Their Applications. Chapman and Hall (2014)Google Scholar
  23. 23.
    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  24. 24.
    Rudolph, M., Ruiz, F., Mandt, S., Blei, D.: Exponential family embeddings. In: Advances in Neural Information Processing Systems, pp. 478–486 (2016)Google Scholar
  25. 25.
    Shazeer, N., Doherty, R., Evans, C., Waterson, C.: Swivel: Improving embeddings by noticing what’s missing. arXiv:1602.02215 (2016)
  26. 26.
    Teimouri, M., Rezakhah, S., Mohammdpour, A.: Robust mixture modelling using sub-Gaussian alpha-stable distribution. Technical report. arXiv:1701.06749 (2017)
  27. 27.
    Tian, F., Dai, H., Bian, J., Gao, B.: A probabilistic model for learning multi-prototype word embeddings. In: Proceedings of the 25th International Conference on Computational Linguistics, pp. 151–160 (2014)Google Scholar
  28. 28.
    Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational bayes for non-conjugate inference. In: Proceedings of International Conference on Machine Learning (ICML) (2014)Google Scholar
  29. 29.
    Vilnis, L., McCallum, A.: Word representations via Gaussian embeddings. In: Proceedings of International Conference on Learning Representations (2015)Google Scholar
  30. 30.
    Zhou, M., Carin, L.: Negative binomial process count and mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 307–320 (2015)CrossRefGoogle Scholar
  31. 31.
    Zhou, M., Hannah, L., Dunson, D., Carin, L.: Beta-negative binomial process and Poisson factor analysis. In: Proceedings of Artificial Intelligence and Statistics, pp. 1462–1471 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations