Learning Semidefinite Regularizers

Article
  • 82 Downloads

Abstract

Regularization techniques are widely employed in optimization-based approaches for solving ill-posed inverse problems in data analysis and scientific computing. These methods are based on augmenting the objective with a penalty function, which is specified based on prior domain-specific expertise to induce a desired structure in the solution. We consider the problem of learning suitable regularization functions from data in settings in which precise domain knowledge is not directly available. Previous work under the title of ‘dictionary learning’ or ‘sparse coding’ may be viewed as learning a regularization function that can be computed via linear programming. We describe generalizations of these methods to learn regularizers that can be computed and optimized via semidefinite programming. Our framework for learning such semidefinite regularizers is based on obtaining structured factorizations of data matrices, and our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with an operator analog of Sinkhorn scaling. Under suitable conditions on the input data, our algorithm provides a locally linearly convergent method for identifying the correct regularizer that promotes the type of structure contained in the data. Our analysis is based on the stability properties of Operator Sinkhorn scaling and their relation to geometric aspects of determinantal varieties (in particular tangent spaces with respect to these varieties). The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems.

Keywords

Atomic norm Convex optimization Low-rank matrices Nuclear norm Operator scaling Representation learning 

Mathematics Subject Classification

Primary 90C25 Secondary 90C22 15A24 41A45 52A41 

Notes

Acknowledgements

The authors were supported in part by NSF Career award CCF-1350590, by Air Force Office of Scientific Research Grants FA9550-14-1-0098 and FA9550-16-1-0210, by a Sloan research fellowship, and an A*STAR (Agency for Science, Technology, and Research, Singapore) fellowship. The authors thank Joel Tropp for a helpful remark that improved the result in Proposition 17.

References

  1. 1.
    Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P.: Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization. SIAM Journal on Optimization 26(4), 2775–2799 (2016).  https://doi.org/10.1137/140979861 MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Agarwal, A., Anandkumar, A., Netrapalli, P.: A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries. IEEE Transactions on Information Theory 63(1), 575–592 (2017).  https://doi.org/10.1109/TIT.2016.2614684 MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006).  https://doi.org/10.1109/TSP.2006.881199 MATHCrossRefGoogle Scholar
  4. 4.
    Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, Efficient, and Neural Algorithms for Sparse Coding. In: Conference on Learning Theory (2015)Google Scholar
  5. 5.
    Arora, S., Ge, R., Moitra, A.: New Algorithms for Learning Incoherent and Overcomplete Dictionaries. Journal of Machine Learning Research: Workshop and Conference Proceedings 35, 1–28 (2014)Google Scholar
  6. 6.
    Barak, B., Kelner, J.A., Steurer, D.: Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method. In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing. ACM (2015).  https://doi.org/10.1145/2746539.2746605
  7. 7.
    Barron, A.R.: Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory 39(3), 930–945 (1993).  https://doi.org/10.1109/18.256500 MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Bhaskar, B.N., Tang, G., Recht, B.: Atomic Norm Denoising with Applications to Line Spectral Estimation. IEEE Transactions on Signal Processing 61(23), 5987–5999 (2013)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Blumensath, T., Davies, M.E.: Iterative Hard Thresholding for Compressed Sensing. Applied and Computational Harmonic Analysis 27, 265–274 (2009).  https://doi.org/10.1016/j.acha.2009.04.002 MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Bruckstein, A.M., Donoho, D.L., Elad, M.: From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images. SIAM Review 51(1), 34–81 (2009).  https://doi.org/10.1137/060657704 MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Candès, E.J., Plan, Y.: Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements. IEEE Transactions on Information Theory 57(4), 2342–2359.  https://doi.org/10.1109/TIT.2011.2111771
  12. 12.
    Candès, E.J., Recht, B.: Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics 9(6), 717–772 (2009).  https://doi.org/10.1007/s10208-009-9045-5 MathSciNetMATHCrossRefGoogle Scholar
  13. 13.
    Candès, E.J., Romberg, J., Tao, T.: Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information. IEEE Transactions on Information Theory 52(2), 489–509 (2006).  https://doi.org/10.1109/TIT.2005.862083 MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Candès, E.J., Tao, T.: Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? IEEE Transactions on Information Theory 52(12), 5406–5425 (2006).  https://doi.org/10.1109/TIT.2006.885507 MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Chandrasekaran, V., Parillo, P., Willsky, A.S.: Latent Variable Graphical Model Selection via Convex Optimization. The Annals of Statistics 40(4), 1935–1967 (2012).  https://doi.org/10.1214/11-AOS949 MathSciNetMATHCrossRefGoogle Scholar
  16. 16.
    Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics 12(6), 805–849 (2012).  https://doi.org/10.1007/s10208-012-9135-7 MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing 20(1), 33–61 (1998).  https://doi.org/10.1137/S1064827596304010 MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    Cho, E.: Inner Products of Random Vectors on \({S}^n\). Journal of Pure and Applied Mathematics: Advances and Applications 9(1), 63–68 (2013)MathSciNetGoogle Scholar
  19. 19.
    Cuturi, M.: Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  20. 20.
    Davidson, K.R., Szarek, S.J.: Local Operator Theory, Random Matrices and Banach Spaces. In: W.B. Johnson, J. Lindenstrauss (eds.) Handbook of the Geometry of Banach Spaces, chap. 8, pp. 317–366. Elsevier B. V. (2011)Google Scholar
  21. 21.
    DeVore, R.A., Temlyakov, V.N.: Some Remarks on Greedy Algorithms. Advances in Computational Mathematics 5(1), 173–187 (1996).  https://doi.org/10.1007/BF02124742 MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Donoho, D.L.: Compressed Sensing. IEEE Transactions on Information Theory 52(4), 1289–1306 (2006).  https://doi.org/10.1109/TIT.2006.871582 MathSciNetMATHCrossRefGoogle Scholar
  23. 23.
    Donoho, D.L.: For Most Large Underdetermined Systems of Linear Equations the Minimal \(\ell _1\)-norm Solution Is Also the Sparsest Solution. Communications on Pure and Applied Mathematics 59(6), 797–829 (2006).  https://doi.org/10.1002/cpa.20132 MathSciNetMATHGoogle Scholar
  24. 24.
    Donoho, D.L., Huo, X.: Uncertainty Principles and Ideal Atomic Decomposition. IEEE Transactions on Information Theory 47(7), 2845–2862Google Scholar
  25. 25.
    Elad, M.: Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer (2010).  https://doi.org/10.1007/978-1-4419-7011-4
  26. 26.
    Fazel, M.: Matrix Rank Minimization with Applications. Ph.D. thesis, Department of Electrical Engineering, Stanford University (2002)Google Scholar
  27. 27.
    Fazel, M., Candès, E., Recht, B., Parrilo, P.: Compressed Sensing and Robust Recovery of Low Rank Matrices. In: 42nd IEEE Asilomar Conference on Signals, Systems and Computers (2008)Google Scholar
  28. 28.
    Garg, A., Gurvits, L., Oliveira, R., Wigderson, A.: A Deterministic Polynomial Time Algorithm for Non-Commutative Rational Identity Testing with Applications. In: IEEE 57th Annual Symposium on Foundations of Computer Science (2016).  https://doi.org/10.1109/FOCS.2016.95
  29. 29.
    Ge, R., Lee, J.D., Ma, T.: Matrix Completion has No Spurious Local Minimum. In: Advances in Neural Information Processing Systems (2016)Google Scholar
  30. 30.
    Goldfarb, D., Ma, S.: Convergence of Fixed-Point Continuation Algorithms for Matrix Rank Minimization. Foundations of Computational Mathematics 11, 183–210 (2011).  https://doi.org/10.1007/s10208-011-9084-6 MathSciNetMATHCrossRefGoogle Scholar
  31. 31.
    Gorman, W.M.: Estimating Trends in Leontief Matrices. Unplublished note, referenced in Bacharach (1970) (1963)Google Scholar
  32. 32.
    Gouveia, J., Parrilo, P.A., Thomas, R.R.: Lifts of Convex Sets and Cone Factorizations. Mathematics of Operations Research 38(2), 248–264 (2013).  https://doi.org/10.1287/moor.1120.0575 MathSciNetMATHCrossRefGoogle Scholar
  33. 33.
    Gribonval, R., Jenatton, R., Bach, F., Kleinsteuber, M., Seibert, M.: Sample Complexity of Dictionary Learning and Other Matrix Factorizations. IEEE Transactions on Information Theory 61(6), 3469–3486 (2015).  https://doi.org/10.1109/TIT.2015.2424238 MathSciNetMATHCrossRefGoogle Scholar
  34. 34.
    Gurvits, L.: Classical Complexity and Quantum Entanglement. Journal of Computer and Systems Sciences 69(3), 448–484 (2004).  https://doi.org/10.1016/j.jcss.2004.06.003 MathSciNetMATHCrossRefGoogle Scholar
  35. 35.
    Idel, M.: A Review of Matrix Scaling and Sinkhorn’s Normal Form for Matrices and Positive Maps. CoRR arXiv:1609.06349 (2016)
  36. 36.
    Jain, P., Meka, R., Dhillon, I.S.: Guaranteed Rank Minimization via Singular Value Projection. In: Advances in Neural Information Processing Systems (2009)Google Scholar
  37. 37.
    Jones, L.K.: A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training. The Annals of Statistics 20(1), 608–613 (1992).  https://doi.org/10.1214/aos/1176348546 MathSciNetMATHCrossRefGoogle Scholar
  38. 38.
    Kato, T.: Perturbation Theory for Linear Operators. Springer-Verlag (1966)Google Scholar
  39. 39.
    Khachiyan, L., Kalantari, B.: Diagonal Matrix Scaling and Linear Programming. SIAM Journal on Optimization 2(4), 668–672 (1991).  https://doi.org/10.1137/0802034 MathSciNetMATHCrossRefGoogle Scholar
  40. 40.
    Linial, N., Samorodnitsky, A., Wigderson, A.: A Deterministic Strongly Polynomial Algorithm for Matrix Scaling and Approximate Permanents. Combinatorica 20(4), 545–568 (2000).  https://doi.org/10.1007/s004930070007 MathSciNetMATHCrossRefGoogle Scholar
  41. 41.
    Mairal, J., Bach, F., Ponce, J.: Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision 8(2–3), 85–283 (2014).  https://doi.org/10.1561/0600000058 MATHCrossRefGoogle Scholar
  42. 42.
    Marcus, M., Moyls, B.N.: Transformations on Tensor Product Spaces. Pacific Journal of Mathematics 9(4), 1215–1221 (1959)MathSciNetMATHCrossRefGoogle Scholar
  43. 43.
    Meinhausen, N., Bühlmann, P.: High-Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics 34(3), 1436–1462 (2006).  https://doi.org/10.1214/009053606000000281 MathSciNetMATHCrossRefGoogle Scholar
  44. 44.
    Natarajan, B.K.: Sparse Approximate Solutions to Linear Systems. SIAM Journal on Computing 24(2), 227–234 (1993).  https://doi.org/10.1137/S0097539792240406 MathSciNetMATHCrossRefGoogle Scholar
  45. 45.
    Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM Studies in Applied and Numerical Mathematics (1994).  https://doi.org/10.1137/1.9781611970791
  46. 46.
    Olshausen, B.A., Field, D.J.: Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, 607–609 (1996).  https://doi.org/10.1038/381607a0 CrossRefGoogle Scholar
  47. 47.
    Oymak, S., Hassibi, B.: Sharp MSE Bounds for Proximal Denoising. Foundations of Computational Mathematics 16(4), 965–1029 (2016).  https://doi.org/10.1007/s10208-015-9278-4 MathSciNetMATHCrossRefGoogle Scholar
  48. 48.
    Parikh, N., Boyd, S.: Proximal Algorithms. Foundations and Trends in Optimization 1(3), 127–239 (2014).  https://doi.org/10.1561/2400000003 CrossRefGoogle Scholar
  49. 49.
    Pisier, G.: Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit “Maurey-Schwartz”) pp. 1–12 (1981)Google Scholar
  50. 50.
    Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review 52(3), 471–501 (2010).  https://doi.org/10.1137/070697835 MathSciNetMATHCrossRefGoogle Scholar
  51. 51.
    Renegar, J.: A Mathematical View of Interior-Point Methods in Convex Optimization. MOS-SIAM Series on Optimization (2001).  https://doi.org/10.1137/1.9780898718812
  52. 52.
    Schnass, K.: On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD. Applied and Computational Harmonic Analysis 37(3), 464–491 (2014).  https://doi.org/10.1016/j.acha.2014.01.005 MathSciNetMATHCrossRefGoogle Scholar
  53. 53.
    Schnass, K.: Convergence Radius and Sample Complexity of ITKM Algorithms for Dictionary Learning. Applied and Computational Harmonic Analysis (2016).  https://doi.org/10.1016/j.acha.2016.08.002
  54. 54.
    Shah, P., Bhaskar, B.N., Tang, G., Recht, B.: Linear System Identification via Atomic Norm Regularization. In: 51st IEEE Conference on Decisions and Control (2012)Google Scholar
  55. 55.
    Sinkhorn, R.: A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics 35(2), 876–879 (1964).  https://doi.org/10.1214/aoms/1177703591 MathSciNetMATHCrossRefGoogle Scholar
  56. 56.
    Spielman, D.A., Wang, H., Wright, J.: Exact Recovery of Sparsely-Used Dictionaries. Journal on Machine Learning and Research: Workshop and Conference Proceedings 23(37), 1–18 (2012)Google Scholar
  57. 57.
    Stewart, G., Sun, J.: Matrix Perturbation Theory. Academic Press (1990)Google Scholar
  58. 58.
    Sun, J., Qu, Q., Wright, J.: A Geometric Analysis of Phase Retrieval. Foundations of Computational Mathematics (2017).  https://doi.org/10.1007/s10208-017-9365-9
  59. 59.
    Sun, J., Qu, Q., Wright, J.: Complete Dictionary Recovery over the Sphere I: Overview and the Geometric Picture. IEEE Transactions on Information Theory 63(2), 853–884 (2017).  https://doi.org/10.1109/TIT.2016.2632162 MathSciNetMATHCrossRefGoogle Scholar
  60. 60.
    Sun, J., Qu, Q., Wright, J.: Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method. IEEE Transactions on Information Theory 63(2), 885–914 (2017).  https://doi.org/10.1109/TIT.2016.2632149 MathSciNetMATHCrossRefGoogle Scholar
  61. 61.
    Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288 (1994)MathSciNetMATHGoogle Scholar
  62. 62.
    Toh, K.C., Todd, M.J., Tütüncü, R.H.: SDPT3 – a MATLAB Software Package for Semidefinite Programming. Optimization Methods and Software 11, 545–581 (1999).  https://doi.org/10.1080/10556789908805762 MathSciNetMATHCrossRefGoogle Scholar
  63. 63.
    Tropp, J.A.: User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computational Mathematics 12(4), 389–434 (2012).  https://doi.org/10.1007/s10208-011-9099-z MathSciNetMATHCrossRefGoogle Scholar
  64. 64.
    Tunçel, L.: Potential Reduction and Primal-Dual Methods. In: H. Wolkowicz, R. Saigal, L. Vandenberghe (eds.) Handbook of Semidefinite Programming – Theory, Algorithms, and Applications, chap. 9. Kluwer’s International Series in Operations Research and Management Science (2000).  https://doi.org/10.1007/978-1-4615-4381-7
  65. 65.
    Vainsencher, D., Mannor, S., Bruckstein, A.M.: The sample complexity of dictionary learning. Journal of Machine Learning Research 12 (2011)Google Scholar
  66. 66.
    Yannakakis, M.: Expressing Combinatorial Optimization Problems by Linear Programs. Journal of Computer and System Sciences 43, 441–466 (1991).  https://doi.org/10.1016/0022-0000(91)90024-Y MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© SFoCM 2018

Authors and Affiliations

  1. 1.Department of Computing and Mathematical SciencesCalifornia Institute of TechnologyPasadenaUSA
  2. 2.Department of Electrical EngineeringCalifornia Institute of TechnologyPasadenaUSA

Personalised recommendations