## Abstract

Mixed membership factorization is a popular approach for analyzing data sets that have within-sample heterogeneity. In recent years, several algorithms have been developed for mixed membership matrix factorization, but they only guarantee estimates from a local optimum. Here, we derive a global optimization algorithm that provides a guaranteed *𝜖*-global optimum for a sparse mixed membership matrix factorization problem. We test the algorithm on simulated data and a small real gene expression dataset and find the algorithm always bounds the global optimum across random initializations and explores multiple modes efficiently.

This is a preview of subscription content, access via your institution.

## Buying options

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions## References

Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res.

**9**, 1981–2014 (2008)Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numer. Math.

**4**(1), 238–252 (1962)Blei, D.M., Lafferty, J.D.: Correlated topic models. In: Proceedings of the International Conference on Machine Learning, pp 113–120 (2006)

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.

**3**, 993–1022 (2003)Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc.

**112**(518), 859–877 (2017)Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

Dheeru, D., Karra T.E.: UCI machine learning repository. URL UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

Floudas, C.A.: Deterministic Global Optimization, Nonconvex Optimization and Its Applications, vol 37. Springer, Boston (2000)

Floudas, C.A.: Deterministic Global Optimization: Theory, Methods and Applications, vol. 37. Springer, Berlin (2013)

Floudas, C.A., Gounaris, C.E.: A review of recent advances in global optimization. J. Glob. Optim.

**45**, 3–38 (2008)Floudas, C.A., Visweswaran, V.: A global optimization algorithm (GOP) for certain classes of nonconvex NLPS. Comput. Chem. Eng.

**14**(12), 1–34 (1990)Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl.

**10**, 237–260 (1972)Gorski, J., Pfeuffer, F., Klamroth, K.: Biconvex sets and optimization with biconvex functions: a survey and extensions. Math. Methods Oper. Res.

**66**, 373–407 (2007)Gurobi Optimization, Inc (2018) Gurobi optimizer version 8.0

Horst, R., Tuy, H.: Global Optimization: Deterministic Approaches. Springer, Berlin (2013)

Kabán, A.: On Bayesian classification with laplace priors. Pattern Recognit. Lett.

**28**(10), 1271–1282 (2007)Lancaster, P., Tismenetsky, M., et al.: The theory of matrices: with applications. Elsevier, San Diego (1985)

Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature

**401**, 788–791 (1999)MacKay, D.J.C.: Bayesian interpolation. Neural Comput.

**4**(3), 415–447 (1992)Mackey, L., Weiss, D., Jordan, M.I.: Mixed membership matrix factorization. In: International Conference on Machine Learning, pp 1–8 (2010)

Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics

**155**, 945–959 (2000)Saddiki, H., McAuliffe, J., Flaherty, P.: GLAD: a mixed-membership model for heterogeneous tumor subtype classification. Bioinformatics

**31**(2), 225–232 (2015)Singh, A.P., Gordon, G.J.: A unified view of matrix factorization models. In: Lecture Notes in Computer Science, vol. 5212, pp. 358–373, Springer, Berlin (2008)

Taddy, M.: Multinomial inverse regression for text analysis. J. Am. Stat. Assoc.

**108**(503), 755–770, (2013). https://doi.org/10.1080/01621459.2012.734168Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: hierarchical Dirichlet processes. In: Advances in Neural Information Processing Systems, vol. 1, MIT Press, Cambridge (2005)

Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M., Network CGAR, et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet.

**45**(10), 1113 (2013)Xiao, H., Stibor, T.: Efficient collapsed Gibbs sampling for latent Dirichlet allocation. In: Sugiyama, M., Yang, Q. (eds.) Proceedings of 2nd Asian Conference on Machine Learning, vol. 13, pp. 63–78 (2010)

Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval–SIGIR ’03, p. 267 (2003)

Zaslavsky, T.: Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes: Face-Count Formulas for Partitions of Space by Hyperplanes, vol. 154. American Mathematical Society (1975)

## Acknowledgements

We acknowledge Hachem Saddiki for valuable discussions and comments on the manuscript.

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Appendices

### Appendix 1: Derivation of Relaxed Dual Problem Constraints

The Lagrange function is the sum of the Lagrange functions for each sample,

and the Lagrange function for a single sample is

We see that the Lagrange function is biconvex in *x* and *θ*
_{i}. We develop the constraints for a single sample for the remainder.

### 1.1 Linearized Lagrange Function with Respect to *x*

Casting *x* as a vector and rewriting the Lagrange function gives

where \(\bar {x}\) is formed by stacking the columns of *x* in order. The coefficients are formed such that

The linear coefficient matrix is the *KM* × 1 vector,

The quadratic coefficient is the *KM* × *KM* and block matrix

The Taylor series approximation about *x*
_{0} is

The gradient with respect to *x* is

Plugging the gradient into the Taylor series approximation gives

Simplifying the linearized Lagrange function gives

Finally, we write the linearized Lagrangian using the matrix form of *x*
_{0},

While the original Lagrange function is convex in *θ*
_{i} for a fixed *x*, the linearized Lagrange function is not necessarily convex in *θ*
_{i}. This can be seen by collecting the quadratic, linear and constant terms with respect to *θ*
_{i},

Now, if and only if \(2x_0^Tx - x_0^Tx_0 \succeq 0\) is positive semidefinite, then \(L(y_i, \theta _i, x, \lambda _i, \mu _i) \bigg |{ }^{\text{lin}}_{x_0}\) is convex. The condition is satisfied at *x* = *x*
_{0} but may be violated at some other value of *x*.

### 1.2 Linearized Lagrange Function with Respect to *θ*
_{i}

Now, we linearize (7.18) with respect to *θ*
_{i}. Using the Taylor series approximation with respect to *θ*
_{0i} gives

The gradient for this Taylor series approximation is

where *g*
_{i}(*x*) is the vector of *K* qualifying constraints associated with the Lagrange function. The qualifying constraint is linear in *x*. Plugging the gradient into the approximation gives

The linearized Lagrange function is bi-linear in *x* and *θ*
_{i}. Finally, simplifying the linearized Lagrange function gives

### Appendix 2: Proof of Biconvexity

To prove the optimization problem is biconvex, first we show the feasible region over which we are optimizing is biconvex. Then, we show the objective function is biconvex by fixing *θ* and showing convexity with respect to *x*, and then vice versa.

### 1.1 The Constraints Form a Biconvex Feasible Region

Our constraints can be written as

The inequality constraint (7.25) is convex if either *x* or *θ* is fixed, because any norm is convex. The equality constraints (7.26) is an affine combination that is still affine if either *x* or *θ* is fixed. Every affine set is convex. The inequality constraint (7.27) is convex if either *x* or *θ* is fixed, because *θ* is a linear function.

### 1.2 The Objective Is Convex with Respect to *θ*

We prove the objective is a biconvex function using the following two theorems.

### Theorem 1

*Let* \(A \subseteq {\mathbb {R}^n}\) *be a convex open set and let* \(f: A \rightarrow \mathbb {R}\) *be twice differentiable. Write H*(*x*) *for the Hessian matrix of f at x *∈* A. If H*(*x*) *is positive semidefinite for all x *∈* A, then f is convex (Boyd and Vandenberghe*
2004*).*

### Theorem 2

*A symmetric matrix A is positive semidefinite (PSD) if and only if there exists B such that A *=* B*
^{T}*B (Lancaster et al.*
1985*).*

The objective of our problem is,

The objective function is the sum of the objective functions for each sample.

The gradient with respect to *θ*
_{i},

Take second derivative with respect to *θ*
_{i} to get Hessian matrix,

The Hessian matrix \(\nabla _{\theta _i}^2 f(y_i,x,\theta _i)\) is positive semidefinite based on Theorem 2. Then, we have *f*(*y*
_{i}, *x*, *θ*
_{i}) is convex in *θ*
_{i} based on Theorem 1. The objective *f*(*y*, *x*, *θ*) is convex with respect to *θ*, because the sum of convex functions, \(\sum _{i=1}^{N}f(y_i,x,\theta _i)\), is still a convex function.

### 1.3 The Objective Is Convex with Respect to *x*

The objective function for sample *i* is

We cast *x* as a vector \(\bar {x}\), which is formed by stacking the columns of *x* in order. We rewrite the objective function as

The coefficients are formed such that

The linear coefficient matrix is the *KM* × 1 vector

The quadratic coefficient is the *KM* × *KM* and block matrix

The gradient with respect to \(\bar {x}\)

Take second derivative to get Hessian matrix,

The Hessian matrix \(\nabla _{\bar {x}}^2 f(y_i,\bar {x},\theta _i)\) is positive semidefinite based on Theorem 2. Then, we have \(f(y_i,\bar {x},\theta _i)\) is convex in \(\bar {x}\) based on Theorem 1. The objective *f*(*y*, *x*, *θ*) is convex with respect to *x*, because the sum of convex functions, \(\sum _{i=1}^{N}f(y_i,x,\theta _i)\), is still a convex function.

The objective is biconvex with respect to both *x* and *θ*. Thus, we have a biconvex optimization problem based on the proof of biconvexity of the constraints and the objective.

### Appendix 3: A-Star Search Algorithm

In this procedure, first we remove all the duplicate and all-zero coefficients hyperplanes to get unique hyperplanes. Then we start from a specific region *r* and put it into a open set. Open set is used to maintain a region list which need to be explored. Each time we pick one region from the open set to find adjacent regions. Once finishing the step of finding adjacent regions, region *r* will be moved into a closed set. Closed set is used to maintain a region list which already be explored. Also, if the adjacent region is a newly found one, it also need to be put into the open set for exploring. Finally, once the open set is empty, regions in the closed set are all the unique regions, and the number of the unique regions is the length of the closed set. This procedure begins from one region and expands to all the neighbors until no new neighbor is existed.

The overview of the A-star search algorithm to identify unique regions is shown in Algorithm 1.

### Algorithm 1 A-star Search Algorithm

### 1.1 Hyperplane Filtering

Assuming there are two different hyperplanes *H*
_{i} and *H*
_{j} represented by \(A_i=\left \{a_{i,0},\ldots ,a_{i,MK}\right \}\) and \(A_j=\left \{a_{j,0},\ldots ,a_{j,MK}\right \}\). We take these two hyperplanes duplicated when

This can be converted to

where threshold *τ* is a very small positive value.

We eliminate a hyperplane *H*
_{i} represented by \(A_i=\left \{a_{i,0},\ldots ,a_{i,MK}\right \}\) from hyperplane arrangement \({\mathcal {A}}\) if the coefficients of *A*
_{i} are all zero,

The arrangement \({\mathcal {A}}^\prime \) is the reduced arrangement and *A*
^{′}*x* = *b* are the equations of unique hyperplanes.

### 1.2 Interior Point Method

An interior point is found by solving the following optimization problem:

### Algorithm 2 Interior Point Method (Component 1)

### Algorithm 3 Get Adjacent Regions (Component 2)

## Rights and permissions

## Copyright information

© 2019 Springer Nature Switzerland AG

## About this chapter

### Cite this chapter

Zhang, F., Wang, C., Trapp, A.C., Flaherty, P. (2019). A Global Optimization Algorithm for Sparse Mixed Membership Matrix Factorization. In: Zhang, L., Chen, DG., Jiang, H., Li, G., Quan, H. (eds) Contemporary Biostatistics with Biopharmaceutical Applications. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-15310-6_7

### Download citation

DOI: https://doi.org/10.1007/978-3-030-15310-6_7

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-15309-0

Online ISBN: 978-3-030-15310-6

eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)