# Fitting Penalized Logistic Regression Models Using QR Factorization

- 138 Downloads

## Abstract

The paper presents improvement of a commonly used learning algorithm for logistic regression. In the direct approach Newton method needs inversion of Hessian, what is cubic with respect to the number of attributes. We study a special case when the number of samples *m* is smaller than the number of attributes *n*, and we prove that using previously computed QR factorization of the data matrix, Hessian inversion in each step can be performed significantly faster, that is \(\mathcal {O}\left( m^3\right) \) or \(\mathcal {O}\left( m^2n\right) \) instead of \(\mathcal {O}\left( n^3\right) \) in the ordinary Newton optimization case. We show formally that it can be adopted very effectively to \(\ell ^2\) penalized logistic regression and also, not so effectively but still competitively, for certain types of sparse penalty terms. This approach can be especially interesting for a large number of attributes and relatively small number of samples, what takes place in the so-called extreme learning. We present a comparison of our approach with commonly used learning tools.

## Keywords

Newton method Logistic regression Regularization QR factorization## 1 Introduction

We consider a task of binary classification problem with *n* inputs and with one output. Let \(\mathbf {X} \in \mathbb {R}^{m\times n}\) be a dense data matrix including *m* data samples and *n* attributes, and \(\varvec{y}_{m\times 1}\), \(y_i \in \{-1, +1\}\) are corresponding targets. We consider the case \(m<n\). In the following part bold capital letters \(\mathbf {X},\mathbf {Y},\ldots \) denote matrices, bold lower case letters \(\varvec{x},\varvec{w}\) stand for vectors, and normal lower case \(x_{ij}, y_i, \lambda \) for scalars. The paper concerns classification, but it is clear that the presented approach can be easily adopted to the linear regression model.

- 1.
rotationally invariant case, i.e. \(P(\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\),

- 2.
other (possibly non-convex) cases, including \(P(\varvec{w})=\frac{1}{q}\Vert \varvec{w}\Vert _q^{q}\).

*i*-th entry equals \(\sigma (\varvec{x}_i,\varvec{w})\cdot (1-\sigma (\varvec{x}_i,\varvec{w}))\), and \(v_i=y_i\cdot (\sigma (\varvec{x}_i,\varvec{w})-1)\).

Hessian is a sum of the matrix \(\mathbf {E}\) (second derivative of the penalty function multiplied by \(\lambda \)) and the matrix \(\mathbf {X}^T \mathbf {D} \mathbf {X}\). Depending on the penalty function *P*, the matrix \(\mathbf {E}\) may be: 1) scalar diagonal (\(\lambda \mathbf {I}\)), 2) non-scalar diagonal, 3) other type than diagonal. In this paper we investigate only cases 1) and 2).

**Related Works.** There are many approaches to learning logistic regression model, among them there are direct second order procedures like IRLS, Newton (with Hessian inversion using linear conjugate gradient) and first order procedures with nonlinear conjugate gradient as the most representative example. A short review can be found in [14]. The other group of methods includes second order procedures with Hessian approximation like L-BFGS [21] or fixed Hessian, or truncated Newton [2, 13]. Some of those techniques are implemented in scikit-learn [17], which is the main environment for our experiments. QR factorization is a common technique of fitting the linear regression model [9, 15].

## 2 Procedure of Optimization with QR Decomposition

LQ factorization for \(m < n\),

QR factorization for \(m \geqslant n\).

### 2.1 The \(\ell ^2\) Penalty Case and Rotational Invariance

This approach is not new [8, 16], however the use of this trick does not seem to be common in machine learning tools.

### 2.2 Rotational Variance

In the case of penalty functions whose Hessian \(\mathbf {E}\) is a non-scalar diagonal matrix, it is still possible to construct algorithm, which solves smaller problem via QR factorization.

**Application to the Smooth**\({\varvec{\ell }}^{\mathbf {1}}\)

**Approximation.**Every convex twice continuously differentiable regularizer can be put in place of ridge penalty and above procedure may be used to optimize such a problem. In this article we focused on the smoothly approximated \(\ell ^1\)-norm [12] via integral of hyperbolic tangent function:

**Application to the Strict**\({\varvec{\ell }}^{\mathbf {1}}\)

**Penalty.**Fan and Li proposed a unified algorithm for the minimization problem (2) via local quadratic approximations [3]. Here we use the idea presented by Krishnapuram [11], in which the following inequality is used:

*j*-th columns of matrices \(\mathbf {X}\) and \(\hat{\mathbf {Q}}\) respectively. Similar concept is used when multiplying matrix \(\mathbf {E}^{-1}\hat{\mathbf {Q}}^T\) by a vector e.g. \(\varvec{z}\):

*j*-th element of the result equals \(e_{jj}^{-1}\hat{\varvec{q}}_j\cdot \varvec{z}\). We refer to this model as L1-QR.

After obtaining direction \(\varvec{d}\) we use backtracking line search^{1} with sufficient decrease condition given by Tseng and Yun [19] with one exception: if a unit step is already decent, we seek for a bigger step to ensure faster convergence.

**Application to the**\({\varvec{\ell }}^{\mathbf{q}<{\mathbf {1}}}\)

**Penalty.**The idea described above can be directly applied to the \(\ell ^{q<1}\) “norms” [10] and we call it Lq-QR. Cost function has a form:

## 3 Complexity of Proposed Methods

Cost of each iteration in the ordinary Newton method for logistic regression equals \(k\cdot \left( 2n^2+n\right) \), where *k* is the number of conjugate gradient iterations. In general \(k\le n\), so in the worst case its complexity is \(\mathcal {O}\left( n^3\right) \).

**Rotationally Invariant Case.** QR factorization is done once and its complexity is \(\mathcal {O}\left( 2m^2\cdot \left( n-\frac{m}{3}\right) \right) =\mathcal {O}\left( m^2n\right) \). Using data transformed to the smaller space, each step of the Newton procedure is much cheaper and it requires about \(km^2\) operations (cost of solving system of linear equations using conjugate gradient, \(k\le m\)), what is \(\mathcal {O}\left( m^3\right) \) in general.

As it is shown in the experimental part, this approach dominates other optimization methods (especially exact second order procedures). Looking at the above estimations, it is clear that the presented approach is especially attractive when \(m \ll n\).

**Rotationally Variant Case.** In the second case the most dominating operation comes from computation of the matrix \(\mathbf {C}_1\) in the Eq. (15). Due to dimensionality of matrices: \(\hat{\mathbf {L}}_{m\times m}, \mathbf {X}_{m\times n}\) and \(\hat{\mathbf {Q}}_{m\times n}\), the complexity of computation \(\mathbf {C}_1\) is \(\mathcal {O}(m^2n)\) — cost of inversion of the matrix \(\mathbf {C}_1\) is less important i.e. \(\mathcal {O}(m^3)\). In the case of \(\ell ^1\) penalty taking sparsity of \(\varvec{w}\) into account reduces this complexity to \(\mathcal {O}(m^2\cdot \mathrm {\#nnz})\), where \(\mathrm {\#nnz}\) is the number of non-zero coefficients.

Therefore theoretical upper bound on iteration for logistic regression with rotationally variant penalty function is \(\mathcal {O}\left( m^2n\right) \), what is better than direct Newton approach. However, looking at (15), we see that the number of multiplications is large, thus a constant factor in this estimation is large.

## 4 Experimental Results

- 1.
Artificial dataset with 100 informative attributes and 1000 redundant attributes, informative part was produced by function make_classification from package scikit-learn and whole set was transformed introducing correlations.

- 2.
- 3.
Artificial non-linearly separable datasets: chessboard \(3\times 3\) and \(4\times 4\), and two spirals — used for learning neural network.

As a reference we use solvers that are available in the package scikit-learn for LogisticRegression model i.e. for \(\ell ^2\) penalty we use: LibLinear [4] in two variants (primal and dual), L-BFGS, L2-NEWTON-CG; For sparse penalty functions we compare our solutions with two solvers available in the scikit-learn: LibLinear and SAGA.

For the case \(\ell ^2\) penalty we provide algorithm L2-QR presented in the Sect. 2.1. In the “sparse” case we compare three algorithms presented in the Sect. 2.2: L1-QR-soft, L1-QR and Lq-QR. Our approach L2-QR (Algorithm 1) is computationally equivalent to the L2-NEWTON-CG meaning that we solve an identical optimization problem (though in the smaller space). In the case of \(\ell ^2\) penalty all models should converge theoretically to the same solution, so differences in the final value of the objective function are caused by numerical issues (like numerical errors, approximations or exceeding the number of iterations without convergence). These differences affect the predictions on a test set.

The case of \(\ell ^1\) penalty is more complicated to compare. The L1-QR Algorithm is equivalent to the L1-Liblinear i.e. it minimizes the same cost function. Algorithm L1-QR-soft uses approximated \(\ell ^1\)-norm, and algorithm Lq-QR uses a bit different non-convex cost function which gives similar results to \(\ell ^1\) penalized regression for \(q\approx 1\). We also should emphasize that SAGA algorithm does not optimize directly penalized log-likelihood function on the training set, but it is stochastic optimizer and it gives sometimes qualitatively different models. In the case L1-QR-soft final solution is sparse only approximately (and depends on *a* (16)), whereas other models produce strictly sparse models. The measure of sparsity is the number of non-zero coefficients. For L1-QR-soft we check the sparsity with a tolerance of order \(10^{-5}\).

*C*(or \(1/\lambda \)). This parameter is selected in the cross-validation procedure from the same range. During experiments with artificial data we change the size of training subset. Experiments were performed on Intel Xeoen E5-2699v4 machine, in the one threaded envirovement (with parameters n_jobs=1 and MKL_NUM_THREADS=1).

Experimental results for micro-array datasets and \(\ell ^2\) penalized logistic regressions. All solvers converge to the same solution, there are only differences in times.

Dataset | Classifier | \(TIME_{FIT}\)[s] | Cost Fcn. | \(AUC_{TEST}\) | \(ACC_{TEST}\) |
---|---|---|---|---|---|

Golub \((38\times 7129)\) | L2-Newton-CG | 0.0520 | 1.17e+11 | 0.8571 | 0.8824 |

L2-QR | 0.0065 | 1.17e+11 | 0.8571 | 0.8824 | |

SAG | 1.2560 | 1.17e+11 | 0.8571 | 0.8824 | |

Liblinear L2 | 0.0280 | 1.17e+11 | 0.8571 | 0.8824 | |

Liblinear L2 dual | 0.0737 | 1.17e+11 | 0.8571 | 0.8824 | |

L-BFGS | 0.0341 | 1.17e+11 | 0.8571 | 0.8824 | |

Singh \((102\times 12600)\) | L2-Newton-CG | 0.6038 | 5.14e+11 | 0.9735 | 0.9706 |

L2-QR | 0.0418 | 5.14e+11 | 0.9735 | 0.9706 | |

SAG | 5.2822 | 5.13e+11 | 0.9735 | 0.9706 | |

Liblinear L2 | 0.1991 | 5.14e+11 | 0.9735 | 0.9706 | |

Liblinear L2 dual | 0.6083 | 5.14e+11 | 0.9735 | 0.9706 | |

L-BFGS | 0.1192 | 5.14e+11 | 0.9735 | 0.9706 |

Experimental results for micro-array datasets and \(\ell ^1\) penalized logistic regressions. L1-QR solver converges to the same solution as L1-Liblinear, there are only difference in times. SAGA and L1-QR-soft gives different solution.

Dataset | Classifier | \(TIME_{FIT}\)[s] | Cost Fcn. | \(AUC_{TEST}\) | \(ACC_{TEST}\) | NNZ coefs. |
---|---|---|---|---|---|---|

Golub | L1-QR-soft | 8.121 | 2.74e+07 | 0.8929 | 0.9118 | 90.1 |

\(L^{q=0.9}\)-QR | 0.544 | 2.80e+07 | 0.9393 | 0.95 | 9.1 | |

L1-QR | 1.062 | 2.28e+07 | 0.8679 | 0.8912 | 10.2 | |

Liblinear | 0.042 | 2.28e+07 | 0.8679 | 0.8912 | 10.4 | |

SAGA | 4.532 | 2.78e+07 | 0.8857 | 0.9059 | 46.7 | |

Singh | L1-QR-soft | 51.042 | 6.74e+07 | 0.8753 | 0.8794 | 91.2 |

\(L^{q=0.9}\)-QR | 3.941 | 8.65e+07 | 0.8893 | 0.9 | 13.4 | |

L1-QR | 6.716 | 6.52e+07 | 0.8976 | 0.8912 | 20.1 | |

Liblinear | 0.225 | 6.52e+07 | 0.8976 | 0.8912 | 20.2 | |

SAGA | 21.251 | 7.11e+07 | 0.8869 | 0.8912 | 65.9 |

**Learning Ordinary Logistic Regression Model.** In the first experiment, presented in the Fig. 1, we use an artificial highly correlated dataset (1). We used training/testing procedure for each size of learning data, and for each classifier we select optimal value of parameter \(C=1/\lambda \) using cross-validation. The number of samples varies from 20 to 300. As we can see, in the case \(\ell ^2\) penalty our solution using QR decomposition L2-QR gives better times of fitting than ordinary solvers available in the scikit-learn and all algorithms work nearly the same, only L2-lbfgs gives slightly different results. In the case of sparse penalty our algorithm L1-QR works faster than L1-Liblinear and obtains comparable but not identical results. For sparse case L1-SAGA gives best predictions (about 1–2% better than other sparse algorithms), but it produces the most dense solutions similarly like L1-QR-soft.

In the second experiment we used micro-array data with an original train and test sets. For those datasets quotients (samples/attributes) are fixed (about 0.005–0.01). The results are shown in Table 1 (\(\ell ^2\) case) and in Table 2 (\(\ell ^1\) case). Tables present mean values of times and cost functions, averaged over \(\lambda \)s. Whole traces over \(\lambda \)s are presented in the Fig. 2 and Fig. 3. For the case of \(\ell ^2\) penalty we notice that all tested algorithms give identical results looking at the quality of prediction and the cost function. However, time of fitting differs and the best algorithm is that, which uses QR factorization.

For the case of sparse penalty functions only algorithms L1-Liblinear and L1-QR give quantitatively the same results, however L1-Liblinear works about ten times faster. Other models give qualitatively different results. Algorithm Lq-OR obtained the best sparsity and the best accuracy in prediction and was also slightly faster than L1-QR. Looking at the cost function with \(\ell ^1\) penalty we see that L1-Liblinear and L1-QR are the same, SAGA obtains worse cost function than even L1-QR-soft. We want to stress that scikit-learn provides only solvers for \(\ell ^2\) and \(\ell ^1\) penalty, not for general case \(\ell ^q\).

**Application to Extreme Learning and RVFL Networks.** Random Vector Functional-link (RVFL) network is a method of learning two (or more) layer neural networks in two separate steps. In the first step coefficients for hidden neurons are chosen randomly and are fixed, and then in the second step learning algorithm is used only for the output layer. The second step is equivalent to learning the logistic regression model (a linear model with the sigmoid output function). Recently, this approach is also known as “extreme learning” (see: [20] for more references).

*Z*is the number of hidden neurons, \(\varphi (\varvec{x}; \varvec{w}, b) = \tanh \bigl (\sum _{k=1}^n w_kx_k+b \bigr )\) is the activation function.

*Z*, the kind of linear output classifier and its parameters. In the fitting part we ensure the same random part of classifier. In this experiment we also added a new model — multi-layer perceptron with two layers and with

*Z*hidden neurons fitted in the standard way using L-BFGS algorithm (MLP-lbfgs).

Results of the experiment are presented in the Fig. 4. For each size of learning data and for each classifier we select optimal value of parameter \(C=1/\lambda \) using cross-validation. The number of samples varies from 20 to 300. As we can see, in both cases (\(\ell ^2\) and sparse penalties) our solution using QR decomposition gives always better times of fitting than ordinary solvers available in the scikit-learn. Time of fitting of L1-QR is 2–5 times shorter than L1-Liblinear, especially for the case chessboard \(4\times 4\) and two spirals. Looking at quality we see that sparse models are similar, but slightly different. For two spirals the best one is Lq-QR and it is also the sparsest model. Generally sparse models are better for two spirals and chessboard \(4\times 4\). The MLP model has the worst quality and comparable time of fitting to sparse regressions.

The experiment shows that use of QR factorization can effectively implement learning of RVFL network with different regularization terms. Moreover, we confirm that such learning works more stable than ordinary neural network learning algorithms, especially for the large number of hidden neurons. Exemplary decision boundaries, sparsity and found hidden neurons are shown in the Fig. 5.

## 5 Conclusion

In this paper we presented application of the QR matrix factorization to improve the Newton procedure for learning logistic regression models with different kind of penalties. We presented two approaches: rotationally invariant case with \(\ell ^2\) penalty, and general convex rotationally variant case with sparse penalty functions. Generally speaking, there is a strong evidence that use of QR factorization in the rotational invariant case can improve classical Newton-CG algorithm when \(m<n\). The most expensive operation in this approach is QR factorization itself, which is performed once at the beginning. Our experiments showed also that this approach, for \(m\ll n\) surpasses also other algorithms approximating Hessian like L-BFGS and truncated Newton method (used in Liblinear). In this case we have shown that theoretical upper bound on cost of Newton iteration is \(\mathcal {O}\left( m^3\right) \).

We showed also that using QR decomposition and Shermann-Morrison-Woodbury formula we can solve a problem of learning the regression model with different sparse penalty functions. Actually, improvement in this case is not as strong as in the case of \(\ell ^2\) penalty, however we proved that using QR factorization we obtain theoretical upper bound significantly better than for general Newton-CG procedure. In fact, the Newton iterations in this case have the same cost as the initial cost of the QR decomposition i.e. \(\mathcal {O}\left( m^2n\right) \). Numerical experiments revealed that for more difficult and correlated data (e.g. for extreme learning) such approach may work faster than L1-Liblinear. However, we should admit that in a typical and simpler cases L1-Liblinear may be faster.

## Footnotes

## References

- 1.Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)CrossRefGoogle Scholar
- 2.Dai, Y.H.: On the nonmonotone line search. J. Optim. Theory Appl.
**112**(2), 315–330 (2002)MathSciNetCrossRefGoogle Scholar - 3.Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.
**96**(456), 1348–1360 (2001)MathSciNetCrossRefGoogle Scholar - 4.Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res.
**9**, 1871–1874 (2008)zbMATHGoogle Scholar - 5.Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press (2013)Google Scholar
- 6.Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science
**286**(5439), 531–537 (1999)CrossRefGoogle Scholar - 7.Green, P.J.: Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion). J. R. Stat. Soc. Ser. B Methodol.
**46**, 149–192 (1984)zbMATHGoogle Scholar - 8.Hastie, T., Tibshirani, R.: Expression arrays and the \(p\gg n\) problem (2003)Google Scholar
- 9.Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2001). https://doi.org/10.1007/978-0-387-21606-5CrossRefzbMATHGoogle Scholar
- 10.Kabán, A., Durrant, R.J.: Learning with \(L_{q<1}\) vs \(L_1\)-norm regularisation with exponentially many irrelevant features. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS, vol. 5211, pp. 580–596. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_56CrossRefGoogle Scholar
- 11.Krishnapuram, B., Carin, L., Figueiredo, M.A.T., Hartemink, A.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell.
**27**(6), 957–968 (2005)CrossRefGoogle Scholar - 12.Lee, Y.J., Mangasarian, O.: SSVM: a smooth support vector machine for classification. Comput. Optim. Appl.
**20**, 5–22 (2001)MathSciNetCrossRefGoogle Scholar - 13.Lin, C.J., Weng, R.C., Keerthi, S.S.: Trust region Newton method for logistic regression. J. Mach. Learn. Res.
**9**, 627–650 (2008)MathSciNetzbMATHGoogle Scholar - 14.Minka, T.P.: A comparison of numerical optimizers for logistic regression (2003). https://tminka.github.io/papers/logreg/minka-logreg.pdf
- 15.Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2013)zbMATHGoogle Scholar
- 16.Ng, A.Y.: Feature selection, \(L_1\) vs. \(L_2\) regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning, ICML 2004, pp. 78–85. ACM, New York (2004)Google Scholar
- 17.Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res.
**12**, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar - 18.Singh, S., Skanda, S., Scott, S., Arie, B., Sujata, P., Gurmit, S.: Overexpression of vimentin: role in the invasive phenotype in an androgen-independent model of prostate cancer. Cancer Res.
**63**(9), 2306–2311 (2003)Google Scholar - 19.Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program.
**117**, 387–423 (2009)MathSciNetCrossRefGoogle Scholar - 20.Wang, L.P., Wan, C.R.: Comments on “the extreme learning machine”. IEEE Trans. Neural Netw.
**19**(8), 1494–1495 (2008)CrossRefGoogle Scholar - 21.Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw.
**23**(4), 550–560 (1997)MathSciNetCrossRefGoogle Scholar