Robust and sparse regression in generalized linear model by stochastic optimization

Kawashima, Takayuki; Fujisawa, Hironori

doi:10.1007/s42081-019-00049-9

Robust and sparse regression in generalized linear model by stochastic optimization

Original Paper
Information Theory and Statistics
Published: 11 June 2019

Volume 2, pages 465–489, (2019)
Cite this article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Takayuki Kawashima¹ &
Hironori Fujisawa^2,3

1017 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

The generalized linear model (GLM) plays a key role in regression analyses. In high-dimensional data, the sparse GLM has been used but it is not robust against outliers. Recently, the robust methods have been proposed for the specific example of the sparse GLM. Among them, we focus on the robust and sparse linear regression based on the $\gamma$-divergence. The estimator of the $\gamma$-divergence has strong robustness under heavy contamination. In this paper, we extend the robust and sparse linear regression based on the $\gamma$-divergence to the robust and sparse GLM based on the $\gamma$-divergence with a stochastic optimization approach to obtain the estimate. We adopt the randomized stochastic projected gradient descent as a stochastic optimization approach and extend the established convergence property to the classical first-order necessary condition. By virtue of the stochastic optimization approach, we can efficiently estimate parameters for very large problems. Particularly, we show the linear regression, logistic regression and Poisson regression with $L_1$ regularization in detail as specific examples of robust and sparse GLM. In numerical experiments and real data analysis, the proposed method outperformed comparative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

References

Alfons, A., Croux, C., & Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248.
Article MathSciNet Google Scholar
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. https://doi.org/10.1137/080716542.
Article MathSciNet MATH Google Scholar
Bootkrajang, J., & Kabán, A. (2013). Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics, 29(7), 870–877. https://doi.org/10.1093/bioinformatics/btt078.
Article Google Scholar
Borwein, J., & Lewis, A. S. (2010). Convex analysis and nonlinear optimization: Theory and examples. Berlin: Springer Science & Business Media.
Google Scholar
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, Springer, pp. 177–186.
Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Found Trends in Machine Learning, 8(3–4), 231–357. https://doi.org/10.1561/2200000050.
Article MATH Google Scholar
Chi, E. C., & Scott, D. W. (2014). Robust parametric classification and variable selection by a minimum distance criterion. Journal of Computational and Graphical Statistics, 23(1), 111–128. https://doi.org/10.1080/10618600.2012.737296.
Article MathSciNet Google Scholar
Dean, C., & Lawless, J. F. (1989). Tests for detecting overdispersion in poisson regression models. Journal of the American Statistical Association, 84(406), 467–472. https://doi.org/10.1080/01621459.1989.10478792. http://www.tandfonline.com/doi/abs/10.1080/01621459.1989.10478792.
Duchi, J., & Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10, 2899–2934. http://dl.acm.org/citation.cfm?id=1577069.1755882.
Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, ACM, New York, NY, USA, pp 272–279, https://doi.org/10.1145/1390156.1390191.
Duchi, J. C., Shalev-Shwartz, S., Singer, Y., & Tewari, A. (2010). Composite objective mirror descent. In COLT 2010 - The 23rd Conference on Learning Theory, pp 14–26. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=22.
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159. http://dblp.uni-trier.de/db/journals/jmlr/jmlr12.html#DuchiHS11.
Fernandes, K., Vinagre, P., & Cortez, P. (2015). A proactive intelligent decision support system for predicting the popularity of online news. In F. Pereira, P. Machado, E. Costa, & A. Cardoso (Eds.), Progress in artificial intelligence (pp. 535–546). Cham: Springer International Publishing.
Chapter Google Scholar
Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9), 2053–2081.
Article MathSciNet Google Scholar
Ghadimi, S., & Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1), 59–99. https://doi.org/10.1007/s10107-015-0871-8.
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G., & Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1–2), 267–305. https://doi.org/10.1007/s10107-014-0846-1.
Article MathSciNet MATH Google Scholar
Hunter, D. R., & Lange, K. (2004). A tutorial on mm algorithms. The American Statistician, 58(1), 30–37.
Article MathSciNet Google Scholar
Kanamori, T., & Fujisawa, H. (2015). Robust estimation under heavy contamination using unnormalized models. Biometrika, 102(3), 559–572.
Article MathSciNet Google Scholar
Kawashima, T., & Fujisawa, H. (2017). Robust and sparse regression via $\gamma$-divergence. Entropy, 19, 608. https://doi.org/10.3390/e19110608.
Article Google Scholar
Khan, J. A., Van Aelst, S., & Zamar, R. H. (2007). Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299.
Article MathSciNet Google Scholar
Kivinen, J., & Warmuth, M. K. (1995). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1–63.
Article MathSciNet Google Scholar
Lambert, D. (1992). Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14. http://www.jstor.org/stable/1269547.
McCullagh, P., & Nelder, J. (1989). Generalized linear models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series, Chapman & Hall. http://books.google.com/books?id=h9kFH2_FfBkC.
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A (General), 135(3), 370–384. http://www.jstor.org/stable/2344614.
Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. CORE Discussion Papers 2007076, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE). https://EconPapers.repec.org/RePEc:cor:louvco:2007076.
Rockafellar, R. T. (1970). Convex analysis. Princeton Mathematical Series. Princeton: Princeton University Press.
Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B pp 267–288.
Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11, 2543–2596.
MathSciNet MATH Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 17K00065.

Author information

Authors and Affiliations

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan
Takayuki Kawashima
The Institute of Statistical Mathematics, Tokyo, Japan
Hironori Fujisawa
Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Hironori Fujisawa

Authors

Takayuki Kawashima
View author publications
You can also search for this author in PubMed Google Scholar
Hironori Fujisawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takayuki Kawashima.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Here, we prove a convergence of $\sum _{y=0}^\infty f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$ and $\sum _{y=0}^\infty (y- y_{t,i} ) f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$. First, let us consider $\sum _{y=0}^\infty f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$ and we denote n-th term that $S_n = f(n|x_{t,i};\theta ^{(t)})^{1+\gamma }$. Then, we use the dalembert ratio test for $S_n$:

$$\begin{aligned}&\lim _{n \rightarrow \infty } \left| \frac{ S_{n+1}}{S_n} \right| \\&\quad =\lim _{n \rightarrow \infty } \left| \frac{ f(n+1|x_{t,i};\theta ^{(t)})^{1+\gamma } }{f(n|x_{t,i};\theta ^{(t)})^{1+\gamma }} \right| \\&\quad =\lim _{n \rightarrow \infty } \left| \frac{ \frac{\exp (-\mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)}) ) }{n+1!} \mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)})^{n+1} }{ \frac{\exp (-\mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)}) ) }{n!} \mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)})^n } \right| ^{1+\gamma } \\&\quad =\lim _{n \rightarrow \infty } \left| \frac{ \mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)}) }{ n+1 } \right| ^{1+\gamma } \\&\qquad \text{ If } \text{ the } \text{ term } \mu _{x_{t,i}}(\beta _0^{(t)}, \beta ^{(t)}) \text{ is } \text{ bounded, } \\&\quad = 0. \end{aligned}$$

Therefore, $\sum _{y=0}^\infty f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$ converges.

Next, let us consider $\sum _{y=0}^\infty (y - y_{t,i} ) f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$ and we denote n-th term that $S^{'}_n =(n - y_{t,i} ) f(n|x_{t,i};\theta ^{(t)})^{1+\gamma }$. Then, we use the dalembert ratio test for $S^{'}_n$:

$$\begin{aligned}&\lim _{n \rightarrow \infty } \left| \frac{ S^{'}_{n+1}}{S^{'}_n} \right| \\&\quad = \lim _{n \rightarrow \infty } \left| \frac{ (1+\frac{1}{n} - \frac{y_{t,i}}{n} )f(n+1|x_{t,i};\theta ^{(t)})^{1+\gamma } }{ (1 - \frac{y_{t,i}}{n} )f(n|x_{t,i};\theta ^{(t)})^{1+\gamma } } \right| \\&\quad = \lim _{n \rightarrow \infty } \left| \frac{ (1+\frac{1}{n} - \frac{y_{t,i}}{n} )}{ (1 - \frac{y_{t,i}}{n} ) } \right| \left| \frac{ f(n+1|x_{t,i};\theta ^{(t)})^{1+\gamma } }{ f(n|x_{t,i};\theta ^{(t)})^{1+\gamma } } \right| \\&\quad = 0. \end{aligned}$$

Therefore, $\sum _{y=0}^\infty (y-y_{t,i}) f(y|x_{t,i};\theta ^{(t)})^{1+\gamma }$ converges.

Appendix 2

The proof of Theorem 2.

$$\begin{aligned}&\lim _{ k \downarrow 0} \frac{ \varPsi (\theta ^{(R)} + k\delta ) - \varPsi (\theta ^{(R)}) }{k} \nonumber \\&\quad = \lim _{ k \downarrow 0} \frac{ E_{(x,y)} \! \left[ \! l( (x,y);\theta ^{(R)} {+} k\delta \! ) \! \right] \! {-} E_{(x,y)} \! \left[ \! l( (x,y);\theta ^{(R)} ) \! \right] \! {+} \lambda P(\theta ^{(R)} {+} k\delta ) {-} \lambda P(\theta ^{(R)}) }{k} \nonumber \\&\quad = \lim _{ k \downarrow 0} \frac{ E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} + k\delta ) \right] - E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} ) \right] }{k} \nonumber \\&\qquad + \lim _{ k \downarrow 0} \frac{ \lambda P(\theta ^{(R)} + k\delta ) - \lambda P(\theta ^{(R)}) }{k} . \end{aligned}$$

(22)

The directional derivative of the differentiable function always exists and is represented by the dot product with the gradient of the differentiable function and the direction given by

$$\begin{aligned}&\lim _{ k \downarrow 0} \frac{ E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} + k\delta ) \right] - E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} ) \right] }{k} \nonumber \\&\quad = \left\langle \nabla E_{(x,y)} \left[ l( (x,y);\theta ^{(R)}) \right] , \delta \right\rangle . \end{aligned}$$

(23)

Moreover, the directional derivative of the (proper) convex function exists at the relative interior point of the domain and is greater than the dot product with the subgradient of the convex function and direction (Rockafellar 1970) given by

$$\begin{aligned} \lim _{ k \downarrow 0} \frac{ \lambda P(\theta ^{(R)} + k\delta ) - \lambda P(\theta ^{(R)}) }{k}&= \sup _{g \in \partial P(\theta ^{(R)}) } \lambda \left\langle g , \delta \right\rangle \nonumber \\&\ge \lambda \left\langle g , \delta \right\rangle \quad \ for \ any \ g \in \partial P(\theta ^{(R)}). \end{aligned}$$

(24)

Then, by the optimality condition of (16), we have the following equation

$$\begin{aligned}&0 \in \nabla E_{(x,y)} \left[ l((x,y); \theta ^{(R)} ) \right] + \lambda \partial P(\theta ^{+}) +\frac{1}{\eta _R} \left\{ \nabla w \left( \theta ^{+} \right) - \nabla w \left( \theta ^{(R)} \right) \right\} \nonumber \\&\quad \frac{1}{\eta _R} \left\{ \nabla w \left( \theta ^{(R)} \right) - \nabla w \left( \theta ^{+} \right) \right\} \in \nabla E_{(x,y)} \left[ l((x,y); \theta ^{(R)} ) \right] + \lambda \partial P(\theta ^{+}) . \end{aligned}$$

(25)

Therefore, we can obtain (21) from $P_{X,R} \approx 0$, (22), (23), (24) and (25) as follows;

$$\begin{aligned}&\lim _{ k \downarrow 0} \frac{ E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} + k\delta ) \right] - E_{(x,y)} \left[ l( (x,y);\theta ^{(R)} ) \right] }{k} \\&\qquad + \lim _{ k \downarrow 0} \frac{ \lambda P(\theta ^{(R)} + k\delta ) - \lambda P(\theta ^{(R)}) }{k} \\&\quad \ge \left\langle \nabla E_{(x,y)} \left[ l( (x,y);\theta ^{(R)}) \right] , \delta \right\rangle +\lambda \left\langle g , \delta \right\rangle \quad for \ any \ g \in \partial P(\theta ^{(R)}) \\&\quad =\left\langle \nabla E_{(x,y)} \left[ l( (x,y);\theta ^{(R)}) \right] +\lambda g ,\delta \right\rangle \quad for \ any \ g \in \partial P(\theta ^{(R)}) \\&\qquad \ni 0 . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kawashima, T., Fujisawa, H. Robust and sparse regression in generalized linear model by stochastic optimization. Jpn J Stat Data Sci 2, 465–489 (2019). https://doi.org/10.1007/s42081-019-00049-9

Download citation

Received: 27 April 2019
Accepted: 29 May 2019
Published: 11 June 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s42081-019-00049-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust and sparse regression in generalized linear model by stochastic optimization

Abstract

Access this article

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust and sparse regression in generalized linear model by stochastic optimization

Abstract

Access this article

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation