Online Feature Selection by Adaptive Sub-gradient Methods
Abstract
The overall goal of online feature selection is to iteratively select, from high-dimensional streaming data, a small, “budgeted” number of features for constructing accurate predictors. In this paper, we address the online feature selection problem using novel truncation techniques for two online sub-gradient methods: Adaptive Regularized Dual Averaging (ARDA) and Adaptive Mirror Descent (AMD). The corresponding truncation-based algorithms are called B-ARDA and B-AMD, respectively. The key aspect of our truncation techniques is to take into account the magnitude of feature values in the current predictor, together with their frequency in the history of predictions. A detailed regret analysis for both algorithms is provided. Experiments on six high-dimensional datasets indicate that both B-ARDA and B-AMD outperform two advanced online feature selection algorithms, OFS and SOFS, especially when the number of selected features is small. Compared to sparse online learning algorithms that use \(\ell _1\) regularization, B-ARDA is superior to \(\ell _1\)-ARDA, and B-AMD is superior to Ada-Fobos. Code related to this paper is available at: https://github.com/LUCKY-ting/online-feature-selection.
Keywords
Online feature selection Adaptive sub-gradient methods High-dimensional streaming dataNotes
Acknowledgments
The authors would like to acknowledge support for this project from the National Key R&D Program of China (2017YFB0702600, 2017YFB0702601), the National Natural Science Foundation of China (Nos. 61432008, 61503178) and the Natural Science Foundation of Jiangsu Province of China (BK20150587).
References
- 1.Brown, G., Pocock, A.C., Zhao, M., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)MathSciNetzbMATHGoogle Scholar
- 2.Condat, L.: Fast projection onto the simplex and the \(\ell _1\) ball. Math. Program. 158(1–2), 575–585 (2016)MathSciNetzbMATHGoogle Scholar
- 3.Duchi, J.C., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)MathSciNetzbMATHGoogle Scholar
- 4.Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
- 5.Duchi, J.C., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the \(\ell _1\)-ball for learning in high dimensions. In: Proceedings of ICML, pp. 272–279 (2008)Google Scholar
- 6.Duchi, J.C., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of COLT, pp. 14–26 (2010)Google Scholar
- 7.Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
- 8.Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 777–801 (2009)MathSciNetzbMATHGoogle Scholar
- 9.Rao, N.S., Nowak, R.D., Cox, C.R., Rogers, T.T.: Classification with the sparse group lasso. IEEE Trans. Signal Process. 64(2), 448–463 (2016)MathSciNetCrossRefGoogle Scholar
- 10.Shalev-Shwartz, S., Srebro, N., Zhang, T.: Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20(6), 2807–2832 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
- 11.Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)MathSciNetzbMATHGoogle Scholar
- 12.Song, L., Smola, A.J., Gretton, A., Bedo, J., Borgwardt, K.M.: Feature selection via dependence maximization. J. Mach. Learn. Res. 13, 1393–1434 (2012)MathSciNetzbMATHGoogle Scholar
- 13.Tan, M., Tsang, I.W., Wang, L.: Towards ultrahigh dimensional feature selection for big data. J. Mach. Learn. Res. 15(1), 1371–1429 (2014)MathSciNetzbMATHGoogle Scholar
- 14.Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of ICML, pp. 1047–1054 (2010)Google Scholar
- 15.Wang, D., Wu, P., Zhao, P., Wu, Y., Miao, C., Hoi, S.C.H.: High-dimensional data stream classification via sparse online learning. In: Proceedings of ICDM, pp. 1007–1012 (2014)Google Scholar
- 16.Wang, J., Zhao, P., Hoi, S.C., Jin, R.: Online feature selection and its applications. IEEE Trans. Knowl. Data Eng. 26(3), 698–710 (2014)CrossRefGoogle Scholar
- 17.Wang, J., et al.: Online feature selection with group structure analysis. IEEE Trans. Knowl. Data Eng. 27(11), 3029–3041 (2015)CrossRefGoogle Scholar
- 18.Woznica, A., Nguyen, P., Kalousis, A.: Model mining for robust feature selection. In: Proceedings of SIGKDD, pp. 913–921 (2012)Google Scholar
- 19.Wu, X., Yu, K., Ding, W., Wang, H., Zhu, X.: Online feature selection with streaming features. IEEE Trans. Pattern Anal. Mach. Intell. 35(5), 1178–1192 (2013)CrossRefGoogle Scholar
- 20.Wu, Y., Hoi, S.C.H., Mei, T., Yu, N.: Large-scale online feature selection for ultra-high dimensional sparse data. ACM Trans. Knowl. Discov. Data 11(4), 48:1–48:22 (2017)CrossRefGoogle Scholar
- 21.Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)MathSciNetzbMATHGoogle Scholar
- 22.Yu, K., Wu, X., Ding, W., Pei, J.: Scalable and accurate online feature selection for big data. ACM Trans. Knowl. Discov. Data 11(2), 16:1–16:39 (2016)CrossRefGoogle Scholar
- 23.Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)MathSciNetzbMATHGoogle Scholar