Lossless Compression of Random Forests
- 10 Downloads
Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber-based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble’s trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.
Keywordsentropy coding lossless compression lossy compression random forest
Unable to display preview. Download preview PDF.
- Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees (1st edition). Chapman and Hall/CRC, 1984.Google Scholar
- Quinlan J R. C4.5: Programs for Machine Learning (1st edition). Morgan Kaufmann Publishers, 1992.Google Scholar
- Schapire R E. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, Denison D D, Hansen M H, Holmes C C, Mallick B, Yu B (eds.), Springer, 2003, pp.149-171.Google Scholar
- Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (1st edition). Springer, 2001.Google Scholar
- Painsky A, Rosset S. Compressing random forests. In Proc. the 16th International Conference on Data Mining, December 2016, pp.1131-1136.Google Scholar
- Geurts P. Some enhancements of decision tree bagging. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, Sept. 2000, pp.136-147.Google Scholar
- Bernard S, Heutte L, Adam S. On the selection of decision trees in random forests. In Proc. the 2009 International Joint Conference on Neural Networks, June 2009, pp.302-307.Google Scholar
- Joly A, Schnitzler F, Geurts P, Wehenkel L. L 1-based compression of random forest models. In Proc. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2012, pp.375-380.Google Scholar
- Buciluă C, Caruana R, Niculescu-Mizil A. Model compression. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp.535-541.Google Scholar
- Chen S, Reif J H. Efficient lossless compression of trees and graphs. In Proc. the 6th Data Compression Conference, March 1996, pp.428.Google Scholar
- Painsky A, Wornell G W. On the universality of the logistic loss function. arXiv:1805.03804, 2018. https://arxiv.org/pdf/1805.03804.pdf, September 2018.
- Painsky A, Wornell G W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv:1810.07014, 2018. http://export.arxiv.org/pdf/1810.07014, September 2018.
- Sayood K. Introduction to Data Compression (5th Edition). Morgan Kaufmann, 2017.Google Scholar
- Painsky A, Rosset S, Feder M. Universal compression of memoryless sources over large alphabets via independent component analysis. In Proc. the 2015 Data Compression Conference, April 2015, pp.213-222.Google Scholar
- Painsky A, Rosset S, Feder M. A simple and efficient approach for adaptive entropy coding over large alphabets. In Proc. the 2016 Data Compression Conference, March 2016, pp.369-378.Google Scholar
- Cover T M, Thomas J A. Elements of Information Theory (2nd edition, e-book). John Wiley & Sons, 2012.Google Scholar
- Deutsch L P. Gzip file format specification version 4.3. 1996. https://www.rfc-editor.org/rfc/rfc1952.txt, Oct. 2018.
- Zhou Z H, Feng J. Deep forest: Towards an alternative to deep neural networks. arXiv:1702.08835, 2017. https://arxiv.org/pdf/1702.08835v2.pdf, September 2018.