Lossless Compression of Random Forests

Painsky, Amichai; Rosset, Saharon

doi:10.1007/s11390-019-1921-0

Lossless Compression of Random Forests

Regular Paper
Published: 22 March 2019

Volume 34, pages 494–506, (2019)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Amichai Painsky¹ &
Saharon Rosset²

143 Accesses
7 Citations
Explore all metrics

Abstract

Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber-based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble’s trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees (1st edition). Chapman and Hall/CRC, 1984.
Quinlan J R. C4.5: Programs for Machine Learning (1st edition). Morgan Kaufmann Publishers, 1992.
Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123-140.
MATH Google Scholar
Schapire R E. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, Denison D D, Hansen M H, Holmes C C, Mallick B, Yu B (eds.), Springer, 2003, pp.149-171.
Breiman L. Random forests. Machine Learning, 2001, 45(1): 5-32.
Article MATH Google Scholar
Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (1st edition). Springer, 2001.
Painsky A, Rosset S. Compressing random forests. In Proc. the 16th International Conference on Data Mining, December 2016, pp.1131-1136.
Geurts P. Some enhancements of decision tree bagging. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, Sept. 2000, pp.136-147.
Meinshausen N. Node harvest. The Annals of Applied Statistics, 2010, 4(4): 2049-2072.
Article MathSciNet MATH Google Scholar
Friedman J H, Popescu B E. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2008, 2(3): 916-954.
Article MathSciNet MATH Google Scholar
Bernard S, Heutte L, Adam S. On the selection of decision trees in random forests. In Proc. the 2009 International Joint Conference on Neural Networks, June 2009, pp.302-307.
Joly A, Schnitzler F, Geurts P, Wehenkel L. L ₁-based compression of random forest models. In Proc. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2012, pp.375-380.
Buciluă C, Caruana R, Niculescu-Mizil A. Model compression. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp.535-541.
Tikk D, Kóczy L T, Gedeon T D. A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning, 2003, 33(2): 185-202.
Article MathSciNet MATH Google Scholar
Katajainen J, Mäkinen E. Tree compression and optimization with applications. International Journal of Foundations of Computer Science, 1990, 1(04): 425-447.
Article MathSciNet MATH Google Scholar
Chen S, Reif J H. Efficient lossless compression of trees and graphs. In Proc. the 6th Data Compression Conference, March 1996, pp.428.
Painsky A, Wornell G W. On the universality of the logistic loss function. arXiv:1805.03804, 2018. https://arxiv.org/pdf/1805.03804.pdf, September 2018.
Painsky A, Wornell G W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv:1810.07014, 2018. http://export.arxiv.org/pdf/1810.07014, September 2018.
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 2006, 15(3): 651-674.
Article MathSciNet Google Scholar
Painsky A, Rosset S. Cross-validated variable selection in tree-based methods improves predictive performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2142-2153.
Article Google Scholar
Sayood K. Introduction to Data Compression (5th Edition). Morgan Kaufmann, 2017.
Szpankowski W, Weinberger M J. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE Transactions on Information Theory, 2012, 58(7): 4094-4104.
Article MathSciNet MATH Google Scholar
Orlitsky A, Santhanam N P, Zhang J. Universal compression of memoryless sources over unknown alphabets. IEEE Transactions on Information Theory, 2004, 50(7): 1469-1481.
Article MathSciNet MATH Google Scholar
Painsky A, Rosset S, Feder M. Universal compression of memoryless sources over large alphabets via independent component analysis. In Proc. the 2015 Data Compression Conference, April 2015, pp.213-222.
Painsky A, Rosset S, Feder M. A simple and efficient approach for adaptive entropy coding over large alphabets. In Proc. the 2016 Data Compression Conference, March 2016, pp.369-378.
Painsky A, Rosset S, Feder M. Large alphabet source coding using independent component analysis. IEEE Transactions on Information Theory, 2017, 63(10): 6514-6529.
Article MathSciNet MATH Google Scholar
Painsky A, Rosset S, Feder M G. Linear independent component analysis over finite fields: Algorithms and bounds. IEEE Transactions on Signal Processing, 2018, 66(22): 5875-5886.
Article MathSciNet MATH Google Scholar
Zaks S. Lexicographic generation of ordered trees. Theoretical Computer Science, 1980, 10(1): 63-82.
Article MathSciNet MATH Google Scholar
Banerjee A, Merugu S, Dhillon I S, Ghosh J. Clustering with Bregman divergences. Journal of Machine Learning Research, 2005, 6: 1705-1749.
MathSciNet MATH Google Scholar
Lloyd S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
Article MathSciNet MATH Google Scholar
Cover T M, Thomas J A. Elements of Information Theory (2nd edition, e-book). John Wiley & Sons, 2012.
Deutsch L P. Gzip file format specification version 4.3. 1996. https://www.rfc-editor.org/rfc/rfc1952.txt, Oct. 2018.
Schuchman L. Dither signals and their effect on quantization noise. IEEE Transactions on Communication Technology, 1964, 12(4): 162-165.
Article Google Scholar
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine Learning, 2006, 63(1): 3-42.
Article MATH Google Scholar
Liu F T, Ting K M, Yu Y, Zhou Z H. Spectrum of variable-random trees. Journal of Artificial Intelligence Research, 2008, 32: 355-384.
Article MATH Google Scholar
Zhou Z H, Feng J. Deep forest: Towards an alternative to deep neural networks. arXiv:1702.08835, 2017. https://arxiv.org/pdf/1702.08835v2.pdf, September 2018.

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, The Hebrew University of Jerusalem, 91904, Jerusalem, Israel
Amichai Painsky
Department of Statistics and Operations Research, Tel Aviv University, 69978, Tel Aviv, Israel
Saharon Rosset

Authors

Amichai Painsky
View author publications
You can also search for this author in PubMed Google Scholar
Saharon Rosset
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amichai Painsky.

Electronic supplementary material

ESM 1

(PDF 838 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Painsky, A., Rosset, S. Lossless Compression of Random Forests. J. Comput. Sci. Technol. 34, 494–506 (2019). https://doi.org/10.1007/s11390-019-1921-0

Download citation

Received: 08 February 2018
Revised: 03 January 2019
Published: 22 March 2019
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11390-019-1921-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Lossless Compression of Random Forests

Abstract

Access this article

Similar content being viewed by others

Bitpaths: Compressing Datasets Without Decreasing Predictive Performance

CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests

Random Decision DAG: An Entropy Based Compression Approach for Random Forest

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lossless Compression of Random Forests

Abstract

Access this article

Similar content being viewed by others

Bitpaths: Compressing Datasets Without Decreasing Predictive Performance

CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests

Random Decision DAG: An Entropy Based Compression Approach for Random Forest

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation