Improving on Bagging with Input Smearing

Frank, Eibe; Pfahringer, Bernhard

doi:10.1007/11731139_14

Eibe Frank²² &
Bernhard Pfahringer²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3079 Accesses
9 Citations

Abstract

Bagging is an ensemble learning method that has proved to be a useful tool in the arsenal of machine learning practitioners. Commonly applied in conjunction with decision tree learners to build an ensemble of decision trees, it often leads to reduced errors in the predictions when compared to using a single tree. A single tree is built from a training set of size N. Bagging is based on the idea that, ideally, we would like to eliminate the variance due to a particular training set by combining trees built from all training sets of size N. However, in practice, only one training set is available, and bagging simulates this platonic method by sampling with replacement from the original training data to form new training sets. In this paper we pursue the idea of sampling from a kernel density estimator of the underlying distribution to form new training sets, in addition to sampling from the data itself. This can be viewed as “smearing out” the resampled training data to generate new datasets, and the amount of “smear” is controlled by a parameter. We show that the resulting method, called “input smearing”, can lead to improved results when compared to bagging. We present results for both classification and regression problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
MATH Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Thirteenth Int. Conf. on Machine Learning, pp. 148–156 (1996)
Google Scholar
Bay, S.D.: Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis 3, 191–209 (1999)
Article Google Scholar
Melville, P., Mooney, R.J.: Creating diversity in ensembles using artificial data. Journal of Information Fusion (Special Issue on Diversity in Multiple Classifier Systems) 6/1, 99–111 (2004)
Google Scholar
Breiman, L.: Randomizing outputs to increase prediction accuracy. Machine Learning 40, 229–242 (2000)
Article MATH Google Scholar
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)
Article Google Scholar
Domingos, P.: Knowledge acquisition from examples via multiple models. In: Proc. 14th Int. Conf. on Machine Learning, pp. 98–106 (1997)
Google Scholar
Chawla, N.V., Bowyer, K.W., Kegelmeyer, L.W.P.: Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Newman, D.J., Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998)
Google Scholar
Nadeau, C., Bengio, Y.: Inference for the generalization error. Machine Learning 52, 239–281 (2003)
Article MATH Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proc. Twentieth Int. Conf. on Machine Learning, pp. 616–623. AAAI Press, Menlo Park (2003)
Google Scholar
Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Proc. Thirteenth Int. Conf. on Machine Learning, pp. 275–283 (1996)
Google Scholar
Torgo, L.: Regression datasets (2005), www.liacc.up.pt/~ltorgo/Regression
Quinlan, J.R.: Learning with Continuous Classes. In: Proc. 5th Australian Joint Conf. on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)
Google Scholar
Wang, Y., Witten, I.: Inducing model trees for continuous classes. In: Proc. of Poster Papers, European Conf. on Machine Learning (1997)
Google Scholar
Ting, K., Witten, I.: Stacking bagged and dagged models. In: Fourteenth Int. Conf. on Machine Learning (ICML 2007), pp. 367–375 (1997)
Google Scholar
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832–844 (1998)
Article Google Scholar
Achlioptas, D.: Database-friendly random projections. In: Twentieth ACM Symposium on Principles of Database Systems, pp. 274–281 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Eibe Frank & Bernhard Pfahringer

Authors

Eibe Frank
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Pfahringer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Frank, E., Pfahringer, B. (2006). Improving on Bagging with Input Smearing. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_14

Download citation

DOI: https://doi.org/10.1007/11731139_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics