Bootstrap

Yu, Hwanjo

doi:10.1007/978-1-4614-8265-9_566

Bootstrap

Hwanjo Yu³

Reference work entry
First Online: 01 January 2018

88 Accesses

Download reference work entry PDF

Synonyms

Bootstrap estimation; Bootstrap sampling

Definition

The bootstrap is a statistical method for estimating the performance (e.g., accuracy) of classification or regression methods. The bootstrap is based on the statistical procedure of sampling with replacement. Unlike other estimation methods such as cross-validation, the same object or tuple can be selected for the training set more than once in the boostrap. That is, each time a tuple is selected, it is equally likely to be selected again and re-added to the training set.

Historical Background

The bootstrap sampling was developed by Bradley Efron in 1979, and mainly used for estimating the statistical parameters such as mean, standard errors, etc. [2]. A meta-classification method using the bootstrap called bootstrap aggregating (or bagging) was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets [1].

Foundations

This section discusses a commonly used bootstrap method, 0.632 bootstrap. Given a dataset of N tuples, the dataset is sampled N times, with replacement, resulting in a bootstrap sample or training set of N tuples. It is very likely that some of the original data tuples will occur more than once in the training set. The data tuples that were not sampled into the training set end up forming the test set. If this process is repeated multiple times, on average 63.2 % of the original data tuples will end up in the training set and the remaining 36.8 % will form the test set (hence, the name, 0.632 bootstrap).

The figure, 63.2 %, comes from the fact that a tuple will not be chosen with probability of 36.8 %. Each tuple has a probability of 1∕N of being selected, so the probability of not being chosen is (1 – 1∕N). The selection is done N times, so the probability that a tuple will not be chosen during the whole time is (1 – 1∕N)^N. If N is large, the probability approaches e⁻¹ = 0.368. Thus, 36.8 % of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2 % will form the training set.

The above procedure can be repeated k times, where in each iteration, the current test set is used to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model is then estimated as

$$ \begin{array}{ll} Acc(M)&={\displaystyle \sum_{i=1}^k\big(0.632\times cc{\left({M}_i\right)}_{test\hbox{\_}set}}\\[6pt]&{}\quad{\displaystyle+0.368\times Acc{\left({M}_i\right)}_{train\hbox{\_}{set}}\big)},\end{array} $$

(1)

where Acc(M_i)_{train_set} and Acc(M_i)_{test_set} are the accuracy of the model obtained with bootstrap sample i when it is applied to training set and test set respectively in sample i.

Key Applications

The bootstrap method is preferably used for estimating the performance when the size of dataset is relatively small.

Cross-References

Cross-Validation

Author information

Authors and Affiliations

University of Iowa, Iowa City, IA, USA
Hwanjo Yu

Authors

Hwanjo Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hwanjo Yu .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

School of Elec. Eng. and Computer Science, Seoul National Univ., Seoul, Republic of Korea
Kyuseok Shim

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Yu, H. (2018). Bootstrap. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_566

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_566
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Bootstrap

Synonyms

Definition

Historical Background

Foundations

Key Applications

Cross-References

Recommended Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Synonyms

Definition

Historical Background

Foundations

Key Applications

Cross-References

Recommended Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation