Data Randomization, Partitioning, and Augmentation
Although subsampling, resampling, or otherwise rearranging a given dataset cannot increase its information content, these procedures can sometimes be useful in extracting information. Randomly rearranging the observed dataset, for example, can give an indication of how unusual the dataset is with respect to a given null hypothesis. This idea leads to randomization tests. There are many useful procedures for data analysis that involve partitioning the original sample. Using subsets of the full sample, we may be able to get an estimate of the bias or the variance of the standard estimator or test statistic without relying too heavily on the assumptions that led to that choice of estimator or test statistic. It is often useful to partition the dataset into two parts and use the data in the “training set” or “estimation set” to arrive at a preliminary estimate or fit and then use the data in the “validation set” or “test set” to evaluate the fit. This kind of approach is particularly appropriate when we are unsure of our model of the data-generating process. In actual applications, of course, we are always at least somewhat unsure of our model. If the full dataset is used to fit the model, we are very limited in the extent to which we can validate the model. No matter what we do with a given dataset, we are still left with uncertainty about the relevance of that dataset to future modeling problems. Prediction requires some assumption about the model of the data-generating processes, both the one that yielded the given data and the unseen one for which inferences are to be made. The variance of predictors is called prediction error or generalization error. Obviously, since this involves unseen data and unknown scenarios, there is no way of measuring or estimating this kind of error with any confidence. The use of partitions of the given dataset, however, is one way of getting some feel for the generalization error for scenarios that are somewhat similar to the one that gave rise to the observed dataset. Subsets of the data can be formed systematically or they can be formed as random samples from the given dataset. Sometimes the given dataset is viewed as a set of mass points of a finite distribution whose distribution function is the same as the empirical distribution function of the given dataset. In this case, the data partitioning may be done in such a way that observations may occur multiple times in the various partitions. In most cases, when we speak of “sets” or “subsets”, we mean “multisets” (that is, collections in which items may occur more than once and are distinguished in each occurrence).
KeywordsCross Validation Randomization Test Data Randomization Generalization Error Monte Carlo Study
Unable to display preview. Download preview PDF.