Abstract
Data preprocessing techniques generally refer to the addition, deletion, or transformation of the training set data. Preprocessing data is a crucial step prior to modeling since data preparation can make or break a model’s predictive ability. To illustrate general preprocessing techniques, we begin by introducing a cell segmentation data set (Section 3.1). This data set contains common predictor problems such as skewness, outliers, and missing values. Sections 3.2 and 3.3 review predictor transformations for single predictors and multiple predictors, respectively. In Section 3.4 we discuss several approaches for handling missing data. Other preprocessing steps may include removing (Section 3.5), adding (Section 3.6), or binning (Section 3.7) predictors, all of which must be done carefully so that predictive information is not lost or erroneous information is added to the data. The computing section (3.8) provides R syntax for the previously described preprocessing steps. Exercises are provided at the end of the chapter to solidify concepts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The individual data points can be found on the journal web site or in the RAppliedPredictiveModeling package. See the Computing section at the end of this chapter.
- 2.
The original authors included several “status” features that are binary representations of other features in the data set. We excluded these from the analysis in this chapter.
- 3.
Some readers familiar with Box and Cox (1964) will know that this transformation was developed for outcome data while Box and Tidwell (1962) describe similar methods for transforming a set of predictors in a linear model. Our experience is that the Box–Cox transformation is more straightforward, less prone to numerical issues, and just as effective for transforming individual predictor variables.
- 4.
Section 20.5 discusses model extrapolation—where the model predicts samples outside of the mainstream of the training data. Another concept is the applicability domain of the model, which is the population of samples that can be effectively predicted by the model.
- 5.
A preprocessed version of these data can also be found in the caret package and is used in later chapters.
- 6.
References
Abdi H, Williams L (2010). “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.
Austin P, Brunner L (2004). “Inflation of the Type I Error Rate When a Continuous Confounding Variable Is Categorized in Logistic Regression Analyses.” Statistics in Medicine, 23(7), 1159–1178.
Bone R, Balk R, Cerra F, Dellinger R, Fein A, Knaus W, Schein R, Sibbald W (1992). “Definitions for Sepsis and Organ Failure and Guidelines for the Use of Innovative Therapies in Sepsis.” Chest, 101(6), 1644–1655.
Box G, Cox D (1964). “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), pp. 211–252.
Box G, Tidwell P (1962). “Transformation of the Independent Variables.” Technometrics, 4(4), 531–550.
Everitt B, Landau S, Leese M, Stahl D (2011). Cluster Analysis. Wiley.
Forina M, Casale M, Oliveri P, Lanteri S (2009). “CAIMAN brothers: A Family of Powerful Classification and Class Modeling Techniques.” Chemometrics and Intelligent Laboratory Systems, 96(2), 239–245.
Geladi P, Manley M, Lestander T (2003). “Scatter Plotting in Multivariate Data Analysis.” Journal of Chemometrics, 17(8–9), 503–511.
Giuliano K, DeBiasio R, Dunlay R, Gough A, Volosky J, Zock J, Pavlakis G, Taylor D (1997). “High–Content Screening: A New Approach to Easing Key Bottlenecks in the Drug Discovery Process.” Journal of Biomolecular Screening, 2(4), 249–259.
Hill A, LaPan P, Li Y, Haney S (2007). “Impact of Image Segmentation on High–Content Screening Data Quality for SK–BR-3 Cells.” BMC Bioinformatics, 8(1), 340.
Jerez J, Molina I, Garcia-Laencina P, Alba R, Ribelles N, Martin M, Franco L (2010). “Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem.” Artificial Intelligence in Medicine, 50, 105–115.
Kuiper S (2008). “Introduction to Multiple Regression: How Much Is Your Car Worth?” Journal of Statistics Education, 16(3).
Mente S, Lombardo F (2005). “A Recursive–Partitioning Model for Blood–Brain Barrier Permeation.” Journal of Computer–Aided Molecular Design, 19(7), 465–481.
Myers R (1994). Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition.
Saar-Tsechansky M, Provost F (2007b). “Handling Missing Values When Applying Classification Models.” Journal of Machine Learning Research, 8, 1625–1657.
Serneels S, Nolf ED, Espen PV (2006). “Spatial Sign Pre-processing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators.” Journal of Chemical Information and Modeling, 46(3), 1402–1409.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001). “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics, 17(6), 520–525.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kuhn, M., Johnson, K. (2013). Data Pre-processing. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_3
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6849-3_3
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6848-6
Online ISBN: 978-1-4614-6849-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)