How to address missing data is an issue most researchers face. Computerized algorithms have been developed to ingest rectangular data sets, where the rows represent observations and the columns represent variables. These data matrices contain elements whose values are real numbers. In many data sets, some of the elements of the matrix are not observed. Quite often, missing observations arise from instrument failures,values that have not passed quality control criteria, etc. That leads to a quandary for the analyst using techniques that require a full data matrix. The first ecision an analyst must make is whether the actual underlying values would have been observed if there was not an instrument failure, an extreme value, or some unknown reason. Since many programs expect complete data and the most economical way to achieve this is by deleting the observations with missing data, most often the analysis is performed on a subset of available data. This situation can become extreme in cases where a substantial portion of the data are missing or, worse, in cases where many variables exist with a seemingly small percentage of missing data. In such cases, large amounts of available data are discarded by deleting observations with one or more pieces of missing data. The importance of this problem arises as the investigator is interested in making inferences about the entire population, not just those observations with complete data.
Before embarking on an analysis of the impact of missing data on the first two moments of data distributions, it is helpful to discuss if there are patterns in the missing data. Quite often, understanding the way data are missing helps to illuminate the reason for the missing values. In the case of a series of gridpoints, all gridpoints but one may have complete data. If the gridpoint with missing data is consideredimportant, some technique to fill-in the missing values may be sought. Spatial interpolation techniques have been developed that are accurate in most situations (e.g., Barnes 1964; Julian 1984; Spencer and Gao 2004). Contrast this type of missing data pattern to another situation where a series of variables (e.g., temperature, precipitation, station pressure, relative humidity) are measured at a single location. Perhaps all but one of the variables is complete over a set of observations, but the last variable has some missing data. In such cases, interpolation techniques are not the logical alternative; some other method is required. Such problems are not unique to the environmental sciences. In the analysis of agriculture data, patterns of missing data have been noted for nearly a century (Yates 1933). Dodge (1985) discusses the use of least squares estimation to replace missing data in univariate analysis.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Afafi, A. A., & Elashoff, R. M. (1966). Missing observations in multivariate statistics: Review of the literature.Journal of the American Statistical Association 61595–604
Barnes, S. L. (1964). A technique for maximizing details in numerical weather map analysis.Journal of Applied Meteorology 3396–409
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.),5th Annual ACM Workshop on COLT (pp. 144–152). Pittsburgh, PA: ACM Press
Chang, C., & Lin, C. (2001). LIBSVM: A library for support vector machines. Fromhttp://www.csie.ntu.edu.tw/~cjlin/libsvm
Cox, D. R., & Hinkley, D. V. (1974).Theoretical statistics. NewYork: Wiley
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society B391–38
Dodge, Y. (1985).Analysis of experiments with missing data. New York: Wiley
Duffy, P. B., Doutriaux, C., Santer, B. D., & Fodor, I. K. (2001). Effect of missing data estimates of near-surface temperature change since 1900.Journal of Climate 142809–2814
Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data.Biometrics 14174–194
Haykin, S. (1999).Neural networks: A comprehensive foundation (2nd ed.). Englewoods Cliffs, NJ: Prentice-Hall
Julian, P. R. (1984). Objective analysis in the tropics: A proposed scheme.Monthly Weather Review 1121752–1767
Kemp, W. P., Burnell, D. G., Everson, D. O., & Thomson, A. J. (1983). Estimating missing daily maximum and minimum temperatures.Journal of Applied Meteorology 221587–1593
Kidson, J. W., & Trenberth, K. E. (1988). Effects of missing data on estimates of monthly mean general circulation statistics.Journal of Climate 11261–1275
Lu, Q., Lund, R., & Seymour, L. (2005). An update on U.S. temperature trends.Journal of Climate18, 4906–4914
Lund, R. B., Seymour, L., & Kafadar, K. (2001). Temperature trends in the United States.Environmetrics 12673–690
Mann, M. E., Rutherford, S., Wahl, E., & Ammann, C. (2005). Testing the fidelity of methods used in proxy-based reconstructions of past climate.Journal of Climate 184097–4107
Marini, M. M., Olsen, A. R., & Ruben, D. B. (1980). Maximum likelihood estimation in panel studies with missing data.Sociological Methodology 11314–357
MATLAB Statistics Toolbox (2007). Retrieved August 28, 2007, fromhttp://www.mathworks.com/access/helpdesk/ help/toolbox/stats/
Meng, X.-L., & Pedlow, S. (1992). EM: A bibliographic review with missing articles.Proceedings of the Statistical Computing Section, American Statistical Association. Alexandria, VA: American Statistical Association, 24–27
Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm.Journal of the American Statistical Association 86899–909
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applications.Proceedings of the 6th Berkeley Symposium of Mathematical Statistics and Probability 1697–715
Richman, M. B., & Lamb, P. J. (1985). Climatic pattern analysis of three- and seven-day summer rainfall in the central United States: Some methodological considerations and a regionalization.Journal of Applied Meteorology 241325–1343
Roth, P. L., Campion, J. E., & Jones, S. D. (1996). The impact of four missing data techniques on validity estimates in human resource management.Journal of Business and Psychology 11101–112
Rubin, D. B. (1976). Inference and missing data.Biometrika 63581–592
Rubin, D. B. (1988). An overview of multiple imputation.Proceedings of the Survey Research Methods Section of the merican Statistical Association79–84
Rutherford, S., Mann, M. E., Osborn, T. J., Bradley, R. S., Briffa, K. R., Hughes, M. K., & Jones, P. D. (2005). Proxy-based orthern hemisphere surface temperature reconstructions: ensitivity to method, predictor network, target solution and arget domain.Journal of Climate 182308–2329
Schneider, T. (2001). Analysis of incomplete climate data: Estiation of mean values and covariance matrices and imputation of missing values.Journal of Climate 14853–871
Spencer, P. L., & Gao, J. (2004). Can gradient information e used to improve variational objective analysis?.Monthly eather Review 1322977–2994
Stooksbury, D. E., Idso, C. D., & Hubbard, K. G. (1999). The ffects of data gaps on the calculated monthly mean maximum and minimum temperatures in the continental United tates: A spatial and temporal study.Journal of Climate 12524–1533
Trafalis, T. B., Santosa, B., & Richman, M. B. (2003). Prediction of rainfall from WSR-88D radar using Kernel-based ethods.International Journal of Smart Engineering System esign 5429–438
Vapnik, V. N. (1998).Statistical learning theory. New York: Springer
Wilks, S. S. (1932). Moments and distribution of estimates of opulation parameters from fragmentary samples.Annals of athematical Statistics 3163–195
Yates, F. (1933). The analysis of replicated experiments when he field results are incomplete.Empirical Journal of Experimental Agriculture 1129–142
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media B.V
About this chapter
Cite this chapter
Richman, M.B., Trafalis, T.B., Adrianto, I. (2009). Missing Data Imputation Through Machine Learning Algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds) Artificial Intelligence Methods in the Environmental Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9119-3_7
Download citation
DOI: https://doi.org/10.1007/978-1-4020-9119-3_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-9117-9
Online ISBN: 978-1-4020-9119-3
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)