Missing Data Imputation Through Machine Learning Algorithms

Richman, Michael B.; Trafalis, Theodore B.; Adrianto, Indra

doi:10.1007/978-1-4020-9119-3_7

Missing Data Imputation Through Machine Learning Algorithms

Michael B. Richman⁴,
Theodore B. Trafalis⁵ &
Indra Adrianto⁵

Chapter

3254 Accesses
17 Citations

How to address missing data is an issue most researchers face. Computerized algorithms have been developed to ingest rectangular data sets, where the rows represent observations and the columns represent variables. These data matrices contain elements whose values are real numbers. In many data sets, some of the elements of the matrix are not observed. Quite often, missing observations arise from instrument failures,values that have not passed quality control criteria, etc. That leads to a quandary for the analyst using techniques that require a full data matrix. The first ecision an analyst must make is whether the actual underlying values would have been observed if there was not an instrument failure, an extreme value, or some unknown reason. Since many programs expect complete data and the most economical way to achieve this is by deleting the observations with missing data, most often the analysis is performed on a subset of available data. This situation can become extreme in cases where a substantial portion of the data are missing or, worse, in cases where many variables exist with a seemingly small percentage of missing data. In such cases, large amounts of available data are discarded by deleting observations with one or more pieces of missing data. The importance of this problem arises as the investigator is interested in making inferences about the entire population, not just those observations with complete data.

Before embarking on an analysis of the impact of missing data on the first two moments of data distributions, it is helpful to discuss if there are patterns in the missing data. Quite often, understanding the way data are missing helps to illuminate the reason for the missing values. In the case of a series of gridpoints, all gridpoints but one may have complete data. If the gridpoint with missing data is consideredimportant, some technique to fill-in the missing values may be sought. Spatial interpolation techniques have been developed that are accurate in most situations (e.g., Barnes 1964; Julian 1984; Spencer and Gao 2004). Contrast this type of missing data pattern to another situation where a series of variables (e.g., temperature, precipitation, station pressure, relative humidity) are measured at a single location. Perhaps all but one of the variables is complete over a set of observations, but the last variable has some missing data. In such cases, interpolation techniques are not the logical alternative; some other method is required. Such problems are not unique to the environmental sciences. In the analysis of agriculture data, patterns of missing data have been noted for nearly a century (Yates 1933). Dodge (1985) discusses the use of least squares estimation to replace missing data in univariate analysis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Afafi, A. A., & Elashoff, R. M. (1966). Missing observations in multivariate statistics: Review of the literature.Journal of the American Statistical Association 61595–604
Article Google Scholar
Barnes, S. L. (1964). A technique for maximizing details in numerical weather map analysis.Journal of Applied Meteorology 3396–409
Article Google Scholar
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.),5th Annual ACM Workshop on COLT (pp. 144–152). Pittsburgh, PA: ACM Press
Google Scholar
Chang, C., & Lin, C. (2001). LIBSVM: A library for support vector machines. Fromhttp://www.csie.ntu.edu.tw/~cjlin/libsvm
Cox, D. R., & Hinkley, D. V. (1974).Theoretical statistics. NewYork: Wiley
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society B391–38
Google Scholar
Dodge, Y. (1985).Analysis of experiments with missing data. New York: Wiley
Google Scholar
Duffy, P. B., Doutriaux, C., Santer, B. D., & Fodor, I. K. (2001). Effect of missing data estimates of near-surface temperature change since 1900.Journal of Climate 142809–2814
Article Google Scholar
Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data.Biometrics 14174–194
Article Google Scholar
Haykin, S. (1999).Neural networks: A comprehensive foundation (2nd ed.). Englewoods Cliffs, NJ: Prentice-Hall
Google Scholar
Julian, P. R. (1984). Objective analysis in the tropics: A proposed scheme.Monthly Weather Review 1121752–1767
Article Google Scholar
Kemp, W. P., Burnell, D. G., Everson, D. O., & Thomson, A. J. (1983). Estimating missing daily maximum and minimum temperatures.Journal of Applied Meteorology 221587–1593
Article Google Scholar
Kidson, J. W., & Trenberth, K. E. (1988). Effects of missing data on estimates of monthly mean general circulation statistics.Journal of Climate 11261–1275
Article Google Scholar
Lu, Q., Lund, R., & Seymour, L. (2005). An update on U.S. temperature trends.Journal of Climate18, 4906–4914
Article Google Scholar
Lund, R. B., Seymour, L., & Kafadar, K. (2001). Temperature trends in the United States.Environmetrics 12673–690
Article Google Scholar
Mann, M. E., Rutherford, S., Wahl, E., & Ammann, C. (2005). Testing the fidelity of methods used in proxy-based reconstructions of past climate.Journal of Climate 184097–4107
Article Google Scholar
Marini, M. M., Olsen, A. R., & Ruben, D. B. (1980). Maximum likelihood estimation in panel studies with missing data.Sociological Methodology 11314–357
Article Google Scholar
MATLAB Statistics Toolbox (2007). Retrieved August 28, 2007, fromhttp://www.mathworks.com/access/helpdesk/ help/toolbox/stats/
Meng, X.-L., & Pedlow, S. (1992). EM: A bibliographic review with missing articles.Proceedings of the Statistical Computing Section, American Statistical Association. Alexandria, VA: American Statistical Association, 24–27
Google Scholar
Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm.Journal of the American Statistical Association 86899–909
Article Google Scholar
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applications.Proceedings of the 6th Berkeley Symposium of Mathematical Statistics and Probability 1697–715
Google Scholar
Richman, M. B., & Lamb, P. J. (1985). Climatic pattern analysis of three- and seven-day summer rainfall in the central United States: Some methodological considerations and a regionalization.Journal of Applied Meteorology 241325–1343
Article Google Scholar
Roth, P. L., Campion, J. E., & Jones, S. D. (1996). The impact of four missing data techniques on validity estimates in human resource management.Journal of Business and Psychology 11101–112
Article Google Scholar
Rubin, D. B. (1976). Inference and missing data.Biometrika 63581–592
Article Google Scholar
Rubin, D. B. (1988). An overview of multiple imputation.Proceedings of the Survey Research Methods Section of the merican Statistical Association79–84
Google Scholar
Rutherford, S., Mann, M. E., Osborn, T. J., Bradley, R. S., Briffa, K. R., Hughes, M. K., & Jones, P. D. (2005). Proxy-based orthern hemisphere surface temperature reconstructions: ensitivity to method, predictor network, target solution and arget domain.Journal of Climate 182308–2329
Article Google Scholar
Schneider, T. (2001). Analysis of incomplete climate data: Estiation of mean values and covariance matrices and imputation of missing values.Journal of Climate 14853–871
Article Google Scholar
Spencer, P. L., & Gao, J. (2004). Can gradient information e used to improve variational objective analysis?.Monthly eather Review 1322977–2994
Article Google Scholar
Stooksbury, D. E., Idso, C. D., & Hubbard, K. G. (1999). The ffects of data gaps on the calculated monthly mean maximum and minimum temperatures in the continental United tates: A spatial and temporal study.Journal of Climate 12524–1533
Google Scholar
Trafalis, T. B., Santosa, B., & Richman, M. B. (2003). Prediction of rainfall from WSR-88D radar using Kernel-based ethods.International Journal of Smart Engineering System esign 5429–438
Article Google Scholar
Vapnik, V. N. (1998).Statistical learning theory. New York: Springer
Google Scholar
Wilks, S. S. (1932). Moments and distribution of estimates of opulation parameters from fragmentary samples.Annals of athematical Statistics 3163–195
Article Google Scholar
Yates, F. (1933). The analysis of replicated experiments when he field results are incomplete.Empirical Journal of Experimental Agriculture 1129–142
Google Scholar

Download references

Author information

Authors and Affiliations

School of Meteorology, University of Oklahoma, 120 David L. Boren Blvd, Suite 5900, Norman, OK, 73072, USA
Michael B. Richman
Indra Adrianto School of Industrial Engineering, University of Oklahoma, 202 West Boyd St., Room 124, Norman, OK, 73019, USA
Theodore B. Trafalis & Indra Adrianto

Authors

Michael B. Richman
View author publications
You can also search for this author in PubMed Google Scholar
Theodore B. Trafalis
View author publications
You can also search for this author in PubMed Google Scholar
Indra Adrianto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael B. Richman .

Editor information

Editors and Affiliations

Applied Research Laboratory, Pennsylvania State University, Box 30, State College, PA, 16804-0030, USA
Sue Ellen Haupt
Institute of Atmospheric Pollution, National Research Council, Via Salaria Km. 29.300, Monterotondo Stazione, Rome, 00016, Italy
Antonello Pasini
Dept. of Statistics, University of Washington and the Applied Physics Laboratory, Box 354322, Seattle, WA, 98195-4322, USA
Caren Marzban

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Richman, M.B., Trafalis, T.B., Adrianto, I. (2009). Missing Data Imputation Through Machine Learning Algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds) Artificial Intelligence Methods in the Environmental Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9119-3_7

Download citation

DOI: https://doi.org/10.1007/978-1-4020-9119-3_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-9117-9
Online ISBN: 978-1-4020-9119-3
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics

Buying options