Skip to main content

Missing Data Imputation Through Machine Learning Algorithms

  • Chapter

How to address missing data is an issue most researchers face. Computerized algorithms have been developed to ingest rectangular data sets, where the rows represent observations and the columns represent variables. These data matrices contain elements whose values are real numbers. In many data sets, some of the elements of the matrix are not observed. Quite often, missing observations arise from instrument failures,values that have not passed quality control criteria, etc. That leads to a quandary for the analyst using techniques that require a full data matrix. The first ecision an analyst must make is whether the actual underlying values would have been observed if there was not an instrument failure, an extreme value, or some unknown reason. Since many programs expect complete data and the most economical way to achieve this is by deleting the observations with missing data, most often the analysis is performed on a subset of available data. This situation can become extreme in cases where a substantial portion of the data are missing or, worse, in cases where many variables exist with a seemingly small percentage of missing data. In such cases, large amounts of available data are discarded by deleting observations with one or more pieces of missing data. The importance of this problem arises as the investigator is interested in making inferences about the entire population, not just those observations with complete data.

Before embarking on an analysis of the impact of missing data on the first two moments of data distributions, it is helpful to discuss if there are patterns in the missing data. Quite often, understanding the way data are missing helps to illuminate the reason for the missing values. In the case of a series of gridpoints, all gridpoints but one may have complete data. If the gridpoint with missing data is consideredimportant, some technique to fill-in the missing values may be sought. Spatial interpolation techniques have been developed that are accurate in most situations (e.g., Barnes 1964; Julian 1984; Spencer and Gao 2004). Contrast this type of missing data pattern to another situation where a series of variables (e.g., temperature, precipitation, station pressure, relative humidity) are measured at a single location. Perhaps all but one of the variables is complete over a set of observations, but the last variable has some missing data. In such cases, interpolation techniques are not the logical alternative; some other method is required. Such problems are not unique to the environmental sciences. In the analysis of agriculture data, patterns of missing data have been noted for nearly a century (Yates 1933). Dodge (1985) discusses the use of least squares estimation to replace missing data in univariate analysis.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Afafi, A. A., & Elashoff, R. M. (1966). Missing observations in multivariate statistics: Review of the literature.Journal of the American Statistical Association 61595–604

    Article  Google Scholar 

  • Barnes, S. L. (1964). A technique for maximizing details in numerical weather map analysis.Journal of Applied Meteorology 3396–409

    Article  Google Scholar 

  • Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.),5th Annual ACM Workshop on COLT (pp. 144–152). Pittsburgh, PA: ACM Press

    Google Scholar 

  • Chang, C., & Lin, C. (2001). LIBSVM: A library for support vector machines. Fromhttp://www.csie.ntu.edu.tw/~cjlin/libsvm

  • Cox, D. R., & Hinkley, D. V. (1974).Theoretical statistics. NewYork: Wiley

    Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society B391–38

    Google Scholar 

  • Dodge, Y. (1985).Analysis of experiments with missing data. New York: Wiley

    Google Scholar 

  • Duffy, P. B., Doutriaux, C., Santer, B. D., & Fodor, I. K. (2001). Effect of missing data estimates of near-surface temperature change since 1900.Journal of Climate 142809–2814

    Article  Google Scholar 

  • Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data.Biometrics 14174–194

    Article  Google Scholar 

  • Haykin, S. (1999).Neural networks: A comprehensive foundation (2nd ed.). Englewoods Cliffs, NJ: Prentice-Hall

    Google Scholar 

  • Julian, P. R. (1984). Objective analysis in the tropics: A proposed scheme.Monthly Weather Review 1121752–1767

    Article  Google Scholar 

  • Kemp, W. P., Burnell, D. G., Everson, D. O., & Thomson, A. J. (1983). Estimating missing daily maximum and minimum temperatures.Journal of Applied Meteorology 221587–1593

    Article  Google Scholar 

  • Kidson, J. W., & Trenberth, K. E. (1988). Effects of missing data on estimates of monthly mean general circulation statistics.Journal of Climate 11261–1275

    Article  Google Scholar 

  • Lu, Q., Lund, R., & Seymour, L. (2005). An update on U.S. temperature trends.Journal of Climate18, 4906–4914

    Article  Google Scholar 

  • Lund, R. B., Seymour, L., & Kafadar, K. (2001). Temperature trends in the United States.Environmetrics 12673–690

    Article  Google Scholar 

  • Mann, M. E., Rutherford, S., Wahl, E., & Ammann, C. (2005). Testing the fidelity of methods used in proxy-based reconstructions of past climate.Journal of Climate 184097–4107

    Article  Google Scholar 

  • Marini, M. M., Olsen, A. R., & Ruben, D. B. (1980). Maximum likelihood estimation in panel studies with missing data.Sociological Methodology 11314–357

    Article  Google Scholar 

  • MATLAB Statistics Toolbox (2007). Retrieved August 28, 2007, fromhttp://www.mathworks.com/access/helpdesk/ help/toolbox/stats/

  • Meng, X.-L., & Pedlow, S. (1992). EM: A bibliographic review with missing articles.Proceedings of the Statistical Computing Section, American Statistical Association. Alexandria, VA: American Statistical Association, 24–27

    Google Scholar 

  • Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm.Journal of the American Statistical Association 86899–909

    Article  Google Scholar 

  • Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applications.Proceedings of the 6th Berkeley Symposium of Mathematical Statistics and Probability 1697–715

    Google Scholar 

  • Richman, M. B., & Lamb, P. J. (1985). Climatic pattern analysis of three- and seven-day summer rainfall in the central United States: Some methodological considerations and a regionalization.Journal of Applied Meteorology 241325–1343

    Article  Google Scholar 

  • Roth, P. L., Campion, J. E., & Jones, S. D. (1996). The impact of four missing data techniques on validity estimates in human resource management.Journal of Business and Psychology 11101–112

    Article  Google Scholar 

  • Rubin, D. B. (1976). Inference and missing data.Biometrika 63581–592

    Article  Google Scholar 

  • Rubin, D. B. (1988). An overview of multiple imputation.Proceedings of the Survey Research Methods Section of the merican Statistical Association79–84

    Google Scholar 

  • Rutherford, S., Mann, M. E., Osborn, T. J., Bradley, R. S., Briffa, K. R., Hughes, M. K., & Jones, P. D. (2005). Proxy-based orthern hemisphere surface temperature reconstructions: ensitivity to method, predictor network, target solution and arget domain.Journal of Climate 182308–2329

    Article  Google Scholar 

  • Schneider, T. (2001). Analysis of incomplete climate data: Estiation of mean values and covariance matrices and imputation of missing values.Journal of Climate 14853–871

    Article  Google Scholar 

  • Spencer, P. L., & Gao, J. (2004). Can gradient information e used to improve variational objective analysis?.Monthly eather Review 1322977–2994

    Article  Google Scholar 

  • Stooksbury, D. E., Idso, C. D., & Hubbard, K. G. (1999). The ffects of data gaps on the calculated monthly mean maximum and minimum temperatures in the continental United tates: A spatial and temporal study.Journal of Climate 12524–1533

    Google Scholar 

  • Trafalis, T. B., Santosa, B., & Richman, M. B. (2003). Prediction of rainfall from WSR-88D radar using Kernel-based ethods.International Journal of Smart Engineering System esign 5429–438

    Article  Google Scholar 

  • Vapnik, V. N. (1998).Statistical learning theory. New York: Springer

    Google Scholar 

  • Wilks, S. S. (1932). Moments and distribution of estimates of opulation parameters from fragmentary samples.Annals of athematical Statistics 3163–195

    Article  Google Scholar 

  • Yates, F. (1933). The analysis of replicated experiments when he field results are incomplete.Empirical Journal of Experimental Agriculture 1129–142

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael B. Richman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media B.V

About this chapter

Cite this chapter

Richman, M.B., Trafalis, T.B., Adrianto, I. (2009). Missing Data Imputation Through Machine Learning Algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds) Artificial Intelligence Methods in the Environmental Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9119-3_7

Download citation

Publish with us

Policies and ethics