A robust deep learning model for missing value imputation in big NCDC dataset

Abstract

Missing data are integral parts of most real datasets. To provide an efficient and accurate analytical result of data, the datasets need to be processed using imputation and cleaning techniques. Recently, deep learning is considered as the most powerful part of machine learning techniques, which is used for finding out the hidden knowledge within a very large dataset to make predictions more accurate. In this work, an efficient deep learning imputation model is proposed for imputing the missing values in weather data of an individual weather station on a temporal basis. Evaluation is carried out using various stations of National Climatic Data Center (NCDC) datasets to predict missing data of stations nearest to geographical station that are having the complete data. The comparison was performed on five optimizers [Rmsprop, Adam, Nadam, Stochastic Gradient Descent (SGD), Adagrad], on the basis of three evaluation criteria: mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE). Among these, the SGD optimizer is found to be more accurate in predicting the missing numbers. The proposed technique imputes missing values with higher accuracy and an error rate less than the previous models.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. 1.

    Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14(5), 853–871 (2001)

    Article  Google Scholar 

  2. 2.

    Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

  3. 3.

    Leke, C., Marwala, T., Paul, S.: Proposition of a theoretical model for missing data imputation using deep learning and evolutionary algorithms. arXiv preprint arXiv:1512.01362

  4. 4.

    Liang, F., Jia, B., Xue, J., Li, Q., Luo, Y.: An imputation-consistency algorithm for high-dimensional missing data problems and beyond. arXiv preprint arXiv:1802.02251

  5. 5.

    Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a comparison of neural network and expectation maximization techniques. Curr. Sci. 93(11), 1514–1521 (2007)

  6. 6.

    Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: a comparative review. J. Am. Stat. Assoc. 100(469), 332–346 (2005)

    MathSciNet  MATH  Article  Google Scholar 

  7. 7.

    Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)

    Google Scholar 

  8. 8.

    Kang, H.: The prevention and handling of the missing data. Korean J. Anesthesiol. 64(5), 402–406 (2013)

    Article  Google Scholar 

  9. 9.

    Scheg, A.G.: Critical Examinations of Distance Education Transformation Across Disciplines. IGI Global, Hershey (2014)

    Google Scholar 

  10. 10.

    Doreswamy, Gad, I., Manjunatha, B.: Performance evaluation of predictive models for missing data imputation in weather data. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1327–1334, IEEE, New York (2017). http://ieeexplore.ieee.org/document/8126025/. Accessed 2017

  11. 11.

    Deng, L., Yu, D., et al.: Deep learning: methods and applications. Found. Trends® Signal Process. 7(3–4), 197–387 (2014)

  12. 12.

    Sugomori, Y., Kaluza, B., Soares, F.M., Souza, A.M.: Deep Learning: Practical Neural Networks with Java. Packt Publishing Ltd, Birmingham (2017)

    Google Scholar 

  13. 13.

    Grover, A., Kapoor, A., Horvitz, E.: A deep hybrid model for weather forecasting. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 379–386. ACM, New York (2015)

  14. 14.

    Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Article  Google Scholar 

  15. 15.

    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press Cambridge (2016)

  16. 16.

    Koko, E.E.M., Mohamed, A.I.A.: Missing data treatment method on cluster analysis. Int. J. Adv. Stat. Probab. 3(2), 191–209 (2015)

    Article  Google Scholar 

  17. 17.

    Rana, S., John, A.H., Midi, H., Imon, A.: Robust regression imputation for missing data in the presence of outliers. Far East J. Math. Sci. 97(2), 183 (2015)

    MATH  Google Scholar 

  18. 18.

    Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough Sets and Current Trends in Computing, vol. 3066, pp. 573–579. Springer, Berlin (2004)

  19. 19.

    Saba, T., Rehman, A., AlGhamdi, J.S.: Weather forecasting based on hybrid neural model. Appl. Water Sci. 7(7), 1–6 (2017)

  20. 20.

    Di, C., Yang, X., Wang, X.: A four-stage hybrid model for hydrological time series forecasting. PLoS One 9(8), e104663 (2014)

    Article  Google Scholar 

  21. 21.

    Yaseen, Z.M., Ghareb, M.I., Ebtehaj, I., Bonakdari, H., Siddique, R., Heddam, S., Yusif, A.A., Deo, R.: Rainfall pattern forecasting using novel hybrid intelligent model based ANFIS-FFA. Water Resour. Manag. 32(1), 105–122 (2018)

    Article  Google Scholar 

  22. 22.

    NCDC, National Climatic Data Center, NOAA’s National Centers for Environmental Information (NCEI). https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets. Accessed 2016

  23. 23.

    Lawrimore, J.H., Menne, M.J., Gleason, B.E., Williams, C.N., Wuertz, D.B, Vose, R.S., Rennie, J.: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. J. Geophys. Res. Atmos. 116, D19121. https://doi.org/10.1029/2011JD016187

  24. 24.

    Balluff, S., Bendfeld, J., Krauter, S.: Meteorological data forecast using RNN. Int. J. Grid High Perform. Comput. 9(1), 61–74 (2017)

    Article  Google Scholar 

  25. 25.

    Firth, R., Chen, J.: Neural Network Implementation of a Mesoscale Meteorological Model, pp. 164–173. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-08326-1_17

  26. 26.

    Hu, Q., Zhang, R., Zhou, Y.: Transfer learning for short-term wind speed prediction with deep neural networks, Renew. Energy 85(Supplement C), 83–95 (2016). ISSN:0960-1481. http://www.sciencedirect.com/science/article/pii/S0960148115300574

  27. 27.

    Kiani, K., Saleem, K.: K-nearest temperature trends: a method for weather temperature data imputation. In: Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 23–27. ACM, New York (2017)

  28. 28.

    Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L., Santana, A.: Multi-objective genetic algorithm for missing data imputation. Pattern Recognit. Lett. 68, 126–131 (2015)

    Article  Google Scholar 

  29. 29.

    Abdella, M., Marwala, T.: The use of genetic algorithms and neural networks to approximate missing data in database. IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005, pp. 207–212. IEEE, New York (2005)

  30. 30.

    Aydilek, I.B., Arslan, A.: A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks. Int. J. Innov. Comput. Inf. Control 7(8), 4705–4717 (2012)

    Google Scholar 

  31. 31.

    Leke, C., Twala, B., Marwala, T.: Modeling of missing data prediction: computational intelligence and optimization algorithms. 2014 IEEE International Conference on Systems. Man and Cybernetics (SMC), pp. 1400–1404. IEEE, New York (2014)

  32. 32.

    Liew, A.W.-C., Law, N.-F., Yan, H.: Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief. Bioinform. 12(5), 498–513 (2010)

    Article  Google Scholar 

  33. 33.

    Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)

    Article  Google Scholar 

  34. 34.

    Kezunovic, M., Obradovic, Z., Dokic, T., Zhang, B., Stojanovic, J., Dehghanian, P., Chen, P.-C.: Predicting Spatiotemporal Impacts of Weather on Power Systems Using Big Data Science, pp. 265–299. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-53474-9_12

  35. 35.

    Kalaycioglu, O., Copas, A., King, M., Omar, R.Z.: A comparison of multiple-imputation methods for handling missing data in repeated measurements observational studies. J. R. Stat. Soc. Ser. A (Stat. Soc.) 179(3), 683–706 (2016)

  36. 36.

    Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biom. Biostat. 6(1), 1 (2015)

    Google Scholar 

  37. 37.

    Zeng, Y.: A study of missing data imputation and predictive modeling of strength properties of wood composites. Master’s Thesis, University of Tennessee. http://trace.tennessee.edu/utk_gradthes/1041. Accessed 2011

  38. 38.

    Subashini, P., Krishnaveni, M.: Imputation of missing data using Bayesian Principal Component Analysis on TEC ionospheric satellite dataset. In: Electrical and 24th Canadian Conference on Computer Engineering (CCECE), 2011, pp. 001540–001543. IEEE, New York (2011)

  39. 39.

    Boke, A.S.: Comparative evaluation of spatial interpolation methods for estimation of missing meteorological variables over Ethiopia. J. Water Resour. Prot. 9(08), 945 (2017)

    Article  Google Scholar 

  40. 40.

    Leke, C., Marwala, T.: Missing data estimation in high-dimensional datasets: a swarm intelligence-deep neural network approach. In: International Conference in Swarm Intelligence, pp. 259–270. Springer, Berlin (2016)

  41. 41.

    Denil, M., Shakibi, B., Dinh, L., De Freitas, N., et al.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, pp. 2148–2156 (2013)

  42. 42.

    Ghaderi, A., Sanandaji, B.M., Ghaderi, F.: Deep Forecast: Deep Learning-Based Spatio-Temporal Forecasting. arXiv preprint. arXiv:1707.08110

  43. 43.

    Gao, Y., Merz, C., Lischeid, G., Schneider, M.: A review on missing hydrological data processing. Environ. Earth Sci. 77(2), 47 (2018)

    Article  Google Scholar 

  44. 44.

    Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12(7), 878 (2016)

    Article  Google Scholar 

  45. 45.

    Swara, G.Y., et al.: Implementation of Haversine Formula and Best First Search Method in Searching of Tsunami Evacuation Route. In: IOP Conference Series: Earth and Environmental Science, vol. 97, p. 012004. IOP Publishing, Philadelphia (2017)

  46. 46.

    Campozano, L., Sánchez, E., Aviles, A., Samaniego, E.: Evaluation of infilling methods for time series of daily precipitation and temperature: the case of the Ecuadorian Andes. Maskana 5(1), 99–115 (2015)

    Article  Google Scholar 

  47. 47.

    Varatharajan, R., Manogaran, G., Priyan, M.: A big data classification approach using LDA with an enhanced SVM method for ECG signals in cloud computing. Multimed. Tools Appl. 77(8), 10195–10215 (2018)

    Article  Google Scholar 

  48. 48.

    Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint. arXiv:1609.04747

  49. 49.

    Gitman, I., Dilipkumar, D., Parr, B.: Convergence analysis of gradient descent algorithms with proportional updates. arXiv preprint. arXiv:1801.03137

  50. 50.

    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

  51. 51.

    Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

  52. 52.

    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980

  53. 53.

    Dozat, T.: Incorporating nesterov momentum into adam, International Conference on Learning Representations (ICLR), pp. 1–6 (2016). http://cs229.stanford.edu/proj2015/054_report.pdf

  54. 54.

    Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints. arXiv:1605.02688

  55. 55.

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  56. 56.

    Park, I., Kim, H.S., Lee, J., Kim, J.H., Song, C.H., Kim, H.K.: Temperature prediction using the missing data refinement model based on a long short-term memory neural network. Atmosphere 10(11), 718 (2019)

    Article  Google Scholar 

  57. 57.

    Saima, H., Jaafar, J., Belhaouari, S., Jillani, T.: Intelligent methods for weather forecasting: a review. In: National Postgraduate Conference (NPC), 2011, pp. 1–6. IEEE, New York (2011)

Download references

Acknowledgements

We are indebted to the National Oceanic and Atmospheric Administration for making available of the NCDC data to the public, without that this work would not have been made possible.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ibrahim Gad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gad, I., Hosahalli, D., Manjunatha, B.R. et al. A robust deep learning model for missing value imputation in big NCDC dataset. Iran J Comput Sci (2020). https://doi.org/10.1007/s42044-020-00065-z

Download citation

Keywords

  • Weather forecasting
  • Hybrid deep learning model
  • Missing data
  • NCDC dataset