Skip to main content

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11056))

Included in the following conference series:

Abstract

When we collect data, usually they consist of small samples with missing values. As a consequence of this flaw, the data analysis becomes less effective. Almost all algorithms for statistical data analysis need a complete data set. In data preprocessing, we have to deal with missing values. Some well-known methods for filling missing values are: Mean, K-nearest neighbours (kNN), fuzzy K-means (FKM), etc. There are quite a lot of R packages offering the imputation of missing values, but sometimes its hard to find the appropriate algorithm for a particular dataset. When we have to deal with large datasets sometimes, these known methods cannot work as supposed because they need too much memory to perform their operations. This paper provides an overview of a considerable dataset imputation by applying three different algorithms. A comparison was performed using three different algorithms under a missing completely at random (MCAR) assumption, and based on the evaluation criteria: Root mean squared error (RMSE). The experiment results show that Random Forest algorithm can be quite useful for missing values imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2015/.

References

  1. Allison, P.D.: Missing data: quantitative applications in the social sciences. Br. J. Math. Stat. Psychol. 55(1), 193–196 (2002)

    Article  Google Scholar 

  2. Breiman, L.: Random forests Leo Breiman and Adele Cutler. Random Forests-Classification Description (2015)

    Google Scholar 

  3. Christopher, F., Thomas: Gecco 2015 recovering missing information in heating system recovering missing information in heating system operating dataoperating data (2015)

    Google Scholar 

  4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  5. Faisal, S., Tutz, G.: Nearest neighbor imputation for categorical data by weighting of attributes. arXiv preprint arXiv:1710.01011 (2017)

  6. Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)

    Article  Google Scholar 

  7. Mitchell, M.W.: Bias of the random forest out-of-bag (OOB) error for certain input parameters (2011)

    Google Scholar 

  8. Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biometrics Biostatistics 6(1), 1 (2015)

    Google Scholar 

  9. Shrive, F.M., Stuart, H., Quan, H., Ghali, W.A.: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Methodol. 6(1), 57 (2006)

    Article  Google Scholar 

  10. Troyanskaya, O., et al.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

  11. Wang, D., et al.: Effects of replacing the unreliable cdna microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 22(23), 2883–2889 (2006)

    Article  Google Scholar 

  12. Zhang, S.: Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fitore Muharemi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Muharemi, F., Logofătu, D., Leon, F. (2018). Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset. In: Nguyen, N., Pimenidis, E., Khan, Z., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2018. Lecture Notes in Computer Science(), vol 11056. Springer, Cham. https://doi.org/10.1007/978-3-319-98446-9_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98446-9_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98445-2

  • Online ISBN: 978-3-319-98446-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics