Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Muharemi, Fitore; Logofătu, Doina; Leon, Florin

doi:10.1007/978-3-319-98446-9_36

Fitore Muharemi¹⁷,
Doina Logofătu¹⁷ &
Florin Leon¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11056))

Included in the following conference series:

International Conference on Computational Collective Intelligence

1631 Accesses
6 Citations

Abstract

When we collect data, usually they consist of small samples with missing values. As a consequence of this flaw, the data analysis becomes less effective. Almost all algorithms for statistical data analysis need a complete data set. In data preprocessing, we have to deal with missing values. Some well-known methods for filling missing values are: Mean, K-nearest neighbours (kNN), fuzzy K-means (FKM), etc. There are quite a lot of R packages offering the imputation of missing values, but sometimes its hard to find the appropriate algorithm for a particular dataset. When we have to deal with large datasets sometimes, these known methods cannot work as supposed because they need too much memory to perform their operations. This paper provides an overview of a considerable dataset imputation by applying three different algorithms. A comparison was performed using three different algorithms under a missing completely at random (MCAR) assumption, and based on the evaluation criteria: Root mean squared error (RMSE). The experiment results show that Random Forest algorithm can be quite useful for missing values imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2015/.

References

Allison, P.D.: Missing data: quantitative applications in the social sciences. Br. J. Math. Stat. Psychol. 55(1), 193–196 (2002)
Article Google Scholar
Breiman, L.: Random forests Leo Breiman and Adele Cutler. Random Forests-Classification Description (2015)
Google Scholar
Christopher, F., Thomas: Gecco 2015 recovering missing information in heating system recovering missing information in heating system operating dataoperating data (2015)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Faisal, S., Tutz, G.: Nearest neighbor imputation for categorical data by weighting of attributes. arXiv preprint arXiv:1710.01011 (2017)
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)
Article Google Scholar
Mitchell, M.W.: Bias of the random forest out-of-bag (OOB) error for certain input parameters (2011)
Google Scholar
Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biometrics Biostatistics 6(1), 1 (2015)
Google Scholar
Shrive, F.M., Stuart, H., Quan, H., Ghali, W.A.: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Methodol. 6(1), 57 (2006)
Article Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Wang, D., et al.: Effects of replacing the unreliable cdna microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 22(23), 2883–2889 (2006)
Article Google Scholar
Zhang, S.: Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Frankfurt University of Applied Sciences, Frankfurt Am Main, Germany
Fitore Muharemi & Doina Logofătu
Technical University of Iaşi, Iaşi, Romania
Florin Leon

Authors

Fitore Muharemi
View author publications
You can also search for this author in PubMed Google Scholar
Doina Logofătu
View author publications
You can also search for this author in PubMed Google Scholar
Florin Leon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fitore Muharemi .

Editor information

Editors and Affiliations

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Ngoc Thanh Nguyen
Department of Computer Science and Creative Technologies, University of the West of England, Bristol, United Kingdom
Elias Pimenidis
Department of Computer Science and Creative Technologies, University of the West of England, Bristol, United Kingdom
Zaheer Khan
Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muharemi, F., Logofătu, D., Leon, F. (2018). Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset. In: Nguyen, N., Pimenidis, E., Khan, Z., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2018. Lecture Notes in Computer Science(), vol 11056. Springer, Cham. https://doi.org/10.1007/978-3-319-98446-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-98446-9_36
Published: 08 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98445-2
Online ISBN: 978-3-319-98446-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset