Multiple Imputation Inference for Missing Values in Distributed Datasets Using Apache Spark

Kaliamoorthy, Sathish; Bhanu, S. Mary Saira

doi:10.1007/978-981-13-1813-9_3

Sathish Kaliamoorthy¹⁴ &
S. Mary Saira Bhanu¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 906))

Included in the following conference series:

International Conference on Advances in Computing and Data Sciences

894 Accesses
1 Citations

Abstract

Big data is a term that describes the large volume of data, both structured and unstructured. Due to its huge quantity, big data are stored by partitioning and distributing into smaller chunks of data in multiple machines for quick and efficient analysis, because it is not possible for a single machine to hold all of the big data by itself. However, these datasets are generally incomplete because it contains many instances of missing values. Missing values are a serious impediment to data analysis, and Multiple Imputation is a preferred method for handling missing values. All existing multiple imputation implementations in statistical software packages are all based on the in-memory processing of data and are unsuitable if the data is distributed. So there is a need for handling missing values using multiple imputation if the data is distributed. The goal of this work is to implement a multiple imputation algorithm for missing values using fuzzy clustering on a distributed computing system built with Apache Spark. The results show that the multiple imputation algorithm outperforms traditional imputation techniques for missing values in a distributed computing system in terms of imputation accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kang, H.: The prevention and handling of the missing data. Korean J. Anesthesiol. 64(5), 402–406 (2013)
Article Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: USENIX Symposium on Networked Systems Design and Implementation (2012)
Google Scholar
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20(1), 40–49 (2011)
Article Google Scholar
Houari, R., Bounceur, A., Tari, A., Kechadi, M.T.: Handling missing data problems with sampling methods. In: International Conference on Advanced Distributed Systems and Applications (2014)
Google Scholar
Ye, H.: Missing data analysis using multiple imputation: getting to the heart of the matter. Circ. Cardiovasc. Qual. Outcomes 3(1), 98–105 (2010)
Article Google Scholar
Harel, O., Zhou, X.H.: Multiple imputation - review of theory, implementation and software. Stat. Med. 26(16), 3057–3077 (2007)
Article MathSciNet Google Scholar
Rubin, D.B.: Basic ideas of multiple imputation for nonresponse. Stat. Can. 12(1), 37–47 (1986)
Google Scholar
Nikfalazar, S., Khorshidi, H.A., Bedingfield, S., Yeh, C.-H.: A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In: IEEE International Conference on Fuzzy Systems, Fuzzy Systems, FUZZ-IEEE, Naples (2017)
Google Scholar
Bharill, N., Tiwari, A., Malviya, A.: Fuzzy based clustering algorithms to handle big data with implementation on Apache Spark. In: IEEE Second International Conference on Big Data Computing Service and Applications, Exeter College, Oxford, UK, pp. 95–104 (2016)
Google Scholar
Armina, R., Zain, A.M., Ali, N.A., Sallehuddin, R.: A review on missing value estimation using imputation algorithm. J. Phys. Conf. Ser. (JPCS) 892(1), 4 (2017)
Google Scholar
Saravanan, P., Sailakshmi, P.: Missing value imputation using fuzzy possibilistic C means optimized with support vector regression and genetic algorithm. J. Theoret. Appl. Inf. Technol. 72(1), 34–39 (2015)
Google Scholar
Software for Multiple Imputation. http://multiple-imputation.com/software.html
Apache Spark. https://spark.apache.org
Open Government Data Platform (OGD) India. https://data.gov.in

Download references

Author information

Authors and Affiliations

The Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, 620015, Tamil Nadu, India
Sathish Kaliamoorthy & S. Mary Saira Bhanu

Authors

Sathish Kaliamoorthy
View author publications
You can also search for this author in PubMed Google Scholar
S. Mary Saira Bhanu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sathish Kaliamoorthy .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Mayank Singh
Jaypee University of Information Technology, Solan, India
P. K. Gupta
Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, India
Vipin Tyagi
Institute of Information Theory and Automation, Prague 8, Czech Republic
Jan Flusser
University of Ottawa, Ottawa, Canada
Tuncer Ören

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaliamoorthy, S., Bhanu, S.M.S. (2018). Multiple Imputation Inference for Missing Values in Distributed Datasets Using Apache Spark. In: Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2018. Communications in Computer and Information Science, vol 906. Springer, Singapore. https://doi.org/10.1007/978-981-13-1813-9_3

Download citation

DOI: https://doi.org/10.1007/978-981-13-1813-9_3
Published: 26 October 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1812-2
Online ISBN: 978-981-13-1813-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics