Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data

Ruiz-Chavez, Zoila; Salvador-Meneses, Jaime; Garcia-Rodriguez, Jose; Tallón-Ballesteros, Antonio J.

doi:10.1007/978-3-030-03496-2_32

Zoila Ruiz-Chavez¹⁷,
Jaime Salvador-Meneses¹⁷,
Jose Garcia-Rodriguez¹⁸ &
…
Antonio J. Tallón-Ballesteros¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11315))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1091 Accesses

Abstract

Improving accuracy or reducing computational cost are the main approaches of machine learning techniques, but it depends heavily on the test data used. Even more so when it comes to from real-world data such as censuses, surveys or tokens that contain a high level of missing values. The data absence or presence of outliers are problems that must be treated carefully prior to any process related to data analysis. The following work presents an overview of data pre-processing and aims at presenting the steps to follow prior to process large volumes of high-dimensionality data with categorical variables. As part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables, we use the Pairwise and Listwise Deletion methods. Thus, the generation of m-clusters using the Kohonen Self-Organizing Maps (SOM) algorithm with H2O over R is also considered as a division of data into similar groups, which are used as cluster to apply Multiple Imputation algorithms, creating different m-values to impute a missing value.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bar, H.: Missing data–mechanisms and possible solutions. Cultura y Educación 29(3), 492–525 (2017)
Article Google Scholar
Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)
Google Scholar
Cheema, J.R.: A review of missing data handling methods in education research. Rev. Educ. Res. 84(4), 487–508 (2014)
Article Google Scholar
Famili, A., Shen, W.-M., Weber, R., Simoudis, E.: Data preprocessing and intelligent data analysis. Intell. Data Anal. 1(1), 3–23 (1997)
Article Google Scholar
Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10(4), 300–310 (2002)
Article Google Scholar
Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
Article Google Scholar
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70
Chapter Google Scholar
Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)
Article Google Scholar
Newman, D.A.: Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ. Res. Methods 6(3), 328–362 (2003)
Article Google Scholar
Nishanth, K.J., Ravi, V.: Probabilistic neural network based categorical data imputation. Neurocomputing 218, 17–25 (2016)
Article Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Central del Ecuador, Ciudadela Universitaria, Quito, Ecuador
Zoila Ruiz-Chavez & Jaime Salvador-Meneses
Universidad de Alicante, Ap. 99., 03080, Alicante, Spain
Jose Garcia-Rodriguez
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Authors

Zoila Ruiz-Chavez
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Salvador-Meneses
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Antonio J. Tallón-Ballesteros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zoila Ruiz-Chavez .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Rm 209, Building B, Autonomous University of Madrid, Madrid, Spain
David Camacho
Campus of Gualtar, University of Minho, Braga, Portugal
Paulo Novais
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruiz-Chavez, Z., Salvador-Meneses, J., Garcia-Rodriguez, J., Tallón-Ballesteros, A.J. (2018). Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11315. Springer, Cham. https://doi.org/10.1007/978-3-030-03496-2_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-03496-2_32
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03495-5
Online ISBN: 978-3-030-03496-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics