Abstract
Improving accuracy or reducing computational cost are the main approaches of machine learning techniques, but it depends heavily on the test data used. Even more so when it comes to from real-world data such as censuses, surveys or tokens that contain a high level of missing values. The data absence or presence of outliers are problems that must be treated carefully prior to any process related to data analysis. The following work presents an overview of data pre-processing and aims at presenting the steps to follow prior to process large volumes of high-dimensionality data with categorical variables. As part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables, we use the Pairwise and Listwise Deletion methods. Thus, the generation of m-clusters using the Kohonen Self-Organizing Maps (SOM) algorithm with H2O over R is also considered as a division of data into similar groups, which are used as cluster to apply Multiple Imputation algorithms, creating different m-values to impute a missing value.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bar, H.: Missing data–mechanisms and possible solutions. Cultura y Educación 29(3), 492–525 (2017)
Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)
Cheema, J.R.: A review of missing data handling methods in education research. Rev. Educ. Res. 84(4), 487–508 (2014)
Famili, A., Shen, W.-M., Weber, R., Simoudis, E.: Data preprocessing and intelligent data analysis. Intell. Data Anal. 1(1), 3–23 (1997)
Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10(4), 300–310 (2002)
Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70
Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)
Newman, D.A.: Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ. Res. Methods 6(3), 328–362 (2003)
Nishanth, K.J., Ravi, V.: Probabilistic neural network based categorical data imputation. Neurocomputing 218, 17–25 (2016)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ruiz-Chavez, Z., Salvador-Meneses, J., Garcia-Rodriguez, J., Tallón-Ballesteros, A.J. (2018). Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11315. Springer, Cham. https://doi.org/10.1007/978-3-030-03496-2_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-03496-2_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03495-5
Online ISBN: 978-3-030-03496-2
eBook Packages: Computer ScienceComputer Science (R0)