Abstract
Structured Data Bases which include both numerical and categorical attributes (Mixed Databases or MD) ought to be adequately pre-processed so that machine learning algorithms may be applied to their analysis and further processing. Of primordial importance is that the instances of all the categorical attributes be encoded so that the patterns embedded in the MD be preserved. We discuss CESAMO, an algorithm that achieves this by statistically sampling the space of possible codes. CESAMO’s implementation requires the determination of the moment when the codes distribute normally. It also requires the approximation of an encoded attribute as a function of other attributes such that the best code assignment may be identified. The MD’s categorical attributes are thusly mapped into purely numerical ones. The resulting numerical database (ND) is then accessible to supervised and non-supervised learning algorithms. We discuss CESAMO, normality assessment and functional approximation. A case study of the US census database is described. Data is made strictly numerical using CESAMO. Neural Networks and Self-Organized Maps are then applied. Our results are compared to classical analysis. We show that CESAMO’s application yields better results.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Artificial Neural Networks
- 2.
Genetic Algorithms
- 3.
For which see Sect. 4.
- 4.
Several possible forms for the approximant are possible. We selected a polynomial form due to the well known Weierstrass approximation theorem, which states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function.
- 5.
One of the requirements for the generality of the algorithm is that the characteristics of the data are not known in advance. Assuming them to be experimental underlines this fact.
- 6.
(N) denotes a numerical attribute; (C) denotes a categorical attribute.
References
Goebel, M., Gruenwald, L.: A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explor. Newsl. 1(1), 20–33 (1999)
Sokal, R.R.: The principles of numerical taxonomy: twenty-five years later. Comput.-Assist. Bacterial Syst. 15, 1 (1985)
Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM (2002)
Kuri-Morales, A.F.: Categorical encoding with neural networks and genetic algorithms. In: Zhuang, X., Guarnaccia, C. (eds.) WSEAS Proceedings of the 6th International Conference on Applied Informatics and. Computing Theory, pp. 167–175, 01 Jul 2015. ISBN 9781618043139, ISSN 1790-5109
Kuri-Morales, A., Sagastuy-Breña, J.: A parallel genetic algorithm for pattern recognition in mixed databases. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds.) MCPR 2017. LNCS, vol. 10267, pp. 13–21. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59226-8_2
Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92198-3_18
Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Schoenauer, M., et al. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45356-3_83
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2(4), 303–314 (1989)
Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)
Kuri-Morales, A.F., Aldana-Bobadilla, E., López-Peña, I.: The best genetic algorithm II. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS (LNAI), vol. 8266, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45111-9_2
Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)
Lopez-Peña, I., Kuri-Morales, A.: Multivariate approximation methods using polynomial models: a comparative study. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI). IEEE (2015)
Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06483-3_30
Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62428-0_1
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994)
Kwon, S.H.: Cluster validity index for fuzzy clustering. Electron. Lett. 34(22), 2176–2177 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Kuri-Morales, A. (2018). Transforming Mixed Data Bases for Machine Learning: A Case Study. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Soft Computing. MICAI 2018. Lecture Notes in Computer Science(), vol 11288. Springer, Cham. https://doi.org/10.1007/978-3-030-04491-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-04491-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04490-9
Online ISBN: 978-3-030-04491-6
eBook Packages: Computer ScienceComputer Science (R0)