Transforming Mixed Data Bases for Machine Learning: A Case Study

  • Angel Kuri-MoralesEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11288)


Structured Data Bases which include both numerical and categorical attributes (Mixed Databases or MD) ought to be adequately pre-processed so that machine learning algorithms may be applied to their analysis and further processing. Of primordial importance is that the instances of all the categorical attributes be encoded so that the patterns embedded in the MD be preserved. We discuss CESAMO, an algorithm that achieves this by statistically sampling the space of possible codes. CESAMO’s implementation requires the determination of the moment when the codes distribute normally. It also requires the approximation of an encoded attribute as a function of other attributes such that the best code assignment may be identified. The MD’s categorical attributes are thusly mapped into purely numerical ones. The resulting numerical database (ND) is then accessible to supervised and non-supervised learning algorithms. We discuss CESAMO, normality assessment and functional approximation. A case study of the US census database is described. Data is made strictly numerical using CESAMO. Neural Networks and Self-Organized Maps are then applied. Our results are compared to classical analysis. We show that CESAMO’s application yields better results.


Machine Learning Mixed Databases Non-linear regression Goodness-of-fit 


  1. 1.
    Goebel, M., Gruenwald, L.: A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explor. Newsl. 1(1), 20–33 (1999)CrossRefGoogle Scholar
  2. 2.
    Sokal, R.R.: The principles of numerical taxonomy: twenty-five years later. Comput.-Assist. Bacterial Syst. 15, 1 (1985)Google Scholar
  3. 3.
    Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM (2002)Google Scholar
  4. 4.
    Kuri-Morales, A.F.: Categorical encoding with neural networks and genetic algorithms. In: Zhuang, X., Guarnaccia, C. (eds.) WSEAS Proceedings of the 6th International Conference on Applied Informatics and. Computing Theory, pp. 167–175, 01 Jul 2015. ISBN 9781618043139, ISSN 1790-5109Google Scholar
  5. 5.
    Kuri-Morales, A., Sagastuy-Breña, J.: A parallel genetic algorithm for pattern recognition in mixed databases. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds.) MCPR 2017. LNCS, vol. 10267, pp. 13–21. Springer, Cham (2017). Scholar
  6. 6.
    Kuri-Morales, A.: Pattern discovery in mixed data bases. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Sarkar, S. (eds.) MCPR 2018. LNCS, vol. 10880, pp. 178–188. Springer, Cham (2018). Scholar
  7. 7.
    Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Schoenauer, M., et al. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000). Scholar
  8. 8.
    Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2(4), 303–314 (1989)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Rudolph, G.: Convergence analysis of canonical genetic algorithms. IEEE Trans. Neural Netw. 5(1), 96–101 (1994)CrossRefGoogle Scholar
  10. 10.
    Kuri-Morales, A.F., Aldana-Bobadilla, E., López-Peña, I.: The best genetic algorithm II. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013. LNCS (LNAI), vol. 8266, pp. 16–29. Springer, Heidelberg (2013). Scholar
  11. 11.
    Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)CrossRefGoogle Scholar
  12. 12.
    Lopez-Peña, I., Kuri-Morales, A.: Multivariate approximation methods using polynomial models: a comparative study. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI). IEEE (2015)Google Scholar
  13. 13.
    Kuri-Morales, A., Cartas-Ayala, A.: Polynomial multivariate approximation with genetic algorithms. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 307–312. Springer, Cham (2014). Scholar
  14. 14.
    Kuri-Morales, A.F., López-Peña, I.: Normality from monte carlo simulation for statistical validation of computer intensive algorithms. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds.) MICAI 2016. LNCS (LNAI), vol. 10062, pp. 3–14. Springer, Cham (2017). Scholar
  15. 15.
    Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994)zbMATHGoogle Scholar
  16. 16.
    Kwon, S.H.: Cluster validity index for fuzzy clustering. Electron. Lett. 34(22), 2176–2177 (1998)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Instituto Tecnológico Autónomo de MéxicoD.F. MexicoMexico

Personalised recommendations