Full Model Selection in Huge Datasets and for Proxy Models Construction

  • Angel Díaz-PachecoEmail author
  • Carlos Alberto Reyes-García
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11288)


Full Model Selection is a technique for improving the accuracy of machine learning algorithms through the search of the most adequate combination on each dataset of feature selection, data preparation, a machine learning algorithm and its hyper-parameters tuning. With the increasingly larger quantities of information generated in the world, the emergence of the paradigm known as Big Data has made possible the analysis of gigantic datasets in order to obtain useful information for science and business. Though Full Model Selection is a powerful tool, it has been poorly explored in the Big Data context, due to the vast search space and the elevated number of fitness evaluations of candidate models. In order to overcome this obstacle, we propose the use of proxy models in order to reduce the number of expensive fitness functions evaluations and also the use of the Full Model Selection paradigm in the construction of such proxy models.


Big Data Model Selection Machine learning 


  1. 1.
    Alenezi, F., Mohaghegh, S.: A data-driven smart proxy model for a comprehensive reservoir simulation. In: Saudi International Conference on Information Technology (Big Data Analysis) (KACSTIT), pp. 1–6. IEEE (2016)Google Scholar
  2. 2.
    Bansal, B., Sahoo, A.: Full model selection using bat algorithm. In: 2015 International Conference on Cognitive Computing and Information Processing (CCIP), pp. 1–4. IEEE (2015)Google Scholar
  3. 3.
    Ceruti, C., Bassis, S., Rozza, A., Lombardi, G., Casiraghi, E., Campadelli, P.: DANCo: dimensionality from angle and norm concentration. arXiv preprint arXiv:1206.3881 (2012)
  4. 4.
    Couckuyt, I., De Turck, F., Dhaene, T., Gorissen, D.: Automatic surrogate modeltype selection during the optimization of expensive black-box problems. In: Proceedings of the 2011 Winter Simulation Conference (WSC), pp. 4269–4279. IEEE (2011)Google Scholar
  5. 5.
    Cruz-Vega, I., García, C.A.R., Gil, P.G., Cortés, J.M.R., Magdaleno, J.d.J.R.: Genetic algorithms based on a granular surrogate model and fuzzy aptitude functions. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 2122–2128. IEEE (2016)Google Scholar
  6. 6.
    Escalante, H.J., Montes, M., Sucar, L.E.: Particle swarm model selection. J. Mach. Learn. Res. 10(Feb), 405–440 (2009)Google Scholar
  7. 7.
    Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: Asano, T., Nakano, S., Okamoto, Y., Watanabe, O. (eds.) ISAAC 2011. LNCS, vol. 7074, pp. 374–383. Springer, Heidelberg (2011). Scholar
  8. 8.
    Guller, M.: Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis. Apress, New York City (2015). Scholar
  9. 9.
    Guo, X., Yang, J., Wu, C., Wang, C., Liang, Y.: A novel LS-SVMs hyper-parameter selection based on particle swarm optimization. Neurocomputing 71(16), 3211–3215 (2008)CrossRefGoogle Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Kaneko, H., Funatsu, K.: Fast optimization of hyperparameters for support vector regression models with highly predictive ability. Chemom. Intell. Lab. Syst. 142, 64–69 (2015)., Scholar
  12. 12.
    Lessmann, S., Stahlbock, R., Crone, S.F.: Genetic algorithms for support vector machine model selection. In: 2006 International Joint Conference on Neural Networks. IJCNN 2006, pp. 3063–3069. IEEE (2006)Google Scholar
  13. 13.
    Lombardi, G., Rozza, A., Ceruti, C., Casiraghi, E., Campadelli, P.: Minimum neighbor distance estimators of intrinsic dimension. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6912, pp. 374–389. Springer, Heidelberg (2011). Scholar
  14. 14.
    Pilat, M., Neruda, R.: Meta-learning and model selection in multi-objective evolutionary algorithms. In: 2012 11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 433–438. IEEE (2012)Google Scholar
  15. 15.
    Rosales-Pérez, A., Gonzalez, J.A., Coello Coello, C.A., Escalante, H.J., Reyes-Garcia, C.A.: Surrogate-assisted multi-objective model selection for support vector machines. J. Neurocomputing 150, 163–172 (2015)CrossRefGoogle Scholar
  16. 16.
    Sánchez-Monedero, J., Gutiérrez, P.A., Pérez-Ortiz, M., Hervás-Martínez, C.: An n-spheres based synthetic data generator for supervised classification. In: Rojas, I., Joya, G., Gabestany, J. (eds.) IWANN 2013. LNCS, vol. 7902, pp. 613–621. Springer, Heidelberg (2013). Scholar
  17. 17.
    Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combinedselection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledgediscovery and Data Mining, pp. 847–855. ACM (2013)Google Scholar
  18. 18.
    Tlili, M., Hamdani, T.M.: Big data clustering validity. In: 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 348–352. IEEE (2014)Google Scholar
  19. 19.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Angel Díaz-Pacheco
    • 1
    Email author
  • Carlos Alberto Reyes-García
    • 1
  1. 1.Computer Science DepartmentInstituto Nacional de Astrofísica, Óptica y Electrónica (INAOE)San Andrés CholulaMexico

Personalised recommendations