Advertisement

Full Model Selection in Big Data

  • Angel Díaz-PachecoEmail author
  • Jesús A. Gonzalez-Bernal
  • Carlos Alberto Reyes-García
  • Hugo Jair Escalante-Balderas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10632)

Abstract

The increasingly larger quantities of information generated in the world over the last few years, has led to the emergence of the paradigm known as Big Data. The analysis of those vast quantities of data has become an important task in science and business in order to turn that information into a valuable asset. Many data analysis tasks involves the use of machine learning techniques during the model creation step and the goal of these predictive models consists on achieving the highest possible accuracy to predict new samples, and for this reason there is high interest in selecting the most suitable algorithm for a specific dataset. This trend is known as model selection and it has been widely studied in datasets of common size, but poorly explored in the Big Data context. As an effort to explore in this direction this work propose an algorithm for model selection in Big Data.

Keywords

Big Data Model selection Machine learning 

References

  1. 1.
    Apacheorg: ML tuning: model selection and hyperparameter tuning, August 2016. http://spark.apache.org/docs/latest/ml-tuning.html
  2. 2.
    Bansal, B., Sahoo, A.: Full model selection using bat algorithm. In: 2015 International Conference on Cognitive Computing and Information Processing (CCIP), pp. 1–4. IEEE (2015)Google Scholar
  3. 3.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Ceruti, C., Bassis, S., Rozza, A., Lombardi, G., Casiraghi, E., Campadelli, P.: DANCo: dimensionality from angle and norm concentration. arXiv preprint arXiv:1206.3881 (2012)
  5. 5.
    Chatelain, C., Adam, S., Lecourtier, Y., Heutte, L., Paquet, T.: A multi-model selection framework for unknown and/or evolutive misclassification cost problems. Pattern Recogn. 43(3), 815–823 (2010).  https://doi.org/10.1016/j.patcog.2009.07.006CrossRefzbMATHGoogle Scholar
  6. 6.
    Escalante, H.J., Montes, M., Sucar, L.E.: Particle swarm model selection. J. Mach. Learn. Res. 10(Feb), 405–440 (2009)Google Scholar
  7. 7.
    Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the MapReduce framework. In: Asano, T., Nakano, S., Okamoto, Y., Watanabe, O. (eds.) ISAAC 2011. LNCS, vol. 7074, pp. 374–383. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25591-5_39CrossRefzbMATHGoogle Scholar
  8. 8.
    Guller, M.: Big Data Analytics with Spark: A Practitioners Guide to Using Spark for Large Scale Data Analysis. Apress, New York (2015). http://www.apress.com/9781484209653CrossRefGoogle Scholar
  9. 9.
    Guo, X., Yang, J., Wu, C., Wang, C., Liang, Y.: A novel LS-SVMs hyper-parameter selection based on particle swarm optimization. Neurocomputing 71(16), 3211–3215 (2008)CrossRefGoogle Scholar
  10. 10.
    Kaneko, H., Funatsu, K.: Fast optimization of hyperparameters for support vector regression models with highly predictive ability. Chemom. Intell. Lab. Syst. 142, 64–69 (2015).  https://doi.org/10.1016/j.chemolab.2015.01.001, http://linkinghub.elsevier.com/retrieve/pii/S0169743915000039CrossRefGoogle Scholar
  11. 11.
    Lessmann, S., Stahlbock, R., Crone, S.F.: Genetic algorithms for support vector machine model selection. In: 2006 International Joint Conference on Neural Networks. IJCNN 2006, pp. 3063–3069. IEEE (2006)Google Scholar
  12. 12.
    Lombardi, G., Rozza, A., Ceruti, C., Casiraghi, E., Campadelli, P.: Minimum neighbor distance estimators of intrinsic dimension. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6912, pp. 374–389. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23783-6_24CrossRefGoogle Scholar
  13. 13.
    Rosales-Pérez, A.: Surrogate-assisted multi-objective model selection for support vector machines. Neurocomputing 150(2015), 163–172 (2015)CrossRefGoogle Scholar
  14. 14.
    Rosales-Pérez, A., Gonzalez, J.A., Coello Coello, C.A., Escalante, H.J., Reyes-Garcia, C.A.: Multi-objective model type selection. Neurocomputing, 146, 83–94 (2014).  https://doi.org/10.1016/j.neucom.2014.05.077, http://linkinghub.elsevier.com/retrieve/pii/S0925231214008789CrossRefGoogle Scholar
  15. 15.
    Sánchez-Monedero, J., Gutiérrez, P.A., Pérez-Ortiz, M., Hervás-Martínez, C.: An n-spheres based synthetic data generator for supervised classification. In: Rojas, I., Joya, G., Gabestany, J. (eds.) IWANN 2013. LNCS, vol. 7902, pp. 613–621. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38679-4_62CrossRefGoogle Scholar
  16. 16.
    Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)Google Scholar
  17. 17.
    Tlili, M., Hamdani, T.M.: Big data clustering validity. In: 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 348–352. IEEE (2014)Google Scholar
  18. 18.
    Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  19. 19.
    Yu, K., Ji, L., Zhang, X.: Kernel nearest-neighbor algorithm. Neural Process. Lett. 15(2), 147–156 (2002)CrossRefGoogle Scholar
  20. 20.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Computer Science DepartmentInstituto Nacional de Astrofísica, Óptica y Electrónica (INAOE)PueblaMexico

Personalised recommendations