Difficult first strategy GP: an inexpensive sampling technique to improve the performance of genetic programming

Abstract

Genetic programming (GP) is a top performer in solving classification and clustering problems, in general and symbolic regression problems, in particular. GP has produced impressive results and has outperformed human generated results for 76 different problems taken from 22 different fields. There remain a number of significant open issues despite its impressive results. Among them are high computational cost, premature convergence and high error rate. These issues must be addressed for GP to realise its full potential. In this paper a simple and cost effective technique called Difficult First Strategy-GP (DFS-GP) is proposed to address the aforementioned problems. The proposed technique involves pre-processing and sampling steps. In the pre-processing step, difficult to evolve data points by GP from the given data set are marked and in the sampling step, they are introduced in the evolutionary run by using two newly defined sampling techniques, called difficult points first and difficulty proportionate selection. These techniques are biased towards selecting difficult data points during the initial stage of a run and of easy points in the latter stage of a run. This ensures that GP does not ignore difficult-to-evolve data points during a run. Experiments have shown that GP coupled with DFS avoids premature convergence and attained higher fitness than standard GP using same fitness evaluations. Performance of the proposed technique was evaluated on three commonly known metrics, which are convergence speed, fitness and variance in the best results. Our results have shown that the proposed setups had achieved 10–15% better fitness values than Standard GP. Furthermore, the proposed setups had consistently generated better quality solutions on all the problems and utilized 30–50% less computations to match the best performance of Standard GP.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    Darwaish A, Majeed H, Ali MQ, Rafay A (2017) Dynamic programming inspired genetic programming to solve regression problems. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2017.080463

    Article  Google Scholar 

  2. 2.

    Dhiman G, Kaur A (2017) Spotted hyena optimizer for solving engineering design problems. In: 2017 international conference on machine learning and data science (MLDS), pp 114–119. https://doi.org/10.1109/MLDS.2017.5

  3. 3.

    Dhiman G, Kumar V (2018a) Emperor penguin optimizer: a bio-inspired algorithm for engineering problems. Knowl Based Syst 159:20–50. https://doi.org/10.1016/j.knosys.2018.06.001

    Article  Google Scholar 

  4. 4.

    Dhiman G, Kumar V (2018b) Multi-objective spotted hyena optimizer: a multi-objective optimization algorithm for engineering problems. Knowl Based Syst 150:175–197. https://doi.org/10.1016/j.knosys.2018.03.011

    Article  Google Scholar 

  5. 5.

    Dhiman G, Kumar V (2019) Spotted hyena optimizer for solving complex and non-linear constrained engineering problems. In: Yadav N, Yadav A, Bansal JC, Deep K, Kim JH (eds) Harmony search and nature inspired optimization algorithms. Springer, Singapore, pp 857–867

    Google Scholar 

  6. 6.

    Doucette J, Heywood MI (2008) GP classification under imbalanced data sets: active sub-sampling and AUC approximation. In: O’Neill M, Vanneschi L, Gustafson S, Esparcia Alcázar AI, De Falco I, Della Cioppa A, Tarantino E (eds) Genetic programming. Springer, Berlin, pp 266–277

    Google Scholar 

  7. 7.

    Gathercole C, Ross P (1994) Dynamic training subset selection for supervised learning in genetic programming. In: Davidor Y, Schwefel HP, Männer R (eds) Parallel problem solving from nature—PPSN III. Springer, Berlin, pp 312–321

    Google Scholar 

  8. 8.

    Giacobini M, Tomassini M, Vanneschi L (2002) Limiting the number of fitness cases in genetic programming using statistics. In: Guervós JJM, Adamidis P, Beyer HG, Schwefel HP, Fernández-Villacañas JL (eds) Parallel problem solving from nature—PPSN VII. Springer, Berlin, pp 371–380

    Google Scholar 

  9. 9.

    Gonçalves I, Silva S (2013) Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In: Krawiec K, Moraglio A, Hu T, Etaner-Uyar AŞ, Hu B (eds) Genetic programming. Springer, Berlin, pp 73–84

    Google Scholar 

  10. 10.

    Gonçalves I, Silva S, Melo JB, Carreiras JM (2012) Random sampling technique for overfitting control in genetic programming. In: European conference on genetic programming. Springer, pp 218–229

  11. 11.

    Kommenda M, Affenzeller M, Burlacu B, Kronberger G, Winkler SM (2014) Genetic programming with data migration for symbolic regression. In: Proceedings of the companion publication of the 2014 annual conference on genetic and evolutionary computation. ACM, pp 1361–1366

  12. 12.

    La Cava W, Spector L, Danai K (2016) Epsilon-lexicase selection for regression. In: Proceedings of the genetic and evolutionary computation conference 2016, ACM, New York, NY, USA, GECCO ’16, pp 741–748. https://doi.org/10.1145/2908812.2908898,

  13. 13.

    Lasarczyk CW, Dittrich P, Banzhaf W (2004a) Dynamic subset selection based on a fitness case topology. Evolut Comput 12(2):223–242. https://doi.org/10.1162/106365604773955157

    Article  Google Scholar 

  14. 14.

    Lasarczyk CWG, Dittrich P, Banzhaf W (2004b) Dynamic subset selection based on a fitness case topology. Evolut Comput 12(2):223–242. https://doi.org/10.1162/106365604773955157

    Article  Google Scholar 

  15. 15.

    Majeed H, Ryan C (2007) On the constructiveness of context-aware crossover. In: Proceedings of the 9th annual conference on genetic and evolutionary computation, ACM, New York, NY, USA, GECCO ’07, pp 1659–1666. https://doi.org/10.1145/1276958.1277286,

  16. 16.

    Martínez Y, Trujillo L, Naredo E, Legrand P (2014) A comparison offitness-case sampling methods for symbolic regression with genetic programming V. In: Tantar AA, Tantar E, Sun JQ, Zhang W, Ding Q, Schütze O, Emmerich M, Legrand P, Del Moral P, Coello Coello CA (eds) EVOLVE—a bridge between probability, set oriented numerics, and evolutionary computation. Springer, Cham, pp 201–212

    Google Scholar 

  17. 17.

    Martínez Y, Naredo E, Trujillo L, Galván-López E (2013) Searching for novel regression functions. In: 2013 IEEE congress on evolutionary computation, pp 16–23. https://doi.org/10.1109/CEC.2013.6557548

  18. 18.

    Martínez Y, Naredo E, Trujillo L, Legrand P, López U (2017) A comparison of fitness-case sampling methods for genetic programming. J Exp Theor Artif Intell 29(6):1203–1224. https://doi.org/10.1080/0952813X.2017.1328461

    Article  Google Scholar 

  19. 19.

    Robilliard D, Fonlupt C (2002) Backwarding: an overfitting control for geneticprogramming in a remote sensing application. In: Collet P, Fonlupt C, Hao JK, Lutton E, Schoenauer M (eds) Artificial evolution. Springer, Berlin, pp 245–254

    Google Scholar 

  20. 20.

    Schmidt M, Lipson H (2011) Age-fitness Pareto optimization. Springer, New York, pp 129–146. https://doi.org/10.1007/978-1-4419-7747-2_8

    Book  Google Scholar 

  21. 21.

    Sikulova M, Hulva J, Sekanina L (2015) Indirectly encoded fitness predictors coevolved with cartesian programs. In: Machado P, Heywood MI, McDermott J, Castelli M, García-Sánchez P, Burelli P, Risi S, Sim K (eds) Genetic programming. Springer, Cham, pp 113–125

    Google Scholar 

  22. 22.

    Spector L (2012) Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In: Proceedings of the 14th annual conference companion on genetic and evolutionary computation. ACM, New York, NY, USA, GECCO ’12, pp 401–408, https://doi.org/10.1145/2330784.2330846,

  23. 23.

    Žegklitz J, Pošík P (2015) Model selection and overfitting in genetic programming: empirical study. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. ACM, New York, NY, USA, GECCO companion ’15, pp 1527–1528. https://doi.org/10.1145/2739482.2764678,

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hammad Majeed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ali, M.Q., Majeed, H. Difficult first strategy GP: an inexpensive sampling technique to improve the performance of genetic programming. Evol. Intel. 13, 537–549 (2020). https://doi.org/10.1007/s12065-020-00355-2

Download citation

Keywords

  • Difficult first strategy
  • Genetic programming
  • Pre-processing
  • Sampling techniques
  • Dataset sampling
  • Machine learning