Software review: the GPTIPS platform
- 173 Downloads
GPTIPS is a widely used genetic programming software that was developed in Matlab. The most recent version of this software, GPTIPS 2.0, provides a symbolic multi-gene regression for data analysis, in addition to traditional evolutionary algorithms. We briefly explain the GPTIPS methodology and describe its main features, including its weaknesses and strengths, and give examples of GPTIPS applications.
KeywordsGP MGGP SMGR
In Sect. 2 we introduce GPTIPS and explain its main features, including symbolic multi-gene regression (SMGR) and rate-based crossover. The major strengths (Sect. 3) and weaknesses (Sect. 4) of GPTIPS are discussed in-depth. We include an example of a GPTIPS input (configuration) and output file with instructions, plus a few examples of GPTIPS, solving real-world problems from different fields in science and industry.
2 Major features of GPTIPS
3 GPTIPS strengths
GPTIPS is a freely available GP platform. The majority of GPTIPS features operate automatically, however the user must have a basic understanding of Matlab programming. They must also set configurational variables, such as the maximum number of trees in each solution, the population size and the maximum number of generations.
GPTIPS provides three tournament selection options: goodness of fit (RMS error), solution complexity (size), and Pareto Front. Complexity can be measured either by the number of nodes per tree or expressional complexity [1, 3]. The user can choose the probability of each tournament selection method.
GPTIPS produces an HTML report that contains comprehensive data on configuration and run parameters as well as the results of the analysis. The GPTIPS report contains a statistical analysis of the solutions on training, validation and test datasets. The GPTIPS report also provides details of models that lie on the Pareto front (using the training dataset), along with a plot of the Pareto front.
The Pareto curve plots expressional complexity against the goodness of fit (R2) for the models that are not dominated by other solutions in terms of both complexity and accuracy. The Pareto curve enables the user to visualize the performance of solutions and select a solution that retains a balance between complexity and accuracy. The final solutions can be saved for future use in Matlab as a symbolic function and exported to C [1, 3].
In order to save run-time, in addition to the maximum number of generations, the user can define the maximum run-time and desired fitness score at which GPTIPS will stop. By default GPTIPS uses root mean squared error (RMSE) for fitness but GPTIPS also allows the user to define their own fitness function. Also, GPTIPS is compatible with the Matlab Parallel Computing Toolbox, which reduces runtime by using multiple computer cores [1, 3].
To improve robustness, GPTIPS uses a novel crossover technique, includes an extensive set of functions to be the internal nodes of the trees, used to create the evolved models, and forces each tree to be unique in the initial generation.
Evolutionary algorithms have the tendency to evolve bigger programs. This phenomenon is called bloat, which may cause overfitting of the training dataset. Bloat significantly increases the complexity of solutions while providing little to the accuracy of the model. GPTIPS support features that suppress bloat. For instance, the user can limit the depth of the trees and include expression complexity in the fitness of solutions. Since GPTIPS uses multi-gene genetic programming (MGGP), it tends to generate horizontal bloat, meaning extra trees are added with little improvement in model performance. To resolve this issue, GPTIPS permits the user to manually delete low-performing genes by setting the maximum number of genes per solution and by monitoring gene performance, during the runtime, in the evolving populations (i.e. interactive evolution). As shown in Fig. 3a, GPTIPS tracks the input variables used in each predictive model. For instance, in Fig. 3a, only four of the six input variables are used in the best training model. This information can be used to further analyze the importance of each independent variable by measuring the frequency of variable usage.
GPTIPS has been widely used for solving engineering and science problems. In the field of structural engineering (e.g. bridges and pipelines), it is known that studying the geotechnical behavior of structural systems is complex due to its dependency on several variables, such as soil properties. GPTIPS has been used widely in this field to model material and structural problems [4, 5] as well as geotechnical and earthquake engineering problems [6, 7]. Other examples of GPTIPS applications include solving multi-objective management problems , energy forecasting , and biotechnology and bioprocess optimization [10, 11].
4 GPTIPS weaknesses
GPTIPS is a Matlab-based platform. Although GPTIPS is freely available software, Matlab needs to be purchased by the user. GPTIPS needs an advanced Matlab library that includes Symbolic Math and Statistics Toolbox. Matlab is not always as fast as other programming languages, such as Python. It is possible that trees in a solution are collinear, i.e. they are not independent, and therefore cannot add to each other. GPTIPS requires the user to define the function set, the maximum number of trees and the population size. Additionally, GPTIPS does not permit seeding the initial population. Finally, it is not simple to access individual parts of the solutions in each population to extract or manipulate them.
We would like thank William B. Langdon for his careful review and valuable comments on the draft of this paper.
- 1.D.P. Searson, GPTIPS 2: an open-source software platform for symbolic data mining, in Handbook of Genetic Programming Applications (Springer, Berlin, 2015), pp. 551–573. https://doi.org/10.1007/978-3-319-20883-1_22
- 5.H. Bolandi, W. Banzhaf, N. Lajnef, K. Barri, A.H. Alavi, Bond strength prediction of FRP-bar reinforced concrete: a multi-gene genetic programming approach, in Proceedings of Genetic and Evolutionary Computation Conference Companion, pp. 364–364 (2019). https://doi.org/10.1145/3319619.3322066