2.1 Introduction

2.1.1 Diversity and Evolutionary Dynamics

“Evolutionary dynamics” is an often-encountered expression in genetic programming (GP) research. It refers to changes within the population, such as quality and size distribution [15], genotype-phenotype maps and neutral networks [4, 9], diversity [5, 6], modularity and building blocks [11, 25], bloat [18], evolvability [2, 22] or emergent phenomena [3].

The dynamics of the population are uniquely influenced by the interplay between selection and recombination operators (crossover, mutation), as well as specific parameterizations and problem instances. As a biologically-inspired process, GP is able to deal with noisy data, multiple local optima, non-smooth objective functions, while also critically depending on genetic diversity in order to evolve the solution candidates towards the given goal. Population diversity at both the genotypic and phenotypic level remains one of the main focus points for GP research.

In this work we analyze population diversity looking at the distribution of solution candidates into subsets that belong to the same schema or structural template. We define such templates as rooted trees containing wildcard node symbols in their structure. Additionally, we describe population convergence via schema frequency curves over the evolutionary run.

2.1.2 Genetic Programming Schemata

The study of schema theorems began with John Holland’s work on providing a mathematical justification for the performance of genetic algorithms. The canonical version of a genetic algorithm (Holland [8]) used a fixed-length binary string encoding where each bit took a value from the set {0, 1}. Holland then defined schemata (or schemata) as binary string templates with symbols from the set {0, 1, ∗}, where ∗ represents a wildcard symbol that can be matched by either a 0 or a 1.

The fixed length schemata, each equivalent to a hyperplane in the search space, represented suitable theoretical instruments for the analysis of genetic operators and their effects on the distribution of solution candidates along the hyperplanes, in relation with the average fitness of the population. Holland’s schema theory states that the number of low-order, low defining-length schemata with above average fitness increases exponentially between successive generations, where:

  • A schema’s order is given by the number of fixed positions in the binary string

  • The defining length is given by the distance between the first and last fixed positions in the binary string

  • Schema average fitness is the average fitness if its matching individuals

In this context, low-order, low-defining-length schemata are seen as building blocks, structural patterns increasingly sampled by selection and used by the genetic algorithm to assemble better and better solutions.

It was later shown by Poli [13] that Holland’s findings are also valid in the context of GP, with some small provisions: “building blocks in GP and variable-size GAs with one-point crossover exist, but they are not necessarily all short, low-order or highly fit”. Schema theorems for GP are complicated by the variable-length tree encoding, requiring mathematical formulations for the expected schema frequencies to also account for the size variation of individuals under the action of selection, crossover and mutation. Several schema definitions dealing with these issues were proposed in the literature [24].

Despite significant progress in the last couple of decades, leading to exact formulations of the expected number of individuals sampling a schema at the next generation [16, 17], “large gaps remain between GP theory and practice”, due to the large number of schema equations in typical GP populations, and the “large number of terms growing proportionally to the square of the number of program shapes times the square of the number of possible crossover points” [19]. Thus, from a practical perspective, the application of schema theory on concrete algorithms and problem instances remains problematic.

In this work we attempt to close the gap between schemata as theoretical instruments for the analysis of population dynamics and their role in empirical investigations and we introduce a practical methodology to identify GP schemata and compute their frequencies. We consider the hyperschema definition by Poli et al. [12], where a schema is a rooted tree template that may include two types of wildcard symbols:

  • The ‘= ’ symbol matches any valid node of the same type and arity

  • The ‘#’ symbol matches any valid subtree

Figure 2.1 shows an example of a hyperschema and matching trees. We notice that the # symbol can match both leaf and function nodes, while the =  symbol only matches nodes of the same type (a function node and a leaf node, respectively, for the two occurrences below).

Fig. 2.1
figure 1

Example hyperschema and matching tree individuals [12]

Originally, the set of wildcards in Poli’s hyperschema was chosen in such a way as to make it easier to evaluate the effects of the genetic operators on schema frequencies and to enable a more concise mathematical representation of the schema equations. Our proposed methodology is not bound by such considerations and supports different schema structures, containing either or both wildcard symbols.

The remainder of the chapter is organized as follows: Sect. 2.2 describes our methodology for schema generation and matching, Sect. 2.3 gives details about our empirical experiments, Sect. 2.4 shows the obtained results and Sect. 2.5 discusses some final conclusions.

2.2 Methodology

We construct relevant schemata using hereditary relationships between crossover parents and their offspring. The schemata may include wildcard symbols from the set {=, #} and are matched against the population of solution candidates using a pattern matching algorithm adapted from the field of XML query matching. Since GP schemata represent a more restricted instance of wildcard query matching, we adapt the algorithm’s implementation with additional constraints. The two steps, schema generation and schema matching, are described in more detail below. The methodology was implemented in HeuristicLab [23].

2.2.1 Schema Generation

Conceptually, we expand on the idea by Stephens and Waelbroek [20] that “at the level of the microscopic degrees of freedom, the strings, the action of crossover by its very nature introduces the notion of a schema.”

The schema generation algorithm tries to exploit the fact that structural similarity is passed on (to various degrees) from parents to their offspring via the crossover operation. Additionally, it is assumed that successful individuals selected for reproduction will participate as root parentsFootnote 1 in multiple crossover operations. In these circumstances, we can generate schemata from crossover root parents by considering crossover cutpoints as potential candidates for wildcard placement. We arrive at the following heuristic:

  1. 1.

    Group individuals based on their common root parent

  2. 2.

    Identify all genetic fragments and their respective positions in the root parent

  3. 3.

    For each fragment f with preorder index f i in the root parent, replace the node at position f i with a wildcard.

The heuristic is controlled by a minimum schema length parameter which limits wildcard placement in order to avoid the creation of ‘match-all’ schemata (schemata that contain wildcards in the tree root or in its close proximity). The method is listed as pseudocode in Algorithm 2.1.

Since wildcards are inserted at cutpoint locations, the structure of the generated schemata is influenced indirectly by the selection pressure applied on the population, which determines the multiplicity of root parent individuals (how many times each individual participates in crossover as a root parent) and therefore the number of wildcards. Intuitively, the method will generate more general schemata under high selection pressure and more specific schemata (containing fewer wildcards) under lower selection pressure. The algorithm can generate different kinds of schemata (according to the schema definitions in the literature, for an excellent summary see [24]), depending on the kind of wildcard symbols used for replacement.

2.2.2 Schema Matching

The schema matching part of our methodology is based on the algorithm for the tree homeomorphism decision problem by Götz et al. [7], which tries to find a non-injective mapping between every parent-child pair in a query tree Q (the schema) and corresponding ancestor-descendant pairs in data tree D (the matched individual). Such a situation is shown in Fig. 2.2 where, according to the algorithm, the query tree Q is matched by the data tree D. The algorithm runs in O(|D|⋅|Q|⋅ depth(Q)) time using a stack of depth bounded by O(depth(D) ⋅ branch(D)).

Fig. 2.2
figure 2

Example query matching between query tree Q (left) and data tree D (right) [7]. The algorithm finds a non-injective mapping between every parent-child pair in Q and a corresponding ancestor-descendant pair in D. The answer will be yes (there is a matching) if, starting from the bottom up, the procedure can map the root nodes of the two trees (q 5 and d 6)

We notice that the algorithm in its default implementation does not enforce strict enough matching rules as required by schema matching, since tree nodes are matched from the bottom up if they have the same label, without additional considerations for their depth in the tree (relative to the root node). Therefore we added additional rules in our implementation, to make sure two nodes are only matched if they are on the same level in the tree and their parent and children nodes are matched as well. Another important detail is the matching of commutative symbols, in which case the algorithm does not consider the order of the child subtrees (internally, a sorting is performed). For example, a schema x (in postfix notation) will be matched by an individual x y because the +  symbol is commutative, despite the fact that the x symbol is found at different positions in the argument order.

2.3 Experimental Setup

We compared the evolution of schema frequencies between two algorithmic variants: standard genetic programming (SGP) [10] and genetic programming with strict offspring selection (OSGP) [1]. The difference between the two algorithms consists of an extra selection step enforced by OSGP on the generated offspring, such that offspring get rejected if they do not fulfil certain performance criteria. In effect, the extra selection step concentrates the algorithm’s efforts on generating adaptive changes (that do not decrease fitness), making it possible for less fit individuals to participate as parents if they can produce children fitter than themselves, while high fitness individuals might not contribute if they cannot be improved.

Each problem and algorithm configuration was repeated for a number of 20 runs, from which a single representative run was selected based on best performance on the training data. This final run selection step was necessary for clarity and space reasons, as the slight differences between runs (particularly at the genotypic level) make it impossible for schemata generated from one population genealogy to be applied to another population genealogy.

2.3.1 Algorithm Parameters

We applied our schema generation and matching methodology at each generation on the whole population of solution candidates.Footnote 2 The parameterizations for the two algorithms are presented in Tables 2.1 and 2.2.

Table 2.1 SGP configuration
Table 2.2 OSGP configuration

The OSGP algorithm uses the same primitive set, tree depth and size limits and crossover and mutation operators as SGP, with differences in population size, stopping criteria and selection mechanism.

2.3.2 Problem Instances

For this experiment, we selected one symbolic regression benchmark problem that facilitates discernible genotypic representations of solutions, in order to more easily observe solution fragments or building blocks contained by the schemata. We used the Poly-10 [14] synthetic symbolic regression benchmark, where the goal is to find the target function:

$$\displaystyle \begin{aligned} f(\mathbf{x})=x_1 x_2 + x_3 x_4 + x_5 x_6 + x_1 x_7 x_9 + x_3 x_6 x_{10} \end{aligned} $$
(2.1)

For the second test problem we used the Tower dataset [21], containing real-world data in the form of gas chromatography measurements of the composition of a distillation tower.

2.3.3 Analysis Methods

We perform our analysis a posteriori with the help of a complete genealogical record of the algorithmic run. We generate a set of potential schemata from the population at each generation, and match it against the whole genealogy in order to determine the evolution of schema frequencies over time.

As diversity loss in the course of the evolutionary process reflects itself in the set of schemata obtained each generation (which can contain duplicates or can repeat structures obtained in previous generations), we additionally perform filtering based on the schema frequency curves. If two schemata have highly correlated frequency curves (with a Pearson’s R 2 correlation coefficient value > 0.99), one of them is removed from the set of all schemata.

2.4 Empirical Results

We prefix each tested configuration with the name of the algorithm, followed by distinctive parameters such as the selection mechanism and maximum tree length, and then followed by the problem name. For example, the name SGP-P-25-Poly10 denotes a standard GP run with proportional selection and a maximum tree length of 25, while SGP-T-25-Poly10 denodes the same configuration with tournament selection instead.

When discussing schema frequencies, we use the notation S 1,PS 10,P for schemata generated by SGP with proportional selection, and S 1,TS 10,T for SGP with tournament selection. For OSGP, we use the notation S 1,GS 10,G to denote the 10 most common schemata. To keep a concise notation, we repeat the same notation in each section corresponding to each tested problem.

2.4.1 Standard GP

2.4.1.1 Poly-10 Problem

We first look at the convergence of SGP-P-25-Poly10. At the structural level, convergence should manifest itself as an increased occurrence count of repeated patterns in the population. Table 2.3 shows the most frequent schemata found in the last generation, represented in postfix notation. The notation S 1,PS 10,P in the first column of the table is used to designate the schemata obtained in the SGP run with proportional selection.

Table 2.3 SGP-P-25-Poly10: most common schemata in the last generation

We notice that some schemata (for example, S 1,P and S 2,P, as well as S 3,P and S 4,P) share a degree of structural similarity. A closer look at their respective frequency curves (not detailed here for space reasons) reveals that:

  • The frequency curves for S 1,P and S 2,P are highly correlated (R 2 = 0.962), however S 2,P represents a more specific template which matches fewer individuals at each generation.

  • The frequency curves for S 3,P and S 4,P are correlated (R 2 = 0.916). In this case S 4,P represents the slightly more specific template, matching fewer individuals than S 3,P.

  • The frequency curves for S 7,P and S 10,P are correlated (R 2 = 0.907), with S 10 being the slightly more specific schema.

The fact that we obtained similar and frequency-correlated schemata via our crossover-based generation procedure indicates the presence of similar parent individuals in the population, suggesting loss of diversity. We focus on the most relevant schemata (S 1,P, S 3,P and S 7,P) and show their frequency evolution in Fig. 2.3.

Fig. 2.3
figure 3

SGP-P-25-Poly10: frequency evolution of relevant schemata

The frequency curves show the moments when the algorithm was able to discover parts of the formula such as x 1 x 2, x 3 x 4 and x 5 x 6. The schemata sharply increase their frequency in the population in the beginning of the run and then vary according to the internal dynamics of the evolutionary search (competition between schemata, stagnation in the later stages).

From a diversity perspective, the schema frequency approach has the ability to identify high level similarities in the population (e.g., when 30% of the population share the same genetic template) that would otherwise be hard to notice with conventional metrics like tree distances.

The results so far confirm that schemata identified by our method correspond to what could be considered as building blocks for this problem, including in their structure the terms of the formula and showing an exponential increase in frequency from the moment of their occurrence.

We calculate schema average quality (as the average quality of their matching individuals) and show the results in Fig. 2.4. The quality curves suggest that the identified schemata are of above-average quality.

Fig. 2.4
figure 4

SGP-P-25-Poly10: average population and schema qualities

A similar situation can be observed for SGP with tournament selection, where several frequent schemata are present in the last generation and shown in Table 2.4. We notice that the top four schemata are matching relatively high proportions of the population and show their detailed frequency evolution in Fig. 2.5.

Fig. 2.5
figure 5

SGP-T-25-Poly10: frequency evolution of relevant schemata

Table 2.4 SGP-T-25-Poly10: most common schemata in the last generation

The generated schemata correspond to solution building blocks, containing terms of the target formula. Compared to proportional selection, the extra selection pressure applied on the population by the tournament selection (with a group size of 5) leads to larger schemata.

The observed schema frequency evolutions for SGP with proportional and tournament selection support the idea that relevant schemata increase in frequency over the generations.

Quality measurements in Fig. 2.6 show a significant difference between the average quality of the population and the average schema qualities. The discontinued line segments in this figure correspond to generations when the schema frequency dropped to zero, therefore an average quality could not be calculated. The results suggest that tournament selection (applying higher pressure on the population) promotes higher quality schemata.

Fig. 2.6
figure 6

SGP-T-25-Poly10: average population and schema qualities

2.4.1.2 Tower Problem

We compare the two standard GP configurations using proportional and tournament selection, denoted SGP-P-25-Tower and SGP-T-25-Tower. The most common schemata for the SGP variant with proportional selection are given in Table 2.5.

Table 2.5 SGP-P-25-Tower: most common schemata in the last generation

The obtained templates have low length and only include 2 out of 25 input variables, with the most common schema matching 15% of the population in the last generation. This result suggests that the two variables x 1 and x 6 are more relevant (in terms of the implicit variable ranking performed by GP) for the modeling of the target. In terms of quality, the produced symbolic regression solution achieved a Pearson’s R 2 correlation with the target variable of 0.8. As with the previous problem, we plot the evolution of schema frequencies, using a correlation-based filtering step to eliminate similar curves. The Pearson’s R 2 correlation values for S 1,PS 10,P show that:

  • S 1,P is highly correlated with S 4,P and S 5,P

  • S 2,P is highly correlated with S 6,P

  • S 8,P is highly correlated with S 9,P

The frequencies of the remaining schemata are shown in Fig. 2.7, while their qualities, along with the average quality of the population are shown in Fig. 2.8. We see that S 2,P becomes frequent rather early and is overall more frequent that S 1,P, while the latter has a marginally higher frequency in the last generation. Quality-wise, S 1,P and S 2,P are clearly above the average of the population, while S 7,P and S 8,P occasionally dip below the average.

Fig. 2.7
figure 7

SGP-P-25-Tower: frequency evolution of relevant schemata

Fig. 2.8
figure 8

SGP-P-25-Tower: average population and schema qualities

Tournament selection determines the evolution of more complex schemata. The ten most frequent schemata in the last generation shown in Table 2.6 are larger in size, match more individuals and contain more variables from the dataset.

Table 2.6 SGP-T-25-Tower: most common schemata in the last generation

Correlation analysis of the frequency curves reveals that:

  • S 1,T is highly correlated with S 2,T, S 3,T and S 4,T with an R 2 value of 0.96.

  • S 7,T is highly correlated with S 10,T with an R 2 value of 0.95

Filtering correlated schemata, we display the remaining schema frequency curves in Fig. 2.9. Interestingly S 1,T, the most common schema in the last generation has a noticeably lower average quality compared to S 5,T, S 6,T and S 7,T, although it still manages to rise above the average population quality, as seen in Fig. 2.10.

Fig. 2.9
figure 9

SGP-T-25-Tower: frequency evolution of relevant schemata

Fig. 2.10
figure 10

SGP-T-25-Tower: average population and schema qualities

2.4.2 Offspring Selection GP

As previously mentioned, OSGP implements an additional selection step which decides if the offspring produced by mutation and crossover are accepted into the next generation. We analyze the influence of offspring selection on the generated schemata and their frequencies.

2.4.2.1 Poly-10 Problem

Surprisingly, schema frequencies in the last generation show that only two out of all the generated schemata managed to survive. Furthermore, these two schemata represent a very similar genotypic template which managed to propagate itself to all of the individuals in the population. The two schemata are displayed in Table 2.7. This result shows that it is entirely possible under strict offspring selection for the algorithm to converge to a single genetic template.

Table 2.7 OSGP-25-Poly10: most common schemata in the last generation

Since we only have two schemata in the last generation, we investigate the evolution of schema frequencies using a different strategy: we rank the schemata based on their overall frequency, that is, the average of their individual frequencies in each generation. The new ranking is shown in Table 2.8, where the frequency represents an average of the schema frequency over all generations.

Table 2.8 OSGP-25-Poly10: most common schemata overall

Several of the schemata from Table 2.8 match the same individuals and have highly correlated frequency curves. These schemata were filtered from Fig. 2.11 to eliminate clutter. The figure shows multiple schemata (S 3,G, S 6,G and S 8,G) proliferating in the population in the earlier generations of the run, only to become extinct later.

Fig. 2.11
figure 11

OSGP-25-Poly10: frequency evolution of relevant schemata

After generation 38, the two most frequent schemata in the last generation, S 1,G and S 2,G have overlapping frequency curves, suggesting that S 2,G has a higher degree of specificity, presumably due to the lack of ‘#’ wildcard symbols in its structure.

2.4.2.2 Tower Problem

We notice a similar behavior for the Tower problem, where a single schema denoted as S 1,G matches all the individuals in the last generation:

C X1 C X12 X6 X23 X22 # C X5 X1 C

Like before, we consider in this situation the most frequent schemata overall, shown in Table 2.9.

Table 2.9 OSGP-25-Tower: most common schemata overall

Figure 2.12 shows the evolution of schema frequencies for the top three most frequent schemata from Table 2.9. We see schema S 1,G rising in frequency after generation 20 and driving other schemata to extinction.

Fig. 2.12
figure 12

OSGP-25-Tower: frequency evolution of relevant schemata

Compared to SGP, the schemata obtained by OSGP and their frequency evolution suggests a more pronounced loss of diversity as the population becomes dominated by a single schema.

2.5 Conclusion

We described in this chapter a practical approach for performing schema analysis on GP populations, considering a well-known schema definition (Poli’s hyperschema) that uses two types of wildcard symbols for function and leaf nodes, respectively. The methodology can be easily extended to include different schema definitions or stricter matching rules.

Hyperschema are generated algorithmically by taking into account genealogical information about crossover offspring and their respective parents. A pattern matching algorithm is then used to match schemata against the GP population at each generation.

We tested our methodology using two test problems (Poly-10 and Tower) and two algorithmic variants: Standard GP and Offspring Selection GP. The results validate our approach: the identified schemata for each test problem are of increasing frequency in the population and above-average quality. Compared to other methods for measuring genotypic diversity, our schema-based approach offers a detailed picture of the propagation of repeated patterns, while also being able to identify these patterns.

The evolution of schema frequencies suggests that diversity loss starts to occur early in the evolutionary run and tends to homogenize the genotypic structure of the population. As expected, this phenomenon is highly influenced by the selection mechanism. For both problems, the SGP runs using tournament selection displayed lengthier, more frequent and more specific schemata. Offspring selection determines even more drastic effects, as the population shares a single (and rather specific) genetic template.

Future research in this direction will focus on a more detailed analysis of population dynamics where we also consider schema disruption events. The approach can also be employed online to guide the evolutionary process, for example by avoiding loss of diversity via localized mutation rates within frequent schemata.