1 Introduction: Making E Stupid and Then Smart Again

State-of-the-art saturation-based automated theorem provers (ATPs) for first-order logic (FOL), such as E and Vampire [12], employ the given clause algorithm [13], translating the input FOL problem \(T\cup \{\lnot C\}\) (background theory and negated conjecture) into a refutationally equivalent set of clauses. The search for a contradiction is performed maintaining sets of processed (P) and unprocessed (U) clauses (the proof state \(\Pi \)). The algorithm repeatedly selects a given clause g from U, moves g to P, and extends U with all clauses inferred with g and P. This process continues until a contradiction is found, U becomes empty, or a resource limit is reached.

Historically, term ordering, together with literal selection, is used to guarantee the completeness of the proof search [1] and to “tame the growth of the search space and help steer proof search” [5]. Term ordering ensures that rewriting happens in only one direction, toward smaller terms. Literal selection limits the inferences done with each given clause g to the selected literals, which slows down the growth of the search space and reduces redundant inferences.

E includes a strategy language of clause evaluation functions, made up of weight and priority functions, to heuristically guide the proof search. In this work, I use two algorithmically invented [6, 7] strategies, E1 and E2Footnote 1, that use many sophisticated clause evaluation functions, the Knuth-Bendix ordering (KBO6), literal selection, and other E heuristics.

The ENIGMA [4, 8,9,10] system with the XGBoost [2] implementation of gradient boosted decision trees has recently demonstrated high capability to learn to guide the E [14] theorem prover’s inferences in real-time. ENIGMA uses the XGBoost model as a clause evaluation function to recommend clauses for selection based on clause and conjecture features. In particular, after several proving and learning iterations, its performance on the 57880 problems from the Mizar40 [11] benchmark improved by \(70\%\) (\(=25397/14933\)) [10] over the strategy E1 used for the initial proving phase.

In this work, E is stripped to the bare bones by disabling term ordering and literal selection. KBO6 is replaced with an identity relation as the minimal possible ordering (called IDEN – an addition to EFootnote 2). While this frees E to do inferences in any order, E can no longer perform rewriting inferences. The strategy E1 is replaced with the simple combination of the clause weight and FIFO (first in first out) evaluation functions. E is thus practically reduced to a basic superposition prover, without advanced heuristics, rewriting, or completeness guarantees. We call this strategy E0:

figure a

E0 solves only 3872 of the Mizar40 problems in 5 s compared to 14526 for E1. The first research question is the extent to which ENIGMA with this basic prover can learn ATP guidance completely on its own. The second is to what extent ENIGMA’s learning can be boosted with data from strong strategies and models. That is, I explore how smart machine learning can become in this zero-strategy setting. The more general related question is to what extent can machine learning replace the sophisticated human-invented theorem-proving body of wisdom used in today’s ATPs for restricting advanced proof calculi.

2 Experiments

We evaluate ENIGMA with the basic strategy, E0, in several scenarios and over two datasets of different sizes. All experiments are run with 5 s per problemFootnote 3 Footnote 4.

ENIGMA has so far been used in two ways: coop combines the learned model with some standard E strategy equally (50:50) while solo only uses the learned model for choosing the given clauses. The best results have been achieved by MaLARea-style [16] looping: that is, an ENIGMA model is trained and run with E (loop 0), then the resulting data are added to the initial training data and a new ENIGMA model is trained (loop 1).

In this work, ENIGMA trains with both solo and coop data. I present results from solo runs because they represent the most minimal setting.

2.1 Small Data (2000 Problems)

figure b

The E evaluations and XGBoost training can take a long time on the full Mizar40 dataset, so 2000 randomly sampled problems are used to test meta-parameters on. Each XGBoost model consists of T decision trees of depth D, the most important training meta-parameters in addition to the learning rate (\(\eta =0.2\)). In previous work with ENIGMA, T and D were fixed for all loops of learning. Here we try to vary the values of T and D during 16 loops. Let \(S_{D,T}\) denote the experiment with specific T and D. Of the many protocols tested, the following are included in the plot of solved problems (above): Fives (\(S_{5,100}\)), Nines (\(S_{9,100}\)), Thirteens (\(S_{13,200}\)), Sixteens (\(S_{16,100}\)).

We also experiment with adaptively setting the meta-parameters as the number of training examples increases according to the following protocols:

  • Inc (\(S_{[3,33],100}\)) increases D by 2 from 3 to 33 and keeps \(T=100\) fixed.

  • 32_inc (\(S_{32,[50,250]}\)) fixes \(D=32\) and gradually increases T from 50 to 250.

  • Inc2 (\(S_{[3,33],*}\)) gradually decreases T from 150 to 50, varying the value intuitivelyFootnote 5.

  • Inc3 (\(S_{[3,33],[50,250]}\)) aims to be more systematic and steps T from 50 to 250.

  • Dec3(\(S_{[3,33],[250,50]}\)) decreases T from 250 to 50.

At the 16th loop Inc’s performance is best, solving 299 problems, doubling the performance of E0 (152). However Inc2 and Inc3 solve 298 problems and 32_inc solves 291 problems. The conclusion is that simple protocols work well so long as T or D is incremented adaptively rather than fixed.

2.2 Big Data (57880 Problems)

figure c

These experiments are done on the large benchmark of 57880 Mizar40 [11] problems from the MPTP dataset [15]. E1 and E2 are two strong E strategies that solve 14526 and 12788 problems.

  • Experiment 1 is done with \(D=9\) and \(T=200\) and uses our previously trained model that allowed us to solve 25562 problems when cooperating with E1 in our previous experiments [10]. This strong model, which hashes the features into 32768 (\(2^{15}\)) buckets [3, Sect. 3.4], is used with E0 now.

  • Experiment 2’s parameters were intuitively toggled during the looping as in Inc3, and a feature size of \(2^{16}\) is used. Exp. 2 uses training data from E1 and E2 for additional guidance up to the 4th loop (and then stops including them in the training data based on the assumption they may confuse learning).

  • Experiment 3 sets T and D according to protocol Inc3. Exp. 3 only learns from E run with E0 and trains on the GPU, which requires the feature size to be reduced to 256.

  • Experiment 4 mimics Exp. 3 but uses E1 and E2 data for training (up to the 4th loop).

  • Experiment 5 further tests boosting with data from an E0 ENIGMA model that proved 9759 problems and an E1 that proved 21542 problems. Tree depth is intuitively varied among 32, 512, and 1000, the number of trees is varied among 2, 100, 200, and 32. The feature vector size starts at \(2^{14}\) and is decreased to allow the data to fit on the RAM, down to 32 (\(=2^{5}\)).

As seen in the figure, the strong model does not help much in guiding E without ordering or selection in Exp. 1. Exp. 2 learns gradually and catches up with Exp. 1, but seems to plateau around 10,000. Surprisingly the pure Exp. 3 learns fast with the small feature size, but plateaus and drops in performance (perhaps due to overfitting). Exp. 4 indicates that guidance is useful and surpasses E2 with 13805 in round 13. Exp. 5 solves 15990 problems, showing that ENIGMA can take E0 beyond the smart strategies with appropriate parameters and boosting. This is a great improvement over the 3872 problems solved by E0.

3 Conclusion

ENIGMA can learn to guide the E prover effectively even without smart strategies and term orderings. The models confer a \(256\%\) increase over the naive E0 after 13 rounds of the proving/learning loop, and even trained without guidance data, a \(121\%\) increase.

The experiments indicate that machine learning can be used to fully control an ATP’s guidance, learning to replace orderings, heuristic strategies, and deal with the increase in generated clauses without literal selection. However the combination of ENIGMA and standard ATP heuristics still significantly out-performs ENIGMA alone.

Given the large symmetry-breaking impact of these methods in classical ATP, future work includes, e.g., training the guidance in such a way that redundant (symmetric) inferences are not done by the trained model once it has committed to a certain path. This probably means equipping the learning with more history and knowledge of the proof state in the saturation-style setting. ENIGMAWatch [4] may aid with symmetry breaking by focusing the proof search on particular proof paths. Additional work is needed to isolate the factors in Exp. 5’s performance, and determine the most effective boosting methods in addition to increasing D and T with training loops. Ablation studies should be done to discover the impact of term ordering and literal selection individually on E and ENIGMA’s performance. Perhaps term ordering alone is sufficient to train good ENIGMA models.

Running ENIGMA without term ordering and other restrictions is important because it may allow us to combine training data from different strategies, and it may allow ENIGMA to find novel proofs.