1 Introduction

We saw in Sect. 8.11 that a nested sampling calculation can require billions of energy evaluations. If we want to apply nested sampling to atomic potentials that are slower to evaluate than Lennard-Jonesium then it will certainly be necessary to parallelise the algorithm.

In this section we describe two approaches to parallelising nested sampling. The first [1, 2] is to parallelise over the number of iterations performed, by discarding \(\mathcal{P}\) configurations at each iteration, rather than just one. Each iteration \(\mathcal{P}\) samples are cloned and decorrelated, using \(\mathcal{P}\) computer processors in parallel. In this way the calculation proceeds in larger steps through \(\chi \) and takes less time to run.

The second approach, developed by Lívia Bartók-Pártay and myself, parallelises within the individual iterations, by parallelising over the walk length \(L\). We detail an efficient algorithm for performing our parallelisation scheme, and evaluate the speedup, finding it to be almost perfect.

2 Parallelising over the Number of Iterations

Unlike standard nested sampling, where \(E_{\mathrm {lim}}\) is updated to the highest energy in our sample set at each iteration, here it is updated to the \(\mathcal{P}^{\mathrm {th}}\) highest energy. We then record the \(\mathcal{P}\) highest energies and their configurations with appropriate weightings [2]. Finally, \(\mathcal{P}\) other configurations are chosen at random from our remaining samples, and the clone and decorrelate algorithm 7.3 is applied to those \(\mathcal{P}\) clones in parallel.

Recall that in Sect. 7.5 we saw that for standard nested sampling, after \(\mathcal{K}\) iterations \(\log \chi \) is reduced from \(\log \chi _i\) to \(\log \chi _{i+\mathcal{K}}=\log \chi _i - 1\) with a standard deviation \(\frac{1}{\sqrt{\mathcal{K}}}\). We will now calculate the standard deviation for the same reduction \(\log \chi \leftarrow \log \chi -1\), when taking steps of \(\mathcal{P}\) configurations as described above.

The pdf for \(t=\frac{\zeta _i}{\zeta _{i-1}}\), when our live set includes \(\mathcal{K}\) configurations overall is [3]

$$\begin{aligned} \varrho \left( t\vert \mathcal{P},\mathcal{K}\right) = \frac{\mathcal{K}!}{\left( \mathcal{P}-1\right) !\left( \mathcal{K}-\mathcal{P}\right) !} t^{\mathcal{K}-\mathcal{P}}\left( 1-t\right) ^{\mathcal{P}-1}. \end{aligned}$$
(10.1)

Evaluating the mean, and variance of \(\log t\) we find [4]

$$\begin{aligned} \langle \log t\rangle&= -\left[ \sum _{i=\mathcal{K}-\mathcal{P}+1}^{\mathcal{K}}{\frac{1}{i}}\right] \end{aligned}$$
(10.2)
$$\begin{aligned} \mathrm {Var}\left( \log t\right)&= \frac{d^{(\mathcal{K}-\mathcal{P}+1)}\Gamma (z)}{dz^{(\mathcal{K}-\mathcal{P}+1)}}\Bigg |_{z=1} - \frac{d^{(\mathcal{K}+1)}\Gamma (z)}{dz^{(\mathcal{K}+1)}}\Bigg |_{z=1} \end{aligned}$$
(10.3)

When parallelising over the number of iterations, \(Y_{\mathrm {lim}}\) is updated to the lowest of the \(\mathcal{P}\) highest enthalpies in our sample set, and the configuration space volume contained by the updated \(Y_{\mathrm {lim}}\) is \(\chi _i\approx \chi _0[(\mathcal{K}-\mathcal{P}+1)/(\mathcal{K}+1)]^i\), where i is the NS iteration number. For \(\mathcal{P}>1\) it is also possible to give analytic estimates of the configuration space volumes contained by the \(\mathcal{P}-1\) higher enthalpy values between \(Y_{\mathrm {lim}}^{\left( i-1\right) }\) and \(Y_{\mathrm {lim}}^{\left( i\right) }\) [2]. Thus one may consider the configurational entropy contained at fractional numbers of NS enthalpy levels.

After a number of enthalpy levels

$$\begin{aligned} n_\Delta =\left( \sum _{i=\mathcal{K}-\mathcal{P}+1}^k{\frac{1}{i}}\right) ^{-1} \end{aligned}$$
(10.4)

the expectation of the logarithm of the configuration space enclosed by \(Y_{\mathrm {lim}}\) decreases by 1:

$$\begin{aligned} \langle \log \chi _i - \log \chi _{i+n_{\Delta }} \rangle = -1. \end{aligned}$$
(10.5)

If we assume that it is possible to draw exact random samples in each iteration, then it can be shown that, after the same number of enthalpy levels \(n_{\Delta }\), the variance of \(\Delta \log \chi =\log \chi _i - \log \chi _{i + n_\Delta }\) is given by

$$\begin{aligned} \mathrm {Var}(\Delta \log \chi ) = \frac{d^{(\mathcal{K}-\mathcal{P}+1)}\Gamma (z)}{dz^{(\mathcal{K}-\mathcal{P}+1)}}\Bigg |_{z=1} - \frac{d^{(\mathcal{K}+1)}\Gamma (z)}{dz^{(\mathcal{K}+1)}}\Bigg |_{z=1} \end{aligned}$$
(10.6)

where \(\Gamma (z)\) is the gamma function. The standard deviation, \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}}\), represents the rate at which uncertainty in \(\log \chi \) accumulates during a nested sampling calculation. For a serial calculation (\(\mathcal{P}=1\)), \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}} = \frac{1}{\sqrt{\mathcal{K}}}\).

Figure 10.1 shows how the ratio of \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}}\) for parallel and serial NS, \(R = [\mathrm {Var}(\Delta \log \chi )]^{\frac{1}{2}}\div \frac{1}{\sqrt{\mathcal{K}}}\), depends on \(\frac{\mathcal{P}}{\mathcal{K}}\). R represents the relative rate at which uncertainty in \(\log \chi \) accumulates during parallel and serial calculations. One can see that \(\log {R}\) converges for \(\mathcal{K}>10^2\), and for \(\frac{\mathcal{P}}{\mathcal{K}}\le 0.25\), \( \left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}} \sim \exp {\left( 0.28 \frac{\mathcal{P}-1}{\mathcal{K}} \right) } \frac{1}{\sqrt{\mathcal{K}}}\). For larger \(\frac{\mathcal{P}-1}{\mathcal{K}}\), R increases more rapidly.

Fig. 10.1
figure 1

Ratio of the rates, R, at which uncertainty in \(\log \chi \) accumulates for runs parallelised by removing and walking \(\mathcal{P}> 1\) as compared to serial run with \(\mathcal{P}= 1\), as a function of scaled number of configurations removed \(\frac{\mathcal{P}-1}{\mathcal{K}}\), for several values of \(\mathcal{K}\). Dashed line indicates linear trend for \(\mathcal{P}-1 \ll \mathcal{K}\)

3 Parallelising Within Each Iteration

Instead of parallelising over the number of iterations, we propose a method of parallelising within each iteration, with the aim of avoiding the cumulative error shown in Fig. 10.1.

In the serial algorithm, each new clone is decorrelated using a trajectory of walk length \(L\). In our parallelised algorithm, new clones are decorrelated using a trajectory of walk length \(\frac{L}{\mathcal{P}}\), and \(\left( \mathcal{P}-1\right) \) other random configurations are also (independently) propagated through \(\frac{L}{\mathcal{P}}\) steps. Thus, rather than propagating a single configuration for \(L\) steps, we propagate \(\mathcal{P}\) configurations for \(\frac{L}{\mathcal{P}}\) steps.

It is possible to derive the variance of the total number of MC steps that is applied to a sample between when it is cloned by copying, and when it is eventually written as output, \(L'\).

$$\begin{aligned} \mathrm {Var}(L') = L^2\, \frac{\mathcal{P}-1 }{\mathcal{P}} \end{aligned}$$
(10.7)

While unlike the case of parallelising over the number of iterations, there are no analytic results for the error due to the variability of \(L'\), several observations can be made. One is that the square root of the variance of \(L'\) (except for the serial case of \(\mathcal{P}=1\)) is almost as large as L itself. Another is that the scaling of \(\mathrm {Var}(L')\) with the number of extra parallel tasks \(\mathcal{P}- 1\) is polynomial, unlike the exponential scaling of the error R with extra tasks for parallelising over the number of iterations. It is unclear, however, how this variability in walk length will affect the error in the results, although the empirical observation is that this effect does not appear to be strong. Nevertheless, the serial case results in the lowest uncertainty in estimates of configuration space volumes.

3.1 Details of Our Implementation

The MCMC strategy described in Sect. 8.5 makes abundant use of single atom moves. When computing the change in energy after such a move it is often not necessary to recompute the interaction energies between all atoms. Rather, most of the interaction energies may remain unchanged, such that it is only necessary to compute a few terms. In this case it is efficient to store the individual interaction energies, updating only a few at a time. However, the number of such interaction energies can quickly become very large, and sending large arrays between processors using Message Passing Interface (MPI) can be slow, depending on the network connecting the processors. For this reason we developed an implementation with minimal use of MPI.

In our implementation, each processor stores \(\frac{\mathcal{K}}{\mathcal{P}}\) configurations locally. This number of local configurations should be the same for each processor. At the end of each iteration, each processor identifies the local configuration with highest energy, and a single MPI_ALLREDUCE call is made to identify which processor stores the highest energy value overall. Next, a specific processor sends a cloned configuration to the processor storing the maximum energy. The “sending” processor is chosen at random, and picks the configuration to send at random from the set of configurations it stores. The configuration received is taken to be the “cloned” configuration, and will be propagated by the receiving processor, using \( \frac{L}{\mathcal{P}} \) steps. Each of the other processors (including the “sending” processor) picks a configuration at random from its local set, and propagates that configuration through \(\frac{L}{\mathcal{P}}\) steps.

I performed 8 sets of 16 simulations of 64 Lennard-Jonesium particles, at a reduced pressure \(\log _{10}\left( P^*\equiv \frac{P\sigma ^3}{\epsilon }\right) =-1.194\), with \(\mathcal{K}=640\) and \(L=4160\). Each set of 16 calculations was performed with a different value of \(\mathcal{P}\). The mean run time \(\tau \) in seconds for these calculations is shown in Fig. 10.2. I fitted the average run times with a function \(\tau = A\times \mathcal{P}^B + C\), and found the fixed run time of the algorithm C to be less than \(1{\%}\) of the serial run time A, even for Lennard-Jonesium. The exponent B was found to be \(B=-0.9799\pm 0.0050\), which is extremely close to perfect parallelisation.

Fig. 10.2
figure 2

Parallelisation by walk length leads to almost perfect speed up. That mean fixed run time of the algorithm C was found to be less than \(1{\%}\) of the serial run time A. This is particularly surprising for Lennard-Jonesium where a single energy evaluation is almost as quick as it could possibly be. The exponent B was found to be \(B=-0.9799\pm 0.0050\), which is extremely close to perfect parallelisation

4 Summary

We assessed the performance of two parallelisation schemes for nested sampling: parallelisation over the number of iterations and parallelisation over the walk length, within each iteration.

We found that parallelising over the number of iterations increases the rate at which error is introduced to our estimate of \(\chi \left( U\right) \). This error rate is approximately exponential in \(\frac{\mathcal{P}-1}{\mathcal{K}}\) for \(\frac{\mathcal{P}-1}{\mathcal{K}}\le 0.25\), and much greater for large values.

To address this, we proposed a method of parallelising over the walk length, within each iteration. In this algorithm \(\mathcal{P}\) processors decorrelate \(\mathcal{P}\) configurations, each for a \(\frac{L}{\mathcal{P}}\) steps. The cloned configuration is always one of the \(\mathcal{P}\) configurations to be decorrelated. We found the variance of the total number of MC steps that is applied to a sample between when it is copied by cloning, and when it is eventually written as output, \(L'\), to be \( \mathrm {Var}(L') = \langle L\rangle ^2\, \frac{\mathcal{P}-1 }{\mathcal{P}} \). For \(\mathcal{P}\gg 1\), the standard deviation of \(L'\) is almost as large as the total walk-length \(L\). Nevertheless, we observe empirically that this variability in walk length does not strongly affect the error in the results.

Finally, we gave a detailed account of our implementation, and measured its performance. We performed sixteen calculations at each of \(\mathcal{P}\in \left\{ 2^0, 2^1, 2^2,\dots ,2^7 \right\} \), with a total walk length \(L=4160\) and \(\mathcal{K}=640\). Fitting the mean run times with a function \(\tau = A\times \mathcal{P}^B + C\), we found that the fixed calculation costs C were less than \(1{\%}\) of the serial run time, \(A=8.694\times 10^{4}\mathrm {s} \pm 1.8\times 10^{2}\mathrm {s}\). This is surprising for Lennard-Jonesium, where one might have expected the fixed costs to be a larger proportion of the run time, since the energy evaluations are very fast. We found the exponent to be \(B=-0.9799\pm 0.0050\), which is extremely close to perfect parallelisation.

5 Further Work

It would be enlightening to make a comparison of the two parallelisation schemes, similar to Fig. 10.1. This would require many calculations using the “parallelisation within each iteration” method, at each number of processors, up to at least \(\frac{\mathcal{P}-1}{\mathcal{K}}=0.5\). For each number of processors, the error, relative to serial calculations, would be obtained. These could then be compared to the theoretical curve for the relative error in the method of “parallelisation over the number of iterations”. Repeating this process at a range of pressures would constitute a fair comparison of the two methods.