Abstract
This chapter describes and compares two strategies for parallelising nested sampling.
1 Introduction
We saw in Sect. 8.11 that a nested sampling calculation can require billions of energy evaluations. If we want to apply nested sampling to atomic potentials that are slower to evaluate than Lennard-Jonesium then it will certainly be necessary to parallelise the algorithm.
In this section we describe two approaches to parallelising nested sampling. The first [1, 2] is to parallelise over the number of iterations performed, by discarding \(\mathcal{P}\) configurations at each iteration, rather than just one. Each iteration \(\mathcal{P}\) samples are cloned and decorrelated, using \(\mathcal{P}\) computer processors in parallel. In this way the calculation proceeds in larger steps through \(\chi \) and takes less time to run.
The second approach, developed by Lívia Bartók-Pártay and myself, parallelises within the individual iterations, by parallelising over the walk length \(L\). We detail an efficient algorithm for performing our parallelisation scheme, and evaluate the speedup, finding it to be almost perfect.
2 Parallelising over the Number of Iterations
Unlike standard nested sampling, where \(E_{\mathrm {lim}}\) is updated to the highest energy in our sample set at each iteration, here it is updated to the \(\mathcal{P}^{\mathrm {th}}\) highest energy. We then record the \(\mathcal{P}\) highest energies and their configurations with appropriate weightings [2]. Finally, \(\mathcal{P}\) other configurations are chosen at random from our remaining samples, and the clone and decorrelate algorithm 7.3 is applied to those \(\mathcal{P}\) clones in parallel.
Recall that in Sect. 7.5 we saw that for standard nested sampling, after \(\mathcal{K}\) iterations \(\log \chi \) is reduced from \(\log \chi _i\) to \(\log \chi _{i+\mathcal{K}}=\log \chi _i - 1\) with a standard deviation \(\frac{1}{\sqrt{\mathcal{K}}}\). We will now calculate the standard deviation for the same reduction \(\log \chi \leftarrow \log \chi -1\), when taking steps of \(\mathcal{P}\) configurations as described above.
The pdf for \(t=\frac{\zeta _i}{\zeta _{i-1}}\), when our live set includes \(\mathcal{K}\) configurations overall is [3]
Evaluating the mean, and variance of \(\log t\) we find [4]
When parallelising over the number of iterations, \(Y_{\mathrm {lim}}\) is updated to the lowest of the \(\mathcal{P}\) highest enthalpies in our sample set, and the configuration space volume contained by the updated \(Y_{\mathrm {lim}}\) is \(\chi _i\approx \chi _0[(\mathcal{K}-\mathcal{P}+1)/(\mathcal{K}+1)]^i\), where i is the NS iteration number. For \(\mathcal{P}>1\) it is also possible to give analytic estimates of the configuration space volumes contained by the \(\mathcal{P}-1\) higher enthalpy values between \(Y_{\mathrm {lim}}^{\left( i-1\right) }\) and \(Y_{\mathrm {lim}}^{\left( i\right) }\) [2]. Thus one may consider the configurational entropy contained at fractional numbers of NS enthalpy levels.
After a number of enthalpy levels
the expectation of the logarithm of the configuration space enclosed by \(Y_{\mathrm {lim}}\) decreases by 1:
If we assume that it is possible to draw exact random samples in each iteration, then it can be shown that, after the same number of enthalpy levels \(n_{\Delta }\), the variance of \(\Delta \log \chi =\log \chi _i - \log \chi _{i + n_\Delta }\) is given by
where \(\Gamma (z)\) is the gamma function. The standard deviation, \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}}\), represents the rate at which uncertainty in \(\log \chi \) accumulates during a nested sampling calculation. For a serial calculation (\(\mathcal{P}=1\)), \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}} = \frac{1}{\sqrt{\mathcal{K}}}\).
Figure 10.1 shows how the ratio of \(\left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}}\) for parallel and serial NS, \(R = [\mathrm {Var}(\Delta \log \chi )]^{\frac{1}{2}}\div \frac{1}{\sqrt{\mathcal{K}}}\), depends on \(\frac{\mathcal{P}}{\mathcal{K}}\). R represents the relative rate at which uncertainty in \(\log \chi \) accumulates during parallel and serial calculations. One can see that \(\log {R}\) converges for \(\mathcal{K}>10^2\), and for \(\frac{\mathcal{P}}{\mathcal{K}}\le 0.25\), \( \left[ \mathrm {Var}(\Delta \log \chi )\right] ^{\frac{1}{2}} \sim \exp {\left( 0.28 \frac{\mathcal{P}-1}{\mathcal{K}} \right) } \frac{1}{\sqrt{\mathcal{K}}}\). For larger \(\frac{\mathcal{P}-1}{\mathcal{K}}\), R increases more rapidly.
3 Parallelising Within Each Iteration
Instead of parallelising over the number of iterations, we propose a method of parallelising within each iteration, with the aim of avoiding the cumulative error shown in Fig. 10.1.
In the serial algorithm, each new clone is decorrelated using a trajectory of walk length \(L\). In our parallelised algorithm, new clones are decorrelated using a trajectory of walk length \(\frac{L}{\mathcal{P}}\), and \(\left( \mathcal{P}-1\right) \) other random configurations are also (independently) propagated through \(\frac{L}{\mathcal{P}}\) steps. Thus, rather than propagating a single configuration for \(L\) steps, we propagate \(\mathcal{P}\) configurations for \(\frac{L}{\mathcal{P}}\) steps.
It is possible to derive the variance of the total number of MC steps that is applied to a sample between when it is cloned by copying, and when it is eventually written as output, \(L'\).
While unlike the case of parallelising over the number of iterations, there are no analytic results for the error due to the variability of \(L'\), several observations can be made. One is that the square root of the variance of \(L'\) (except for the serial case of \(\mathcal{P}=1\)) is almost as large as L itself. Another is that the scaling of \(\mathrm {Var}(L')\) with the number of extra parallel tasks \(\mathcal{P}- 1\) is polynomial, unlike the exponential scaling of the error R with extra tasks for parallelising over the number of iterations. It is unclear, however, how this variability in walk length will affect the error in the results, although the empirical observation is that this effect does not appear to be strong. Nevertheless, the serial case results in the lowest uncertainty in estimates of configuration space volumes.
3.1 Details of Our Implementation
The MCMC strategy described in Sect. 8.5 makes abundant use of single atom moves. When computing the change in energy after such a move it is often not necessary to recompute the interaction energies between all atoms. Rather, most of the interaction energies may remain unchanged, such that it is only necessary to compute a few terms. In this case it is efficient to store the individual interaction energies, updating only a few at a time. However, the number of such interaction energies can quickly become very large, and sending large arrays between processors using Message Passing Interface (MPI) can be slow, depending on the network connecting the processors. For this reason we developed an implementation with minimal use of MPI.
In our implementation, each processor stores \(\frac{\mathcal{K}}{\mathcal{P}}\) configurations locally. This number of local configurations should be the same for each processor. At the end of each iteration, each processor identifies the local configuration with highest energy, and a single MPI_ALLREDUCE call is made to identify which processor stores the highest energy value overall. Next, a specific processor sends a cloned configuration to the processor storing the maximum energy. The “sending” processor is chosen at random, and picks the configuration to send at random from the set of configurations it stores. The configuration received is taken to be the “cloned” configuration, and will be propagated by the receiving processor, using \( \frac{L}{\mathcal{P}} \) steps. Each of the other processors (including the “sending” processor) picks a configuration at random from its local set, and propagates that configuration through \(\frac{L}{\mathcal{P}}\) steps.
I performed 8 sets of 16 simulations of 64 Lennard-Jonesium particles, at a reduced pressure \(\log _{10}\left( P^*\equiv \frac{P\sigma ^3}{\epsilon }\right) =-1.194\), with \(\mathcal{K}=640\) and \(L=4160\). Each set of 16 calculations was performed with a different value of \(\mathcal{P}\). The mean run time \(\tau \) in seconds for these calculations is shown in Fig. 10.2. I fitted the average run times with a function \(\tau = A\times \mathcal{P}^B + C\), and found the fixed run time of the algorithm C to be less than \(1{\%}\) of the serial run time A, even for Lennard-Jonesium. The exponent B was found to be \(B=-0.9799\pm 0.0050\), which is extremely close to perfect parallelisation.
4 Summary
We assessed the performance of two parallelisation schemes for nested sampling: parallelisation over the number of iterations and parallelisation over the walk length, within each iteration.
We found that parallelising over the number of iterations increases the rate at which error is introduced to our estimate of \(\chi \left( U\right) \). This error rate is approximately exponential in \(\frac{\mathcal{P}-1}{\mathcal{K}}\) for \(\frac{\mathcal{P}-1}{\mathcal{K}}\le 0.25\), and much greater for large values.
To address this, we proposed a method of parallelising over the walk length, within each iteration. In this algorithm \(\mathcal{P}\) processors decorrelate \(\mathcal{P}\) configurations, each for a \(\frac{L}{\mathcal{P}}\) steps. The cloned configuration is always one of the \(\mathcal{P}\) configurations to be decorrelated. We found the variance of the total number of MC steps that is applied to a sample between when it is copied by cloning, and when it is eventually written as output, \(L'\), to be \( \mathrm {Var}(L') = \langle L\rangle ^2\, \frac{\mathcal{P}-1 }{\mathcal{P}} \). For \(\mathcal{P}\gg 1\), the standard deviation of \(L'\) is almost as large as the total walk-length \(L\). Nevertheless, we observe empirically that this variability in walk length does not strongly affect the error in the results.
Finally, we gave a detailed account of our implementation, and measured its performance. We performed sixteen calculations at each of \(\mathcal{P}\in \left\{ 2^0, 2^1, 2^2,\dots ,2^7 \right\} \), with a total walk length \(L=4160\) and \(\mathcal{K}=640\). Fitting the mean run times with a function \(\tau = A\times \mathcal{P}^B + C\), we found that the fixed calculation costs C were less than \(1{\%}\) of the serial run time, \(A=8.694\times 10^{4}\mathrm {s} \pm 1.8\times 10^{2}\mathrm {s}\). This is surprising for Lennard-Jonesium, where one might have expected the fixed costs to be a larger proportion of the run time, since the energy evaluations are very fast. We found the exponent to be \(B=-0.9799\pm 0.0050\), which is extremely close to perfect parallelisation.
5 Further Work
It would be enlightening to make a comparison of the two parallelisation schemes, similar to Fig. 10.1. This would require many calculations using the “parallelisation within each iteration” method, at each number of processors, up to at least \(\frac{\mathcal{P}-1}{\mathcal{K}}=0.5\). For each number of processors, the error, relative to serial calculations, would be obtained. These could then be compared to the theoretical curve for the relative error in the method of “parallelisation over the number of iterations”. Repeating this process at a range of pressures would constitute a fair comparison of the two methods.
References
N.S. Burko, C. Várnai, S.A. Wells, D.L. Wild, Exploring the energy landscapes of protein folding simulations with Bayesian computation. Biophys. J. 102, 878 (2012)
S. Martiniani, J.D. Stevenson, D.J. Wales, D. Frenkel, Superposition enhanced nested sampling. Phys. Rev. X 4, 031034 (2014)
S. Martiniani, Masters thesis (unpublished), Cavendish Laboratory, University of Cambridge, 2013
Wolfram Research, Inc., Mathematica, version 10.0 (Champaign, IL, 2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Baldock, R.J.N. (2017). Parallelising Nested Sampling. In: Classical Statistical Mechanics with Nested Sampling. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-66769-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-66769-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66768-3
Online ISBN: 978-3-319-66769-0
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)