Introduction

The Ising model [1] gives a microscopic description of the ferromagnetism which is caused by the interaction between spins of the electrons in a crystal. The particles are assumed to be fixed on the sites of the lattice. Spin is considered as a scalar quantity which can achieve two values \(+1\) and \(-\,1\). The model is a simple statistical one which shows the phase transition between high-temperature paramagnetism phase and low-temperature ferromagnetic one at a specific temperature. In fact, the symmetry between up and down is spontaneously broken when the temperature goes below the critical temperature. However, the one-dimensional Ising model, which has been exactly solved, shows no phase transition. The two-dimensional Ising model has been solved analytically with zero [2] and nonzero [3] external field. In spite of a lot of attempts to solve 3D Ising model, one might say that this model has never been solved exactly. All the results for the three-dimensional Ising model have been used approximation approaches and Monte Carlo methods.

Monte Carlo methods or statistical simulation methods are widely used in different fields of science such as physics, chemistry, biology, computational finance and even new fields like econophysics [4,5,6,7,8,9]. The simulation can proceed by sampling from the Probability Density Function and generating random numbers uniformly. The simulation of the Ising model on big lattices increases the cost of simulation. One way to reduce the simulation cost is to design the algorithms which work faster. Swendsen-Wang and Wolff algorithms [10, 11] and multi-spin coding methods [12,13,14] are the examples of such methods. Another way is to parallelize and execute the model on GPUs, GPU clusters and cluster computers [15,16,17,18,19,20,21,22,23,24,25,26,27].

In this paper, we present a parallel algorithm to simulate the 2D Ising model using Monte Carlo Method. Then, we run the algorithm on a cluster computer using C++ programming language and MPI. Message Passing Interface (MPI) is a useful programming model in HPC systems [28,29,30,31,32,33,34] in which the processes communicate through message passing and was designed for distributed memory architectures. MPI provides functionalities which allow two specified processes to exchange data by sending and receiving messages. To get high efficiency, it is necessary to have good load balancing and also to have minimum communications between processes.

In our algorithm, each individual process creates its own sub-lattice, initializes it, gets all Monte Carlo iterations done and calculates the energy of the sub-lattice for a specific temperature. Each process communicates with its two neighbor processes during the job and they exchange the boundary spin variables. Finally, the total energy of lattice is calculated by map-reduce method. Since in multi-spin coding technique each spin is stored by 3 bits, inter-process communications are reduced considerably. Because computational load of each sub-lattice is assigned to each process and size of all sub-lattices is equal, an appropriate load balancing exists. Since each process—independent of number of processes—only communicates with its two neighbor processes and the lattice is decomposed into sub-lattices, the algorithm benefits a good scalability.

This paper has been organized as follows. In “Metropolis algorithm and Ising model” section, Metropolis algorithm and the Ising model are studied briefly. In “Multi-spin coding method” section, we explain how to use Multi-spin coding method to calculate the interaction energy between a specific spin and its nearest neighbors. We also study the boundary conditions in the memory-word lattice.Footnote 1 Details of parallelization of the algorithm are discussed in “Parallelization” section and the method of implementation is given in “Implementation” section. Finally, the results are given in “Results” section.

Metropolis algorithm and Ising model

The Ising model consists of spins variables which take values \(+1\) or \(-\,1\) and are arranged in a one-, two- or three-dimensional lattice. Each spin interacts with its neighbors, and the interaction is given by the Hamiltonian:

$$\begin{aligned} \mathcal{H}=-J\sum _{\langle m,n\rangle} s_m s_n, \end{aligned}$$
(1)

where J is the coupling coefficient. The summation in Eq. (1) is taken over the nearest neighbor pairs \(\langle m,n\rangle \). Periodic boundary conditions are used which state that spins on one edge of the lattice are neighbors with the spins on the opposite side. In this paper, we focus on simulation of the 2D square Ising model using Metropolis Monte Carlo algorithm [35]. The lattice is initialized randomly and is updated as the following:

  1. 1.

    Select a spin (\(s_{i,j}\)) randomly and calculate the interaction energy between this spin and its nearest neighbors (E).

  2. 2.

    Flip the spin \(s_{i,j}\) to \(s^{\prime }_{i,j}\) and again calculate the interaction energy (\(E^{\prime }\)).

  3. 3.

    \(\triangle E=E^{\prime }-E\), if \(\triangle E\le 0\), \(s^{\prime }_{i,j}\) is accepted. Otherwise, \(s^{\prime }_{i,j}\) is accepted with the probability \(e^{-\triangle E/{KT}}\) where K is Boltzmann constant and T is the temperature.

  4. 4.

    Repeat steps 1–3 till we are sure that every spin has been flipped.

  5. 5.

    Calculate the total energy of the lattice for ith iteration \(\left( E_{\mathrm{total}}^i\right) \).

The steps above form a Monte Carlo iteration. We perform enough iterations (N times) and finally average on \(\left( E_{\mathrm{total}}^i\right) \) to obtain \(E_{\mathrm{total}}\):

$$\begin{aligned} E_{\mathrm{total}}=\frac{1}{N}\sum _{i=1}^{N}E_{\mathrm{total}}^i. \end{aligned}$$
(2)

Multi-spin coding method

Multi-spin coding refers to all techniques that store and process multiple spins in one memory word. In this paper, we apply the multi-spin coding technique to the 2D Ising model. In general, multi-spin coding technique results in a faster algorithm as a consequence of updating multiple spins simultaneously. However, we mainly employ this technique to reduce the inter-process communications.

We apply the multi-spin coding introduced in Ref. [12]. However, in our implementation, the size of a memory word is 64 bits, in contrast to Jacobs’s 60-bit memory word. In addition, each spin is retained in three consecutive bits and the value of the 64th bit is always set to zero. 000 represents the spin down and the spin up is shown by 001. Since a memory word contains 21 spins, the size of the lattice is taken to be \(21N \times 21N\), where N is an integer greater than one. Now, we need to convert the spin lattice (Fig. 1a) to the lattice of memory words (Fig. 1b). Therefore, the size of the memory-word lattice is considered as \(N \times 21N\). Each column of the spin lattice is coded into the same column of the memory word in the memory-word lattice. So, 21N spins in one column of the spin lattice are arranged in N memory words of a column in the memory-word lattice as follows:

$$\begin{array}{ll}S(0,J):&\quad s_{0,j}, s_{N,j}, s_{2N,j}, \ldots , s_{19N,j}, s_{20N,j}\\ S(1,J):&\quad s_{1,j}, s_{N+1,j}, s_{2N+1,j}, \ldots , s_{19N+1,j}, s_{20N+1,j}\\ \vdots&\quad \vdots \\ S(N-1,J):&\quad s_{N-1,j}, s_{2N-1,j}, s_{3N-1,j}, \ldots , s_{20N-1,j}, s_{21N-1,j},\end{array}$$

where S(IJ) represents the memory word at the row I and the column J, \(0 \le I \le N-1\), \(0 \le J \le 21N-1\). \(s_{i,j}\) shows the spin located at the row i and the column j where \(j=J\). The advantage of this arrangement is that each spin is placed in the appropriate position related to its neighbors. Consider kth spin in a given memory word S(IJ). The right/left/top/down neighbor of the kth in the spin lattice is exactly kth spin in the right/left/top/down neighbor of the memory word S(IJ) in the memory-word lattice.

Fig. 1
figure 1

Arranging spins in memory words. a Spin lattice, b memory-word lattice

In order to apply periodic boundary conditions to the memory words in the first and last row (Fig. 2a), we need to make some changes to up and down neighbors in advance. In fact, the down (up) neighbor of \(S(N-1,J)\) (S(0, J)) is not exactly S(0, J) (\(S(N-1,J)\)). For a memory word in the first (last) row, its up (down) neighbor—which is the memory word in the last (first) row and in same column—has to be shifted 3 bits to the right (left). These two cases have been shown in the diagrams (b) and (c) of Fig. 2. We should recall that the 64th bit is always set to zero.

Fig. 2
figure 2

a Memory words in the first and last rows of a memory-word lattice, b up neighbor of S(0, J) formed by shifting three bits of \(S(N-1,J)\) to the right, c down neighbor of \(S(N-1,J)\) formed by shifting three bits of S(0, J) to the left

Calculation of energy

Now, using multi-spin coding method, we show how to calculate the energy difference (\(\Delta E\)) between two configurations in the Ising model. At first, to better understand the process, we consider two 3 bit-spins \(s_1\) and \(s_2\). \(s_1\ \hbox {XOR}\ s_2\) produces 000 when the two spins are placed in the same direction and 001 is given when spins \(s_1\) and \(s_2\) are in the opposite directions.Footnote 2 Hence, for a given memory word S(IJ), the expression

$$\begin{aligned}&(S(I,J)\ \hbox {XOR}\ S(I-1,J)) + (S(I,J)\ \hbox {XOR}\ S(I+1,J)) \\ &\quad +\,(S(I,J)\ \hbox {XOR}\ S(I,J-1)) + (S(I,J)\ \hbox {XOR}\ S(I,J+1)), \end{aligned}$$
(3)

generates a value in the range of [0, 4] for every 3bit-group given in the last column of Table 1. In the second and third columns, we have considered different cases that might occur between a selected spin and its four neighbors. The initial interaction energy E and the energy \(E^\prime \) calculated after flipping the selected spin have been presented in the forth and fifth rows, respectively.

Table 1 Different configurations which might happen between a selected spin and its four nearest neighbors

Parallelization

In a Monte Carlo Metropolis iteration, each memory word is updated at least once. The iterations must be performed enough times to yield accurate outcome energy. The given lattice could be vertically divided into \(N_p\) sub-lattices with equal sizes, where \(N_p\) is the number of processes. Computational load of each sub-lattice is assigned to the processes 0 to \(N_p-1\) from left to right. Each process creates a sub-lattice of the specific size, initializes the sub-lattice, performs all Monte Carlo iterations and calculates the energy of the sub-lattice using Eq. (2). When all individual processes calculate the energy of their own sub-lattice, the energies of the sub-lattices are added up, through a Map-Reduce operation, to calculate the total energy of the lattice. However, this approach results in two problems. As illustrated in Fig. 3, half of the neighbors of the memory words on the border, are placed in the sub-lattice of neighbor process. Therefore, to calculate the energy of the memory words on the border, some memory words of the side sub-lattice are needed. Therefore, these memory words have to be observed when needed. Moreover, we should note that neighbor memory words should not be updated simultaneously by different processes.

To deal with the second problem, we propose a method in which the memory words are updated in two phases (see Fig. 3). In each phase, half of the sub-lattice is updated while the other half stays unchanged, i.e., in phase 1 (2) the left (right) half of the sub-lattice is updated. After the phase 1 (2) is done, each process will pass to phase 2 (1) only if its right (left) process accomplishes the phase 1 (2). To better understand this process, we consider three consecutive processes in Fig. 3 which are updating the left half of their sub-lattices in phase 1. The process p2 has updated the left half of its sub-lattice and is going to start the phase 2 to update the right half. However, it is not able to reach the phase 2 until the process p3 accomplishes the phase 1 and finishes the update of the left half. So, the memory words on the borders of the processes p2 and p3, marked with \(\times \) in Fig. 3, do not update simultaneously. In the same manner, the memory words on the borders of processes p1 and p2, marked with + in Fig. 3 do not update at the same time.

Fig. 3
figure 3

Sub-lattices of the memory words that belong to the three consecutive processes. The memory words on the border of the two different sub-lattices which have interaction with each other, are marked the same. Each half of the memory-word lattice is updated in the phase 1 or 2. Therefore, the border memory words are not updated simultaneously

Now, we turn to the first problem. As mentioned before, in each phase half of a sub-lattice is updated. Before a process starts updating the half of the sub-lattice, it should receive the corresponding border memory words of the neighbor process. Suppose that the process p2 is going to update the left half of its sub-lattice in phase 1. It waits to receive the right-side border memory words of the process p1. The process p1 sends its right border memory words to the process p2 asynchronously just after it accomplishes the phase 2 of the last iteration. After p2 receives the border memory words from p1 synchronously, it starts updating the memory words in the phase 1. Just after finishing the phase 1, p2 sends its updated left-side border memory words to p1 asynchronously and goes to the phase 2. The similar procedure occurs for other processes as well. It should be mentioned that we use periodic boundary conditions thereby the left neighbor of the first process is the last process, and likewise the right neighbor of the last process is the first process.

Implementation

In this section, we describe the implementation details of the algorithm presented in the previous section. Consider a memory-word lattice of size \(N\times 21N\) where N is an arbitrary integer bigger than one. We execute the algorithm on \(N_p\) processes and each process is identified by an integer number, 0 to \(N_p -1\), called rank. Each process is responsible for \(N_c\) columns of the memory-word lattice where \(N_c=\frac{21N}{N_p}\). Each individual process creates its own sub-lattice, initializes it, gets all Monte Carlo iterations done and calculates the energy of the sub-lattice for a specific temperature. Within each Monte Carlo iteration a sub-lattice is updated many times and the energy of iteration is calculated. Finally, the total energy of the memory-word lattice for a specific temperature is obtained via a reduce operation. This operation is illustrated in Fig. 4. Now, every step of the algorithm is studied in details.

Fig. 4
figure 4

Function of our implemented program. Left and right arrows denote inter-process communications

Initialization

Now, we explain how a process creates its own sub-lattice and initializes it randomly. We use a 64-bit long integer as a memory word to store 21 spins. A 2D array of \(N\times N_c\) long integers forms the sub-lattice of the process where \(N_c=\frac{21N}{N_p}\). However, we use an array with two additional columns (\(N\times (N_c+2)\)) reserved for border memory words of the neighbor processes (see Fig. 5a). In this paper, we denote this array by S-lattice. Each spin in the memory words of the s-lattice is initialized with 0 or 1 randomly—except for the first and last columns—which represent spin down and up, respectively.

Fig. 5
figure 5

a S-lattice of each process, b communication between processes before updating the memory words in phase 1, c updating memory words in phase 1, d communication between processes before updating memory words in phase 2, e updating the memory words in phase 2

Updating

As mentioned before, updating process is done in two phases. In each phase, one half of the sub-lattice is updated. In cases where \(N_c\) is odd, \(\hbox {floor}(N_c/2)\) columns are updated in phase 1 and the rest of the columns is updated in phase 2. Before starting the update process in each phase, some inter-process communication should be carried out.

At first, each process sends its border memory words to its neighbor process asynchronously. Then, it waits to receive the border memory words from its neighbor processes. When the process receives the required border memory words, it can accomplish the phase by frequently updating the memory words that belong to the corresponding phase. In phase 1, in which the left half of the sub-lattice is updated, each process sends the rightmost column of its sub-lattice to its right neighbor (Fig. 5b). So, the destination of the sending memory words is determined by the following code:

figure a

It means that, due to the periodic boundary conditions, the right neighbor of the process with the rank \(N_p-1\) is the process 0. Likewise, the source process from which the process receives the border memory words is determined by the following code:

figure b

which means that the left neighbor of the process with the rank 0 is the process \(N_p-1\).

The received column of memory words is stored in the first column of the S-lattice which has been reserved for the border memory words of the neighbor process. When the border memory words are received, the left half of sub-lattice is updated (Fig. 5c). Likewise, the destination and the source, in phase 2, are determined by the following code (Fig. 5d):

figure c

The received column of the memory words is stored in the last column of the S-lattice which has been reserved for the border memory words of the neighbor process (Fig. 5e).

Calculating the Energy of a Monte Carlo Iteration

In order to obtain the total energy of the lattice, the energy of all nearest neighbor pairs must be considered. However, if we consider the interaction energy of the right and down neighbors of each memory word, the total energy is calculated. Each process calculates the energy of each memory word in the S-lattice except for the first and last columns. Notice that the last column of the S-lattice contains the copy of the border memory words of the right process (Fig. 5e). Since these border memory words are not used until they are sent, the copy of them is still valid. This copied column is used as the right neighbor of the last column of the sub-lattice. The code in Listing 1 shows how the interaction energy between a specific memory word and its right and down memory words is calculated: In the line 3, the outcome of the expression on the right side of the assignment operator, is a memory word which includes 21 3bit-groups. Every group contains a number between 0 and 2, i.e., 0 represents \(-\,2J\), 1 represents 0 and 2 represents \(+2J\). Each 3bit-group retains the sum of the interaction energy between a specific spin with its right and down neighbors. The for loop iterates on the 3bit-groups of E. In each iteration, the energy of one 3bit-group is extracted and is added to rv where rv retains the sum of energies of the 3bit-groups. Therefore, rv contains the total energy of the 21 3bit-groups.

figure d

Results

We have executed our program on a part of the computer cluster of Plasma Physics Research Center which includes 16 nodes networked by a switch. Each node is equipped with two Intel Xeon X5365 CPUs. We have used up to 9 nodes to test our program. Three different cases with different number of iterations have been considered in Table 2.

Table 2 Three different cases which have been examined in this paper

The measured speedup and efficiency versus the number of cores are illustrated in Figs. 6 and 7, respectively. As shown, for all three test cases, as number of cores and nodes is increased, efficiency goes down. Especially when one more node is exploited, the efficiency drops considerably. This fall is due to the fact that overhead of the communication between processes on different nodes is higher than overhead of the communication between processes on the same node.

Fig. 6
figure 6

Parallel speedup versus the number of cores presented in Table 2

Fig. 7
figure 7

Parallel efficiency versus the number of cores presented in Table 2

Now, we are able to inspect the impact of the lattice dimension and the number of Monte Carlo iterations on the performance of our algorithm. Comparing the test cases 2 and 3, it is inferred that bigger lattice sizes get better speedup and efficiency. In addition, the comparison between the test cases 1 and 2, we can claim when the number of Monte Carlo iterations increases, better speedup and efficiency is deduced. Therefore, our algorithm has better performance for bigger lattice sizes and more Monte Carlo iterations.