Algorithm and Data Structures for Xeon Phi
Abstract
Algorithms and data structures appropriate for Xeon Phi are active fields of research and deserve a book on their own. This chapter will touch only on some common algorithm and data structure optimization techniques that I have found useful for common technical computing applications. These algorithms will definitely evolve as we gain more experience with the hardware. This chapter does not derive the algorithms but rather focuses on optimization techniques to achieve good performance on Xeon Phi. For example, I assume familiarity with Monte Carlo simulation techniques and the algorithms used in financial applications and instead focus on those components of the algorithms that be optimized to make the most effective use of Xeon Phi architecture capabilities.
Keywords
Sorting AliasingAlgorithms and data structures appropriate for Xeon Phi are active fields of research and deserve a book on their own. This chapter will touch only on some common algorithm and data structure optimization techniques that I have found useful for common technical computing applications. These algorithms will definitely evolve as we gain more experience with the hardware. This chapter does not derive the algorithms but rather focuses on optimization techniques to achieve good performance on Xeon Phi. For example, I assume familiarity with Monte Carlo simulation techniques and the algorithms used in financial applications and instead focus on those components of the algorithms that be optimized to make the most effective use of Xeon Phi architecture capabilities.

Xeon Phi has a much higher number of threads (240) compared with the 24 threads supported on the typical Intel Xeon processor core. Moreover, the singlethreaded performance of Xeon Phi is much slower than that of the Xeon processor. As a result of these two characteristics of the Xeon Phi coprocessor, the part of an algorithm that depends on singlethreaded performance will be adversely affected and need modification. The low serial performance causes thread synchronization primitives to take on a higher percentage of the overall runtime and may require a fundamental change in the algorithm to reduce the parallel overhead.

Xeon Phi has a larger vector length than Xeon: 512 bits vs. 256 bits. Loops that benefit from a 256bit vector length may not perform well on 512 bits due to insufficient loop counts.

The total flops achievable on the 60core Xeon Phi coprocessor is 2 petaflops (PF) single precision vs. ∼1 teraflops (TF) on a twosocket Xeon E52670 processor running at 2.6 GHz. This is achieved by using all the cores and all the vector operations efficiently. On the other hand, the practical memory bandwidth per core on Xeon Phi is less than that on Xeon. The lower flops/byte supported by the Xeon Phi coprocessor compared to the Xeon processor may require different algorithms to solve the same problem.

The amount of memory per core, ∼266 MB per Xeon Phi core vs. 4 GB per core on a 24core Xeon twosocket system with 96 GB memory, will in many cases require changes in the algorithm to accommodate core scheduling changes to handle a large number of tasks with a small work size per task.
These data points show that the only way to achieve better performance on Xeon Phi is to have the algorithms exploit the massive flops available on the Xeon Phi and mostly running from cache. The Xeon Phi coprocessor will also need to be paired with the Xeon processor capabilities of large memory and high singlethreaded performance to achieve this goal. Since the vector sizes are wide, the data to be vectorized will need to be wide enough to make use of the vector capabilities. Key to achieving high performance on Xeon Phi architecture is implementing an efficient algorithm paired with an appropriate workload size.
Algorithm and Data Structure Design Rules for Xeon Phi

Rule 1. Scalable parallelization: The code should be able to scale to all the cores and preferably all the threads. Not only will you need to vectorize at the thread level, but you must also scale the algorithm to a large number of threads.
 Rule 2. Efficient vectorization: The code should be able to efficiently use 512bit vectors on Xeon Phi. This requires that the input dataset be large enough to provide enough work for the coprocessor such that:

The input array size is long enough.

The working dataset residing on memory is properly aligned for optimal vector load and store.

The working dataset accessed is coalesced so that, when filling the cache line, the consecutive bytes come from consecutive regions rather than needing to be gathered from multiple data locations.


Rule 3. Optimal cache reuse: Maximum reuse of the cache should overcome the memory bandwidth limitation and increase sustainable flops/byte.
Rules 1 and 2 require that the problem size match the machine size, which is defined as the (vector length × number of cores). For a 60core Xeon Phi, the machine size is (64 bytes × 60 cores) = 3.8 kB. To fully utilize the hardware, the algorithm will need at least 3.8 kB of data to work on.
 1.
Divide the data equally among the cores and threads within a core.
 2.
Divide the data so that each portion of working dataset can fit in the L2 cache of each core.
 3.
Work on the data using vector primitives.
 4.
Merge the results back using vector and optimal reduction algorithms.
 1.
If needed and possible, convert the array of structures to an aligned structure of arrays for singlestride access for all threads running on the coprocessor.
 2.
Block the data structure. The data should be blockable so that they can fit in the cache and be worked on.
Let’s look at the logical thinking behind some common algorithms to see how they can be implemented to run efficiently on Xeon Phi architecture.
General MatrixMatrix Multiply Algorithm (GEMM)
General matrixmatrix multiplication (GEMM) is one of the most commonly used algorithms in technical computing applications and is available in the basic linear algebra library (BLAS) in the form of the routines SGEMM or DGEMM for singleprecision and doubleprecision dense matrixmatrix multiply operations. xGEMM, where x can be ‘s’ for single precision and ‘d’ for double precision, computes the new value of matrix C based on the matrix product of matrices A and B and the old value of matrix C, formulated as:
C = alpha*op(A) * op(B) + beta*C
where alpha and beta are scalars and op indicates a possible transpose of the matrices.
The general operation is a triplenested loop:
for(i=0; i<M;i++)
for(j=0;j<N;j++)
for(l=0;l<k;l++)
C(i,j) = C(i,j) + A(i,l)*B(i,j)

Will not vectorize by the compiler because the compiler is unable to resolve aliasing among arrays A, B, and C.

Has in general a very large memory footprint to fit in the L2 cache of each core.

Demands a lot of memory bandwidth, requiring three memory accesses to A, B, and C to perform an FMA operation.
Rules 1 and 3: Scalable Parallelization and Optimal Cache Reuse
In order to optimize this algorithm for Xeon Phi, make use of the blocked dense matrix multiply method. It can be shown that if A, B, and C matrices are divided into blocks with conformable block sizes, the blocked matrixmatrix multiplication can be done like a regular matrixmatrix multiplication.^{1} This property enables you to utilize the same triple nested loop algorithm described previously to work on blocks of matrices instead of individual elements and arrive at the same result as multiplying the whole matrices. This method allows you to break up the work among the cores of Xeon Phi and optimize cache reuse.
Rule 2: Efficient Vectorization
In Figure 112, to compute a subblock of C, a column of A coming from the L1 cache can be multiplied by a row of B and accumulated into 16 vector registers holding C. To illustrate the process, think about a 2x2 square matrix A, B, and C. By sketching on a sheet of paper, you will see that the first components of C11, C12, C21, and C22 can be obtained by multiplying A11 × B11, A11 × B12, A21 × B11, and A21 × B12, which is obtained by multiplying a column of A (A11, A21) by a row of B (B11 and B12). Assuming a 2wide vector register, you load and broadcast A11 in one register and multiply with a register loaded with a row of B to update C registers holding C11 and C12 values (first components) of the matrix multiplication. This can be extended to the 16x16 matrix subblocks discussed here. Using the cache blocking and software prefetch, you can get the values of the matrix blocks into cache levels L2 and L1 to achieve good efficiency with this algorithm on Xeon Phi.
It can be verified that this algorithm conforms to all three design rules of an optimal algorithm for the Xeon Phi coprocessor. It satisfies Rule 1 by dividing the input matrices into smaller chunks to be able to distribute them equally among the cores and threads within a core. The algorithm satisfies Rule 3 by fitting the data into the L2 and L1 caches and including prefetches to reduce data access latencies. And it satisfies Rule 2 by working on the data unit by vector instructions. At the output generation step, the output is done in parallel by each core and individual threads within a core without contention by working on the disjoint area of array C.
The simple square matrix multiplication example considered in this section is an artificial example. Implementing matrix multiplication in practice requires handling various shapes and boundary conditions for optimal performance. I highly recommend that you make use of offtheshelf libraries such as Intel MKL for such operations. These libraries implement various BLAS and Sparse BLAS routines and make it easier for users to optimize for Xeon Phi coprocessors.
Molecular Dynamics
Molecular dynamics (MD) are a subset of the broader class of applications known as the particle dynamics or nbody applications . The goal of these applications is to simulate the interaction between particles using the Newtonian equations of motion caused by force fields defined by the underlying physics. In the MD problem, the goal is to determine the molecular movement over a period of time till the molecules and atoms involved in the simulation get to a steady state. The forces and potential energy involved are defined by molecular mechanics and used to compute the velocities and movement of the atoms and molecules. The MD algorithm is used in many fields besides classical mechanics, including material science, biochemistry, and biophysics.
The basic algorithm of MD simulation is very simple and straightforward. It is a timestepped simulation where, at each time step, the application calculates the forces between the atoms. The forces may be bonded or nonbonded forces. Using the calculated forces, the atoms’ positions are updated. The time step is repeated till either steady state is reached or a certain number of time steps have been performed.
The molecules and atoms are represented by an array of data structures containing the forces, velocities, and positions of each particle. The most timeconsuming computation of MD code is computing the nonbonded force field, which requires computing the force exerted on a particle by every other particle in the system. This computation consumes almost 80 percent of execution time for practical workload sizes. A common optimization is to set a cutoff distance for the neighbor list of interacting molecules and to calculate forces only among them. MD code can also benefit from spatial sorting of the neighbor list to help optimize the force calculation. This section looks at the timeconsuming forcecomputation component of MD applications on Xeon Phi. Other computations such as the neighborlist build and sort may be done on the host processor.
In general, MD code can compute various types of forces between the molecules. Code Listing 111 shows the pseudocode for a LennardJones force calculation in an MD kernel.^{2} The LennardJones potential is widely used in MD code to describe interactions between the molecules or atoms being simulated.
Code Listing 111. LennardJones Force Calculation
// loop over all the atoms
for (int i = 0; i < numAtoms; i++)
{
//Get position info for a given atom
POSITION ipos = position[i];
neighList = getNeigborList(i);
numNeighbors = neighList.getCount();
// loop over all the neighbors of the given atom
for(int j = 0;j < numNeighbors;j++)
{
int jidx = neighList[j];
// get position of the nrighbor jidx
POSITION jpos = position[jidx];
double delx = ipos.x  jpos.x;
double dely = ipos.y  jpos.y;
double delz = ipos.z  jpos.z;
// compute distance between the neighbors
double r2inv = delx*delx + dely*dely + delz*delz;
// if the distance is less than cutoff distance
if (r2inv < cutsq) {
// calculate the force
r2inv = 1.0f/r2inv;
double r6inv = r2inv * r2inv * r2inv;
double force = r2inv*r6inv*(lj1*r6inv  lj2);
// accumulate the force value for this atom
Force[i].x += delx * force;
Force[i].y += dely * force;
Force[i].z += delz * force;
Force[jidx].x = delx * force;
Force[jidx].y = dely * force;
Force[jidx].z = delz * force;
}
}
}
 1.
The code goes over each atom one at a time.
 2.
The code loops over all the neighbors of the atom.
 3.
The code gets the positions of the neighboring atoms and calculates the force using the distance between the atoms.
 4.
If the distance between the atoms is less than the cutoff distance, the code updates the force for this atom and the interacting atoms.
Rule 1: Scalable Parallelization
In order to run this code for Xeon Phi in accordance with the rules for optimal algorithms, I make the modifications shown in Code Listing 112. The first rule dictates that the code should be scalable to all the threads. In this case, I use OpenMP parallel and dynamic schedule to achieve this scalability. Since the amount of work for each atom depends on the neighbor size and may not be equal, dynamic scheduling will prevent load imbalance in the overall force computation task on Xeon Phi. This modification is provided in line 1 of Code Listing 112. Since in general MD code will have millions of atoms to be simulated per node, this will provide enough work for each thread.
Code Listing 112. Xeon Phi LennardJones Force Calculation
// loop over all the atoms in parallel
#pragma omp parallel for schedule(dynamic)  line 1
for (int i = 0; i < numAtoms; i++)
{
POSITION ipos = position[i];
neighList = get NeigborList(i);
numNeighbors = neighList.getCount();
// loop over all the neighbors of the given atom
for(int j = 0;j < numNeighbors;j++)
{
int jidx = neighList[j];
// prefetch position array to cache level 1 and level 2
// Note you need to prefetch 16 elements to help with the
//gather – just shown 2 for clarity
__mm_prefetch(&position[j+offset], _MM_HINT_T0);  line 15 // prefetch to L0 cache
__mm_prefetch(&position[j+offset], __MM_HINT_T1;  line 16 // prefetch to L1 cache
// get position of the nrighbor jidx
POSITION jpos = position[jidx];
double delx = ipos.x  jpos.x;
double dely = ipos.y  jpos.y;
double delz = ipos.z  jpos.z;
// compute distance between the neighbors
double r2inv = delx*delx + dely*dely + delz*delz;
// if the distance is less than cutoff distance
if (r2inv < cutsq) {
// calculate the force
r2inv = 1.0f/r2inv;
double r6inv = r2inv * r2inv * r2inv;
double force = r2inv*r6inv*(lj1*r6inv  lj2);
// accumulate the force value for this atom
Force[i].x += delx * force;
Force[i].y += dely * force;
Force[i].z += delz * force;
Force[jidx].x = delx * force;
Force[jidx].y = dely * force;
Force[jidx].z = delz * force;
}
}
}
Rules 2 and 3: Efficient Vectorization and Optimal Cache Reuse
 1.
Since the different data structures such as position and force in the code are accessed using the vector gather/scatter instruction, which results in path length increase and memory access latencies, a big performance gain may be obtained by inserting prefetch as shown in lines 15 and 16. If successfully prefetched into the cache level, the inefficiency due to scatter gather code is vastly reduced, thus providing good performance.
 2.
Data structures such as positions and the neighbor list can be aligned to cacheline boundaries when allocating the data structures and padded to be multiples of cache line size.
 3.
Since the gather operation works in a loop where each iteration brings in a cache line, it might be possible to have more neighbor data in a cache line if the neighbor position matches the memory layout of the neighbor list. This can be effected by spatial sorting of the neighbor list so that particle orders in the sorted neighbor list are in the nearby cachefriendly memory region.^{3}
 4.
Divide the nonbonded force compute between the host and Xeon Phi using asynchronous computations.
 5.
In some cases it may be possible to utilize mixedprecision arithmetic—such as computing force and position in single precision and accumulating in double precision—to achieve higher performance.
Stencil Operation
Stencil operations are commonly used in many technical computing applications for simulating diffusion in fields such as computational fluid dynamics, electromagnetics, and heat propagation. These problems are associated with structured grids and use mathematical finite difference representation of differential operators—such as Laplacian (u _{ t + 1} ← ∇^{2} u _{ t }) operators, divergence, and gradient—to find answers to the problems formulated. Stencil refers to the predetermined set or pattern of nearest neighbors including the element itself. The stencil can be used to compute the value of various elements in an array at a given timestep based on its neighbors’ values computed from previous timesteps. The algorithm in general steps in time with some given initial conditions over all the elements to simulate the diffusion or other physical process in time.
Code Listing 113. Stencil Operation Pseudocode
for(t=0; t<numTimeSteps;t++){
for(z=2; z<NZ2; z++){
for(y=2; y<N2; y++){
for(x=2; x<NX2; x++){
fout[z,y,x] = c1*f[z,y,x] +
c2*f[z,y,x2] + c3*f[z,y,x1] + c4*f[z,y,x+1] + c5*f[z,y,x+2] +
c6*f[z,y2,x] + c7*f[z,y1,x] + c8*f[z,y+1,x] + c9*f[z,y+2,x] +
c10*f[z2,y,x] + c11*f[z1,y,x] + c12*f[z+1,y,x] + c13*f[z+2,y,x]
}}}
double *tmp = fout; fout = f; f=tmp; //switch buffers
}
The constants c1c11 represent the weighted contribution of the various neighboring elements of the given cell value being computed and defined by the physics of the problem. In this code you are working with one input and one output array and switching between them in every time step. For other operators, such as gradient and divergence, the number of input and output arrays may be multiple and you may need simultaneous access to multiple arrays. In addition, the physics of the problem may impose formulations with more data arrays to be read per grid point computation, increasing the data traffic even more.
The code sweeps through the input 3D array f[], which is usually larger than the L2 cache of the Xeon Phi hardware. This causes low flops/byte for this operation, in turn causing performance to be memory bandwidthbound. Having more arrays to work on puts more pressure on the memory subsystem of Xeon Phi, causing a further performance drop.
Rule 1: Scalable Parallelization
In the stencilbased application, parallelization happens by naturally breaking up the larger 3D grid into smaller ones per card. The boundary (halo) exchange may happen between neighboring cells at every time step to exchange the boundary data. Within a card, because this is a structured grid, a parallelization technique is often used to divide the contiguous blocks allocated to Xeon Phi card equally among the cores so that the threads in a core share one block.
To parallelize the loop, you can do OpenMPbased parallelization, as shown in Code Listing 114, on the outer two loops and vectorizing the inner loop. To increase the amount of work for each OpenMP thread, you can use the OpenMP loop collapse construct on the outer loops as well, as shown in Code Listing 114. Also make sure to affinitize the code to the Xeon Phi core to make optimal cache reuse using KMP_AFFINITY.
Code Listing 114. Stencil Calculation Pseudocode Parallelized with OpenMP
for(t=0; t<numTimeSteps;t++){
#pragma omp for collapse(2)
for(z=2; z<NZ2; z++){
for(y=2; y<NY2; y++){
for(x=2; x<NX2; x++){
fout[z,y,x] = c1*f[z,y,x] +
c2*f[z,y,x2] + c3*f[z,y,x1] + c4*f[z,y,x+1] + c5*f[z,y,x+2] +
c6*f[z,y2,x] + c7*f[z,y1,x] + c8*f[z,y+1,x] + c9*f[z,y+2,x] +
c10*f[z2,y,x] + c11*f[z1,y,x] + c12*f[z+1,y,x] + c13*f[z+2,y,x]
}}}
double *tmp = fout; fout = f; f=tmp; //switch buffers
}
You can now execute the code in parallel on all the available Xeon Phi cores to attain better performance compared with that of the serial version of the code.
Rule 2: Efficient Vectorization
The stencil methods have a high degree of parallelism but usually are not friendly to vector architectures such as Xeon Phi. To start with, the compiler is unable to vectorize the inner loop in Code Listing 114 due to the possible alias between the two c pointers: fout[] and f[]. In order to vectorize this code, you need to tell the compiler to assume the array pointers point to the disjoint location by using #pragma vector always or #pragma ivdep. This pragma will vectorize the inner loop, but the vectorization is inefficient due to memory access issue, as you will see shortly. To help vectorize the code more efficiently, you can remove the prologue and epilogue of the vectorized code by aligning the array data to the cacheline boundary and padding the arrays so that the innermost dimension is a multiple of the cacheline size. Aligning the data will also help with cacheway oversubscription in some scenarios. Once the data are aligned, you can use pragma vector aligned to tell the compiler to assume aligned vectors. Finally, you don’t want to waste BW when writing to the fout[] array, which causes output data to be read, to complete the read for ownership operation. This can be done by indicating to the compiler that fout is a nontemporal write and thus not to waste valuable BW needed to read the data in.
The compilergenerated code created by the above changes and directives will be inefficient. Inefficient vectorization happens because you are adding shifted versions of data elements such as f[z,y,x2] .. f[z,y,x+2] in the same computing statement. This inefficiency is known as stream alignment conflict and refers to the fact that the vectors needed to perform the computation from the same data stream but are not aligned to one another, thus requiring extra data manipulations. In this case, the compiler generates code that requires redundant load and interregister shifts following a load operation. To get around this problem, you will need to use a data structure transformation such as dimensionlifted transposition.^{4}
Rule 3: Optimal Cache Reuse
Cache use can be improved by using cacheblocking in the X and Y directions.^{5} By blocking the code, you can increase the temporal locality. The goal of cacheblocking is to render those elements that are needed by the stencil code available in cache as the code works along the column of the Zaxis. This is shown in the hypothetical example in Code Listing 115 and graphically represented in Figure 114. If threads in the same core work on the same block or neighbor blocks, data reuse would be higher and balanced affinitization would enable better performance.
Code Listing 115. Stencil Pseudocode Optimized with Blocking and Vectorization
for(t=0; t<numTimeSteps;t++){
#pragma omp for collapse(3)
for(yy=2;yy<NY2;yy+=By)
for(xx=2;xx<NX2;xx+=Bx)
for(z=2; z<NZ2; z++){
for(y=yy; y<min(NY2,yy+By2); y++){
#pragma vector aligned //indicate no alising
#pragma vector nontemporal (fout) // indicate fout can be streaming store
for(x=xx; x<min(NX2,xx+Bx2); x++){
fout[z,y,x] = c1*f[z,y,x] +
c2*f[z,y,x2] + c3*f[z,y,x1] + c4*f[z,y,x+1] + c5*f[z,y,x+2] +
c6*f[z,y2,x] + c7*f[z,y1,x] + c8*f[z,y+1,x] + c9*f[z,y+2,x] +
c10*f[z2,y,x] + c11*f[z1,y,x] + c12*f[z+1,y,x] + c13*f[z+2,y,x]
}}}
}}}
double *tmp = fout; fout = f; f=tmp; //switch buffers
}
European Option Pricing Using Monte Carlo Simulation in Financial Applications
Monte Carlo simulation is a computational method widely used in the financial sector to model option prices^{6} of an underlying stock. This section looks at optimizing the Monte Carlo algorithm used for European option pricing on the Xeon Phi coprocessor.
Figure 115 is a simplified illustration of the basic step of Monte Carlo simulation of European option pricing. It suggests how such simulation can be optimized for Xeon Phi coprocessors. The general Monte Carlo process of option pricing starts by generating a set of random numbers (Step 1 in Figure 115). The simulation process takes a batch of queries to set option prices of the underlying security by specifying their input parameters, such as current stock price, option strike price, expiration date, and so forth. At Step 2, for each of these options to be calculated, the computation generates a block of possible stock price at the expiry date using a solution to the stochastic process that simulates underlying stock prices over time. The solution to the stochastic process for each random variate x generated in Step 1 results in a function that is of the form:
S(T) = S(0) Exp2(f(x,T))
where:
S(T) = Stock price at maturity.
S(0) = Stock price at the option issue date.
Exp2(f(x,T)) = exponential of a function of variate ‘x’ and time T
Profiling of the Monte Carlo simulation shows that the most computeintensive part of the computation is evaluating the S(T) price in Step 3.
Rule 1: Scalable Parallelization
Rule 2: Efficient Vectorization
The code can be vectorized efficiently using vectorized versions of random number generators and transcendental functions. The MKL’s vector statistical functions contain vector random number generators with various statistical distributions, including the Gaussian distribution needed by Monte Carlo simulation. Functions such as vsRngGaussian or vdRngGaussian from the Intel MKL can be used to generate vectorized single and doubleprecision random numbers.^{7} The loop at Step 2 in Figure 115 is vectorized by the compiler by using vector transcendental functions. For singleprecision arithmetic, highthroughput transcendental instructions such as vexp223ps are implemented in Xeon Phi hardware and can provide an excellent boost in performance by vectorizing the key calculation at Step 3 in the loop. In my experiments, the Intel Compiler was able to vectorize such loops. For doubleprecision computation, you can use the Intel short vector math library to vectorize the computation as well. The biggest benefit and performance gain can be obtained by using the fast hardwareimplemented transcendental functions provided the algorithm can handle the lower precision of these functions.
Rule 3: Optimal Cache Reuse
Monte Carlo simulations are computebound applications and the data usage easily fits in the L2 cache. So cache reuse is fairly optimal and not a concern for this algorithm.
Conforming to all three rules for optimal algorithms—scalable parallelization, efficient vectorization, and optimal cache reuse—Monte Carlo simulation is ideally suited to reap the potential benefits of the Intel Xeon Phi hardware.
Summary
This chapter looked at various common algorithms and their data structures and how they help and hinder performance while executing on Xeon Phi. You have seen that there are ways to restructure code and data structure so that they follow the three rules of scalable parallelization, efficient vectorization, and optimal cache reuse to achieve high performance on Xeon Phi. This chapter is in no way a comprehensive treatment of technical computing applications, but it does impart a sense of the vista of exciting possibilities now before you, as you prepare to leverage what you have learned in this book to develop new algorithms and data structures for the Intel Xeon Phi coprocessor.
Footnotes
 1.
Howard Eves, Elementary Matrix Theory. Dover Books, 1980.
 2.
A similar implementation of this code can be found in the SHOC benchmark for Xeon Phi. https://github.com/vetter/shocmic
 3.
S. Meloni, M. Rosati, and L. Colombo, “Efficient Particle Labeling in Atomistic Simulations,” Journal of Chemical Physics, Vol. 126, No. 12, 2007.
 4.
T. Henretty, K. Stock, L. N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan, “Data layout transformation for stencil computations on shortvector SIMD architectures,” Compiler Construction: Lecture Notes in Computer Science, Vol. 6601, 2011, pp. 225–245. http://link.springer.com/chapter/10.1007/9783642198618_13#page1
 5.
G. Rivera and C. Tseng, “Tiling optimizations for 3D scientific computations.” In Proceedings of SC’00, Dallas, TX, November 2000.
 6.
Phelim P. Boyle, “Options: A Monte Carlo Approach,” Journal of Financial Economics , Vol. 4, No. 3, pp. 323338, May 1977.
 7.