Insight into tiles generated by means of a correction technique
 89 Downloads
Abstract
Wellknown techniques for tiled code generation are based on the polyhedral model and affine transformations. An alternative approach to generation of tiled code is to correct original rectangular tiles defined for a loop nest by means of the transitive closure of a dependence graph instead of deriving and applying affine transformations. In this paper, we present results of an analysis of basic features of tiles generated due to correction of original rectangular tiles. We introduce procedures which allow us to recognize such features as target tile type (fixed, varied, parametric), dimensionality, size (the number of statement instances within a tile), and loop nest tileability (the percentage of statement instances that can be tiled with rectangular tiles). We consider differences between those features of tiles generated by means of affine transformations and transitive closure. We also discuss results of experiments with PolyBench benchmarks and show how differences in tiles generated with the examined approach and affine transformations affect serial tiled code performance.
Keywords
Optimizing compilers Tiling Transitive closure Dependence graph Code localityMathematics Subject Classification
68N20 68N15 68M20 68R01 57M15 90C101 Introduction
Tiling [8, 11, 13, 15, 24, 30] is an iteration reordering transformation of significant importance for both improving data locality and extracting coarsegrained loop nest parallelism.
To our best knowledge, wellknown tiling techniques are based on linear or affine transformations of program loops [8, 11, 13, 15, 24, 30, 32]. Under affine transformations, tiling is valid when there exists a band of fully permutable loops [13, 30, 33], i.e., when all correspondingly permuted distance vectors have nonnegative elements.
In paper [4], the authors present a novel approach for generation of tiled code for affine loop nests. It is based on correction of original rectangular tiles by means of the transitive closure of a loop nest dependence graph. That approach produces tiled code even when there does not exist any affine transformation allowing for producing a fully permutable loop nest. It breaks cycles in the intertile dependence graph and makes all target tiles valid under the lexicographical order of target tile enumeration.
The effectiveness of the tile correction technique used for tiling and parallelizing bioinformatics code is demonstrated in papers [18, 19].
In general, corrected tiles are different from those generated by means of affine transformations. The goal of this paper is to demonstrate what are the basic features of corrected tiles such as type, size, and dimensionality. We also consider loop nest tileability—the percentage of statement instances that can be tiled with rectangular tiles in the original iteration space.

presentation of basic features of corrected tiles generated by means of the transitive closure of a loop nest dependence graph such as type, size, dimensionality and how those features can be recognized by means of the introduced procedures;

demonstration of similarities and differences of tiles generated by means of transitive closure and affine transformations;

implementation of the presented procedures to recognize basic features of corrected tiles in the TC optimizing compiler applying the ISL library and presentation of experimental results obtained by means of that compiler for PolyBench benchmarks.
2 Background
In this paper, we deal with affine loop nests where, for given loop indices, lower and upper bounds as well as array subscripts and conditionals are affine functions of surrounding loop indices and possibly of structure parameters (defining loop index bounds), and the loop steps are known constants.
Given a loop nest with q statements, we transform it into its polyhedral representation, including: an iteration space \(IS_i\) for each statement \(S_i,i=1,\ldots ,q\), read/write access relations (\( RA / WA \), respectively), and global schedule S corresponding to the original execution order of statement instances in the loop nest.
The loop nest iteration space \(IS_i\) is the set of statement instances executed by a loop nest for statement \(S_i\). An access relation maps an iteration vector \(I_i\) to one or more memory locations of array elements. Schedule S is represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed. We define a global iteration space IS as \(IS=\bigcup _{i=1}^{i=q}IS_i\) and a global iteration vector I as \(I=\bigcup _{i=1}^{i=q}I_i\). Further on, under I and IS, we mean the global iteration vector and global iteration space, respectively.
In order to compute the transitive closure of a dependence graph for loop nests examined in this paper, we have used the isl_map_transitive_closure function [27], the iterative approach [3], and the modified Floyd–Warshall algorithm [2]. The necessity of the usage of different algorithms is justified by our goal to get exact dependence graph transitive closure for each studied kernel and therefore to avoid any performance loss of target code unrelated to the algorithm itself. If one algorithm fails to calculate exact transitive closure, we choose another from those mentioned above.
In paper [4], a novel approach allowing for generation of valid tiles based on the transitive closure of a dependence graph is presented. Below we recap the main idea of that algorithm. First, original rectangular tiles are defined in the loop nest iteration space. Each such a tile is associated with a parametric identifier II. Then an invalid dependence target and the corresponding invalid tile are defined as follows: if there exists a direct or transitive dependence whose target belongs to a tile with identifier II while its source belongs to a tile with an identifier lexicographically greater than II, then those dependence target and tile are invalid. Each invalid tile should be corrected so that invalid dependence targets are relocated to lexicographically greater tiles.
In order to present that technique, let us consider the following example.
It is obvious that scanning those tiles and iterations within each tile in lexicographic order is incorrect because of the violation of the valid execution of dependent iterations (to honor a dependence, we should first execute the source of this dependence, then its target). For example, iteration (2, 2)—the target of the dependence \((1,3) \rightarrow (2,2)\) would be executed before iteration (1, 3)—the source of this dependence. To cope with such a problem, we correct the content of the tiles as follows. We remove iteration (2, 2) from T1 and add it to T2, remove iteration (4, 2) from T3 and add it to T4. After these changes, we get target tiles \(T1\_VLD\), \(T2\_VLD\), \(T3\_VLD\), and \(T4\_VLD\) presented in Fig. 1b. Now scanning tiles \(T1\_VLD\), \(T2\_VLD\), \(T3\_VLD\), \(T4\_VLD\) and iterations within each tile in lexicographic order is valid.
A formal algorithm of generation of target tiles based on transitive closure and the proof of the correctness of that algorithm is presented in paper [4].
In this paper, we deal with the following sets generated by means of the tiling algorithm: (i) set TILE(II) including statement instances of the original rectangular tile whose identifier is II; (ii) set \(II\_SET\) comprising all tile identifiers; (iii) set \(TILE\_GT(II)\) including elements of all of the tiles whose identifiers are greater than identifier II; and (iv) set \(TILE\_VLD(II)\) representing target corrected tiles. Hereinafter, we present an analysis of only target tiles and the corresponding serial tiled code because for a given fixed original tile size, the considered algorithm generates unique tiles and unique serial code. Parallel tiled code can be generated by means of many different approaches: assuming that a tile is a macroatomic statement, all known parallelization algorithms can be applied to serial tiled code: techniques based on affine transformations and/or those based on transitive closure, for example, see paper [20]. In general, each technique can generate a particular parallel tiled code different from that returned by another technique, i.e., parallel tiled code is not unique. Parallel code generation is out of the scope of this paper.
The next three sections present three algorithms that allow us (i) to recognize types of target tiles; (ii) to calculate the number of tiles within a particular category of tiles and tile dimensions; (iii) to determine loop nest tileability. Those algorithms are implemented in the TC^{1} optimizing compiler, which enables the user to perform one, two, or all three algorithms and collect statistics about target tiles generated with that compiler by means of the tile correction technique.
3 Types of target tiles
In this section, we introduce the following types of target tiles: fixed, varied, parametric, and fixed boundary. Subsequently, we show how those types can be recognized. Let us remind that under the examined algorithm, all original tiles are rectangular and their size is defined by input values \(b_1\), \(b_2,\ldots , b_d\), where d is the loop nest depth.
If condition (3) is not true, this means that some points (at least one) within set TILE are invalid dependence targets and they will be eliminated from set TILE and moved to particular lexicographically greater tiles so that all target tiles become valid. As a consequence, target tiles will differ in shape from original ones; in general, they can be nonrectangular. In such a case, the discussed algorithm can generate fixed, and/or varied, and/or parametric tiles. A fixed tile means that its size is constant—it does not depend on neither indices \(ii_1,ii_2, \ldots ,ii_d,\) defining a tile identifier, nor parametric upper loop index bounds. The size of varied tiles depends on indices \(ii_1\), \(ii_2\), \(\ldots \), \(ii_d\), defining tile identifiers, but it does not depend on parametric upper loop index bounds. The size of a parametric tile is defined by parametric upper loop index bounds.
To recognize target tile types for given upper loop index bounds, we calculate by means of the Barvinok library [26] the cardinality of set \(TILE\_VLD\) (\(\mathrm {card}(TILE\_VLD)\)) represented with a piecewise quasipolynomial, i.e., a subdivision of one or more spaces with a quasipolynomial associated with each cell in the subdivision. Each quasipolynomial is of the following form \(\{\, expression : domain \,\}\), where expression defines the number of elements of a tile group, while domain defines tile identifiers of this group. Subsequently, we analyze the expression part of each quasipolynomial—if it involves parameters, representing upper loop index bounds, we conclude that the number of statement instances included in a tile is dependent on upper loop index bounds, i.e., the tile is parametric. In case of the presence of symbolic constants representing tile identifiers (\(ii_1, ii_2, \ldots , ii_d\)), we categorize the tile as varied. If both types of symbols are present—parametric upper loop index bounds and tile identifiers—we refer to corrected tile as parametric and varied. If an analyzed expression does not depend on any variable, the corresponding tile is fixed.
Within the category of fixed tiles, we mark up fixed boundary tiles, their size is fixed, and some their points belong to the boundaries of the loop nest iteration space. Expressions representing the size of such tiles include upper loop bounds as parameters, but the number of points within them is equal or less than that of a nonboundary original rectangular tile.
Below, we illustrate all of the aforementioned cases. Let us consider the following loop nest.
Figure 2 presents the dependence graph, original tiles T1, \(T2,~\ldots \), T6 of the size \(2 \times 2\) (shown with dashed lines), and target tiles \(T1\_VLD\), \(T2\_VLD\), \(\ldots \), \(T6\_VLD\) (shown with solid lines) produced by means of the examined algorithm for Example 2 when \(N=3\). It is worth noting that the structure of the dependence graph for the examined loop nest is a chain: each next iteration depends on the previous one, i.e., only the serial (lexicographical) order of iteration enumeration is valid.
Summing up, we may conclude that the considered algorithm relying on the dependences available in the loop nest can generate fixed and/or varied, and/or parametric tiles, all of which can be recognized by applying Algorithm 1.
4 The number of tiles within a particular group and tile dimensions
In this section, we demonstrate how to calculate the number of tiles within a particular group, tile dimensions, and the percentage of subspaces occupied with target tiles of a particular group.
In order to reveal the number of tiles within each group, we intersect set \(II\_SET\) comprising tile identifiers with a set whose constraints are represented with the domain of the corresponding quasipolynomial.

the examined tiling algorithm divides the iteration spaces into 6 tiles;

four tiles, classified as fixed and fixed boundary, make up \(66.6 \%\) of all tiles, with a total of eight statement instances (\(40 \%\) of all statement instances are included in those tiles); each such a tile is of the size \(1 \times 2\) and includes instances of statement \(\mathrm {S1}\) (see Fig. 2);

two tiles, recognized to be parametric, constitute \(33.3 \%\) of all tiles, with a total of twelve statement instances (\(60 \%\) of all statement instances are included in those tiles); each such a tile consists of a subtile of the size \(1 \times 4\) which includes four instances of statement \(\mathrm {S1}\) and a subtile of the size \(2 \times 1\) which includes two instances of statement \(\mathrm {S2}\) (see Fig. 2);
Figure 3 illustrates dependences and tiles for Example 3 when \(N = 6\). Original tiles are marked with dashed lines while target tiles are depicted with solid ones.
5 Loop nest tileability
In this section, we address the following question: what is the percentage of the iteration space that can be tiled with original rectangular tiles, i.e., without correcting them. Hereinafter, we will refer to this property as tileability.
In order to compute the tileability of a loop nest, we form original rectangular tiles. Next, we remove all invalid dependence targets and all of the statement instances transitively dependent on those targets. Let us refer to such statement instances as problematic ones. The instances that are neither the sources of invalid dependence targets nor those dependent on any of problematic statement instances will be referred to as nonproblematic.
For the purpose of demonstrating the above concepts, let us consider the following loop nest.
We call such a tileability \(tilebility\_before\) because tiled code should be run before serial code enumerating problematic iterations contained within set PROBLEMATIC.
Algorithm 3 presents in a formal way the calculation of loop nest tileability provided that upper loop index bounds are constants.
6 Results of experiments with PolyBench kernels
In this section, we present the results of an analysis of target tile features and corresponding code performance for PolyBench benchmarks.
The algorithm presented in [4] and procedures introduced in this paper have been incorporated into the TC optimizing compiler,^{2} which utilizes the Polyhedral Extraction Tool [29] for extracting polyhedral representations of original loop nests, the Integer Set Library [27] for performing dependence analysis, manipulating sets and relations as well as generating output code, and the Barvinok library [26] for calculating set cardinality and processing its representation.
We have experimented with the PolyBench/C 4.1 [21] benchmark suite, which includes linear algebra kernels and solvers, data mining programs, stencil computations, and dynamic programming algorithms. Generated tiles were analyzed by the introduced algorithms for data sizes defined by PolyBench as medium, large, and extra large. Reports generated by TC as well as tiled code generated by means of TC and Pluto for all studied PolyBench kernels can be found under the results directory in the project repository.
TC is able to generate set \(TILE\_VLD\) and a corresponding report for 28 out of 30 programs included in PolyBench. For the two remaining kernels—heat3d and deriche—deriving set \(TILE\_VLD\) or its cardinality is too computationally intensive.
Table 1 presents types of target tiles for PolyBench kernels. TC finds a total of 13 programs that does not require any correction of original rectangular tiles. For two kernels, original tiles are corrected, but target tiles are fixed or mostly fixed. For the remaining kernels, target tiles are a mix of different types.
For five chosen kernels, we compare features of generated tiles with those produced by Pluto v.0.11.4 [8]—a stateoftheart optimizing compiler based on the Affine Transformations Framework.

programs for which the original tiling does not require correction;

programs for which corrected tiles are categorized as only fixed or mostly fixed;

programs for which the correction algorithm produces tiles of various categories: fixed, parametric, varied.
Tile types within PolyBench benchmarks
Tile categories  Kernels 

Fixed (no correction)  2mm, 3mm, atax, bicg, correlation, covariance, gemm, gemver, gesummv, mvt, syr2k, syrk, trmm 
Fixed (corrected) or mostly fixed  cholesky, trisolv 
Fixed, parametric  lu 
Fixed, parametric & varied  adi, jacobi1d, jacobi2d, seidel2d 
Fixed, varied, parametric & varied  durbin, fdtd2d, gramschmidt, symm 
Fixed, parametric, varied, parametric & varied  doitgen, floydwarshall, ludcmp, nussinov 
Table 2 summarizes execution time and speedup (here and further on, we mean under speedup the ratio of the execution time of original code to that of serial tiled one) of tiled code produced by TC and Pluto for the studied kernels.
The evaluation of tiled code performance was carried out on an Intel Xeon E52699 v3 processor clocked at 2.3 GHz, 32 kB L1 data cache, 256 kB L2 cache, 45 MB L3 cache, 256 GB RAM clocked at 2133 MHz. Code of both original and transformed loop nests was compiled under the Linux kernel 3.10.0 x86_64 by GCC 4.8.3 with the O3 optimization enabled.
Execution time and speedup for TC and Pluto for the studied kernels
Kernel  Original tile size  Problem size  Original time (s)  Time (s)  Speedup  

TC  Pluto  TC  Pluto  
syr2k  64  Medium  0.0315  0.0153  0.0265  2.06  1.19 
Large  2.8107  1.9329  3.2618  1.45  0.86  
Extra large  93.1386  17.8868  30.7912  5.21  3.02  
cholesky  32  Medium  0.0107  0.0078  0.0105  1.37  1.02 
Large  1.1610  0.9706  0.9655  1.20  1.20  
Extra large  10.2771  7.9269  7.8775  1.30  1.30  
64  Medium  0.0107  0.0092  0.0093  1.17  1.15  
Large  1.1610  1.1347  1.1401  1.02  1.02  
Extra large  10.2771  9.1666  9.1964  1.12  1.12  
symm  64  Medium  0.0146  0.0100  *  1.50  * 
Large  1.6532  1.1889  *  1.39  *  
Extra large  21.1922  10.3501  *  2.05  *  
nussinov  64  Medium  0.0258  0.0219  0.0212  1.17  1.22 
Large  3.0717  3.0276  2.9624  1.01  1.04  
Extra large  52.2635  39.8301  41.6972  1.31  1.25  
jacobi2d  64  Medium  0.0172  0.0155  0.0172  1.11  1.00 
Large  1.7527  1.9976  2.3172  0.88  0.76  
Extra large  25.4783  19.0168  21.6232  1.34  1.18 
6.1 syr2k
Features of tiles for kernel syr2k (original tile size 64 \(\times \) 64 \(\times \) 64)
Tile category  Upper bounds  

N = 240  N = 1200  N = 2600  
M = 200  M = 1000  M = 2000  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 2D  11.25%  0.32%  5.28%  0.09%  2.88%  0.05% 
Fixed 3D  33.75%  61.13%  79.19%  88.39%  89.41%  96.12% 
Fixed boundary 2D  8.75%  0.18%  0.60%  0.01%  0.15%  \(<0.01\%\) 
Fixed boundary 3D  46.25%  38.37%  14.93%  11.51%  7.56%  3.83% 
Total  80  11,577,600  6137  1,441,440,000  55,473  13,526,760,000 
Invalid  0.00%  0.00%  0.00%  
Tileability before  100.00%  100.00%  100.00%  
Tileability after  100.00%  100.00%  100.00% 
syr2k is a basic linear algebra subprogram updating a square \(N \times N\) matrix C according to the formula: \(C := \alpha A B^{T} + \alpha B A^{T} + \beta C\). The kernel consists of two separate perfectly nested loop nests—one of depth 2 multiplying the elements of C by \(\beta \) and one of depth 3 performing the \(AB^{T}\) and \(BA^{T}\) matrix multiplications. The syr2k kernel serves as an example of a loop nest, for which original tiles do not require any correction, i.e., there are no invalid dependence targets in each set TILE, so target tiles are fixed and the same as the original ones. Tiling such a loop nest is straightforward—no correction of original tiles is needed; output code enumerates fixed 2D and 3D tiles. Due to the fact that the discussed algorithm originally applies rectangular tiling, we can also expect that the code produced by TC will be similar to that generated by Pluto.
Table 3 summarizes the features of tiles for different upper loop index bounds and the original tile size \(64 \times 64 \times 64\), extracted by means of Algorithms 1 and 2. In this and the following tables, Tiles and Stmts define the percentage of tiles and the iteration subspace occupied with tiles and all statement instances within this group of tiles, respectively. As we can see, nearly the entire loop nest domain is occupied by 3D fixed rectangular tiles, comprising a total of \(99\%\) of all statement instances for the largest problem size.
As shown in Table 2, the tiled code leads to a significant reduction in execution time. The only difference causing visibly worse performance of the code generated by Pluto is that Pluto interchanges the two innermost loops; this does not improve overall program performance and even leads to a slowdown for certain problem sizes.
6.2 cholesky
The cholesky kernel implements the Cholesky–Banachiewicz algorithm. It involves an imperfectly nested loop nest of depth 3 with 4 statements. This kernel is a representative of loop nests, for which tile correction is required, but corrected tiles are fixed or mostly fixed. This depends on an original tile size.
Features of tiles for the cholesky kernel (original tile size \(32 \times 32 \times 32\))
Tile category  Upper bounds  

N = 400  N = 2000  N = 4000  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 2D  10.51%  0.53%  4.01%  0.14%  2.24%  0.07% 
Fixed 3D  55.91%  69.48%  87.16%  92.91%  95.46%  97.57% 
Parametric 3D  –  –  –  –  0.04%  0.02% 
Fixed boundary 2D  4.13%  0.15%  0.27%  0.01%  0.04%  \(< 0.01\%\) 
Fixed boundary 3D  29.46%  29.84%  8.56%  6.94%  2.24%  2.34% 
Total  533  10,746,800  45,633  1,335,334,000  341,125  10,674,668,000 
Invalid  11.31%  2.31%  1.16%  
Tileability before  0.64%  0.15%  0.07%  
Tileability after  \(<0.01\%\)  \(<0.01\%\)  \(<0.01\%\) 
Analyzing the data in Table 4, we discover that target tiles have the following features: (i) there are 2D and 3D tiles, (ii) when the values of the upper bounds of the loop indices increase, the percentage of threedimensional tiles also increases and constitutes over \(90\%\) of all tiles, (iii) most of the statement instances are located inside threedimensional tiles, (iv) almost all tiles are either fixed or fixed boundary.
Analyzing the structure of tiles in detail,^{3} we discover that the most common tile is a 3D fixed tile of the size \(32 \times 32 \times 32\), including 32, 768 instances of statement S1. The number of those tiles constitutes \(31\%\), \(79\%\), and \(91\%\) of all tiles for \(N=400\), \(N=2000\), and \(N=4000\), respectively. The statement instances included in each such tile access approximately \(24~\hbox {kB}\) of memory, while the capacity of L1 data cache of a computer used for experiments is \(32~\hbox {kB}\). For the largest problem size, the overall number of statement instances included in these tiles is nearly \(96\%\). Additionally, TC produces approximately \(4\%\) of fixed 2D tiles of the size \(32 \times 32\) fully covered with statement instances, which—due to a regular shape of tiles—allows us to expect a satisfactory performance of tiled code similar to that produced by Pluto. That is, for the cholesky kernel, Pluto also generates 3D tiles and the measured speedup is practically the same for all problem sizes as that of tiled code generated by TC.
Features of tiles for kernel cholesky (original tile size \(64 \times 64 \times 64\))
Tile category  Upper bounds  

N = 400  N = 2000  N = 4000  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 2D  10.48%  0.42%  6.73%  0.13%  4.01%  0.07% 
Fixed 3D  39.05%  57.53%  77.02%  88.65%  87.16%  92.98% 
Fixed boundary 2D  9.52%  0.21%  0.93%  0.01%  0.27%  \(< 0.01\%\) 
Fixed boundary 3D  40.95%  41.84%  15.32%  11.21%  8.56%  6.95% 
Total  105  10,746,800  6480  1,335,334,000  45,633  10,674,668,000 
Invalid  22.37%  4.67%  2.35%  
Tileability before  0.53%  0.14%  0.07%  
Tileability after  \(< 0.01\%\)  \(< 0.01\%\)  \(< 0.01\%\) 
Analyzing data in Table 5, we can see that increasing the tile size of each dimension by a factor of 2, for the largest problem bounds, we find 35,990 fixed 3D tiles fully covered with statement instances; they constitute \(79\%\) of all tiles. Each such a tile accesses \(96\,\hbox {kB}\) of memory, while the capacity of L1 data cache of a computer used for experiments is 32 kB. This reduces tiled code locality. As shown in Table 2, the speedup of the tiled code generated for \(64 \times 64 \times 64\) original tiles is less than that of the code generated using \(32 \times 32 \times 32\) original tiles.
6.3 symm
The symm kernel implements a symmetric matrix multiplication of the form \(\alpha A B + \beta C\) with matrix A of the size \(M \times M\), and B, C being matrices of the size \(M \times N\). The results of applying Algorithms 1, 2, and 3 to this benchmark are presented in Table 6.
For the symm kernel, Pluto v.0.11.4 does not return any affine transformation to generate tiled code. In our opinion, this fact is due to dependences that involve scalar variable temp2 in the symm kernel. Privatization of this variable allows for respecting all the dependences involved with this variable, but to our best knowledge, Pluto v.0.11.4 does not allow for privatization of variables.
Features of tiles for the symm kernel (original tile size \(64 \times 64 \times 64\))
Tile category  Upper bounds  

N = 240  N = 1200  N = 2600  
M = 200  M = 1000  M = 2000  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 1D  20.00%  \(<0.01\%\)  5.88%  \(<0.01\%\)  3.03%  \(<0.01\%\) 
Fixed 2D  3.75%  0.13%  0.35%  0.01%  0.09%  \(<0.01\%\) 
Fixed 3D  11.25%  24.45%  36.57%  41.25%  42.96%  46.86% 
Varied 1D  30.00%  0.02%  44.12%  0.01%  46.97%  0.01% 
Varied 2D  7.50%  0.63%  4.88%  0.72%  2.77%  0.77% 
Parametric/varied 2D  3.75%  0.39%  0.35%  0.09%  0.09%  0.05% 
Parametric/varied 3D  5.00%  64.44%  0.31%  52.36%  0.07%  50.78% 
Fixed boundary 3D  18.75%  9.93%  7.55%  5.55%  4.01%  1.53% 
Total  80  9,648,000  5168  12,01,200,000  43,296  10,405,200,000 
Invalid  65.49%  53.18%  51.60%  
Tileability before  \(<0.01\%\)  \(<0.01\%\)  \(<0.01\%\)  
Tileability after  34.88%  46.90%  48.43% 
From Table 6, we can see that for each of the problem sizes, \(50\%\) of all tiles are onedimensional. From a detailed report returned with TC, we find that those onedimensional tiles are \(1 \times 1 \times 64\) ones including 64 instances of statement S3. However, the total number of statement instances included in those tiles constitutes only \(0.01\%\) of all statement instances in the loop nest domain. The majority of statement instances are included in threedimensional tiles which constitute up to \(48\%\) of all tiles, most of which are fixed (\(64 \times 64 \times 64\) tiles including 262, 144 instances of statement S2). However, the procedures find also relatively large parametric and varied tiles, up to the size \(64 \times 2600 \times 1983\) for \(M=2000\) and \(N=2600\), each accessing 42 MB of data. On average, target tiles access \(52\hbox { kB}\), \(76\hbox { kB}\), and \(78\hbox { kB}\) of memory for each of the studied problem sizes: medium, large, and extra large, respectively.
An interesting property of the symm kernel is its tileability. Nearly half of the iteration space can be tiled with rectangular tiles after the removal of problematic statement instances using the inverse of the transitive closure of a dependence relation (tileability_after).
6.4 nussinov
Features of tiles for the nussinov kernel (original tile size \(64 \times 64 \times 64\))
Tile category  Upper bounds  

N = 500  N = 2500  N = 5500  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 1D  49.61%  0.02%  86.15%  0.03%  93.27%  0.02% 
Fixed 2D  2.73%  \(<0.01\%\)  0.28%  \(<0.01\%\)  0.07%  \(<0.01\%\) 
Fixed 3D  –  –  0.02%  0.02%  0.07%  \(<0.01\%\) 
Varied 3D  –  –  0.52%  7.39%  –  – 
Parametric 3D  –  –  0.01%  0.40%  –  – 
Parametric/varied 3D  13.28%  99.48%  5.26%  92.10%  3.08%  99.97% 
Fixed boundary 1D  16.02%  0.01%  6.08%  \(<0.01\%\)  3.15%  \(<0.01\%\) 
Fixed boundary 2D  14.84%  0.01%  1.40%  \(<0.01\%\)  0.35%  \(<0.01\%\) 
Fixed boundary 3D  3.52%  0.48%  0.28%  0.06%  \(<0.01\%\)  \(<0.01\%\) 
Total  256  21,082,750  14,096  2,610,413,750  121,299  27,759,410,250 
Invalid  82.33%  96.13%  98.22%  
Tileability before  \(<0.01\%\)  \(<0.01\%\)  \(<0.01\%\)  
Tileability after  \(<0.01\%\)  \(<0.01\%\)  \(<0.01\%\) 
Based on Table 7, we conclude that the tile correction technique is able to produce threedimensional tiles for the Nussinov loop nest. Although the majority of tiles are fixed, they contain approximately only \(0.02\%\) of statement instances. Most of the statement instances are included in parametric & varied 3D tiles; this leads to a variety of target tile sizes. For the studied problem sizes—\(N=500\), \(N=2500\), \(N=5500\)—tiles access on average only \(14~\hbox {kB}\), \(31~\hbox {kB}\), and \(35~\hbox {kB}\) of memory, respectively. However, some of the parametric & varied tiles reach the size of \(2.92~\hbox {MB}\).
For the nussinov kernel, Pluto is able to produce only 2D tiles, while TC generates mostly 3D target tiles both fixed and parametric. For the largest problem size, TC code speedup is 1.31 and 1.05 against original and Pluto code, respectively.
6.5 jacobi2d
Table 8 summarizes the results collected for the 5point jacobi2d kernel for the original \(64 \times 64 \times 64\) tile. According to reports generated with TC, most of the statement instances (up to \(90\%\)) are included in 3D tiles, which constitute \(44\%\) of all tiles. These tiles contain 258, 048 instances of statement S1 in subspace \(63 \times 188 \times 127\), and 262, 144 instances of statement S2 in subspace \(64 \times 190 \times 127\). The coverage with statement instances of that tile, however, is below \(20\%\). Out of 2D tiles, almost all of them are fixed \(1 \times 64 \times 64\) tiles including 4096 instances of statement S1. On average, tiles access \(139\hbox { kB}\), \(188\hbox { kB}\), and \(200~\hbox {kB}\) of memory, for each of the examined problem sizes—medium, large, extra large—respectively.
For the jacobi2d kernel, both properties—tileability_before and tileability_after—yield almost the same outcome.
Features of tiles for kernel jacobi2d (original tile size \(64 \times 64 \times 64\))
Tile category  Upper bounds  

N = 250  N = 1300  N = 2800  
TSTEPS = 100  TSTEPS = 500  TSTEPS = 1000  
Tiles  Stmts  Tiles  Stmts  Tiles  Stmts  
Fixed 2D  28.13%  0.60%  45.35%  0.78%  47.75%  0.77% 
Fixed 3D  18.75%  24.98%  39.97%  80.67%  44.84%  89.22% 
Parametric/varied 3D  31.25%  74.02%  10.03%  18.53%  5.16%  9.98% 
Fixed boundary 2D  21.88%  0.40%  4.65%  0.02%  2.25%  0.03% 
Total  64  12,300,800  7056  1,684,804,000  61,952  15,657,608,000 
Invalid  83.99%  90.14%  90.11%  
Tileability before  1.00%  0.20%  0.10%  
Tileability after  0.52%  0.10%  0.05% 
7 Summarizing basic features of target tiles
In this section, we summarize basic features of tiles generated by means of the examined algorithm and compare them with those generated by means of affine transformations. An analysis of obtained results allows us to state that the following is true: (i) the tile correction technique does not require full loop permutability and always generates target code provided that calculation of the transitive closure of a loop nest dependence graph is possible, but the size of target tiles can be fixed, and/or varied, and/or parametric; a parametric tile de facto represents a subspace where fixed tiles cannot be generated, i.e., a subspace where tiling is excluded provided that the data size per a parametric tile is greater than the capacity of cache; (ii) corrected target tile origins (the lexicographically earliest points of tiles) are the same as original ones, while the shape, size, and dimension of target tiles are specified automatically to respect all the dependences available in the original loop nest; (iii) dimensionalities of corrected target tiles can vary from one to the value equal to the depth of a loop nest; (iv) in general, the number of tiles produced by means of affine transformations is greater than that generated with applying transitive closure because affine transformations may skew the iteration space, while the examined algorithm does not transform the iteration space; (v) the tile correction technique allows for producing tiled code at one run, while wellknown techniques require two runs: the first one is to find affine transformations allowing for generation of fully permutable loop bands and the second one is to apply tiling transformations to those bands to produce tiled code.
Comparison of features of tiles generated by the examined algorithm and affine transformations
Tile features  Examined algorithm  Affine transformations 

Dimensionality  Subject to loop nest dependences; in general, there may be a mix of tiles of different dimensions  Defined by the number of fully permutable loops in a band 
Shape  There may be different tile shapes for a given loop nest  Rectangular in a transformed iteration space except for boundary tiles 
Number of tiles  Defined by the original tile size  Depends on affine transformations applied; in general, there may be more tiles than those generated by the considered algorithm 
Type  There may be fixed, and/or varied, or/and parametric tiles  Fixed tiles defined by constants representing tile size in a transformed iteration space 
8 Related work
There has been a considerable amount of research into tiling demonstrating how to aggregate a set of loop nest iterations into tiles with each tile as an atomic macro statement, from pioneer papers [13, 24, 30, 32] to those presenting advanced techniques [7, 11, 15, 28].
One of the most advanced reordering transformation frameworks is based on the polyhedral model and affine transformations. This approach includes the following three steps: (i) program analysis aimed at translating high level codes to their polyhedral representation and to provide data dependence analysis based on this representation, (ii) program transformation with the aim of improving program locality and/or parallelization, for this purpose affine transformations are derived and applied, (iii) code generation [7, 9, 10, 16, 24, 28]. All above three steps are present in the examined approach. But there exists the following difference in step (ii): the approach based on transitive closure does not find and use any affine function. It applies the transitive closure of a program dependence graph to correct invalid rectangular tiles.
The tiling validity condition by Irigoin and Triolet [13] requires nonnegative elements of dependence vectors. The tiling validity condition by Xue [33] checks for lexicographic nonnegativity of intertile dependences. Mullapudi and Bondhugula [17] demonstrate that those conditions are conservative, i.e., they miss tiling schemes for which the tile schedule is not easy to present statically. They suggest to check whether an intertile dependence graph is cyclefree. If not, splitting or merging problematic original tiles can be applied manually to break cycles and then form a tile schedule dynamically, i.e., at runtime. The examined approach does not require any validity condition, it is able to generate tiled code for an arbitrary affine loop nest provided that transitive closure can be calculated. However, it does not guarantee that generated code will have a higher performance than that of the corresponding original loop nest, for example, when serial execution order of statement instances in an original loop nest is only valid. An additional analysis of tiled code is required to predict its locality and performance. The goal of the algorithms introduced in this paper and implemented in the TC compiler is to support such an analysis.
Papers [22, 23] consider the usage of transitive closure for deriving loop transformations, but they do not consider any tiling.
In paper [31], the authors introduce the definition of “mostly tileable” loop nests for which classic tiling is prevented by an asymptotically insignificant number of iterations. They suggest to peel the problematic iterations of the loop nest and apply rectangular tiling to the remaining iterations. The authors demonstrate the application of their algorithm to only one code implementing Nussinov’s algorithm. The scope of the applicability of that algorithm is not presented. The examined algorithm instead of peeling problematic tiles corrects them to make tiling valid. In this paper, we present a technique that utilizes the transitive closure of a dependence relation to find a set of problematic statement instances and calculate loop nest tileability. That technique is fully automatic and can be applied to an arbitrary affine loop nest.
Index set splitting [12] partitions the loop nest iteration space. This enables finding affine transformations for different partitions and can be useful to break cycles in an intertile dependence graph.
Papers [1, 5] demonstrate how to extract coarse and finegrained parallelism applying different Iteration Space Slicing algorithms; however, they do not consider any tiling transformation.
Paper [4] presents the examined algorithm and the proof of the correctness of generated target code by means of this algorithm, but it does not provide any analysis of features of tiles generated.
Paper [6] presents how to generate parallel synchronizationfree tiled code based on transitive closure and the application of the discussed algorithm to different reallife benchmarks, but it does not provide any analysis of features of target tiles.
Summing up, we may conclude that this paper is the first attempt to present results allowing us to recognize basic features of tiles generated by means of the discussed algorithm.
9 Conclusion
In this paper, we presented the results of an analysis of features of tiles generated by means of correction of original rectangular tiles applying the transitive closure of dependence graphs. We introduced three algorithms which allow us to recognize such features of target tiles as type (fixed/parametric/varied), size, and dimensionality as well as loop nest tileability. We discussed differences between these features for tiles generated by means of applying transitive closure and affine transformations.
We also presented the results of experiments demonstrating what are features of target tiles generated with the examined approach for reallife codes included in the PolyBench benchmark suite and how they affect serial tiled code performance.
In the future, we intend to present other algorithms based on transitive closure that will allow us to (i) use different shapes of original tiles (e.g., parallelepiped, diamond); (ii) form target tiles automatically without using original ones; (iii) examine different techniques to run target tiles in parallel.
Footnotes
 1.
 2.
 3.
Detailed reports can be found at http://tcoptimizer.sourceforge.net under the results directory.
References
 1.Beletska A, Bielecki W, Cohen A, Palkowski M, Siedlecki K (2011) Coarsegrained loop parallelization: Iteration space slicing vs affine transformations. Parallel Comput 37:479–497CrossRefGoogle Scholar
 2.Bielecki W (2013) Using basis dependence distance vectors to calculate the transitive closure of dependence relations by means of the FoydWarshall algorithm. In: Widmayer P, Xu Y, Zhu B (eds) Combinatorial Optimization and Applications. Springer International Publishing, Cham, pp 129–140CrossRefGoogle Scholar
 3.Bielecki W, Klimek T, Palkowski M, Beletska A (2010) An iterative algorithm of computing the transitive closure of a union of parameterized affine integer tuple relations. In: COCOA 2010: Fourth International Conference on Combinatorial Optimization and Applications. Lecture Notes in Computer Science, vol 6508/2010, pp 104–113Google Scholar
 4.Bielecki W, Palkowski M (2016) Tiling arbitrarily nested loops by means of the transitive closure of dependence graphs. Int J Appl Math Comput Sci 26(4):919–939MathSciNetCrossRefGoogle Scholar
 5.Bielecki W, Palkowski M, Klimek T (2012) Free scheduling for statement instances of parameterized arbitrarily nested affine loops. Parallel Comput 38(9):518–532CrossRefGoogle Scholar
 6.Bielecki W, Palkowski M, Skotnicki P (2018) Generation of parallel synchronizationfree tiled code. Computing 100(3):277–302MathSciNetCrossRefGoogle Scholar
 7.Bondhugula U et al (2008) Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In: Hendren L (ed) Compiler constructure. Lecture notes in computer science. Springer, Berlin, pp 132–146CrossRefGoogle Scholar
 8.Bondhugula U et al (2008) A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not 43(6):101–113CrossRefGoogle Scholar
 9.Feautrier P (1992) Some efficient solutions to the affine scheduling problem: I. onedimensional time. Int J Parallel Program 21(5):313–348MathSciNetCrossRefGoogle Scholar
 10.Feautrier P (1992) Some efficient solutions to the affine scheduling problem: II. Multidimensional time. Int J Parallel Program 21(6):389–420MathSciNetCrossRefGoogle Scholar
 11.Griebl M (2004) Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau. Habilitation thesisGoogle Scholar
 12.Griebl M, Feautrier P, Lengauer C (2000) Index set splitting. Int J Parallel Program 28(6):607–631CrossRefGoogle Scholar
 13.Irigoin F, Triolet R (1988) Supernode partitioning. In: Proceedings of the 15th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL ’88. ACM, New York, NY, USA, pp 319–329Google Scholar
 14.Kelly W et al (1996) Transitive closure of infinite graphs and its applications. Int J Parallel Program 24(6):579–598CrossRefGoogle Scholar
 15.Lim A et al (1999) An affine partitioning algorithm to maximize parallelism and minimize communication. In: In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing. ACM Press, pp 228–237Google Scholar
 16.Lim AW, Lam MS (1994) Communicationfree parallelization via affine transformations. In: 24th ACM Symposium on Principles of Programming Languages. Springer, pp 92–106Google Scholar
 17.Mullapudi RT, Bondhugula U (2014) Tiling for dynamic scheduling. In: Fourth International Workshop on Polyhedral Compilation Techniques, VienaGoogle Scholar
 18.Palkowski M, Bielecki W (2018) Parallel tiled codes implementing the SmithWaterman alignment algorithm for two and three sequences. J Comput Biol 25(10):1106–1119CrossRefGoogle Scholar
 19.Palkowski M, Bielecki W (2018) Tuning iteration space slicing based tiled multicore code implementing Nussinov’s RNA folding. BMC Bioinform 19(1):12CrossRefGoogle Scholar
 20.Palkowski M, Klimek T, Bielecki W (2015) TRACO: an automatic loop nest parallelizer for numerical applications. In: Federated Conference on Computer Science and Information SystemsGoogle Scholar
 21.Pouchet LN (2015) The polyhedral benchmark suite/c4.1. http://web.cse.ohiostate.edu/~pouchet/software/polybench. Accessed 28 Dec 2017
 22.Pugh W, Rosser E (1997) Iteration space slicing and its application to communication optimization. In: International Conference on Supercomputing, pp 221–228Google Scholar
 23.Pugh W, Rosser E (1999) Iteration space slicing for locality. In: International Workshop on Languages and Compilers for Parallel Computing. Springer, pp 164–184Google Scholar
 24.Ramanujam J, Sadayappan P (1992) Tiling multidimensional itertion spaces for multicomputers. J Parallel Distrib Comput 16(2):108–120CrossRefGoogle Scholar
 25.Verdoolaege S et al (2011) Transitive closures of affine integer tuple relations and their overapproximations. In: Proceedings of the 18th International Conference on Static Analysis, SAS’11. Springer, Berlin, pp 216–232CrossRefGoogle Scholar
 26.Verdoolaege S (2007) barvinok: user guide. Version 0.40. http://barvinok.gforge.inria.fr/barvinok.pdf. Accessed 28 Dec 2017
 27.Verdoolaege S (2010) isl: an integer set library for the polyhedralmodel. In: Mathematical software—ICMS 2010. Lecture notes in computer science, vol 6327. Springer, Berlin, pp 299–302CrossRefGoogle Scholar
 28.Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gomez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Arch Code Optim 9(4):54Google Scholar
 29.Verdoolaege S, Grosser T (2012) Polyhedral extraction tool. In: Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques. Paris, FranceGoogle Scholar
 30.Wolf ME, Lam MS (1991) A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI ’91. ACM, New York, NY, USA, pp 30–44Google Scholar
 31.Wonnacott D, Jin T, Lake A (2015) Automatic tiling of “mostlytileable” loop nests. In: 5th International Workshop on Polyhedral Compilation Techniques, AmsterdamGoogle Scholar
 32.Xue J (1997) On tiling as a loop transformation. Parallel Process Lett 7(4):409–424MathSciNetCrossRefGoogle Scholar
 33.Xue J (2000) Loop tiling for parallelism. Kluwer Academic Publishers, Norwell, MA, USACrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.