=Paper= {{Paper |id=Vol-2543/rpaper19 |storemode=property |title=Reconstruction of Multi-dimensional Arrays in SAPFOR |pdfUrl=https://ceur-ws.org/Vol-2543/rpaper19.pdf |volume=Vol-2543 |authors=Nikita Kataev,Vladislav Vasilkin |dblpUrl=https://dblp.org/rec/conf/ssi/KataevV19 }} ==Reconstruction of Multi-dimensional Arrays in SAPFOR== https://ceur-ws.org/Vol-2543/rpaper19.pdf
 Reconstruction of multi-dimensional arrays in SAPFOR

              Nikita Kataev1[0000-0002-7603-4026], Vladislav Vasilkin2[0000-0002-7918-1849]
    1 Keldysh Institute of Applied Mathematic RAS Miusskaya sq., 4, 125047, Moscow, Russia
     2
         Lomonosov Moscow State University, GSP-1, Leninskie Gory, 11999, Moscow, Russia
                                 kaniandr@gmail.com



           Abstract. Low-level representation of programs in the form of LLVM IR al-
           lows for various optimizations to improve the quality of program analysis in
           SAPFOR (System FOR Automated Parallelization). Being the same for differ-
           ent high-level languages, LLVM IR allows us to explore multilingual applica-
           tions. At the same time, it loses some features of the program, which are availa-
           ble in its higher level representation. One of these features is the multi-
           dimensional structure of the arrays. Multi-dimensional view improves the accu-
           racy of data dependency analysis, since the linearized representation of arrays
           with parametric sizes may not be an affine expression and it will not be possible
           to apply integer linear programming to analyze it. In addition, the use of multi-
           dimensional arrays allows us to use a multi-dimensional processor matrix and to
           parallelize a whole loop nests, rather than a single loop in the nest. This way,
           parallelism of a program is going to be increased. These opportunities are na-
           tively supported in the DVM system. This paper discusses the approach used in
           the SAPFOR system to recover the form of multi-dimensional arrays by their
           linearized representation in LLVM IR. The proposed approach has been suc-
           cessfully evaluated on various applications.

           Keywords: Program Analysis, Semi-automatic Parallelization, SAPFOR,
           DVM, LLVM.


1          Introduction

The multi-dimensional arrays are widely used in many computational programs. For
example, they enable descriptions of the object properties at various points in a multi-
dimensional computing space. The DVM system [1, 2] relies on the multi-dimensional
view of arrays to increase parallelization opportunity of a program. For example, it
proposes specifications which simplify mapping of multi-dimensional arrays to a
multi-dimensional grid of processors. Thus, to achieve sustainable performance of
parallel programs it is necessary to analyze multi-dimensional data structures in the
program.
   SAPFOR (System FOR Automated Parallelization) [3, 4] produces a parallel ver-
sion of a program in a semi-automatic way according to DVMH model of the parallel
programming for heterogeneous computational clusters. The DVMH model enables
consistent distribution of data between processors of a computational cluster which


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
210


are represented as a multi-dimensional grid. The presence of align directive in a
source code introduces the alignment rules which specify the location of elements of
arrays relative to each other. These rules use affine expressions to set a correspond-
ence between points of multi-dimensional index spaces of two objects. To avoid a
communication overhead, the elements of different arrays which are used on the same
iteration of the loop should be mapped to the same processor. Thus, accesses to arrays
in loops and the form of subscript expressions in these accesses imply the alignment
rules.
    Data dependence analysis is obviously required for program parallelization. The
visibility of multi-dimensional view of data allows us to significantly increase the
accuracy and reduce the computational complexity of data dependency analysis in
many cases. To compute loop-carried data dependencies it is possible to perform the
pairwise comparison of expressions which calculate the addresses of accessed ele-
ments. If the expressions that calculate the offset relative to the beginning of the array
are affine with respect to the induction variables of the enclosing loops, then compu-
tation of data dependencies becomes NP-hard problem. In this case integer linear
programming (ILP) solvers are applicable. Moreover, to reduce the complexity of the
problem being solved, heuristics are used. If subscript expressions in a delinearized
accesses satisfy some conditions (for example, expressions are SIV [5] (i.e. contains a
single index variable)) the pairwise comparison of subscript expressions drastically
reduce the complexity of the analysis. The knowledge of the multi-dimensional view
of arrays makes it possible to recover subscript expressions from the linearized repre-
sentation of array accesses.
    It should be noted that a parametric size of an array makes the linearized accesses
non-affine so it becomes almost impossible to verify the absence of data dependen-
cies.
    The SAPFOR system uses LLVM [6] intermediate representation (IR) to perform
program analysis. This representation allows us to analyze programs for diverse pro-
gramming languages (Fortran, C). So, it becomes possible to overcome multi-lingual
issue essential for large-scale computational applications [7]. LLVM also provides the
ability to hide program transformation from the user in order to improve the quality of
the analysis [8, 9]. Implementation of LLVM-based analysis tool also eliminates the
need to consider syntactic features of different programming languages and to analyze
the diverse set of available language constructs. To preserve the relation with a higher
level representation, the debug information can be used. For these purposes LLVM
uses DWARF debugging file format (DWARF [10]) which is common for various
languages. However, linearized view of array accesses impedes program analysis.
Multi-dimensional view of arrays of known constant size is only visible in LLVM IR.
For example, the type [100 x [200 x float]] can be used to declare an array
of 100 * 200 elements.
    There are two purposes of array delinearization in the context of SAPFOR. The
first one is the reconstruction of the array view which was multi-dimensional in the
source code. And the second one is the search for equivalent the multi-dimensional
view for arrays represented in the program source code in a linearized form. In this
paper, we are primarily interested in the approach to recover multi-dimensional view
                                                                                      211


of arrays which were originally multi-dimensional. This follows from the restrictions
which DVM system imposes on the programs: the use of multi-dimensional arrays in
the source code is required.


2      Delinearization in LLVM

The paper [11] proposes an approach to recovering the multi-dimensional view of
arrays of parametric size. The presented approach is partially implemented in the
context of LLVM and it was initially focused on the use in the Polly framework [12].
Polly is a high-level loop and data-locality optimizer. It can also exploit loop-level
parallelism for systems with shared memory, including GPU. Polly performs IR-level
optimization, therefore, the correspondence between delinearized arrays and the
source code objects is not required. Loops can be optimized separately, so delineari-
zation is required in the context of the evaluated loop only instead of the entire pro-
gram. However, consistent delinearization for all accesses in the entire program is
required to exploit parallelization opportunity in case of HPC systems with distributed
memory.
   The paper [11] is focused on recovering the multi-dimensional view of arrays of
parametric size only. Separately, Polly implements delinearization for arrays of
known constant size. LLVM provides a simple array type to represent sequences of
elements in memory. The array types can be nested, so definition of multi-
dimensional arrays of known constant sizes is also supported. That means that in this
case delinearization is not actually required, since the array type can be used to restore
each subscript expression.
   If some of array dimensions have parametric sizes while other dimensions have
known constant sizes the whole array is treated as an array of parametric size. So, the
recovered number of dimensions and recovered multi-dimensional view may differ
from the definition of array in the original program.
   It also should be noted that to implement delinearization Polly uses its internal data
structures which are only partially included in the LLVM core.
   Thus, the use of the proposed algorithms in SAPFOR directly becomes impossible.
   Let us consider the main points of the approach presented in [11]. The linearized
view of the access A [S_0] ... [S_N – 1] to the array A [D_0] ... [D_N – 1] is

                   A + S_0 * D_1 *…* D_N–1 + … + S_N–1.                               (1)

   In this case, D_I is the size of the dimension I and S_I is the corresponding sub-
script expression, I takes values from 0 to N – 1. This equation implies the following
ones:

    C_N–1 * D_N–1=GCD(S_0 * D_1 * … * D_N–1, …, S_N–2 * D_N–1),                       (2)

C_I * D_I=GCD(S_0 * D_1 * … * D_N–1, … S_I–1 * D_I * … D_N–1) / (D_I+1
*…* D_N–1), 0