Interactive Parallelization of C Programs in SAPFOR

                                 Nikita Kataev[0000-0002-7603-4026]

    Keldysh Institute of Applied Mathematic RAS Miusskaya sq., 4, 125047, Moscow, Russia
                                  kaniandr@gmail.com


        Abstract. SAPFOR (System For Automated Parallelization) is a software de-
        velopment suite that is focused on cost reduction of manual program parallel-
        ization. SAPFOR produces parallel programs according to the high-level
        DVMH parallel programming model. SAPFOR relies on an implicitly parallel
        programming model, so it includes an automatic parallelizing compiler. On the
        other hand, it allows the user to guide parallelization. The system provides the
        user with a set of source-to-source transformations which can be performed in
        an automatic way. The user may also assert some implicit program properties
        useful for parallelization. This paper presents the interactive subsystem of
        SAPFOR and discusses how it supports iterative parallelization. We advocate
        the use of the client-server model to organize the interaction with the user. We
        also present the data transfer optimization technique the automatic parallelizing
        compiler implements. This technique reduces communication overhead when
        GPU is used to execute a parallel program. We guide SAPFOR to perform
        semi-automatic parallelization of the NAS Parallel Benchmarks 3.3.1. The pa-
        per evaluates the performance of parallel versions that SAPFOR builds auto-
        matically and compares it with the performance of manually written versions.

        Keywords: Semi-automatic Parallelization, Program Analysis, Program Trans-
        formation, Graphical User Interface, SAPFOR, DVM, LLVM


1       Introduction

Automation of parallel programming aims to simplify the time-consuming and error-
prone manual development of programs for diverse parallel architectures. To achieve
the best performance developers have to choose between different parallel program-
ming models and even apply them simultaneously. Thus, the significant complication
of hardware as well as the software makes the development of automation tools very
much in demand. Existing research covers a wide range of approaches.
    On the one hand, automatic parallelizing compilers [1–3] are the best way to paral-
lelize a program. However, in general terms, parallel programs which are written in a
manual way still outperform the results of automatic parallelization. Another ap-
proach is to impose restrictions on the use of natural constructs of sequential pro-
gramming languages and to rely on additional user-defined assertions that reveal
some implicit properties of a sequential program [4, 5]. Much attention is paid to tools
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
140


that examine a program being parallelized but is not able to make decisions on their
own [6, 7]. These assistance tools only highlight the most important program proper-
ties but the developer is responsible for program transformation and insertion of par-
allel constructs.
    SAPFOR (System FOR Automated Parallelization) [8] combines approaches men-
tioned above to reduce the cost of parallel programming and at the same time, it pro-
vides the performance gain requested by the user for the resulting programs. Essen-
tially, SAPFOR relies on an automatic parallelizing compiler generating parallel pro-
grams in the directive-based DVMH model [9, 10]. So, the system is responsible for
the exploitation of parallelism inherent in the program. At the same time, SAPFOR
allows the user to guide program transformation and to provide analysis tools with
hints that enable program optimization. In this semi-automatic way, a program can be
prepared to further automatic parallelization. SAPFOR works on lower level LLVM
IR [11] and higher level Clang AST (Abstract Syntax Tree) to analyze and transform
C programs.
    The paper makes the following contributions:

 The interactive subsystem of SAPFOR which gives the user control over the paral-
  lelization process and an approach to its application for semi-automatic program
  parallelization.
 An automatic heterogeneous compiler for well-formed sequential C programs
  which enables SAPFOR to generate parallel programs according to the DVMH
  model.
 Experimental evaluation of the implemented approaches on some C programs from
  the NAS Parallel Benchmarks [12, 13] and comparison of the obtained parallel
  programs with existing OpenMP and OpenCL versions.

The rest of the paper is organized as follows. Section 2 presents the capabilities of the
interactive subsystem of SAPFOR and discusses its application to the program analy-
sis, transformation, and parallelization. Section 3 is devoted to the automatic mapping
of well-formed programs to systems with shared memory. We also consider the opti-
mization of data transfer between the memory of CPU and memory of the accelerator.
Section 4 briefly outlines the implementation details of the SAPFOR components.
Section 5 focuses on the experimental evaluation of the implemented approaches and
it compares the performance of generated parallel programs with existing ones. Final-
ly, section 6 presents the conclusion and future work.


2      Interactive parallelization

SAPFOR allows the user to control the parallelization from the analysis of an original
program to the construction of its parallel version. Interaction with the user in
SAPFOR comprises some key aspects which determine the main capabilities of the
interactive subsystem. This subsystem has to allow the user:
                                                                                    141


 to explore the information structure of the program being parallelized, and to use a
  suitable way to investigate its properties,
 to emphasize the program properties which cannot be determined in an automatic
  way but still which are essential for program parallelization,
 to guide the system through the available program transformations which are help-
  ful to overcome parallelization issues,
 to manually parallelize some of code regions and to control decisions made by the
  system (for example, the choice of data distribution).

The interactive subsystem is a separate component of SAPFOR and it relies on the
client-server model. The SAPFOR core subsystem, which includes analysis and trans-
form passes, provides an interface to exchanged data with clients. It specifies a set of
available requests, which the client produces, and corresponding responses, which the
client has to understand, encoded in JSON. Thus, the client does not have to concern
about how the core subsystem processes the request. So, the different interactive sub-
systems may exist at the same time. Some of them could be implemented as plugins
for well-known development environments and others could be standalone tools.
    At the same time, SAPFOR also provides the console application to access the
main capabilities of the core subsystem. It simplifies the integration of SAPFOR with
build automation tools, such as Make. For example, an existing Makefile can be mod-
ified to get instrumentation for the program before it is analyzed at runtime [14].
    The interactive subsystem helps the user explore the parallelism available in a pro-
gram and highlights the issues that prevent its parallelization. As SAPFOR is de-
signed to reveal loop-level parallelism, the interactive subsystem shows the infor-
mation about each loop in the program. This information summarizes properties of the
program control flow and of the memory accesses inside the loop body. It also deter-
mines the form of each loop nest, because the parallelization is only possible for loops
that have canonical loop form [15]. Moreover, parallelization of a whole perfect loop
nest instead of a single loop may significantly increase program performance and
even enables parallel execution of loops with regular loop carried data dependencies
[16].
    SAPFOR investigates whether any function call inside the loop has a side effect or
performs I/O operations. The absence of indirect calls and the absence of multiple
exits outside the loop body are also checked. To explore the mentioned control-flow
issues in detail the user may request SAPFOR to draw the sequence of calls which
produces them.
    If spurious data dependence is found, the system shows a way to eliminate it. The
system supports variable privatization, reduction, and induction variables recognition.
DVMH model gives us an opportunity to implement hyperplane or pipeline parallel
algorithms in a natural way. That means that SAPFOR exploits loop-level parallelism
even if there are some kinds of loop carried data dependencies.
    Source-level alias tree [17] is suitable to represent properties of memory accesses
in detail. For instance, let us consider a source code in Table 1. The interactive sub-
system draws an alias tree shown in Fig. 1.
142


              Table 1. Example of a source code with regular data dependence

#pragma dvm parallel([I][J][K]) \
            tie(A[I][J][K]) across(A [1:1] [0:0] [0:0])
for (I = 1; I < NX - 1; I++)
  for (J = 1; J < NY - 1; J++)
    for (K = 1; K < NZ - 1; K++)
      A[I][J][K] = (A[I - 1][J][K] + A[I + 1][J][K]) / 2


   If there is a loop carried data dependence between accesses to a variable, the corre-
sponding node in the alias tree is colored in red (array A in Fig. 1). On the right we
can observe dependence type and distance as well as the type of accesses which cause
this dependence (read, write, function call), and the accuracy of the analysis (may or
must). Spurious data dependencies and shared variables are colored in green. Induc-
tion variables and some types of privatizable variables (for instance, if there is a
statement which reads a value of this variable after the loop) may require additional
program transformation and are colored in white. The alias tree for a loop is a subtree
for the alias tree for the entire function. It contains only variables which are accessed
in the loop.


                      Fig. 1. Analysis results for the code in Table 1.

   Data dependence in Fig. 1 has a constant size, so the loop nest in Table 1 can be
parallelized in an automatic way. SAPFOR inserts corresponding annotations into a
source code. The tie clause describes the correspondence between dimensions of the
array and loops in the nest. It allows DVMH run-time system to optimize the place-
ment of the array in memory of the accelerator.
   In the presence of the dynamic analysis results, SAPFOR integrates static and dy-
namic data and shows the user an elaborate summary of the data-dependence analysis.
Unfortunately, the lack of analysis results often affects parallel program performance.
                                                                                     143


That is why, the interactive subsystem helps the user manually describe the unknown
properties of memory accesses and encode them, similarly to dynamic analysis re-
sults, in JSON.
   In general, when SAPFOR is used, parallelization is an iterative process and it in-
cludes five main steps:
1. The user prepares the program to be analyzed and transformed. Firstly, some com-
   piler options, which are necessary to parse source code, should be specified. Sec-
   ondly, the user has to deal with some limitations of the current version of
   SAPFOR. The current implementation of the interactive subsystem is able to pro-
   cess a single source file at a time. Certainly, multiple header files and include di-
   rectives can be used without restrictions. This limitation is primarily applied to the
   source-to-source transform passes that suffer from the inability to transform
   sources across different ASTs. It is not recommended placing macros inside the
   source code regions to be transformed. The Clang AST does not contain these con-
   structs explicitly, so transform passes are not aware of macro definitions at an arbi-
   trary point in a source code. We suggest replacing definitions of integer constants
   with enumerations and definitions of floating-point constants with const-qualified
   variables. Otherwise, SAPFOR would not be able to transform the program to en-
   sure its correctness.
2. The user measures the original program performance to determine the most time-
   consuming program regions which should be parallelized in the first place. The
   performance analyzer, as a part of DVMH system, also allows the user to collect
   data for the original sequential program. The future version of the interactive sub-
   system will summarize gathered data and it will identify loops that need attention.
   Unfortunately, at this moment the user has to investigate the collected data himself.
   The second dynamic tool, which can be applied, is the dynamic analyzer, which is
   a part of SAPFOR. Collected data will be integrated with static analysis results in
   the next step.
3. The user runs the static analysis tool, which is a part of SAPFOR, and explores
   analysis results. The interactive subsystem allows the user to specify global analy-
   sis options and to assist SAPFOR with high-level hints.
4. The user selects regions of a source code to parallelize for the first place. If the
   parallelization of these regions is impossible without preliminary program trans-
   formation, the interactive subsystem highlights the main issues and offers the user
   to decide which transformation should be applied (inline expansion, dead code
   elimination, expression propagation, and others). If automatic transformation is
   successful, the parallelization process goes back to steps (2) or (3). Otherwise, the
   program should be transformed in a manual way.
5. Finally, if the previous steps have got a well-formed program suitable for automat-
   ic parallelization, the user can annotate the source-code to specify parts that should
   be executed in parallel. Then automatic parallelizing compiler is used to generate a
   parallel program. Directive-based DVMH model or OpenMP can be used. If the
   user does not annotate code regions, the entire program is parallelized.
144


3      Data Transfer Optimization

One of the key features of SAPFOR is an automatic parallelizing compiler which
enables it to generate parallel code not only for multi-core CPU but also for an accel-
erator. This compiler inserts high-level specifications into an original program accord-
ing to the directive based DVMH programming model.
   SAPFOR identifies parallelizable loops and the outermost ones become the target
for parallelization. As mentioned in the previous section the interactive subsystem
highlights these loops and the user can see the way of further automatic compilation.
However, the most complicated problem is to specify data transfer between a memory
of CPU and a memory of accelerator.
   A compute region is a region of a program which can be executed on accelerators.
This region comprises one or more parallel loops. If the data transfer specifications
were placed just after and before the compute region, communication overhead would
drastically degrade the performance of the parallel program. Hence the compiler has
to prepare data for the accelerator as early as possible and to request data from the
accelerator as late as possible.
   Data transfer optimization algorithm works as follows:

1. We analyze the data flow graph to determine input, output and local data for each
   compute region separately. A variable is said to be input if it has a value before the
   compute region and a statement in the region uses this value. A variable is said to
   be output if a statement defines its value inside the compute region and another
   statement uses this new value after the compute region. The value of a local varia-
   ble is not important outside the region.
2. We use postorder traversal [18] and visit strongly connected components in a call
   graph.
   a. For each function we consider sibling compute regions and investigate the con-
      trol flow between these regions. The compiler aims to remove data transfer
      specifications from any path between these regions and to place them before the
      first region and after the second region. This optimization is possible if the out-
      put data of the first region and the input data of the second region are not ac-
      cessed on any path between these regions. If there are no computations between
      sibling regions the compiler also joins these regions to decrease initialization
      overhead at the region entry point.
   b. For each function we consider a sequential loop if its body contains compute re-
      gions. The compiler aims to move data transfer specifications outside the body
      of this loop.
   c. Finally, the control flow graph of the entire function is analyzed in order to
      move the data transfer specifications outside the function body. Necessary spec-
      ifications should be placed just before and after each call to the processed func-
      tion.
                                                                                    145


4      Implementation Details

The main component of SAPFOR is the core subsystem. For our work, we use LLVM
to analyze the program and Clang to transform the source code. LLVM 11 is currently
supported. We built a Visual Studio Code extension to implement the interactive sub-
system. Fig. 2 shows a high level overview of SAPFOR.


                        Fig. 2. A high level overview of SAPFOR

Firstly, the user creates a project which comprises program sources and a project file
which includes JSON compilation database. This database specifies a way to compile
translation units. Then the interactive subsystem initializes the connection with the
core subsystem. We use Node.js and C++ socket libraries to send and receive messag-
es encoded in JSON.
   After the connection is established the core subsystem uses a compilation database
to build Clang AST for the original program. Then it lowers Clang AST to LLVM IR.
The program representation loses some language-specific information when lowering
to LLVM IR level, so we maintain correspondence between Clang AST and LLVM
IR. This allows us to propagate analysis results from lower level to higher level of
program representation.
   To improve analysis accuracy the core subsystem runs transform passes. Some of
them are scalar replacement of aggregates, loop rotation, unreachable code elimina-
tion, removal of declarations of unused functions, elimination of unreachable internal
globals, propagation of function attributes, instruction and control flow simplification
and others. Moreover, SAPFOR makes property-sensitive transformations, i.e. it ap-
plies some sequence of transformation to analyze one kind of properties and another
sequence to analyze another kind of properties. To maintain multiple transformation
146


sequences we clone LLVM IR and use source-level alias tree [17] to propagate analy-
sis results from transformed LLVM IR to the original one.


5          Experimental Evaluation

We evaluate SAPFOR capabilities on three benchmarks EP (Embarrassingly Parallel),
BT (Block Tri-diagonal solver), and CG (Conjugate Gradient) from the NAS Parallel
Benchmarks. As hardware we use a workstation with a 6-core Intel Xeon CPU E5-
1660 v2, 3.70 GHz (2 threads per core, 12 threads in total) and GeForce GTX 1660 Ti
GPU. We compile programs with Intel Compiler 19.0.2.187 and CUDA tools V10.2.
All programs are compiled with the -O3 optimization option.
   After user-guided transformation of the code, the automatic parallelizing compiler
is run. We also evaluate manually written OpenCL and OpenMP versions [12,13].
Fig. 3 demonstrates the execution time of the parallel versions and it compares results
for three classes A, B, and C, which determine problem sizes and parameters.

800,00
                                                         Sequential Original
                                                         Sequential Transformed
700,00
                                                         SAPFOR DVMH CPU
                                                         SAPFOR DVMH GPU
600,00
                                                         Manual parallelization OpenMP CPU
                                                         Manual parallelization OpenCL GPU
500,00

400,00

300,00

200,00

100,00

    0,00
              A       B        C        A       B        C         A        B        C
                     BT                         CG                         EP

              Fig. 3. The execution time(s) of the NAS Parallel Benchmarks (NPB)

The OpenCL version of CG benchmark has vectorized inner loops and uses shared
memory on GPU. This explains the faster execution time on the GPU compared to the
DVMH version. Despite this, a significant advantage of DVMH versions is the less
complexity of maintenance. A normal compiler neglects specifications of parallelism
and DVMH versions are still suitable for sequential execution. Hence, the user may
use the normal C language to modify their programs.
                                                                                         147


   The main transformation that is required to parallelize benchmarks is inline expan-
sion. The current version of SAPFOR build a coarse summary to represent a memory
the called function accesses, so it is not possible to disprove data dependencies in
some cases. We also apply the dynamic analysis tool to reveal privatizable arrays in
BT and EP benchmarks.


6      Conclusion

The paper presents the interactive subsystem of SAPFOR (System FOR Automated
Parallelization) and it proposes an approach to optimizations of data transfer between
CPU and accelerators. The automatic parallelizing compiler, as part of SAPFOR re-
lies on this approach annotating source code with data transfer specifications. It uses
DVMH programming model to exploit loop-level parallelism for multi-core proces-
sors and accelerators.
   We implemented the interactive subsystem as an extension for the Visual Studio
Code [19] editor which is available for many platforms. Moreover, it uses SSH to run
extensions directly on the remote machine. Thus working directly on a parallel com-
puting system it is possible to take all advantages of interactive parallelization.
   The interactive subsystem helps the user realize parallelization issues and evaluate
decisions made by SAPFOR. It does not only show analysis results, but it also allows
the user to participate in parallelization and to guide SAPFOR through available pro-
gram transformations. The main goal of interactive parallelization is to obtain a well-
formed sequential program which can be parallelized in an automatic way.
   Application of SAPFOR to the NAS Parallel Benchmarks 3.3.1 shows promising
results. Automatically generated parallel versions have a similar performance to the
manually parallelized ones. However, the conducted study demonstrates that
SAPFOR still suffers from the inaccurate interprocedural analysis that sometimes
unable to disprove data dependencies. In future work, we plan to improve analysis
accuracy and extend the number of available transformations to increase the perfor-
mance of parallel programs.


References
 1. Grosser, T., Groesslinger, A., Lengauer, C.: Polly-performing polyhedral optimizations on
    a low-level intermediate representation. In: Parallel Processing Letters, 22(04), 1250010
    (2012).
 2. Grosser T., Hoefler, T.: Polly-ACC Transparent compilation to heterogeneous hardware
    In: ICS '16: Proceedings of the 2016 International Conference on Supercomputing June
    2016, pp. 1–13 (2016), https://doi.org/10.1145/2925426.2926286.
 3. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic poly-
    hedral parallelizer and locality optimizer. In: SIGPLAN Notices, 43(6), 101–113 (2008).
 4. Vandierendonck, H., Rul, S., Koen De Bosschere: The Paralax Infrastructure: Automatic
    Parallelization with a Helping Hand. In: 2010 19th International Conference on Parallel
    Architectures and Compilation Techniques (PACT), IEEE (2010).
148


 5. Baghdadi, R., Beaugnon, U., Cohen, A., Grosser, T., Kruse, M., Reddy, C., Verdoolae-
    ge, S., Betts, A., Donaldson, A.F., Ketema, J., Absar, J., Haastregt, S., Kravets, A.,
    Lokhmotov, A., David, R., Hajiyev, E.: Pencil: A platform-neutral compute intermediate
    language for accelerator programming. In: Proceedings of the 2015 International Confer-
    ence on Parallel Architecture and Compilation (PACT), PACT’15, pp. 138–149, Washing-
    ton, DC, USA, IEEE Computer Society (2015). https://doi.org/10.1109/PACT.2015.17.
 6. Kim, M., Kim, H., Luk, C.-K.: Prospector: A Dynamic Data-Dependence Profiler to Help
    Parallel Programming. In: 2nd USENIX Workshop on Hot Topics in Parallelism (Hot-
    Par’10) (2010).
 7. Intel Parallel Studio, https://software.intel.com/en-us/parallel-studio-xe, last accessed
    2020/11/25.
 8. Klinov, M.S., Krukov, V.A.: Avtomaticheskoe rasparallelivanie Fortran-programm. Oto-
    brazhenie na klaster. Vestnik Nizhegorodskogo universiteta im. N.I. Lobachevskogo, (2),
    128–134 (2009).
 9. Konovalov, N.A., Krukov, V.A, Mikhajlov, S.N., Pogrebtsov, A.A.: Fortran DVM: a Lan-
    guage for Portable Parallel Program Development. In: Programming and Computer Soft-
    ware, 21 (1), 35–38 (1995).
10. Bakhtin, V.A., Klinov, M.S., Krukov, V.A., Podderiugina, N.V., Pritula, M.N., Saza-
    nov, Iu.L.: Rasshirenie DVM-modeli parallelnogo programmirovaniia dlia klasterov s get-
    erogennymi uzlami. Vestnik Iuzhno-Uralskogo gosudarstvennogo universiteta, seriia "Ma-
    tematicheskoe modelirovanie i programmirovanie", 8 (277, (12), 82–92 (2012).
11. Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis
    & Transformation. In: Proc. of the 2004 International Symposium on Code Generation and
    Optimization (CGO'04). Palo Alto, California (2004).
12. NAS Parallel Benchmarks, https://www.nas.nasa.gov/publications/npb.html, last accessed
    2020/11/25.
13. Seo, S., Jo, G., Lee, J.: Performance Characterization of the NAS Parallel Benchmarks in
    OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC),
    pp. 137–148 (2011).
14. Kataev, N., Smirnov, A., Zhukov A.: Dynamic data-dependence analysis in SAPFOR. In:
    CEUR Workshop Proceedings, 2543, 199–208 (2020), http://ceur-ws.org/Vol-
    2543/rpaper18.pdf.
15. OpenMP        Application      Programming      Interface,    https://www.openmp.org/wp-
    content/uploads/OpenMP-API-Specification-5-1.pdf, last accessed 2020/11/25.
16. Kataev, N., Kolganov, A.: The experience of using DVM and SAPFOR systems in semi-
    automatic parallelization of an application for 3D modeling in geophysics, The Journal of
    Supercomputing 75, 7833–7843 (2019), https://doi.org/10.1007%2Fs11227-018-2551-y.
17. Kataev, N.: Application of the LLVM Compiler Infrastructure to the Program Analysis in
    SAPFOR. In: Voevodin V., Sobolev S. (eds) Supercomputing. RuSCDays 2018. Commu-
    nications in Computer and Information Science, vol. 965, pp. 487–499. Springer, Cham
    (2018), https://doi.org/10.1007/978-3-030-05807-4_41.
18. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.: Compilers: Principles, Techniques, and Tools
    (2nd Edition). Addison Wesley, 2nd edition (September 10, 2006). P. 1038, Chapter 9.
19. Visual Studio Code, https://code.visualstudio.com/, last accessed 2020/11/25.