Optimization of Parallel Software Tuning
       with Statistical Modeling and Machine Learning

     Anatoliy Doroshenko, Pavlo Ivanenko, Oleksandr Novak, Olena Yatsenko

        Institute of Software Systems of National Academy of Sciences of Ukraine,
                          Glushkov prosp. 40, 03187 Kyiv, Ukraine
      doroshenkoanatoliy2@gmail.com, paiv@ukr.net, oayat@ukr.net


       Abstract. High-performance computation is the main goal of parallel comput-
       ers, but the performance of compiled code is often far from the best. Parallel
       program auto-tuning is the method adjusting some structural parameters (main-
       ly, data structures) of an application program for a target hardware platform to
       speed-up computation as much as possible. In previous work, the authors have
       developed a framework intended to automate generation of an auto-tuner from
       an application source code. However, auto-tuning for complex and nontrivial
       parallel systems is usually time-consuming due to empirical evaluation of huge
       amount of parameter values combinations of an initial parallel program in a tar-
       get environment. In this paper, we propose to improve the auto-tuning method
       using statistical modeling and neural network algorithms that allow to reduce
       significantly the space of possible parameter combinations. The resulting opti-
       mization is illustrated by an example of tuning a parallel sorting program, that
       combines several sorting methods. The optimization is done by means of the
       automatic training of a neural network model on results of “traditional” tuning
       cycles with subsequent replacement of some auto-tuner calls with an evaluation
       from the statistical model.

       Keywords. Auto-tuning, parallel computation, machine learning, neural net-
       work, statistical modeling.


1      Introduction

The problem of optimal use of computing resources has always been important in the
process of development of any software — from mobile applications to complex cli-
ent-server systems. The auto-tuning paradigm [1, 2], which has become a standard for
solving the problem of software application optimization over the last decade, allows
to fully automatize this process for any computing environment. Its popularity is pre-
defined first by simplicity of use and independence from qualitative characteristics of
a computer and operating system. Auto-tuning traditionally uses empirical data for
obtaining a qualitative evaluation of optimized code (the quality usually refers to
program execution time and accuracy of output results). It automates the search for
the optimal program version out of a set of provided possibilities by running each
candidate and measuring its performance on a given parallel architecture. Its main
benefit is a high level of abstraction — a program is optimized without explicit
knowledge of hardware implementation details, such as number of cores, cache size
or memory access speed on various levels. Instead, it needs to use subject domain
concepts such as number and size of independent tasks.
   In the previous works [3–6], we have developed a theory, methodology and tools
for automated program design, synthesis, and auto-tuning, based on Glushkov’s sys-
tems of algorithmic algebras (SAA) and term rewriting technique. The model for
parallel programs optimization and the auto-tuning framework named TuningGenie
aimed at automating adjustment of programs to a target platform have been proposed
in [6]. The framework works with a source code of parallel software and performs
source-to-source transformations by using facilities of a rule-based rewriting system
TermWare [3].
   The main drawback of the auto-tuning approach is in significant one-time costs of
optimization process: if the number of program versions is large enough, the optimi-
zation process may run for many hours and even days. In this paper we propose the
hybrid approach to auto-tuning using statistical modeling and machine learning tech-
nique to reduce the time needed for searching for an optimal program version. The
approach consists in automatic training of a neural network model on results of com-
mon tuning cycles with subsequent replacement of some auto-tuner calls with an
evaluation from the statistical model.


2      Auto-Tuning Software Framework and Machine Learning

In the work [6], TuningGenie framework for automated generation of auto-tuner ap-
plications from a source code has been developed. The idea of an auto-tuner consists
in empirical evaluation of several versions of input program and selection of the best
one where the main evaluation criteria are less execution time of input program and
accuracy of results obtained. The framework works with program source code using
expert knowledge of a developer and automation facilities from the framework. A
developer adds some metadata (parameter names and value ranges) to a source code
in the form of special comments-pragmas. Exploiting such expert knowledge (s)he
can reduce the number of program versions to be evaluated and therefore increase
optimization performance.
   The auto-tuning software implementation is based on the rewriting rules system
TermWare [3]. TermWare is an open-source implementation of rewriting rules engine
written in Java. It provides a language for describing rewriting rules that operate on
data structures that are called terms, and a rule engine that interprets rules to trans-
form terms. TuningGenie uses TermWare to extract expert knowledge from program
source code and generates a new program version on each tuning iteration. TermWare
translates source code into a term and provides transformations according to rewriting
rules. The current TermWare version contains components for interaction with Java
and C# languages, and the current TuningGenie version supports Java programs.
   Application of auto-tuning for complex and nontrivial program systems usually
takes a lot of time due to empirical estimating a large number of parameter combina-
tions of input program in a target environment (let us denote the set of parameters
combinations as C ). In this paper we propose to optimize the auto-tuning method by
using statistical modeling and machine learning. The improvement consists in reduc-
ing the number of auto-tuner launches by means of building an approximation model
which allows dismissing the parameter combinations that are unlikely to be fast. The
model approximation often results in a reduction of dimensionality of input parame-
ters of the set C that means significant auto-tuning process speed-up.
    Generally, machine learning methods are based on the concept of learning some
behavior from data [2, 7]. In the context of auto-tuning, the behavior to be learnt, for
example, can be program performance at different settings of program parameters.
A machine learning method first evaluates several alternatives within the search space
for n different input programs P1 , ..., Pn , defined by configurations C1 ,..., Cn . The set
of evaluated alternatives is called training data. The process of generating and evalu-
ating the training data and learning behavior from this data is called training. Once the
training is completed, and given a new version of program P to be evaluated, execu-
tion of P is replaced with estimate, obtained from trained model.
    Machine learning is closely linked to (and often overlaps with) computational sta-
tistics [8]. All statistical algorithms (including machine learning algorithms) require a
significant number of statistical data for analysis and model construction. In the con-
text of auto-tuning tasks, the collection of many statistical data can be a long process.
Therefore, the problem of selecting the algorithms narrowing the search space at a
minimal number of real launches of an auto-tuner is very acute. For a partial solution
of the mentioned problem, in this work we use a neural network for data extrapolation
(see Section 3). In this case, relatively small number of real launches is required for
construction of an approximate model, after which the neural network model can be
used by other algorithms according to the black box principle.


3      A Case Study

In the design process, we follow top-down formal transformational style provided by
our automated toolkit for designing and synthesis of programs (IDS) [4, 5]. We begin
with high-level specification presented as a generalized scheme of the algorithm rep-
resented in the algorithmic language of Glushkov's algorithmic algebras [4] that has
the advantage to be human-friendly and complete with code in one of the parallel
programming languages (Java or C++, in our case).
   Below, we consider a case study of performance tuning by the example of a hybrid
parallel sorting algorithm which applies a merge sort or an insertion sort depending on
a block size (insertionSortThreshold) of input numerical array. The initial
SAA scheme of the algorithm contains the tuneAbleParam pragmas, which
specify       search     domain      for     optimal       values      of      variables
insertionSortThreshold and mergeSortBucketSize. The resulting
algorithm is implemented in Java.
"Parallel Hybrid Sorting (arr)"
==== "Comment(tuneAbleParam name=insertionSortThreshold
              start=10 stop=200 step=10)";
     "Declare a variable (insertionSortThreshold)
      of type (int) with initial value (100)";
     "Comment(tuneAbleParam name=mergeSortBucketSize
              start=5000 stop=1000000 step=5000)";
     "Declare a variable (mergeSortBucketSize)
      of type (int) with initial value (5000)";

       IF 'Length of the array (arr) is less or equal to
           (insertionSortThreshold)'
       THEN "insertionSort(arr)"
       ELSE IF 'Length of the array (arr) is less or equal
                to (mergeSortBucketSize)'
            THEN "sequentialMergeSort(arr)"
            ELSE "concurrentMergeSort(arr)"
            END IF
       END IF

   In the auto-tuning experiment, the set of 2  10 7 random integer numbers were
sorted. The auto-tuner parameters are C  {Tcn ,Ts , Th } , where Tcn is a number of
parallel threads, Ts is a threshold for block size to be sorted sequentially within the
current thread (blocks with size  Ts are split into smaller blocks and assigned to
different threads), Th is a block size at which insertion sort is used.
    The experiment was performed in the following environment: 2.7 GHz Intel
Core i7 processor (6820HQ) with 4 cores and 8 MB L3 cache; 16 GB 2133 MHz
RAM; 512 GB Apple SSD SM0512L; MacOS 10.12.
    In a first phase, the auto-tuner was executed without a statistical model to estimate
how quick the tuned algorithm can be. In a second phase, the statistical modeling was
plugged in to understand how heavily the search space can be pruned while preserv-
ing the near-optimum performance of the tuned algorithm.
    Let’s look at the results of the first phase given in Table 1. Three configurations are
listed: slow (“default” configuration that behaves almost as classical sequential merge
sort); optimal (the quickest one that was automatically picked by the auto-tuner) and
intuitive (values are filled in by intuition with respect to known hardware specifica-
tions and algorithms details). Optimal configuration is 4.93 times quicker than slow.
This result is quite good for 4-core processor and was achieved primarily by a
combination of two factors: optimal usage of processor caches (by switching to in-
place sorting for small data sets) and efficient parallelization schema (merge sort is
easy to parallelize with “divide and conquer” method). Intuitive combination was 3.1
times faster than slow — also a decent result, but it was easy to guess due to relative
simplicity of the test algorithm. Usually optimal configurations are not so obvious for
real-life parallel programs. Optimal configuration is still substantially quicker — by
58%, so we can say that it was worth the time spent on tuning.
                         Table 1. The results of the first auto-tuning phase.

       Configuration                     slow                 optimal               intuitive
  Parallelism level Tcn            1 (one thread)              8                      4
      Insertion sort             0 (do not switch to           120           30 (common notion
      threshold Th              insertion sort at all)                         is to set couple
                                                                             dozen as a threshold
                                                                                for this trick)
      Threshold for            100 000 000 (it’s big-         50 000                10 000
    sequential sorting          ger than the test data
           Ts                 size, so no data decom-
                                 position is applied)
     Test data size                                    20 000 000 integers
  Average sorting time                 4432 ms                898 ms              1426 ms

   Now let’s move to the second phase to see how the auto-tuner’s search space can
be reduced with the help of statistical analysis methods. Ts parameter is excluded
from the model during primary analysis phase because of its minor impact on overall
performance: once the number of subtasks after the decomposition of input data is
couple times bigger than the parallelism level, it makes almost no difference what
value is used. This can be explained by high effectiveness of Java’s Recursive-
Action [9] mechanism that was used in the implementation. RecursiveAction is a re-
cursive ForkJoinTask, which is “a thread-like entity that is much lighter weight than a
normal thread. Huge numbers of tasks and subtasks may be hosted by a small number
of actual threads in a ForkJoinPool, at the price of some usage limitations” [10]. The
experiment proved that the computational overhead on executing new
RecursiveAction is negligible.
   The primary analysis of data was performed in Python language with a help of
Scikit-learn library [11]. Further analysis was implemented by means of R [12],
which is a programming language for statistical computations, analysis and graphical
representation of data. The experiment consisted of several stages: preparation and
loading of auto-tuner results to R environment, data preparation (including normaliza-
tion), building a neural network model on a training dataset and checking the model
on a test dataset.
   The data analysis process is shown in Figure 1. At first, the auto-tuner performs N
experiments and saves the result data to a separate file. The data is used by the neural
network for training. After the training, the neural network extrapolates the data, gen-
erates the new dataset, which is written into a separate file. In the end, both datasets
are analyzed and compared by a human. As a neural network, a multilayer perceptron
with three input neurons, three hidden layers (20-10-5 neurons per layer) and one
output neuron were applied. The rectified linear function f ( x) = max(0, x) was used
as an activation function. The backward propagation of errors has been used as a ma-
chine learning method and the Broyden-Fletcher-Goldfarb-Shanno algorithm [13] has
been applied for optimization of weighting factors.
                                                     Neural
                   Algorithm                        network


                                                      Data
                   Auto-tuner
                                                   processing


                                                       Data
                                                    processing
                                                                  Actor
                                Fig. 1. The process of analysis

   The initial neural network was built based on results of 3300 launches. Then it was
used for further data generation. The use of the neural network for initial approxima-
tion allowed to reduce the search region by 58% (from 106 to 4.2  105 ). For esti-
mating the quality of the obtained results, more than 30000 real launches (evenly
distributed over the combinations set) of the auto-tuner was performed.
   Figure 2 shows the dependency of the model accuracy Acc from 10 neural net-
works on the ratio of sample data used for training.


Fig. 2. The dependency of the model accuracy Acc on the ratio of sample data used for model
training

   The evaluation of the accuracy Acc is based on a confusion matrix [14] and is cal-
                                         TP  TN
culated according to the formula Acc              , where TP is the number of true
                                           PN
positives; TN is the number of true negatives; P is the number of real positive cases
in the data; N is the number of real negative cases.
4      Related Work

Many approaches have been proposed for the problem of auto-tuner development.
Well-known examples of auto-tuners are ATLAS [15] and FFTW [16], which are
specialized libraries introducing high-performance implementation of some specific
functions. Unlike our TuningGenie framework, which provides domain independent
optimization, they are tied to domain and language. TuningGenie is quite similar to
Atune-IL [17], a language extension for auto-tuning. It also uses pragmas and is not
tied to some specific programming language. The main difference of TuningGenie is
due to term rewriting engine that is used for source code transformation. Representing
program code as a term allows modifying program structure in a declarative way.
This feature significantly increases the capabilities of the auto-tuning framework.
   There are also auto-tuners based on machine learning techniques [2]. In paper [18],
an open-source self-tuning compiler Milepost GCC is described, which exploits ma-
chine learning to predict optimal setting of compilation flags for a program at using
GCC. In [19] neural networks are used to learn the behavior of a given program trans-
formation (parametric loop tiling) for different values of input parameter (tile size);
the model is then used to search for optimal parameter values. In the work [20], a
machine learning approach is applied for automatic optimization of task partitioning
for OpenCL for different input problem sizes and different heterogeneous architec-
tures consisting of CPUs and GPUs. In our work, we use neural networks for learning
on the results of tuning cycles (program execution time at different values of internal
program parameters) with subsequent replacement of some auto-tuner calls with an
evaluation from the model.


Conclusion

In this paper, we explore the promising method of software auto-tuning improved by
using statistical modeling and neural networks. The method allows substantially get
rid of the main weakness of the auto-tuning methodology, namely, significantly ac-
celerate the search for an optimal program version by automatic training a neural
network model on the results of regular tuning cycles and subsequent replacement of
some auto-tuner calls with an evaluation from the model. Furthermore, the use of a
perceptron at the primary analysis stage helps to identify the most important input
parameters (i.e. which have the largest influence on a final result). The approach is
illustrated by the example of performance tuning of a hybrid parallel sorting program
that exploits the developed earlier TuningGenie framework. The results of the exper-
iment confirmed the efficiency of the proposed approach and the usefulness of its
further development, in particular, the use of more complex approximation functions
and conducting experiments with more computationally and semantically complex
programs.
References
 1. Naono, K., Teranishi, K., Cavazos, J., Suda, R.: Software automatic tuning: from concepts
    to state-of-the-art results. Springer, Berlin (2010)
 2. Durillo, J., Fahringer, T.: From single- to multi-objective auto-tuning of programs: ad-
    vantages and implications. Scientific Programming, 22(4), 285–297 (2014)
 3. Doroshenko, A., Shevchenko, R.: A rewriting framework for rule-based programming dy-
    namic applications. Fundamenta Informaticae, 72(1–3), 95–108 (2006)
 4. Andon, F.I., Doroshenko, A.Y., Tseytlin, G.O., Yatsenko, O.A.: Algebra-algorithmic
    models and methods of parallel programming. Akademperiodyka, Kyiv (2007) (in Rus-
    sian)
 5. Doroshenko, A., Zhereb, K., Yatsenko, O.: Developing and optimizing parallel programs
    with algebra-algorithmic and term rewriting tools. In: Ermolayev, V., Mayr, H.C., Ni-
    kitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) ICTERI 2013, Communications in
    Computer and Information Science, vol. 412, pp. 70–92. Springer, Cham (2013)
 6. Ivanenko, P., Doroshenko, A., Zhereb, K.: TuningGenie: auto-tuning framework based on
    rewriting rules. In: Ermolayev, V., Mayr, H., Nikitchenko, M., Spivakovsky, A., Zholt-
    kevych, G. (eds.) ICTERI 2014. Communications in Computer and Information Science,
    vol. 469, pp. 139–158. Springer, Cham (2014)
 7. Mitchell, T.M.: Machine learning. 1st edn. McGraw-Hill Education, New York (1997)
 8. Givens, G.H., Hoeting, J.A.: Computational statistics. 2nd edn. Wiley, Chichester (2012)
 9. Class RecursiveAction (Java SE 9 & JDK 9) – Oracle Help Center,
    https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/RecursiveAction.html, last
    accessed 2017/12/20.
10. Class ForkJoinTask (Java SE 9 & JDK 9) – Oracle Help Center,
    https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinTask.html, last ac-
    cessed 2017/12/20.
11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V. et al.: Scikit-learn: machine learn-
    ing in Python. Journal of Machine Learning Research, 12, 2825–2830 (2011)
12. Crawley, M.J.: The R book. 1st edn. Wiley, Chichester (2012)
13. Fletcher, R.: Practical methods of optimization. 2nd edn. Wiley, Chichester (2000)
14. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874
    (2006)
15. Whaley, R., Petitet, A., Dongarra, J. J.: Automated empirical optimizations of software and
    the ATLAS Project. Parallel Computing, 27(1–2), 3–35 (2001)
16. Frigo, M., Johnson, S.: FFTW: an adaptive software architecture for the FF. Acoustics,
    Speech and Signal Processing, 3, 1381–1384 (1998)
17. Schaefer, C.A., Pankratius, V., Tichy, W.F.: Atune-IL: an instrumentation language for au-
    to-tuning parallel applications. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par’2009.
    LNCS, vol. 5704, pp. 9–20. Springer, Berlin, Heidelberg (2009)
18. Fursin, G., Kashnikov, Y., Memon, A.W., Chamski, Z. et al.: Milepost GCC: machine
    learning enabled self-tuning compiler. International Journal of Parallel Programming
    39(3), 296–327 (2011)
19. Rahman, M., Pouchet, L.-N., Sadayappan, P.: Neural network assisted tile size selection.
    In: 5th International Workshop on Automatic Performance Tuning (IWAPT’2010).
    Springer, Berkeley, CA (2010)
20. Kofler, K., Grasso, I., Cosenza, B., Fahringer, T.: An automatic input-sensitive approach
    for heterogeneous task partitioning. In: 27th ACM International Conference on Supercom-
    puting (ICS’13), pp. 149–160. ACM, New York (2013)