<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Algorithm of Automatic Parallelization of Generalized Matrix Multiplication?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena N. Akimova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman A. Gareev</string-name>
          <email>gareev.roman@urfu.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Krasovskii Institute of Mathematics and Mechanics, Ural Branch of RAS</institution>
          ,
          <addr-line>Ekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ural Federal University</institution>
          ,
          <addr-line>Ekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Parallelization of generalized matrix-matrix multiplication is crucial for achieving high performance required in many situations. Parallelization performed using contemporary compilers is not su cient enough to replace expert-tuned multi-threaded implementations or to get close to their performance. All competitive solutions require previously optimized external implementations that cannot be available for a given type of data and hardware architecture. In the paper, we introduce an automatic compiler transformation that does not require an external code or automatic tuning to attain more than 85% of performance of an optimized BLAS library. Our optimization shows competitive performance across various hardware architectures and in the case of di erent forms of generalized matrix-matrix multiplication. We believe that availability of multi-threaded implementations of generalized matrix-matrix multiplication can save time when any optimized libraries are not available.</p>
      </abstract>
      <kwd-group>
        <kwd>high-performance</kwd>
        <kwd>multicore</kwd>
        <kwd>compilers</kwd>
        <kwd>mathematical software</kwd>
        <kwd>computations on matrices</kwd>
        <kwd>linear algebra algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Let A, B, and C be appropriately sized matrices containing elements in a set S;
and let S and operations and be contained in a closed semiring [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Then
the formula C C A B, where and operations from the
corresponding matrix semiring, and and are constants that are not equal to
zero, is a generalized matrix multiplication or matrix multiply-and-add operation
(MMA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        MMA is important in many situations. For instance, it can be used to solve
algebraic path problems (APPs) such as nding shortest connecting paths, nding
the least and the most reliable paths, nding paths with maximum capacity, or
nding paths with maximum cost. Examples of their usage are nding directions
between physical locations, such as driving directions on web mapping websites
? This work was partly performed at the Center of Excellence \Geoinformation
technologies and geophysical data complex interpretation" of the Ural Federal University.
like MapQuest or Google Maps [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], applications studied in operations research,
robotics, plant and facility layout, transportation, and VLSI design [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. There are
various applications of general matrix-matrix multiplication MMA with = +
and = . General matrix-matrix multiplication can be used for encoding the
database before it goes for data mining in a centralized environment [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It can
be also used during the computation in convolution layers of the Convolutional
Neural Networks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] applied in machine learning. Another example of its
application is solving the inverse structural gravity problem of mathematical physics
using the Levenberg-Marquardt algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Matrix-matrix multiplication can
be applied in the calculation of the RI-MP2 correlation energy of a molecule
studied by quantum chemistry [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. General matrix-matrix multiplication is typically
a building block for other matrix-matrix operations studied in high performance
computing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Contemporary compilers and compiler front ends (e.g., GCC [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Clang [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
ICC [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], IBM XL [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]) cannot automatically transform a textbook style
implementation of general matrix-matrix multiplication into a code that comes close
to the performance of expert-tuned multi-threaded implementations of MMA.
Specialized libraries (e.g., Intel's MKL [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], BLIS [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], OpenBLAS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) provide
high-performance implementations of general matrix-matrix multiplication, but
nevertheless these approaches require previously optimized external code and
can only be used if an optimized implementation is available [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        The work presents a new compiler optimization for MMA based on the Polly
loop optimizer [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It is aimed to close the performance gap between compilers
and expert-tuned multi-threaded libraries.
      </p>
      <p>Our contributions are
{ an automatic transformation for parallelizing MMA based on the
computational structure proposed by the BLIS framework (Section 4.3);
{ comparison of our approach to existing production compilers and vendor
optimized BLAS libraries: we attain more than 85% of performance of the
multi-threaded instances of general matrix-matrix multiplication that are
available in optimized BLAS libraries (Section 5).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        In this section, we brie y describe the polyhedral model, i.e. a mathematical
framework for loop nest optimizations, which is used to focus on modeling and
optimization of the memory access behavior of a program [
        <xref ref-type="bibr" rid="ref19 ref20">19,20</xref>
        ].
      </p>
      <p>
        For Static Control Parts (SCoPs) [
        <xref ref-type="bibr" rid="ref21 ref22">21,22</xref>
        ], i.e. program regions that are
\sufciently regular" to be modeled, each compute statement is described by three
components: iteration domain, scheduling function, and access relation. An
iteration domain is represented as a Z-Polytope [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] describing the dynamic instances
of the SCoP statement. A scheduling function is represented by a Z-Polytope
that relates dynamic statement instances to their execution time vectors de ning
the execution order. An access relation is described using a Z-Polytope de ning
the relation between iteration vector and the accessed array subscript.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Automatic Detection of Kernel</title>
      <p>
        In the paper, we consider optimization of MMA-like kernels introduced in Polly [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
They have the following de nition:
De nition 1. A generalized matrix multiplication-like kernel (MMA-like
kernel ) is a perfectly nested loop nest such that
{ it satis es the requirements of the polyhedral model;
{ without loss of generality, it contains three one-dimensional for loops Loop
1, Loop 2, and Loop 3 with induction variables i, j, and p, respectively, that
are incremented by one;
{ the innermost loop body is representable as a statement of the form C[i][j]
= E(A[i][p], B[p][j], C[i][j]), where A[i][p], B[p][j], C[i][j] are accesses to
matrices A, B, C, respectively, and E is an expression that contains reads
from the matrices A, B, C and an arbitrary number of reads from constants
with respect to Loop 1, Loop 2, and Loop 3.
      </p>
      <p>To detect and subsequently optimize the SCoP statements that implement
MMA-like kernels, we use the algorithm that is based on the analysis of
memory accesses and memory dependencies implemented in Polly. Its description is
beyond the scope of our paper. MMA is a particular case of MMA-like kernel.
We consider how it a ects the optimization in Section 4.3.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Optimizing MMA-like kernel</title>
      <p>In this section, we present an algorithm for obtaining the code structured
similar to an expert-optimized multi-threaded general matrix-matrix multiplication.
Firstly, we consider the expert-designed multithreaded general matrix-matrix
multiplication and then discuss our optimization.
4.1</p>
      <p>
        The expert implementation of general matrix-matrix
multiplication
An example of an expert implementation of the general matrix-matrix
multiplication algorithm (i.e. an implementation that was tuned by dense linear algebra
experts) can be found in BLIS [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It consists of two packing routines and ve
loops around the micro-kernel. Packing routings perform copying the data of
matrices A and B (that are operands of the MMA-like kernel) to created
arrays Ac and Bc, respectively, to ensure that their elements are aligned to cache
lines boundaries, preloaded in certain cache levels, and read in-stride access [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
The micro-kernel is a loop around an outer product that can be implemented in
assembly. The micro-kernel and two surrounding loops form the macro-kernel.
Pseudo-code of the described implementation can be found in Listing 1.
      </p>
      <p>
        Determination of Mc; Nc; Kc; Mr, and Nr parameters of the described
implementation can be done manually or using an analytical model [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We use
an analytical model for determination of parameters of single-threaded
MMAlike kernels presented in Polly [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Determination of optimal values for
multithreaded implementation is only partially considered in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and can be direction
of the future work.
      </p>
      <p>
        Expert implementation of multi-threaded general matrix-matrix
multiplication
The implementation of general matrix-matrix multiplication in BLIS consists of
ve loops that can be parallelized. To nd the loop that should be parallelized to
obtain the best performance gain, the parameters of the target architecture (e.g.,
the availability of the L3 cache, the ability to share L2 and L3 caches between
threads, the support of the cache coherency) along with values of parameters of
the implementation need to be considered [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ](i.e., M , Mc, Mr, N , Nc, Nr). In
the paper, we discuss the automatic parallelization of the second loop around
the micro-kernel (i.e., indexed by jc). In case the ratio of Nc to Nr is large,
which is usually the case, it gives good opportunities for parallelism [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] helping
to amortize the cost of transferring of the block of Ac from the main memory
into the L2 cache and to reduce the execution time of general matrix-matrix
multiplication.
      </p>
      <p>An automatic parallelization of MMA-like kernel
In this subsection, we consider the algorithm for automatic parallelization
implemented in Polly along with its modi cations helping to apply it in the case
of MMA-like kernels and, in particular, general matrix-matrix multiplication.</p>
      <p>
        Polly can detect the outermost loop of the loop nest that can be executed in
parallel to generate the OpenMP [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] code and to take advantage of shared
memory parallelism in a SCoP [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. To do it, Polly uses the dependence analysis [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
to discover data dependencies among program statements. A data dependency is
a situation, in which a program statement refers to the data of a preceding
statement [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In the case of Polly, data dependencies between program statements
of SCoP statements are considered.
      </p>
      <p>For example, let us consider the matrix-matrix multiplication of the form
C += A B, where the sizes of the matrices A, B, and C are M K, K N , and
M N , respectively. Listing 2 contains an example of a program that implements
the described matrix-matrix multiplication. In this case, we have only one true
dependency, only one anti dependency, and only one output dependency. The
dependencies have the form S(i; j; p) ! S(i; j; p + 1).</p>
      <p>S:
for (i = 0; i &lt; M; i ++)
for (j = 0; j &lt; N; j ++)
for (p = 0; p &lt; K; p ++)</p>
      <p>C[i ][ j] += A[i ][ p] * B[p ][ j ];</p>
      <p>Listing 2: Example of matrix-matrix multiplication</p>
      <p>We apply Polly to get the code, which is a generalization of the expert
implementation of the general matrix-matrix multiplication in terms of the outer
product de ned in the semiring representing the multiplication. Subsequently,
we add two output dependencies and two ow dependencies introduced by the
packing routines (Section 4.1). The output dependencies have the following form:
Copy0(j; p; i; jc; ic; pc) ! Copy0(j +Nc; p+Kc; i; jc; ic; pc), Copy1(j; p; i; jc; ic; pc)
! Copy1(j; p + Kc; i + Mc; jc; ic; pc). The ow dependencies have the
following form: Copy0(j; p; i; jc; ic; pc) ! S(j; p; i; jc; ic; pc), Copy1(j; p; i; jc; ic; pc) !
S(j; p; i; jc; ic; pc). Since jc becomes the outermost parallel loop, it is
automatically parallelized by Polly.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>We compare performance of the code generated by the presented optimization
to the multi-threaded instances of MMA. The experimental setup is presented
in (Table 1, and Table 2). Throughput and latency of processor oating-point
arithmetic units are denoted by NV F MA and LV F MA, respectively. Sizes of the</p>
      <p>rst two cache levels of all CPUs under evaluation are 32 Kbytes and 256 Kbytes,
respectively. Their associativity degrees are equal to 8. Results of measurements
considered in the section are the corresponding arithmetic means collected until
95% con dence intervals to be within 10% of reported means.</p>
      <p>Additional options
-O3 -march=native -mllvm -polly</p>
      <p>-mllvm -polly-parallel -lgomp
-O3 -march=native -mllvm -polly
-mllvm -polly-target-throughput-vector-fma=NV MMA</p>
      <p>-mllvm -polly-target-latency-vector-fma=LV MMA
-mllvm -polly-target-1st-cache-level-associativity=WL1
-mllvm -polly-target-2nd-cache-level-associativity=WL2
-mllvm -polly-target-1st-cache-level-size=SL1
-mllvm -polly-target-2nd-cache-level-size=SL2
- p-contract=fast -mllvm -polly-parallel -lgomp
-O3 -march=native -ftree-parallelize-loops=n</p>
      <p>
        -O3 -march=native -parallel
-O3 -qarch=auto -qtune=auto -qsmp=auto
We consider the matrix-matrix multiplication of the following form C
C A B implemented in Polybench 3.2 benchmark suite [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] along
1 Values of these parameters that are not publicly available from the vendors'
instruction set/optimization manuals were determined empirically and can be di erent from
the real values.
      </p>
      <p>120
100
80
c
e
s
/sp 60
o
l
F
G 40
20
0
500
400
ce300
s
/
s
p
o
lF200
G
100
0
small
large
with the maximum reliability path problem. We evaluate the matrix-matrix for
four data sets de ned in Polybench 3.2 benchmark suite: small, standard, large,
and extra large.</p>
      <sec id="sec-5-1">
        <title>Theoretical peak</title>
        <p>polly (original)
polly (opt)
gcc
icc</p>
      </sec>
      <sec id="sec-5-2">
        <title>Intel MKL</title>
      </sec>
      <sec id="sec-5-3">
        <title>Intel SandyBridge</title>
      </sec>
      <sec id="sec-5-4">
        <title>BLIS</title>
      </sec>
      <sec id="sec-5-5">
        <title>OpenBLAS ARM</title>
        <p>large</p>
        <p>
          Figure 1 presents the results of performance evaluation of the
implementation of the matrix-matrix multiplication provided by PolyBench 3.2 benchmark
suite [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] for matrices that contain elements of type double. The maximum
number of threads in the OpenMP parallel region is equal to the number of cores
speci ed in Table 1. In the case of Intel SandyBridge with di erent number of
threads, the extra large data set and di erent values of the maximum number
of threads in the OpenMP parallel region are evaluated. We can conclude that
the optimization attains more than 85% of performance of the multi-threaded
instances of the matrix-matrix multiplication that are available in the optimized
BLAS libraries. Figure 1 shows that additional optimizations of the generated
assembly code and, in particular, selection of an appropriate ABI may be required
to attain high performance in the case of IBM Power 7.
        </p>
        <p>Figure 2 presents the result of performance evaluation for the maximum
reliability path problem. We can conclude that the optimization attains more
than 66% of theoretical peak performance.</p>
        <p>120
100
80
c
e
s
/sp 60
o
l
F
G40
20
0
small
large
In this section, we describe the work related to automatic parallization of MMA
and how our algorithm contributes.</p>
        <p>
          Automatic parallization can be performed by compilers (e.g., ICC [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], IBM
XL [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], GCC [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]) to translate a serial program into a multithreaded code.
Automatic parallelization attempts to justify the parallelization e ort of loops,
performs the data ow analysis (to determine whether each iteration of the loop
can be executed independently of the others), and use di erent technologies (e.g.,
OpenMP [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]) to take advantage of it. Such approach can su er from the lack of
domain-speci c knowledge about algorithms (e.g., the information about use of
the cache memory and SIMD registers [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]). Moreover, the extra overhead that
is associated with using multiple processors can negate speedup of parallelized
code.
        </p>
        <p>
          To obtain highly optimized multithreaded implementations, autotuning can
be used. Automatically Tuned Linear Algebra Software (ATLAS) [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] introduces
autotuning to empirically determine optimal parameters of BLAS routines (e.g.,
blocking and unrolling factors). It can be di cult to apply it in the case of
production compilers since autotuning can consume a lot of time.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future directions of work</title>
      <p>This work presents a new compiler optimization that helps to obtain
automatically highly optimized multi-threaded instances of MMA without external code.
Our optimization is based on the automatic optimization of MMA implemented
in Polly along with approaches used in expert multi-threaded implementations.
This allows one to provide the domain-speci c knowledge and reuse automatic
parallelization implemented in Polly.</p>
      <p>
        In this paper, we compare the execution time of the code produced by the
optimization of MMA with the code produced using contemporary compilers
(GCC[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ICC [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], IBM XL [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), and Clang [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the compiler front end, as
well as multi-threaded instances that are available in Intel's MKL [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], BLIS [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
and OpenBLAS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In the case of GEMM, we attain more than 85% of vendor
optimized BLAS library performance. In the case of APPs, we can attain more
than 66% of theoretical peak performance.
      </p>
      <p>We believe that higher performance can be achieved with better automatic
vectorization used by Polly to produce single-threaded MMA. Determination of
optimal values of the analytical model for a multithreaded implementation can
be another direction to improve the presented algorithm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>D. J.:</given-names>
          </string-name>
          <article-title>Algebraic structures for transitive closure</article-title>
          .
          <source>Theoretical Computer Science</source>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>59</volume>
          {
          <fpage>76</fpage>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sedukhin</surname>
            ,
            <given-names>S. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paprzycki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Generalizing Matrix Multiplication for E cient Computations on Modern Computers</article-title>
          .
          <source>Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics Volume Part I. PPAM'11</source>
          , Springer-Verlag, Berlin, Heidelberg.
          <volume>225</volume>
          {
          <issue>234</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sanders</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Fast route planning</article-title>
          .
          <source>Google Tech Talk</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D. Z.</given-names>
          </string-name>
          :
          <article-title>Developing algorithms and software for geometric path planning problems</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>28</volume>
          (
          <issue>4</issue>
          ) (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jayachandran</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venkatachalam</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A Secure Scheme for Privacy Preserving Data Mining Using Matrix Encoding</article-title>
          .
          <source>World Engineering &amp; Applied Sciences Journal. 7</source>
          (
          <issue>3</issue>
          ),
          <volume>190</volume>
          {
          <fpage>193</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Minimizing Computation in Convolutional Neural Networks</article-title>
          . Springer International Publishing, Cham.
          <volume>281</volume>
          {
          <issue>290</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Akimova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skurydina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On solving the three-dimensional structural gravity problem for the case of a multilayered medium by the componentwize Levenberg-Marquardt method</article-title>
          .
          <source>In: 15th EAGE International Conference on Geoinformatics - Theoretical and Applied Aspects. EAGE</source>
          (
          <year>2016</year>
          ). http://earthdoc.eage.org/publication/publicationdetails/?publication=
          <fpage>84606</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Watson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olivares-Amaya</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edgar</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aspuru-Guzik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Accelerating correlated quantum chemistry calculations using graphical processing units</article-title>
          .
          <source>Computing in Science Engineering</source>
          .
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <volume>40</volume>
          {
          <fpage>51</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Goto</surname>
          </string-name>
          , K., van de Geijn, R. A.:
          <article-title>Anatomy of high-performance matrix multiplication</article-title>
          .
          <source>ACM Trans. Math. Softw</source>
          .
          <volume>34</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>25</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stallman</surname>
          </string-name>
          , R.:
          <article-title>Using and Porting the GNU Compiler Collection: For Gcc-2</article-title>
          .
          <fpage>95</fpage>
          .
          <string-name>
            <surname>Free Software Foundation</surname>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>A C language family frontend for LLVM</article-title>
          . https://clang.llvm.org/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Intel</surname>
            <given-names>C</given-names>
          </string-name>
          +
          <article-title>+ Compiler 16.0 Update 4 User and Reference Guide</article-title>
          . https://software.intel.com/en-us/intel-cplusplus-compiler-
          <volume>16</volume>
          .0
          <article-title>-user-and-referenceguide-pdf</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>13. XL C/C++: Compiler Reference - IBM. http://www-01.ibm.com/support /docview.wss</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Intel Math Kernel Library (Intel</surname>
            <given-names>MKL</given-names>
          </string-name>
          ). https://software.intel.com/ru-ru/intelmkl/?cid=
          <article-title>sem43700011401059448&amp;intel term=intel+mkl</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Van Zee</surname>
            ,
            <given-names>F. G.</given-names>
          </string-name>
          , van de Geijn, R. A.:
          <article-title>BLIS: A Framework for Rapidly Instantiating BLAS Functionality</article-title>
          .
          <source>ACM Trans. Math. Softw</source>
          .
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>33</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Xianyi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yunquan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Model-driven level 3 BLAS performance optimization on Loongson 3A processor</article-title>
          .
          <source>Parallel and Distributed Systems (ICPADS)</source>
          ,
          <source>2012 IEEE 18th International Conference on. 684{691</source>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Menon</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pingali</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>High-level semantic optimization of numerical codes</article-title>
          .
          <source>Proceedings of the 13th International Conference on Supercomputing. ICS '99</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA.
          <volume>434</volume>
          {
          <issue>443</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Grosser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aloor</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simburger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grolinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouchet</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          :
          <source>Polly { Polyhedral Optimization in LLVM. 1st International Workshop on Polyhedral Compilation Techniques (IMPACT)</source>
          . Chamonix, France. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Polyhedral</surname>
          </string-name>
          <article-title>Compilation without Polyhedra</article-title>
          . http://polycomp.gforge.inria.fr
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Feautrier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lengauer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Polyhedron Model. Springer US, Boston, MA.
          <volume>1581</volume>
          {
          <issue>1592</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Girbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasilache</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bastoul</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parello</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sigler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Temam</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Semi-automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies</article-title>
          .
          <source>Int. J. Parallel Program</source>
          .
          <volume>34</volume>
          (
          <issue>3</issue>
          ),
          <volume>261</volume>
          {
          <fpage>317</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Bondhugula</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartono</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanujam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadayappan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>A Practical Automatic Polyhedral Parallelizer and Locality Optimizer</article-title>
          .
          <source>SIGPLAN Not</source>
          .
          <volume>43</volume>
          (
          <issue>6</issue>
          ),
          <volume>101</volume>
          {
          <fpage>113</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Loechner</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilde</surname>
            ,
            <given-names>D. K.</given-names>
          </string-name>
          :
          <article-title>Parameterized polyhedra and their vertices</article-title>
          .
          <source>International Journal of Parallel Programming</source>
          .
          <volume>25</volume>
          (
          <issue>6</issue>
          ),
          <volume>525</volume>
          {
          <fpage>549</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Low</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Igual</surname>
            ,
            <given-names>F. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quintana-Orti</surname>
            ,
            <given-names>E. S.</given-names>
          </string-name>
          :
          <article-title>Analytical Modeling Is Enough for High-Performance BLIS</article-title>
          .
          <source>ACM Trans. Math. Softw</source>
          .
          <volume>43</volume>
          (
          <issue>2</issue>
          ),
          <volume>1</volume>
          {
          <fpage>18</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Smith</surname>
          </string-name>
          , T. M., van de Geijn, R.,
          <string-name>
            <surname>Smelyanskiy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Zee</surname>
            ,
            <given-names>F. G.</given-names>
          </string-name>
          :
          <article-title>Anatomy of High-Performance Many-Threaded Matrix Multiplication</article-title>
          .
          <source>Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IPDPS '14</source>
          , IEEE Computer Society, Washington, DC, USA.
          <volume>1049</volume>
          {
          <issue>1059</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <article-title>OpenMP Application Programming Interface</article-title>
          . http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Aho</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sethi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          : Compilers: Principles, Techniques, and
          <string-name>
            <surname>Tools (2Nd Edition). Addison-Wesley Longman</surname>
          </string-name>
          Publishing Co., Inc., Boston, MA, USA. (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <article-title>PolyBench/C the Polyhedral Benchmark suite</article-title>
          . http://web.cse.ohio-state.edu/ pouchet/software/polybench/
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Whaley</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petitet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dongarra</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          :
          <source>Automated empirical optimizations of software and the ATLAS project. Parallel Computing</source>
          .
          <volume>27</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>35</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>