<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A SIMD-based Approach to the Enhancement of Convolution Operation Performance</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Bogolyubov Institute for Theoretical Physics</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Aviation University</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Optimization of two-dimensional convolution by means of 16-bit SIMD technologies (ARM-NEON) is considered. It is shown that utilizing 16bit SIMD NEON and built-in assembler one can achieve a significant increase in performance compared to the similar functions of OpenCV, ACL, and ARM Compute library. Throughout the research, ltercoefficients were quantized to match the 8-bit range.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolution</kwd>
        <kwd>vectorization</kwd>
        <kwd>SIMD</kwd>
        <kwd>optimization</kwd>
        <kwd>CPU</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>where i  a,,W  (r  a) 1, j  a,, H  (c  a) 1 are indexing pixels of the
destination image p ; W and H are width and height of the source P and
destination p images (we neglect border effects in the destination at the moment),  is the
kernel of the convolution (matrix r  c ), and a , a - so-called “anchors” that define
relative position of a filtered point within the kernel.</p>
      <p>
        Equation (1) is rather general and perfectly compatible with cv::filter2D(...)
function of OpenCV library [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and ARM Computer Library (ACL), but further we will
give our attention to square kernels, i.e. c  r , and thus from now on we presume
kernel to be square-shaped without special mentioning.
      </p>
      <p>The good part is that every pixel of the destination, at least in principle, can be
calculated simultaneously - the task is parallelizable. Parallelizable problems are of such
importance that hardware developers created a set of parallel computing platforms
(PCP): Nvidia CUDA, ATI Stream Technology (ATI-ST), etc., all accessible through
OpenCL API. Every PCP developer provides software toolkit to interact with the
PCP: programming language with C-like syntax, additional modules, frameworks, etc.
Basis for PCPs are modern GPUs that are able to perform in parallel nearly any
paralellizable task. For example, at the moment GeForce GTX 1080 Ti is very popular
for CNN learning and significantly accelerates the process. One should notice, shader
blocks of GPU are similar to mobile CPUs with RISK architecture that are used in
modern smartphones.</p>
      <p>In conclusion, improvement of CO performance positively affects nearly any
software on any platform due to the wide variety of tasks it is involved in: DI filtration
(sharpening, edge detection, blurring, etc.), DI scaling, CNN learning, multimedia,
etc. It is worth noting, that CO (1) is the basis for convolutional neural networks
(CNN) functioning. Therefore, accelerating CO we achieve higher CNN performance
and decrease its learning time. Moreover, in some sense CO and similar tasks have
shaped modern approaches to GPU architecture, that emphasizes their importance.</p>
      <p>In current contribution we propose a new method of CO optimization that utilizes
SIMD. Since SIMD can be applied to integer-valued kernels only, we will
demonstrate a reduction method for real-valued kernels that allows this technique to be
implied. Besides, we provide experimental comparison of a human-made code based on
this approach with another recognized solutions (OpenCV library, AMD ACL, and
ARM Compute library). The rest of the paper is organized as follows. First, we
consider SIMD pros and cons and in subsection II-B we introduce the reduction method
itself.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A brief overview of modern software optimization</title>
      <p>We will perform the overview in a “bottom to top” style - we consider hardware first,
then software, and then algorithmic methods of performance enhancement.</p>
      <sec id="sec-2-1">
        <title>2.1 Acceleration by means of hardware</title>
        <p>
          Well chosen hardware architecture may significantly enhance software product
performance. A general perspective on possible options is given by the Flynn’s
taxonomy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and the first question to answer is whether the task (e.g. CO) allows
parallelization of data flow or instructions flow. Today SIMD principles are implemented in
both RISC (e.g. Cortex-A8-23 and Cortex-A53-72 ARM CPUs) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and CISC (e.g.
Intel x86 series) CPUs. One can access this feature with specific extensions of
assembly language - NEON for RISC architecture, while CISC architecture implies SSEn
and AVX1/2 usage. In contrast, MIMD principles are not implemented in modern
CPUs, but partially supported by GPUs.
        </p>
        <p>Modern GPUs provide parallel computing features by means of shader blocks-CPU
(SCPU), based on RISC architecture, that are used in parallel. Principles of MIMD
are achieved by SIMD/MIMD-like instructions for SCPU --- vector instructions for
numerous 128/256/...-bit registers (there are 32 or more registers, that is above the
number that modern CPUs have). The bad part is you cannot access SIMD/MIMD
instructions directly only a small number of intrinsics and pre-implemented
operations are accessible: bit shifts, binary logic, etc. Most cases programmers use specific
frameworks to access mentioned features, e.g. CUDA and OpenCl for GPU, or
OpenCl for CPU/DSP/FPGU. Prevalence of mentioned frameworks led to most GPU
manufacturers get certified by AMD/ATI (compatibility with OpenCl) or Intel/NVidia
(compatibility with CUDA).</p>
        <p>
          Except using GPU, one can employ co-processor units, e.g. Digital Signal
Processor (DSP). For example, Qualcom has developed DSP-Hexagon for embedding into
Snapdragon-625/635/835/825 CPUs. This DSP provides very long instruction word
(VLIW), which means multithreading at the assembler level - during one interruption
3 assembly instructions with different inputs are processed. Compared to simple
SIMD (NEON32 or NEON64) its performance is 4 times higher. Algorithms,
optimized for DSP, reduce CPU load up to ∼ 75% and improve audio/video
encoding/decoding performance up to ∼ 18 times [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Optimization by means of software</title>
        <p>
          Software we use to produce binary code (e.g. compiler itself, additional libraries,
frameworks) highly influences program performance by employing different
optimizations and hardware platform capabilities. In scope of current article we are mostly
concerned with their ability to perform vectorization without significant loss in
precision. Let's consider three well-known compilers: GNU Compiler Collection
(GCC/G++) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Clang [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and nvcc (compiles cu-files for CUDA).
        </p>
        <p>
          Probably, the most popular nowadays is GCC developed and supported by FSF
community. Actually, GCC, first developed by Richard Stallman, is a whole
collection of compilers suitable for different programming languages and architectures. Its
main competitor is the ``rising star'' of compilers - Clang. For example, Apple already
uses it as the basic compiler for its products. Clang itself is a frontend for different
programming languages, e.g. C, C++, Objective-C, Objective-C++, and OpenCL. The
actual generation of binary code and vectorization is performed by the LLVM
framework. Both Gcc and Clang are performance-oriented, but still they fail compared to
human-made assembly code (see comparison [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] Clang ndk-r14b vs Inline Assembly
on Android 5.5.1 (x64) phone with CPU-MT6752).
        </p>
        <p>The last compiler we want to mention is nvcc. It utilizes CUDA and thus allows
significant improvement of performance on platforms with NVidia GPU. But as we
can see, mentioned compilers and technologies introduce large heterogeneity to the
field of program optimization. In a responce OpenCl standard was developed (The
Khronos Group Inc.) that is supported by all mentioned hardware developers and
provides access to parallel computations on GPU/DSP/CPU.</p>
        <p>Except all the advantages of PCPs, they have a drawback - big overhead on
transferring data. To avoid the problem, programmers organize data into pools 100 ∼ 200,
that allows to achieve 20-fold increase in performance compared to CPU. But using
big pools is not always the solution - while CNN learning perfectly fits in this model,
processing stream from video-camera does not at all.</p>
        <p>
          Except for good choice of compiler, one can achieve performance enhancement
using optimized binary code of frequently used functions like CO, scaling, etc. supplied
by different libraries. Many of them contain SIMD-optimized code for armeaby-v7a
and arm64-v8a. Besides, a collection of libraries can be combined into a single
framework in such a way, that advantages of one library compensate drawbacks of the
others. OpenCV and ACL [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] are good examples of libraries comprising a wide
variety of algorithms, including DI processing, DI analysis, and even module for
CNN learning, that are optimized for different CPU architectures and their SIMD:
AVX1/2 , SSE4.4 , ARM NEONx32/x64 . OpenCV is well-known and of a high quality,
but ACL has better extensibility due to modular architecture and seems to perform
better on CO-like tasks (e.g. it is up to ×14 times faster compared to OpenCV on CO
for CNN in one thread [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]). Thus, further we will use both of them as a
reference for comparison.
        </p>
        <p>At the moment SIMD optimization has spread over the wide range of programming
products, both proprietary and open-source. For example, kernel of Windows 10 uses
AVX1/2 to achieve better performance (obviously, this influences the whole system),
while Oracle Java VM utilizes AVX1/2 /3DNow and thus any Java application runs
faster. Game engines of id Tech 2-4 (e.g. the one used in Quake III Arena) are good
example of open-source projects with SIMD optimization. But, using SIMD, they all
face the issue of translating floating-point code to fixed-point with acceptable loss in
precision. This can be quite complicated, thus SIMD optimizations used in proprietary
software are mostly non disclosable.</p>
        <p>
          One more technique to mention is so-called loop unrolling and tiling [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that allows avoidance of redundant comparison operations (e.g. &lt;,&gt;) in cost
of slightly enlargening the code. It is mostly performed by means of compiler or by
introducing appropriate assembly inline-code into the application. Some libraries like
ACL may take advantage of high-level programming language features (e.g.
templates in C++) to perform loop unrolling. A simplified ACL-style code is provided in
listing to demonstrate example implementation (see Помилка! Джерело
посилання не знайдено.).
        </p>
        <p>
          Passing appropriate parameters to do_unroll&lt;...&gt;::run(...) from (see Fig. 1), one may
call function f(...) baseStep×unrollDelta+restSteps times avoiding unrollDelta−1
counter comparisons and decrementations. Achieved performance improvement is
discussed in more details in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Instead of rather general consideration as we did previously, let us focus on CO. The
main obstruction for SIMD optimization is translating floating point CO to
fixedpoint with acceptable loss of precision. First of all, SIMD operations are performed on
integers only.
        </p>
        <p>Thus we should represent elements of kernel  from (1) in a suitable form:
where  is a coefficient for normalization. Now we can perform the most
resourcedemanding part (additions and multiplications) in a SIMD-style and afterwards
normalize the result.</p>
        <p>i, j  i, j ,  
,  i, j 
(2)</p>
        <p>Any kernel can be represented in form (2), but the more precise result we want the
more digits should  i, j have. At this point we meet limitations of platform on which
we intend to run the program. Thus, we should set some constraints on  to avoid
overflow when doing CO.</p>
        <p>Suppose, every pixel in original image is represented as byte and thus possesses
8bit values 0,, 255 . The same range is possessed by kernel elements  i, j .
Intermediate results are stored as 16-bit signed or unsigned values. To warranty that no
overflow occurs, we should make sure that it does not occur on any step of the algorithm.
If kernel has positive elements only, condition we need looks as follows
r r
(28 1)    i, j  216 1. (3)
i0 j0
Substantially, this means that even the largest possible inputs from image do not lead
to overflow.</p>
        <p>If kernel contains negative elements, condition should be much more complicated
and depends on the order of additions when doing CO. Instead, we will use much
stronger but easier to check condition
r r
(28 1)   | i, j | 2161 1, (4)
i0 j0
that is independent on the operations order. This condition can be slightly relaxed
we can use it for positive and negative entries of the kernel  separately. And last
thing to mention: one can easily obtain similar results for signed/unsigned 32-bit
intermediate values by substituting 16 → 32 in (3) and (4).</p>
        <p>What we propose is selecting for given  the biggest  possible, such that  still
satisfies (3) or (4) (which one depends on whether kernel is purely positive or not). Of
coarse, we should be concerned, whether there exist any useful kernels that can be
reduced to a suitable form. And it seems there are plenty of them.</p>
        <p>In conclusion, modern hardware provides mechanisms for vectorization, i.e. SIMD
technologies, that can be used by programmers to enhance performance of the
application. Most cases this technology is utilized by compiler to generate binary code
without participation of programmer. Suitable choose of library may be handy as well
- many libraries contain SIMD-optimized code. But in some cases human intervention
is needed to get the most optimal result. Developing assembly code, one should
represent the function in suitable for SIMD optimization form. It is not always possible and
often restrictions (3,4) should be satisfied. In the next section we will provide new
method of CO optimization and then compare it with existing results, e.g. OpenCV
and ARM CL.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Optimization of Convolution Operation by means of SIMD</title>
      <p>In current contribution we propose a new method of Convolution Operation (CO)
optimization based on SIMD technique. We presume target kernel to satisfy condition
(3). In this section we will provide all necessary considerations and assembly code
that illustrates proposed approach. Next section will be devoted to experimental
comparison of this method's performance to known CO implementations (OpenCV and
ARM CL).</p>
      <p>Regarding condition (4), provided code should be just slightly modified. We will
avoid redundant listings and provide code considering only condition (3), while at the
end of the section all necessary modification for condition (4) will be described.
We start with basic implementation of CO, (see Fig. 2). It contains no specific
optimizations, but still is a good point to start our considerations.</p>
      <p>Here qn are ARM-NEON registers, regarding syntax and instructions order we will
strictly follow ARM reference manuals. For the sake of simplicity we avoided
normalization by coefficient  in (see Fig. 2), but for completeness let us provide it as a
separate (see Fig. 3).</p>
      <p>In (see Fig. 3) we suppose data for normalization to be stored in registers q12 …
q15, while d3 contains normalization coefficient  . Presented code is in some sense
multipurpose and may be used with different CO implementations.</p>
      <p>Now we switch gears to the CO optimization itself. In (see Fig. 2) we provided
some initial version of this operation in assembly code. But it has one significant
drawback - slow data loading. Following (see Fig. 4) avoids this problem by using
one of the registers as buffer. It is known, that simultaneous loading of 16 bytes is
quicker than loading them one-by-one (approximately 10 and 40 cycles respectively).
Thus we use one register for preloading extra data and then use this data byte-by-byte
without redundant load operations.
The main feature of the presented approach (see Fig. 4) is usage of cyclic shift (i.e.
vext.8 q0,q0,q1,\#1) that allows kernel buffering and thus we need less ''loading
operations'' (for more details please see comments in (see Fig. 4). Worth noting, that
provided (see Fig. 4) demands kernel containing not more than 16 elements in one row. If
we need kernels with more than 16 elements in a row, the listing should be just
slightly modified.</p>
      <p>As we mentioned earlier, this code works for kernels satisfying condition (3). To
make it applicable to kernels satisfying (4) we need to change all vmlal.u8/u16
operations to vmlal.s8/s16. This small but crucial changes transform (see Fig. 4) into code
capable of working with signed integer kernels. Depending on given kernel, one can
choose between this two options.</p>
      <p>
        In conclusion, we found a class of kernels that allow significant optimization of CO
by means of SIMD and were able to implement appropriate code combining
approaches of loop unrolling and method from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Exploiting significant difference in
time for simultaneous 16 byte loading compared to one-by-one loading, we were able
to achieve significant speedup of CO. More detailed results and consideration of
measurement procedure will be presented in the following section.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental setup and Results</title>
      <p>Ground truth. To evaluate our results certain reference is needed. As the one we chose
functions cv::filter2d(...) from OpenCV library and NEConvolution{N}x{N}::run()
from ACL library. The latter is well-known among AI and DIP researchers due to its
high-quality and optimized code.</p>
      <p>For comparison we used latest stable tags available at the moment we started
research, release tags are 4.0.0 (2018-11-18 [11:08:36]) for OpenCV and v19.02
(201902-28 [14:25:18]) for ACL. Compilation was performed with NDK-r18b - latest
stable NDK at that moment for to achieve API capability between them. We ensured that
libraries utilize vectorization compiling them with flags
ANDROID_ABI=armeabiv7a with NEON ANDROID_NATIVE_API_LEVEL=22
CPU_BASELINE/CPU_BASELINE_FINAL=NEON
CPU_BASELINE_FLAGS=mfpu=neon -O3 -DNDEBUG. Both OpenCV and ACL were linked as static libraries.
Devices. To make our measurements more relevant we used a set of different devices.
This helps us to understand the influence of architecture, CPU series, and other
parameters on the execution time. Following table lists devices we have used and
parameters of their CPUs.
Measurement procedure. The pivoting parameter we need to measure is the execution
time of each function. Such measurement might be tricky, since it is highly
susceptible to transition processes in Android OS. To avoid this problem we used the
following procedure: each function (cv::filter2d(...), NEConvolution{N}x{N}::run(), and
newCO(...)) was successively called 3 times (for robustness and to simulate RGB
processing) and result was stored to array. After collecting 35 data-points we
calculated median value and treated it as trice the execution time of the function under
consideration.</p>
      <p>Kernel sizes varied 2×2, 3×3, … , 15×15 for experiments with our implementation
and cv::filter2d(...), while implementation of NEConvolution{N}x{N}::run()
necessitates usage of odd-sized kernels only, e.g. 3×3, 5×5, etc. Digital images (DIs) were
generated with equal width and height, corresponding formula follows
125 n 
Wimage  Himage   8   32  Wkernel 1, where square brackets […] denote integer
part of the number. Results are further presented in form of fractions cv::filter2d(...)
execution time divided by execution time of our implementation and
NEConvolution{N}x{N}::run() execution time divided by execution time of our implementation.
Results. First we compared time consumption of the code (see Fig. 4) and reference
function cv::filter2d(...), result is presented in figures 5a and 5b.</p>
      <p>As coordinates we use sizes of kernel and image, while color intensity designates
acceleration, one may achieve using proposed method instead of the reference method
(e.g. fraction of execution times: reference function to proposed).</p>
      <p>
        Despite presented results demonstrate advantage of the proposed method, there is
still room for improvement. It seems, compiler is unable to unroll cycles effectively
on its own,- one may check this by compiling presented code and exploring binary
with any suitable disassembler (e.g. IDA or objdump tool). Thus, we may achieve
additional 30%-40% of acceleration by utilizing techniques [
        <xref ref-type="bibr" rid="ref12 ref13">12-13</xref>
        ].
      </p>
      <p>
        Results for the modified code are shown in figures 5c-5f. We compared time
consumption of the (see Fig. 4), modified with approaches [
        <xref ref-type="bibr" rid="ref12 ref13">12-13</xref>
        ], and both reference
functions (cv::filter2d(...) and NEConvolution{N}x{N}::run()). Besides, we varied
image sizes up to 4500×4500 (~20 [MP]) to emulate modern cameras.
      </p>
      <p>As (fig. 5) suggests, acceleration is independent (almost) on input size, e.g.
complexity (big-O) of our solution and reference solutions coincide. Some small decline
in acceleration (but it is still greater than 1) may be noted for big kernels (9×9 to 15
×15). Regarding mean acceleration, it is estimated as approximately 1.7 times.</p>
      <p>
        It is worth noting, we did not use parallelism for acceleration. Employing OpenMp
or implementing parallelism by any other means may improve presented results twice
or even more. Moreover, no preprocessing, e.g. image tiling, was performed.
Probably, this technique may increase performance of the approach as well [
        <xref ref-type="bibr" rid="ref16 ref17">16-17</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In conclusion, we propose a method of convolution operation (CO) acceleration. We
show that many kernels utilized for practical applications can be reduced to integer
form (table I) that allows for SIMD optimization usage. Despite SIMD itself leads to a
significant boost of performance, we were able to push the frontiers even further by
exploiting significant difference in time for simultaneous 16 byte loading
(approximately 10 cycles) compared to their one-by-one loading (approximately 40 cycles)
q2 register is used as a buffer and loading operations are partially substituted with
cyclic shift (see Fig. 4).</p>
      <p>To test the approach we performed comparison with cv::filter2D(...) function from
OpenCV library and with NEConvolution{N}x{N}::run() from ACL library (fig. 5).
Our results suggest, the current approach leads to significant speedup (mean values:
~1.7× compared to OpenCV and ~1.5× compared to ACL). Measuring acceleration
for different kernels and images we observed no dependence on image size, but kernel
size may influence the result - for kernels smaller than 9×9 we were able to achieve
×4.5 acceleration (compared to cv::filter2D(...) function from OpenCV), while for
larger kernels presented approach allows only ×1.5 speedup. We did not use
parallelism in our code, thus additional ×2 or more acceleration is possible by employing
appropriate techniques, e.g. OpenMp library.</p>
      <p>We expect current approach to be useful for real-time image processing and
convolutional neural networks training as it significantly reduces processing time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chyrkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prystavka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Suspicious Object Search in Airborne Camera Video Stream</article-title>
          . In: Hu Z. et al. (
          <article-title>eds) Advances in Computer Science for Engineering and Education</article-title>
          .
          <source>ICCSEEA 2018. Advances in Intelligent Systems and Computing</source>
          , vol
          <volume>754</volume>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>348</lpage>
          . Springer, Cham, Switzerland (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Gnatyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kinzeryavyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iavich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Prysiazhnyi</surname>
          </string-name>
          , Kh. Yubuzova,
          <article-title>High-Performance Reliable Block Encryption Algorithms Secured against Linear and Differential Cryptanalytic Attacks</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          , Vol.
          <volume>2104</volume>
          , pp.
          <fpage>657</fpage>
          -
          <lpage>668</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. “Documentation for open-cv,” https://docs.opencv.org/trunk/d4/d86/group__imgproc__filter.
          <source>html#ga27c049795ce870216ddfb366086b5a04</source>
          ,
          <year>2017</year>
          , [Online; accessed 27-November-2017].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>M. J. Flynn</surname>
          </string-name>
          , “
          <article-title>Very high-speed computing systems</article-title>
          ,
          <source>” Proceedings of the IEEE</source>
          , vol.
          <volume>54</volume>
          , no.
          <issue>12</issue>
          , pp.
          <fpage>1901</fpage>
          -
          <lpage>1909</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. “Arm® Cortex®
          <article-title>- a53 mpcore processor: Reference book of cortex-a53 cpus</article-title>
          ,”http://infocenter.arm.com/help/topic/com.arm.
          <source>doc.ddi0500g/DDI0500G_cortex_a53_tr m.pdf</source>
          ,
          <year>2013</year>
          , [Online; accessed 27-November-2017].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. “Qualcomm extends hexagon dsp,” http://pages.cs.wisc.edu/~danav/pubs/qcom/ hexagon_microreport2013_
          <fpage>v5</fpage>
          .pdf,
          <year>2013</year>
          ,[Online; accessed 27-November-2017].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. “
          <article-title>Qualcomm hexagon dsp: An architecture optimizedfor mobile multimedia</article-title>
          and communications,” https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf,
          <year>2013</year>
          , [Online; accessed 27-November-2017].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Griffith</surname>
          </string-name>
          ,
          <article-title>GCC: the complete reference</article-title>
          .
          <source>McGraw-Hill</source>
          , Inc.,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Lopes</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Auler</surname>
          </string-name>
          ,
          <article-title>Getting started with LLVM core libraries</article-title>
          .
          <source>Packt Publishing Ltd</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. “Presentation of arm-cl,” https://community.arm.com/graphics/b/blog/posts/arm
          <article-title>-computelibrary-for-computer-vision-and-machine-learning-now-publicly-</article-title>
          <string-name>
            <surname>available</surname>
          </string-name>
          ,
          <year>2017</year>
          , [Online; accessed 24-May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. “
          <article-title>Video presentation for arm-cl</article-title>
          ,” https://developer.arm.com/technologies/computelibrary?_
          <source>ga=2.909169.1792656346</source>
          .
          <fpage>1530630636</fpage>
          -
          <lpage>1257957724</lpage>
          .1521634632,
          <year>2017</year>
          , [Online; accessed 24-May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. A. Nicolau, “
          <article-title>Loop quantization: unwinding for fine-grain parallelism exploitation</article-title>
          ,” Cornell University, Tech. Rep.,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. J. Xue, “
          <article-title>Loop tiling for parallelism</article-title>
          , volume
          <volume>575</volume>
          of kluwer international series in engineering and computer science,”
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. T. Veldhuizen, “Template metaprograms. c++ report,”
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. V. Sarkar, “
          <article-title>Optimized unrolling of nested loops</article-title>
          ,”
          <source>in Proceedings of the 14th International Conference on Supercomputing, ser. ICS '00</source>
          . New York, NY, USA: ACM,
          <year>2000</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>166</lpage>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/335231.335246
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Fedushko</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benova</surname>
            <given-names>E</given-names>
          </string-name>
          .
          <article-title>Semantic analysis for information and communication threats detection of online service users</article-title>
          .
          <source>The 10th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2019) November 4-7</source>
          ,
          <year>2019</year>
          , Coimbra, Portugal.
          <source>Procedia Computer Science</source>
          , Volume
          <volume>160</volume>
          ,
          <year>2019</year>
          , Pages
          <fpage>254</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Gnatyuk</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akhmetov</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlovskyi</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinzeryavyy</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aleksander</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prysiazhnyi</surname>
            <given-names>D</given-names>
          </string-name>
          .
          <article-title>New Secure Block Cipher for Critical Applications: Design, Implementation, Speed</article-title>
          and
          <string-name>
            <given-names>Security</given-names>
            <surname>Analysis</surname>
          </string-name>
          ,
          <source>Advances in Intelligent Systems and Computing</source>
          , Vol.
          <volume>1126</volume>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>