=Paper=
{{Paper
|id=Vol-3826/short32
|storemode=property
|title=Enhancement of convolution operation performance using SIMD of AArch64 (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3826/short32.pdf
|volume=Vol-3826
|authors=Andrii Shevchenko,Pylyp Prystavka
|dblpUrl=https://dblp.org/rec/conf/cpits/ShevchenkoP24
}}
==Enhancement of convolution operation performance using SIMD of AArch64 (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-3826/short32.pdf</pdf>
<pre>
                                Enhancement of convolution operation performance
                                using SIMD of AArch64⋆
                                Andrii Shevchenko1,*,† and Pylyp Prystavka1,†
                                1
                                    National Aviation University, 1 Lubomyra Guzara ave., 03058 Kyiv, Ukraine

                                                      Abstract
                                                      Optimization of two-dimensional convolution through 16-bit SIMD technologies of ARM x64 (aarch64) is
                                                      considered. It is shown that by utilizing inline assembler and 16-bit SIMD commands of aarch64, one can
                                                      achieve a signiﬁcant performance increase compared to the similar functions of OpenCV. Throughout the
                                                      research, filter coeﬃcients were quantized to match the 8-bit range.

                                                      Keywords
                                                      convolution, SIMD, optimization, performance 1


                         1. Introduction                                                                        Equation (1) is rather general and perfectly compatible with
                                                                                                                cv::ﬁlter2D(...) function of the OpenCV library [3], but
                         The process of automatic program code vectorization                                    further we will give our attention to square kernels, i.e. r = c,
                         (APCV) is based on the SIMD instructions of the CPU. APCV                              and thus from now on we presume kernel to be square-
                         is utilized in modern compilers, e.g., GCC and Clang/LLVM.                             shaped without special mentioning.
                         The benefits that provide APCV can be achieved by                                          So, every pixel of the destination DI can be calculated
                         compiling the program with -O3 (or “aggressive” -O4/-                                  simultaneously. This means that the task can be parallelized.
                         Ofast) flag (actually, the flag may differ depending on the                            To accomplish this task, hardware developers created a set
                         platform and compiler). But as was shown in [1, 2], we                                 of parallel computing platforms (PCP) (like Nvidia CUDA,
                         achieve performance from APCV less than we need in the                                 ATI Stream Technology (ATI-ST), etc.) to perform these
                         context of digital image (DI) processing. Moreover, DI                                 parallelizable problems. To create some common approach
                         problems have great importance due to the wide variety of                              for all PCP Chronos is distributing the OpenCL API/lib (just
                         applications in video-stream processing (stabilization,                                like Nvidia distributes CUDA and so on). So every PCP
                         filtration, noise correction, or applying some effects for a                           developer provides the software toolkit to interact with the
                         single image, etc.). In processing DI, one should always take                          PCP: programming language with C-like syntax, additional
                         into account the following features:                                                   modules, frameworks, etc.
                              1.    The computational complexity of the method                                      Modern GPUs that can perform in parallel nearly any
                                    chosen.                                                                     parallelizable task is the basis of PCPs. Currently, GeForce
                              2.    Whether the method is optimized.                                            and ATI video accelerators are very popular for CNN
                              3.    Hardware resources of the target architecture.                              learning and signiﬁcantly accelerate the process. But the
                              One can emphasize resource-demanding (but                                         core/base of this calculation process is performed by shader
                         significant) operations of DI processing: convolution,                                 blocks of GPU. The curious fact is that shader blocks of GPU
                         scaling (mostly achieved through convolution), and analysis                            are similar to mobile CPUs with RISK architecture, which
                         (of color, brightness, contrast, etc.). Convolution operation                          are used in modern smartphones.
                         (CO) (1) is the simplest but most valuable and resource-                                   In conclusion, for to positive effect on the software that
                         demanding operation:                                                                   is using the DI processing (image filtration, scaling, edge
                                                  r    c
                                                                                                                detection, blurring, etc.), it is essential to provide speed
                                       pi , j     k ,l Pk  i  a ,l  j  a  ,                (1)
                                                                                                                improvement of CO. This will automatically lead to speed
                                                k 0 l 0
                                                                                                                improvement in such fields/areas as multimedia (video
                         where 𝑖 = 𝑎, … , 𝑊 − (𝑟 − 𝑎) − 1, 𝑗 = 𝑎 , … , 𝐻 − (𝑐 −
                                                                                                                codecs), CNN learning, etc. It is worth noting that CO (1) is
                         𝑎 ) − 1 are indexing pixels of the destination image p; W
                                                                                                                the basis for (CNN) functioning if it (CO) is used in the base
                         and H are the width and height of the source P and
                                                                                                                optimized computation approach like “Integer-Arithmetic-
                         destination p images (we neglect border effects in the
                                                                                                                Only”. Therefore, by accelerating CO we achieve higher
                         destination at the moment), Γ is the kernel of the
                                                                                                                CNN performance and decrease its learning time. Moreover,
                         convolution (matrix r × c), and a, a´ is so-called “anchors”
                                                                                                                GPU architecture is based/has been created to
                         that deﬁne relative position of a filtered point within the
                                                                                                                improve/solve/speed up such tasks as CO and similar tasks.
                         kernel.


                                CPITS-II 2024: Workshop on Cybersecurity Providing in Information                    0000-0003-3863-0473 (A. Shevchenko);
                                and Telecommunication Systems II, October 26, 2024, Kyiv, Ukraine                  0000-0002-0360-2459 (P. Prystavka)
                                ∗
                                  Corresponding author.                                                                         © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                                Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                †
                                  These authors contributed equally.
                                   lllandreyshevchenkolll@gmail.com (A. Shevchenko);
                                chindakor37@gmail.com (P. Prystavka)
CEUR
Workshop
                  ceur-ws.org
              ISSN 1613-0073
                                                                                                          385
Proceedings
In the current contribution, we propose a new method of                2.2. Optimization using software
CO optimization that utilizes ARMx64 SIMD-like aarch64
(NEON64) operations. Since NEON64 can be applied to                    The software we use (e.g. compiler itself, additional
integer-valued kernels, we will demonstrate a method for               libraries, frameworks) highly influences program
kernels with real values that allows this technique to be              performance (that we produce) by employing different
implied. Besides, to prove the suggested approach is                   optimizations to use more effective the hardware platform
effective we provide an experimental comparison of a                   capabilities. In the scope of the current paper, we are
human-made code (based on this approach) with recognized               primarily concerned with their ability to perform
solutions like OpenCV lib.                                             vectorization without significant loss in precision and
    The rest of the paper is organized as follows. First, we           speed.
consider NEON64’s pros and cons, and in subsection II-B we                  Let’s consider three well-known compilers: GNU
introduce the reduction/proposed method itself.                        Compiler Collection (GCC/G++) [8], Clang [9], and nvcc
                                                                       (compiles cu-files for CUDA).
                                                                            The most popular nowadays is still the GCC compiler
2. A brief overview of modern                                          developed/supported by the FSF community. The first
   software optimization                                               versions of GCC were a collection of compilers for different
Will perform the overview in a “bottom to top” style—                  programming languages developed by Richard Stallman.
consider hardware ﬁrst, then software, and then algorithmic            Nowadays GCC is no longer a GNU C compiler now it is a
methods of performance enhancement.                                    GNU Compiler Collection. GNU is an optimizing compiler
                                                                       produced by the GNU Project supporting various
2.1. Acceleration using hardware                                       programming languages, hardware architectures, and
                                                                       operating systems.
It is worth noting that to signiﬁcantly enhance software                    GCC’s main competitor is Clang. For example, Apple
product performance, well-chosen hardware architecture is              already uses it as the basic compiler for its products.
the most important. The ﬁrst question is whether the task              Moreover, the UNIX/BSD OS/distributives also use it as a
(e.g., CO) allows parallelization of data ﬂow (or instructions         default compiler. The Android NDK no longer uses GCC and
ﬂow). Flynn’s taxonomy [4] gives a general perspective on              by default, the clang compiler is used for it. Clang itself is a
possible solutions.                                                    frontend for different programming languages, e.g. C, C++,
     Today NEON64 (A64 instruction set; Neon SIMD for                  Objective-C, Objective-C++, and OpenCL. The actual
ARM64 CPUs) principles are implemented in both RISC                    generation of binary code and vectorization is performed by
(e.g., Cortex-A53-72/X1 ARM64 CPUs) [5] and CISC (e.g.,                the LLVM framework. Both GCC and Clang are
Intel x64/x86 series) CPUs. To obtain access to such a feature         performance-oriented, but still, they fail compared to
(NEON64) specific extensions of the assembly language are              human-made assembly code [1, 2, 10].
needed. CISC architecture implies SSEn and AVX1/2                           nvcc is the last compiler that we want to mention. It
extensions of the assembly language. In contrast, MIMD                 widely utilizes NVidia CUDA plus the power of C language,
principles are not implemented in modern CPUs but are                  which significantly improves PC performance with NVidia
partially supported by GPUs. As was noted before, based on             GPU only. The main peculiarity is that these GPUs can use
RISC architecture modern GPUs provide parallel computing               SIMT Architecture whose core feature is that the
features due to shader blocks-CPU (SCPU). SCPU contains                multiprocessor creates, manages, schedules, and executes
some specific 128/256/...-bit registers using whom partial             threads in groups of 32 parallel threads.
MIMD principles are implemented. The number of these                        But as we can see, the mentioned compilers and
special registers is 32 or more, above the number of SIMD              technologies introduce significant heterogeneity in the field
registers that modern ARM64 CPUs have. The bad part is                 of program optimization. They represent a family of
you cannot access SIMD/MIMD instructions directly                      separated devices/technologies. In response, the OpenCL
through the language that maintains the possibility of                 standard was developed (The Khronos Group Inc.) that is
communicating with SIMD regs of SCPU. There are some                   supported by all mentioned hardware developers and
preordered intrinsics and pre-implemented operations                   provides access to parallel computations on GPU/DSP/CPU.
accessible: bit shifts, binary logic, etc. Mostly, programmers              But PCPs have a drawback—a big overhead on
use specific frameworks to access the mentioned features,              transferring data through the bus. To avoid the problem,
e.g. CUDA and OpenCL for GPU, or OpenCL for                            programmers organize data into pools, which allows for
CPU/DSP/FPGA.                                                          achieving more than a 20-fold increase in performance
     Except using GPU, one can employ co-processor units,              compared to CPU (CNN learning perfectly fits in this
e.g. Digital Signal Processor (DSP) like Qualcomm Hexagon.             model). But using big pools is not always the solution—
It has been developed for embedding into Snapdragon-                   while processing streams from a video camera does not at
6XX/8XX CPUs to reduce the CPU load by up to ∼75% and                  all.
improve audio/video encoding/decoding performance by up                     One more reasonable approach to achieve performance
to ∼18 times [6, 7]. Moreover, compared to simple NEON64,              enhancement of DI processing is supplied by different
its performance is ∼4 times higher. This DSP uses a very               libraries (proprietary or not) like OpenCV or arm
long instruction word (VLIW), which means multithreading               ComputeLibrary. Many of them contain NEON64-optimized
at the assembler level (as SIMD) during one interruption,              code for armeaby-v7a and arm64-v8a. Another smart
three assembly instructions with different inputs are                  strategy is to use a collection of libraries that can be
processed.                                                             combined into a single framework. As a result, the


                                                                 386
advantages of one library compensate drawbacks of the
others. OpenCV and ACL [9] are good examples of libraries
comprising a wide variety of algorithms, including DI
processing, and DI analysis. Moreover, OpenCV contains
even modules for CNN learning, optimized for different
CPU architectures that use SIMD (AVX1/2/SSE4, NEON64)
and GPU optimized approaches/solutions. Also, OpenCV is
well-known for its high-quality DI processing. Thus,
further, we will OpenCV as a reference for comparison.
     At the moment SIMD optimization has spread over a
wide range of programming products, both proprietary and
open-source. For example, the kernel of Windows 10 OS is
widely used AVX1/2/3DNow SIMD optimizations to achieve
better performance (obviously, this influences the whole
system). Oracle Java VM utilizes AVX1/2/3DNow and thus
any Java application runs faster. But, using SIMD
optimization, they all face the issue of translating floating-         Figure 1: Loop unrolling with C++ templates
point code to fixed-point with acceptable loss in precision.
Therefore, it is quite complicated. Thus SIMD optimizations            Thus, we should represent elements of the kernel Γ from (1)
used in proprietary software are mostly non-disclosable.               in a suitable form:
     One more technique to mention is the so-called loop                              Γ , = 𝜈𝛾 , , 𝜈 ∈ 𝑅, 𝛾 , ∈ 𝑍              (2)
unrolling and tiling [11–13]. This technique avoids                    where ν is a coefficient for normalization. Now we can
redundant comparison operations at the cost of slightly                perform/discuss the most resource-demanding part
enlarging the out/binary file. It is mainly performed by               (additions and multiplications) in a SIMD style and
utilizing the compiler or by introducing appropriate                   afterward normalize the result.
assembly inline code into the application.                                 Any kernel can be represented in form (2), but the more
     Some libraries like ACL may use high-level                        precise the result we want, the more digits should have γi,j.
programming language features (e.g., templates in C++) to              So, we should set some constraints on γ to avoid overﬂow
perform loop unrolling. A simplified ACL-style code is                 when doing CO because of the platform's limitations on
provided in the listing to demonstrate an example                      which we intend to run the program.
implementation in Figure 1: Loop unrolling with C++                        Suppose, every pixel in the original image is represented
templates. Our previous paper [2] provided a detailed                  as a byte and thus possesses 8-bit values 0, …, 255. The same
description that leads to a huge (over +25%) speed                     range is possessed by kernel elements γi,j. Intermediate
improvement to an algorithm. However, the ARM64                        results are stored as 16-bit signed or unsigned values. To
architecture was significantly improved compared to the                warrant that no overflow occurs, we should ensure that it
ARMv7-A. If not go too deep in details main conclusion                 does not happen on any algorithm step. If the kernel has
about this kind of approach is that we do not need this                positive elements only, a condition we need looks as follows
technique. Moreover, we have done some simple research in                                         r       r

which we compare the speed of two equivalent functions,                           (28  1)     i , j  216  1.            (3)
                                                                                              i 0 j 0
one with loop unrolling and another without it. The result
                                                                           Substantially, this means that even the largest possible
was unexpected. The function with a loop unrolling gives a
                                                                       inputs from the image do not lead to overflow.
3-5% speed reduction. To get proof about the fact that the
                                                                           If the kernel contains negative elements, the condition
loop was unrolled the IDA was used. As on the ARMv7-A
                                                                       should be much more complicated and depend on the order
arch, the cycle was unrolled on ARM64 by the clang (9
                                                                       of additions when doing CO. Instead, we will use much
versions) compiler, and as expected the body of the
                                                                       stronger but more straightforward to check the condition
bottleneck CO function part was repeated 8 times. But there
                                                                                              r       r
is one thing to mention—the bottleneck CO function part                          (28  1)    | i , j | 216 1  1,        (4)
was covered by redundant comparisons which can have                                          i 0 j 0
such a negative effect. The ACL lib part that was optimized            independent of the operations’ order. Moreover, this
using NEON64 was rewritten without a loop unrolling                    condition can be slightly relaxed—we can use it for positive
approach.                                                              and negative entries of the kernel γ separately. And the last
     This unusual fact gives food for think and in further             thing to mention: one can easily obtain similar results for
research about different CO approaches/methods, we will                signed/unsigned 32-bit intermediate values by substituting
cover (go deeper) them.                                                16 → 32 in (3) and (4).
                                                                           What we propose is selecting for giving Γ the most
2.3. Optimization using special algorithms                             extensive ν possible, such that γ still satisﬁes (3) or (4)
Let’s focus on CO. The primary obstacle for SIMD                       (which one depends on whether the kernel is purely positive
optimization is the act of translation of floating-point CO            or not). Of course, we shouldn’t be concerned about whether
algo into fixed-point algorithm CO algo with an acceptable             any valuable kernels can be reduced to a suitable form/size
loss of precision or even without it. First of all, SIMD               because there are plenty of them.
operations will be performed on integers further.


                                                                 387
    In conclusion, modern hardware provides mechanisms                        Let’s comment on the sections of this code/approach
for vectorization, i.e., SIMD technologies, that programmers              (Figure 2b). This is a naïve approach representing the
can use to enhance the performance of the application. In                 loading operation for each kernel element and loading
most cases, this technology is utilized by the compiler to                source image elements (lines 5, 6). The loading and storing
generate binary code without the participation of the                     operations are the most expensive operations. (lines from 8–
programmer. A suitable choice of the library may be handy                 11) represent the multiply-and-accumulated image values
as well—many libraries contain SIMD-optimized code. But                   (v2, v3) with the kernel element (v0). The results of these
in some cases, human intervention is needed to get the most               operations are stored in the buffer regs (v12, v13, v14, v15).
optimal result. More specifically—the code must represent                 The buffer regs represent the result of the 8-bit
the function/code which can be/suitable for the SIMD                      multiplication of image values on each kernel element
optimization. However, it is not always possible, and in our              extended up to 16-bit unsigned int using the “umlal”
case restrictions (3,4) should be satisfied. In the next section,         operation. These operations are performed for every kernel
we will provide a new method of CO optimization and then                  element. So as you can see, this is time time-consuming
compare it with existing results from OpenCV lib.                         approach.
                                                                              Let’s comment on the sections of this code/approach
3. Optimization of Convolution                                            (see Figure 2a): line 4 loading 48 bytes of grayscale image to
                                                                          v0–v2; line 5 loading 16 bytes of CO kernel in v8; lines
   Operation using SIMD                                                   13,14,17,18 provide conversion from 8-bit to 16-bit and
In the current contribution, we propose a new Convolution                 multiplication calculation with kernel element in v5
Operation (CO) optimization method based on the SIMD                      simultaneously. Please note that v0–v2 registers contain
technique. We presume that the target kernel satisfies the                part of the image that should be convolved with the kernel
3rd condition. This section will provide all necessary                    stored in v8. Register v2 is exploited as a buffer for 16 more
considerations and an inline assembly code that illustrates               bytes of the input image to speed up the CO by utilizing the
the proposed approach. The following section will be                      “ext” operations. Moreover, data from buffer v2 is being
devoted to an experimental comparison of this method’s                    used to perform cyclically shifting content of v0 (line 26), v1
performance to known CO implementations of OpenCV.                        (line 29), and v2 (line 32 with itself) byte-by-byte performed
     Regarding condition (4), the provided code should be                 with the “ext” command. It is not quite clear but we utilize
just slightly modified. Therefore, we will avoid redundant                different names of the registers to save shifted states of v0–
code listings and deliver code that realizes condition (3). In            v2 (lines 12, 16, 11, 23) which is called the register rename
contrast, all necessary modifications for a realization of                technique. Also as you can see we utilized some reordering
condition (4) will be described at the end of the section. We             of instructions which brought little obfuscation.
start with the basic implementation of CO (see Figure 2b). It             Nevertheless, all this gives about 7-10% speedup in
contains no specific optimizations but still is a good point to           comparison to the ordered instruction set which utilizes the
begin our considerations.                                                 process of saving all the time in the same names registers
     Here νn are the NEON64 registers. Regarding syntax and               names v0..v2.
instructions order, we will strictly follow ARM reference                     Moreover, if we save the result of the shift in the same
manuals. For the sake of simplicity, we avoided                           register name (like v0, v1, v2), we receive speed-reduced
normalization by the coefficient ν in (see Figure 2b), but for            impacts. This is because the operation “ext” saves the result
completeness, let us provide it separately (see Figure 2c).               in a state of progress, and when the next operation tries to
     In (Figure 2c) we suppose data for normalization to be               obtain the content of the v0 (or v1, or v2), it produces the
stored in registers v12–v15, while v1[0] contains the                     waiting/bottleneck state. So, the more such conditions
normalization coefficient ν. The presented code is in some                appear in the program, the less win of time provided by the
sense multipurpose and may be used with different CO                      algorithm. The most resource part of optimized CO algo
implementations.                                                          (Fig. 2a) was almost entirely described by us. Finally, the
     Now we switch gears to the CO optimization itself by                 “case” state (Fig. 2a, lines 37 up to 63) represents the
utilizing NEON64. In (see Figure 2b) have been provided a                 calculation finishing of the kernel row.
naive version/approach of this operation (in assembly code).                  So as you can see the main feature of the presented
But this variant contains one significant drawback—data                   approach (see Figure 2a) is the usage of cyclic shift (i.e., ext
loading. The data loading/storing process is the slowest                  v10.16b, v0.16b, v1.16b, #1) that provides the kernel
operation because it involves sub/inner processes like                    buffering, and thus, we need fewer operations of loading.
communication with the CPU and RAM. Even though such                      One more thing that should be mentioned is the pre-save of
hardware approaches like CPU cache cover this operation,                  the shifted data (see Figure 2a) (in lines 11,15) on to 1
it is still slow.                                                         element and (in lines 19, 22) on to 2 elements were used for
     To avoid this problem, one of the registers was used as              the current iteration of CO. Other “ext” operations (lines 25,
a buffer. The following approach (see Figure 2a) avoids this              28, 31) provide the data initialization for the next iteration
problem by using one of the registers as a buffer. It is known            of CO. Worth noting, that provided (see Figure 2a) demands
that simultaneous loading of 16 bytes is quicker than                     a kernel containing not more than 16 elements in one row.
loading them one by one. Thus we use one register for                     Another variation of this interpretation in which CO kernel
preloading extra data and then use this data to perform                   size is more than 16 elements should utilize data
byte-by-byte shift to exclude redundant load operations.                  reinitialization of the base registers, which can be seen (in
                                                                          lines 3–4).


                                                                    388
                                                                                             (b)


                      (a)                                                    (c)
  Figure 2: CO optimization with SIMD NEON64: (a) optimized approach; (b) naive approach; (c) normalization
                                      procedure and saving the result.

Let’s comment on the sections of this code/approach                  As we mentioned earlier, this code works for kernels
(Figure 2c). This section provides the normalization of              satisfying conditions (3). To make it applicable to
coefficient ν. Lines from 2–9 represent the conversion               kernels satisfying (4), we need to change all "umlal"
process from 16-bit data types up to 32-bit data types.              operations to "smlal" but before it, the extended
Saving all data needs twice as much register stack (v12–             operation is required (like “sxtl”). These small but crucial
v19) as it was before (v12–v15). Lines from 10–17                    changes transform (see Figure 2a) into code that works
represent the data conversion from unsigned integer                  with signed integer kernels. Depending on elements in
(32-bit) up to (32-bit) floating-point. Finally, lines 19–25         the given kernel, one can choose between these two
describe the normalization process with the ν coefficient            options.
(placed in v1.s[0]). All other lines (26–46) represent the               In conclusion, we found a class of kernels that allow
reverse process: the normalized data converts from the               significant optimization CO utilizing NEON64 and
(32-bit) floating-point up to (8-bit) unsigned integer               implementing appropriate code/algo. For example, the
(lines 26-45) and the result saving (line 46).                       Subband low pass filtering kernel, like (5).
                                                               389
               1 6 1                                             understand the influence of architecture, CPU series,
                                    1
             6 36 6  ,  =
                                                                   and other parameters on the execution time. The
                                                   (5)
               1 6 1             64                              Odroid-C4 CPU is Cortex-A55; the OS is Ubuntu 20.04;
                                                                 Linux 5.7.0-odroid-arm64 is the kernel, and its API is
Furthermore, we achieved a significant CO speedup by               aarch64. The CPU series of this device is Amlogic
exploiting substantial differences in time for                     S905X3 which is more powerful than the latest
simultaneous 16-byte loading with a byte-shift approach            Raspberry Pi CPUs.
compared to one-by-one line loading. More detailed                      Measurement procedure. The pivoting parameter
results and considerations of the measurement                      we need to measure is the execution time of each
procedure will be presented in the following section.              function. Such measurement might be tricky since it is
                                                                   highly susceptible to transition processes in any GNU
4. Experimental setup and results                                  OS (Ubuntu, Android, etc.).
                                                                        To avoid this problem, we used the following
Ground truth. To evaluate our results certain reference            procedure: each function (cv::filter2d(...) and proposed
is needed. As the etalon, we chose functions                       method - newCO(...)) was successively called three times
cv::filter2d(...) from the OpenCV library. The latter is           (for robustness and to simulate Grayscale processing),
well-known among AI and DIP researchers due to its                 and the result was stored to the array—this is one data
high-quality and optimized code. Especially when quick             point. Then, after collecting 35 data points, we calculated
prototyping is needed.                                             the median value and treated it as twice the function's
     For comparison, we used the latest stable tag                 execution time under consideration.
available when we started to research. The release tag is               Kernel sizes varied 2×2, 3×3, …, 15×15 for
4.5.2 (2021-04-02 11:23) for OpenCV. The compilation               experiments         with     our   implementation       and
was performed with clang-9 - the latest stable clang               cv::filter2d(...). DIs were generated with equal width and
version. We ensured that libraries utilize vectorization,          height,     the       corresponding    formula      follows
compiling            them      with        flags:       -
DCMAKE_BUILD_TYPE=RELEASE                               -                            125 n 
DENABLE_NEON=ON ... and the compilation process
                                                                   Wimage  Himage          32  Wkernel 1,
was with the verbose mode on. The result is that some                                 8                        where
critical fields like "CPU_BASELINE" (NEON F16) and                 square brackets […] denote the integer part of the
"C++ flags (Release)" (...-O3 -DNDEBUG...) provided                number. Results are further presented in the form of
needed content. Also, we mention the fact that OpenCV              fractions cv::filter2d(...) execution time divided by
lib was linked as a dynamic library.                               execution time of proposed/our implementation.
     Devices. To make our measurements more relevant,
we used such a device as Odroid-C4. This helps us


                         (a)                                                     (b)
Figure 3: Performance comparison of the CO usage of cv::filter2D vs. the proposed method on the devices with
Cortex-A55 ARM CPU.

Color intensity designates relative time consumption for           5. Results
reference function about the proposed method.
Acceleration one may achieve by using the presented                First, we compared the time consumption of the
approach instead of the reference function (the brighter           proposed code (see Figure 2a) and reference function
is color—the greater is acceleration). Legends on each             cv::filter2d(...). The result is presented in Figs. 5a and 5b.
plot designate how to translate color to acceleration; if          As coordinates, we use sizes of kernel and image. At the
this number is greater than 1, it is profitable to use the         same time, color intensity designates acceleration,
proposed method.                                                   which one may achieve using the proposed method
                                                                   instead of the reference method (e.g., a fraction of the

                                                             390
execution times of the reference function divided by the             acceleration compared to cv::filter2D(...), while for larger
execution time of the proposed method).                              kernels presented approach allows ~3.7 speedup.
     Despite the presented results demonstrating the                     We expect the current approach to be useful for real-
advantage of the proposed method, there is still room for            time image processing and convolutional neural
improvement. For example, it seems the compiler cannot               network training as it significantly reduces processing
unroll cycles effectively on its own, and we mentioned               time.
this above. But if we do the same as was done in ACL—
unroll all "bottle-neck cycles" on our own, it seems we              References
can achieve a more speedy approach/results. Thus, we
may reach an additional 10-20% acceleration by utilizing             [1] P. Prystavka, A. Shevchenko, Investigation of the
techniques [11–13] by writing cycle unrolling with the                    Implementation of the Linear Operator of Digital
online assembly by hand.                                                  Image Convolution in 16-bit Computing, Actual
     Results for the modified code are shown in Figure 3.                 Problems of Automation and Information
We have compared the time consumption of the                              Technologies, no. 20 (2016) 78-90.
proposed method (see Figure 2a) and function                         [2] A. Shevchenko, V. Tymchyshyn, A SIMD-based
cv::filter2d(...). Besides, we varied image sizes up to                   Approach to the Enhancement of Convolution
4500×4500 (~20 [MP]) to emulate modern cameras and                        Operation Performance, CMiGIN (2019).
picture libs.                                                        [3] Image Filtering (2021). URL: https://docs.opencv.
     As Figure 3 suggests, acceleration is independent                    org/4.x/d4/d86/group__imgproc__filter.html#ga2
(almost) of the input size, e.g. complexity (big-O) of our                7c049795ce870216ddfb366086b5a04.
solution and reference solutions coincide. Some small                [4] M. J. Flynn, Very High-Speed Computing Systems,
decline in acceleration (but it is still greater than 2) may              in: Proceedings of the IEEE 54.12 (1966) 1901–1909.
be noted for big kernels (13×13 … 15×15) and smaller                 [5] ARM Cortex- A53 MPCore Processor Technical
kernels (2×2 … 4×4). Regarding mean acceleration, it is                   Reference Manual (2021). URL: https://developer.
estimated as approximately 3.7 times.                                     arm.com/documentation/ddi0500/j.
     It is worth noting that we didn’t use parallelism for           [6] Qualcomm Extends Hexagon DSP (2013) URL:
acceleration. Moreover, no preprocessing, e.g., image                     http://pages.cs.wisc.edu/~danav/pubs/qcom/hexa
tiling, was performed. Probably, this technique may                       gon_microreport2013_v5.pdf.
increase the performance of the approach as well.                    [7] Qualcomm Hexagon DSP: An Architecture
                                                                          Optimized      for Mobile       Multimedia       and
                                                                          Communications (2013). URL: https://developer.
6. Conclusions                                                            qualcomm.com/download/hexagon/hexagon-dsp-
In conclusion, we propose a method of convolution                         architecture.pdf.
operation acceleration. We have shown that speed                     [8] Griffith, Arthur. GCC: The Complete Reference.
improvement can be achieved if kernels have been                          McGraw-Hill, Inc., 2002.
reduced to integer values that allow SIMD command                    [9] B. C. Lopes, R. Auler. Getting Started with LLVM
usage. Furthermore, despite SIMD itself leading to a                      Core Libraries. Packt. Publishing ltd. (2014).
significant boost of performance, we were able to push               [10] ARM       Compute       Library     (2021).    URL:
the frontiers even further by exploiting the considerable                 https://developer.arm.com/ip-products/
difference in time for simultaneous 32-byte loading                       processors/machine-learning/compute-librar
compared to their one-by-one loading and using buffer                [11] A. Nicolau, Loop Quantization: Unwinding for
(one-time load for the kernel row—48-byte), and loading                   Fine-Grain Parallelism Exploitation. Cornell
operations are partially substituted with cyclic shift.                   University (1985).
    About ALC, we should mention in addition. There                  [12] Xue, Jingling. Loop Tiling for Parallelism, vol. 575
was a severe code rewriting event in this lib.                            (2000).
Furthermore, the patches became cumulative ("less                    [13] T. Veldhuizen, Expression Templates. C++ Report
description more code"). This fact brought more                           7.5 (1995) 26–31.
obscurity/obfuscation than clarity/understanding. So,
we will compare the ACL lib and modifications of our
suggested approach in our following paper but it is
needed to mention that ALC provides all additional code
optimization approaches that we mentioned above
(cycle unrolling, image tiling, etc.).
    To test the approach we performed a comparison
with the cv::filter2D(...) function from the OpenCV
library. Our results suggest the current approach leads
to significant speedup (mean values: ~3.7× compared to
OpenCV). Measuring acceleration for different kernels
and images we observed no dependence on image size,
but kernel size may influence the result—for kernels
smaller than 8×8 we were able to achieve ×7.379


                                                               391

</pre>