=Paper=
{{Paper
|id=Vol-3826/short32
|storemode=property
|title=Enhancement of convolution operation performance using SIMD of AArch64 (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3826/short32.pdf
|volume=Vol-3826
|authors=Andrii Shevchenko,Pylyp Prystavka
|dblpUrl=https://dblp.org/rec/conf/cpits/ShevchenkoP24
}}
==Enhancement of convolution operation performance using SIMD of AArch64 (short paper)==
Enhancement of convolution operation performance
using SIMD of AArch64⋆
Andrii Shevchenko1,*,† and Pylyp Prystavka1,†
1
National Aviation University, 1 Lubomyra Guzara ave., 03058 Kyiv, Ukraine
Abstract
Optimization of two-dimensional convolution through 16-bit SIMD technologies of ARM x64 (aarch64) is
considered. It is shown that by utilizing inline assembler and 16-bit SIMD commands of aarch64, one can
achieve a significant performance increase compared to the similar functions of OpenCV. Throughout the
research, filter coefficients were quantized to match the 8-bit range.
Keywords
convolution, SIMD, optimization, performance 1
1. Introduction Equation (1) is rather general and perfectly compatible with
cv::filter2D(...) function of the OpenCV library [3], but
The process of automatic program code vectorization further we will give our attention to square kernels, i.e. r = c,
(APCV) is based on the SIMD instructions of the CPU. APCV and thus from now on we presume kernel to be square-
is utilized in modern compilers, e.g., GCC and Clang/LLVM. shaped without special mentioning.
The benefits that provide APCV can be achieved by So, every pixel of the destination DI can be calculated
compiling the program with -O3 (or “aggressive” -O4/- simultaneously. This means that the task can be parallelized.
Ofast) flag (actually, the flag may differ depending on the To accomplish this task, hardware developers created a set
platform and compiler). But as was shown in [1, 2], we of parallel computing platforms (PCP) (like Nvidia CUDA,
achieve performance from APCV less than we need in the ATI Stream Technology (ATI-ST), etc.) to perform these
context of digital image (DI) processing. Moreover, DI parallelizable problems. To create some common approach
problems have great importance due to the wide variety of for all PCP Chronos is distributing the OpenCL API/lib (just
applications in video-stream processing (stabilization, like Nvidia distributes CUDA and so on). So every PCP
filtration, noise correction, or applying some effects for a developer provides the software toolkit to interact with the
single image, etc.). In processing DI, one should always take PCP: programming language with C-like syntax, additional
into account the following features: modules, frameworks, etc.
1. The computational complexity of the method Modern GPUs that can perform in parallel nearly any
chosen. parallelizable task is the basis of PCPs. Currently, GeForce
2. Whether the method is optimized. and ATI video accelerators are very popular for CNN
3. Hardware resources of the target architecture. learning and significantly accelerate the process. But the
One can emphasize resource-demanding (but core/base of this calculation process is performed by shader
significant) operations of DI processing: convolution, blocks of GPU. The curious fact is that shader blocks of GPU
scaling (mostly achieved through convolution), and analysis are similar to mobile CPUs with RISK architecture, which
(of color, brightness, contrast, etc.). Convolution operation are used in modern smartphones.
(CO) (1) is the simplest but most valuable and resource- In conclusion, for to positive effect on the software that
demanding operation: is using the DI processing (image filtration, scaling, edge
r c
detection, blurring, etc.), it is essential to provide speed
pi , j k ,l Pk i a ,l j a , (1)
improvement of CO. This will automatically lead to speed
k 0 l 0
improvement in such fields/areas as multimedia (video
where 𝑖 = 𝑎, … , 𝑊 − (𝑟 − 𝑎) − 1, 𝑗 = 𝑎 , … , 𝐻 − (𝑐 −
codecs), CNN learning, etc. It is worth noting that CO (1) is
𝑎 ) − 1 are indexing pixels of the destination image p; W
the basis for (CNN) functioning if it (CO) is used in the base
and H are the width and height of the source P and
optimized computation approach like “Integer-Arithmetic-
destination p images (we neglect border effects in the
Only”. Therefore, by accelerating CO we achieve higher
destination at the moment), Γ is the kernel of the
CNN performance and decrease its learning time. Moreover,
convolution (matrix r × c), and a, a´ is so-called “anchors”
GPU architecture is based/has been created to
that define relative position of a filtered point within the
improve/solve/speed up such tasks as CO and similar tasks.
kernel.
CPITS-II 2024: Workshop on Cybersecurity Providing in Information 0000-0003-3863-0473 (A. Shevchenko);
and Telecommunication Systems II, October 26, 2024, Kyiv, Ukraine 0000-0002-0360-2459 (P. Prystavka)
∗
Corresponding author. © 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
†
These authors contributed equally.
lllandreyshevchenkolll@gmail.com (A. Shevchenko);
chindakor37@gmail.com (P. Prystavka)
CEUR
Workshop
ceur-ws.org
ISSN 1613-0073
385
Proceedings
In the current contribution, we propose a new method of 2.2. Optimization using software
CO optimization that utilizes ARMx64 SIMD-like aarch64
(NEON64) operations. Since NEON64 can be applied to The software we use (e.g. compiler itself, additional
integer-valued kernels, we will demonstrate a method for libraries, frameworks) highly influences program
kernels with real values that allows this technique to be performance (that we produce) by employing different
implied. Besides, to prove the suggested approach is optimizations to use more effective the hardware platform
effective we provide an experimental comparison of a capabilities. In the scope of the current paper, we are
human-made code (based on this approach) with recognized primarily concerned with their ability to perform
solutions like OpenCV lib. vectorization without significant loss in precision and
The rest of the paper is organized as follows. First, we speed.
consider NEON64’s pros and cons, and in subsection II-B we Let’s consider three well-known compilers: GNU
introduce the reduction/proposed method itself. Compiler Collection (GCC/G++) [8], Clang [9], and nvcc
(compiles cu-files for CUDA).
The most popular nowadays is still the GCC compiler
2. A brief overview of modern developed/supported by the FSF community. The first
software optimization versions of GCC were a collection of compilers for different
Will perform the overview in a “bottom to top” style— programming languages developed by Richard Stallman.
consider hardware first, then software, and then algorithmic Nowadays GCC is no longer a GNU C compiler now it is a
methods of performance enhancement. GNU Compiler Collection. GNU is an optimizing compiler
produced by the GNU Project supporting various
2.1. Acceleration using hardware programming languages, hardware architectures, and
operating systems.
It is worth noting that to significantly enhance software GCC’s main competitor is Clang. For example, Apple
product performance, well-chosen hardware architecture is already uses it as the basic compiler for its products.
the most important. The first question is whether the task Moreover, the UNIX/BSD OS/distributives also use it as a
(e.g., CO) allows parallelization of data flow (or instructions default compiler. The Android NDK no longer uses GCC and
flow). Flynn’s taxonomy [4] gives a general perspective on by default, the clang compiler is used for it. Clang itself is a
possible solutions. frontend for different programming languages, e.g. C, C++,
Today NEON64 (A64 instruction set; Neon SIMD for Objective-C, Objective-C++, and OpenCL. The actual
ARM64 CPUs) principles are implemented in both RISC generation of binary code and vectorization is performed by
(e.g., Cortex-A53-72/X1 ARM64 CPUs) [5] and CISC (e.g., the LLVM framework. Both GCC and Clang are
Intel x64/x86 series) CPUs. To obtain access to such a feature performance-oriented, but still, they fail compared to
(NEON64) specific extensions of the assembly language are human-made assembly code [1, 2, 10].
needed. CISC architecture implies SSEn and AVX1/2 nvcc is the last compiler that we want to mention. It
extensions of the assembly language. In contrast, MIMD widely utilizes NVidia CUDA plus the power of C language,
principles are not implemented in modern CPUs but are which significantly improves PC performance with NVidia
partially supported by GPUs. As was noted before, based on GPU only. The main peculiarity is that these GPUs can use
RISC architecture modern GPUs provide parallel computing SIMT Architecture whose core feature is that the
features due to shader blocks-CPU (SCPU). SCPU contains multiprocessor creates, manages, schedules, and executes
some specific 128/256/...-bit registers using whom partial threads in groups of 32 parallel threads.
MIMD principles are implemented. The number of these But as we can see, the mentioned compilers and
special registers is 32 or more, above the number of SIMD technologies introduce significant heterogeneity in the field
registers that modern ARM64 CPUs have. The bad part is of program optimization. They represent a family of
you cannot access SIMD/MIMD instructions directly separated devices/technologies. In response, the OpenCL
through the language that maintains the possibility of standard was developed (The Khronos Group Inc.) that is
communicating with SIMD regs of SCPU. There are some supported by all mentioned hardware developers and
preordered intrinsics and pre-implemented operations provides access to parallel computations on GPU/DSP/CPU.
accessible: bit shifts, binary logic, etc. Mostly, programmers But PCPs have a drawback—a big overhead on
use specific frameworks to access the mentioned features, transferring data through the bus. To avoid the problem,
e.g. CUDA and OpenCL for GPU, or OpenCL for programmers organize data into pools, which allows for
CPU/DSP/FPGA. achieving more than a 20-fold increase in performance
Except using GPU, one can employ co-processor units, compared to CPU (CNN learning perfectly fits in this
e.g. Digital Signal Processor (DSP) like Qualcomm Hexagon. model). But using big pools is not always the solution—
It has been developed for embedding into Snapdragon- while processing streams from a video camera does not at
6XX/8XX CPUs to reduce the CPU load by up to ∼75% and all.
improve audio/video encoding/decoding performance by up One more reasonable approach to achieve performance
to ∼18 times [6, 7]. Moreover, compared to simple NEON64, enhancement of DI processing is supplied by different
its performance is ∼4 times higher. This DSP uses a very libraries (proprietary or not) like OpenCV or arm
long instruction word (VLIW), which means multithreading ComputeLibrary. Many of them contain NEON64-optimized
at the assembler level (as SIMD) during one interruption, code for armeaby-v7a and arm64-v8a. Another smart
three assembly instructions with different inputs are strategy is to use a collection of libraries that can be
processed. combined into a single framework. As a result, the
386
advantages of one library compensate drawbacks of the
others. OpenCV and ACL [9] are good examples of libraries
comprising a wide variety of algorithms, including DI
processing, and DI analysis. Moreover, OpenCV contains
even modules for CNN learning, optimized for different
CPU architectures that use SIMD (AVX1/2/SSE4, NEON64)
and GPU optimized approaches/solutions. Also, OpenCV is
well-known for its high-quality DI processing. Thus,
further, we will OpenCV as a reference for comparison.
At the moment SIMD optimization has spread over a
wide range of programming products, both proprietary and
open-source. For example, the kernel of Windows 10 OS is
widely used AVX1/2/3DNow SIMD optimizations to achieve
better performance (obviously, this influences the whole
system). Oracle Java VM utilizes AVX1/2/3DNow and thus
any Java application runs faster. But, using SIMD
optimization, they all face the issue of translating floating- Figure 1: Loop unrolling with C++ templates
point code to fixed-point with acceptable loss in precision.
Therefore, it is quite complicated. Thus SIMD optimizations Thus, we should represent elements of the kernel Γ from (1)
used in proprietary software are mostly non-disclosable. in a suitable form:
One more technique to mention is the so-called loop Γ , = 𝜈𝛾 , , 𝜈 ∈ 𝑅, 𝛾 , ∈ 𝑍 (2)
unrolling and tiling [11–13]. This technique avoids where ν is a coefficient for normalization. Now we can
redundant comparison operations at the cost of slightly perform/discuss the most resource-demanding part
enlarging the out/binary file. It is mainly performed by (additions and multiplications) in a SIMD style and
utilizing the compiler or by introducing appropriate afterward normalize the result.
assembly inline code into the application. Any kernel can be represented in form (2), but the more
Some libraries like ACL may use high-level precise the result we want, the more digits should have γi,j.
programming language features (e.g., templates in C++) to So, we should set some constraints on γ to avoid overflow
perform loop unrolling. A simplified ACL-style code is when doing CO because of the platform's limitations on
provided in the listing to demonstrate an example which we intend to run the program.
implementation in Figure 1: Loop unrolling with C++ Suppose, every pixel in the original image is represented
templates. Our previous paper [2] provided a detailed as a byte and thus possesses 8-bit values 0, …, 255. The same
description that leads to a huge (over +25%) speed range is possessed by kernel elements γi,j. Intermediate
improvement to an algorithm. However, the ARM64 results are stored as 16-bit signed or unsigned values. To
architecture was significantly improved compared to the warrant that no overflow occurs, we should ensure that it
ARMv7-A. If not go too deep in details main conclusion does not happen on any algorithm step. If the kernel has
about this kind of approach is that we do not need this positive elements only, a condition we need looks as follows
technique. Moreover, we have done some simple research in r r
which we compare the speed of two equivalent functions, (28 1) i , j 216 1. (3)
i 0 j 0
one with loop unrolling and another without it. The result
Substantially, this means that even the largest possible
was unexpected. The function with a loop unrolling gives a
inputs from the image do not lead to overflow.
3-5% speed reduction. To get proof about the fact that the
If the kernel contains negative elements, the condition
loop was unrolled the IDA was used. As on the ARMv7-A
should be much more complicated and depend on the order
arch, the cycle was unrolled on ARM64 by the clang (9
of additions when doing CO. Instead, we will use much
versions) compiler, and as expected the body of the
stronger but more straightforward to check the condition
bottleneck CO function part was repeated 8 times. But there
r r
is one thing to mention—the bottleneck CO function part (28 1) | i , j | 216 1 1, (4)
was covered by redundant comparisons which can have i 0 j 0
such a negative effect. The ACL lib part that was optimized independent of the operations’ order. Moreover, this
using NEON64 was rewritten without a loop unrolling condition can be slightly relaxed—we can use it for positive
approach. and negative entries of the kernel γ separately. And the last
This unusual fact gives food for think and in further thing to mention: one can easily obtain similar results for
research about different CO approaches/methods, we will signed/unsigned 32-bit intermediate values by substituting
cover (go deeper) them. 16 → 32 in (3) and (4).
What we propose is selecting for giving Γ the most
2.3. Optimization using special algorithms extensive ν possible, such that γ still satisfies (3) or (4)
Let’s focus on CO. The primary obstacle for SIMD (which one depends on whether the kernel is purely positive
optimization is the act of translation of floating-point CO or not). Of course, we shouldn’t be concerned about whether
algo into fixed-point algorithm CO algo with an acceptable any valuable kernels can be reduced to a suitable form/size
loss of precision or even without it. First of all, SIMD because there are plenty of them.
operations will be performed on integers further.
387
In conclusion, modern hardware provides mechanisms Let’s comment on the sections of this code/approach
for vectorization, i.e., SIMD technologies, that programmers (Figure 2b). This is a naïve approach representing the
can use to enhance the performance of the application. In loading operation for each kernel element and loading
most cases, this technology is utilized by the compiler to source image elements (lines 5, 6). The loading and storing
generate binary code without the participation of the operations are the most expensive operations. (lines from 8–
programmer. A suitable choice of the library may be handy 11) represent the multiply-and-accumulated image values
as well—many libraries contain SIMD-optimized code. But (v2, v3) with the kernel element (v0). The results of these
in some cases, human intervention is needed to get the most operations are stored in the buffer regs (v12, v13, v14, v15).
optimal result. More specifically—the code must represent The buffer regs represent the result of the 8-bit
the function/code which can be/suitable for the SIMD multiplication of image values on each kernel element
optimization. However, it is not always possible, and in our extended up to 16-bit unsigned int using the “umlal”
case restrictions (3,4) should be satisfied. In the next section, operation. These operations are performed for every kernel
we will provide a new method of CO optimization and then element. So as you can see, this is time time-consuming
compare it with existing results from OpenCV lib. approach.
Let’s comment on the sections of this code/approach
3. Optimization of Convolution (see Figure 2a): line 4 loading 48 bytes of grayscale image to
v0–v2; line 5 loading 16 bytes of CO kernel in v8; lines
Operation using SIMD 13,14,17,18 provide conversion from 8-bit to 16-bit and
In the current contribution, we propose a new Convolution multiplication calculation with kernel element in v5
Operation (CO) optimization method based on the SIMD simultaneously. Please note that v0–v2 registers contain
technique. We presume that the target kernel satisfies the part of the image that should be convolved with the kernel
3rd condition. This section will provide all necessary stored in v8. Register v2 is exploited as a buffer for 16 more
considerations and an inline assembly code that illustrates bytes of the input image to speed up the CO by utilizing the
the proposed approach. The following section will be “ext” operations. Moreover, data from buffer v2 is being
devoted to an experimental comparison of this method’s used to perform cyclically shifting content of v0 (line 26), v1
performance to known CO implementations of OpenCV. (line 29), and v2 (line 32 with itself) byte-by-byte performed
Regarding condition (4), the provided code should be with the “ext” command. It is not quite clear but we utilize
just slightly modified. Therefore, we will avoid redundant different names of the registers to save shifted states of v0–
code listings and deliver code that realizes condition (3). In v2 (lines 12, 16, 11, 23) which is called the register rename
contrast, all necessary modifications for a realization of technique. Also as you can see we utilized some reordering
condition (4) will be described at the end of the section. We of instructions which brought little obfuscation.
start with the basic implementation of CO (see Figure 2b). It Nevertheless, all this gives about 7-10% speedup in
contains no specific optimizations but still is a good point to comparison to the ordered instruction set which utilizes the
begin our considerations. process of saving all the time in the same names registers
Here νn are the NEON64 registers. Regarding syntax and names v0..v2.
instructions order, we will strictly follow ARM reference Moreover, if we save the result of the shift in the same
manuals. For the sake of simplicity, we avoided register name (like v0, v1, v2), we receive speed-reduced
normalization by the coefficient ν in (see Figure 2b), but for impacts. This is because the operation “ext” saves the result
completeness, let us provide it separately (see Figure 2c). in a state of progress, and when the next operation tries to
In (Figure 2c) we suppose data for normalization to be obtain the content of the v0 (or v1, or v2), it produces the
stored in registers v12–v15, while v1[0] contains the waiting/bottleneck state. So, the more such conditions
normalization coefficient ν. The presented code is in some appear in the program, the less win of time provided by the
sense multipurpose and may be used with different CO algorithm. The most resource part of optimized CO algo
implementations. (Fig. 2a) was almost entirely described by us. Finally, the
Now we switch gears to the CO optimization itself by “case” state (Fig. 2a, lines 37 up to 63) represents the
utilizing NEON64. In (see Figure 2b) have been provided a calculation finishing of the kernel row.
naive version/approach of this operation (in assembly code). So as you can see the main feature of the presented
But this variant contains one significant drawback—data approach (see Figure 2a) is the usage of cyclic shift (i.e., ext
loading. The data loading/storing process is the slowest v10.16b, v0.16b, v1.16b, #1) that provides the kernel
operation because it involves sub/inner processes like buffering, and thus, we need fewer operations of loading.
communication with the CPU and RAM. Even though such One more thing that should be mentioned is the pre-save of
hardware approaches like CPU cache cover this operation, the shifted data (see Figure 2a) (in lines 11,15) on to 1
it is still slow. element and (in lines 19, 22) on to 2 elements were used for
To avoid this problem, one of the registers was used as the current iteration of CO. Other “ext” operations (lines 25,
a buffer. The following approach (see Figure 2a) avoids this 28, 31) provide the data initialization for the next iteration
problem by using one of the registers as a buffer. It is known of CO. Worth noting, that provided (see Figure 2a) demands
that simultaneous loading of 16 bytes is quicker than a kernel containing not more than 16 elements in one row.
loading them one by one. Thus we use one register for Another variation of this interpretation in which CO kernel
preloading extra data and then use this data to perform size is more than 16 elements should utilize data
byte-by-byte shift to exclude redundant load operations. reinitialization of the base registers, which can be seen (in
lines 3–4).
388
(b)
(a) (c)
Figure 2: CO optimization with SIMD NEON64: (a) optimized approach; (b) naive approach; (c) normalization
procedure and saving the result.
Let’s comment on the sections of this code/approach As we mentioned earlier, this code works for kernels
(Figure 2c). This section provides the normalization of satisfying conditions (3). To make it applicable to
coefficient ν. Lines from 2–9 represent the conversion kernels satisfying (4), we need to change all "umlal"
process from 16-bit data types up to 32-bit data types. operations to "smlal" but before it, the extended
Saving all data needs twice as much register stack (v12– operation is required (like “sxtl”). These small but crucial
v19) as it was before (v12–v15). Lines from 10–17 changes transform (see Figure 2a) into code that works
represent the data conversion from unsigned integer with signed integer kernels. Depending on elements in
(32-bit) up to (32-bit) floating-point. Finally, lines 19–25 the given kernel, one can choose between these two
describe the normalization process with the ν coefficient options.
(placed in v1.s[0]). All other lines (26–46) represent the In conclusion, we found a class of kernels that allow
reverse process: the normalized data converts from the significant optimization CO utilizing NEON64 and
(32-bit) floating-point up to (8-bit) unsigned integer implementing appropriate code/algo. For example, the
(lines 26-45) and the result saving (line 46). Subband low pass filtering kernel, like (5).
389
1 6 1 understand the influence of architecture, CPU series,
1
6 36 6 , =
and other parameters on the execution time. The
(5)
1 6 1 64 Odroid-C4 CPU is Cortex-A55; the OS is Ubuntu 20.04;
Linux 5.7.0-odroid-arm64 is the kernel, and its API is
Furthermore, we achieved a significant CO speedup by aarch64. The CPU series of this device is Amlogic
exploiting substantial differences in time for S905X3 which is more powerful than the latest
simultaneous 16-byte loading with a byte-shift approach Raspberry Pi CPUs.
compared to one-by-one line loading. More detailed Measurement procedure. The pivoting parameter
results and considerations of the measurement we need to measure is the execution time of each
procedure will be presented in the following section. function. Such measurement might be tricky since it is
highly susceptible to transition processes in any GNU
4. Experimental setup and results OS (Ubuntu, Android, etc.).
To avoid this problem, we used the following
Ground truth. To evaluate our results certain reference procedure: each function (cv::filter2d(...) and proposed
is needed. As the etalon, we chose functions method - newCO(...)) was successively called three times
cv::filter2d(...) from the OpenCV library. The latter is (for robustness and to simulate Grayscale processing),
well-known among AI and DIP researchers due to its and the result was stored to the array—this is one data
high-quality and optimized code. Especially when quick point. Then, after collecting 35 data points, we calculated
prototyping is needed. the median value and treated it as twice the function's
For comparison, we used the latest stable tag execution time under consideration.
available when we started to research. The release tag is Kernel sizes varied 2×2, 3×3, …, 15×15 for
4.5.2 (2021-04-02 11:23) for OpenCV. The compilation experiments with our implementation and
was performed with clang-9 - the latest stable clang cv::filter2d(...). DIs were generated with equal width and
version. We ensured that libraries utilize vectorization, height, the corresponding formula follows
compiling them with flags: -
DCMAKE_BUILD_TYPE=RELEASE - 125 n
DENABLE_NEON=ON ... and the compilation process
Wimage Himage 32 Wkernel 1,
was with the verbose mode on. The result is that some 8 where
critical fields like "CPU_BASELINE" (NEON F16) and square brackets […] denote the integer part of the
"C++ flags (Release)" (...-O3 -DNDEBUG...) provided number. Results are further presented in the form of
needed content. Also, we mention the fact that OpenCV fractions cv::filter2d(...) execution time divided by
lib was linked as a dynamic library. execution time of proposed/our implementation.
Devices. To make our measurements more relevant,
we used such a device as Odroid-C4. This helps us
(a) (b)
Figure 3: Performance comparison of the CO usage of cv::filter2D vs. the proposed method on the devices with
Cortex-A55 ARM CPU.
Color intensity designates relative time consumption for 5. Results
reference function about the proposed method.
Acceleration one may achieve by using the presented First, we compared the time consumption of the
approach instead of the reference function (the brighter proposed code (see Figure 2a) and reference function
is color—the greater is acceleration). Legends on each cv::filter2d(...). The result is presented in Figs. 5a and 5b.
plot designate how to translate color to acceleration; if As coordinates, we use sizes of kernel and image. At the
this number is greater than 1, it is profitable to use the same time, color intensity designates acceleration,
proposed method. which one may achieve using the proposed method
instead of the reference method (e.g., a fraction of the
390
execution times of the reference function divided by the acceleration compared to cv::filter2D(...), while for larger
execution time of the proposed method). kernels presented approach allows ~3.7 speedup.
Despite the presented results demonstrating the We expect the current approach to be useful for real-
advantage of the proposed method, there is still room for time image processing and convolutional neural
improvement. For example, it seems the compiler cannot network training as it significantly reduces processing
unroll cycles effectively on its own, and we mentioned time.
this above. But if we do the same as was done in ACL—
unroll all "bottle-neck cycles" on our own, it seems we References
can achieve a more speedy approach/results. Thus, we
may reach an additional 10-20% acceleration by utilizing [1] P. Prystavka, A. Shevchenko, Investigation of the
techniques [11–13] by writing cycle unrolling with the Implementation of the Linear Operator of Digital
online assembly by hand. Image Convolution in 16-bit Computing, Actual
Results for the modified code are shown in Figure 3. Problems of Automation and Information
We have compared the time consumption of the Technologies, no. 20 (2016) 78-90.
proposed method (see Figure 2a) and function [2] A. Shevchenko, V. Tymchyshyn, A SIMD-based
cv::filter2d(...). Besides, we varied image sizes up to Approach to the Enhancement of Convolution
4500×4500 (~20 [MP]) to emulate modern cameras and Operation Performance, CMiGIN (2019).
picture libs. [3] Image Filtering (2021). URL: https://docs.opencv.
As Figure 3 suggests, acceleration is independent org/4.x/d4/d86/group__imgproc__filter.html#ga2
(almost) of the input size, e.g. complexity (big-O) of our 7c049795ce870216ddfb366086b5a04.
solution and reference solutions coincide. Some small [4] M. J. Flynn, Very High-Speed Computing Systems,
decline in acceleration (but it is still greater than 2) may in: Proceedings of the IEEE 54.12 (1966) 1901–1909.
be noted for big kernels (13×13 … 15×15) and smaller [5] ARM Cortex- A53 MPCore Processor Technical
kernels (2×2 … 4×4). Regarding mean acceleration, it is Reference Manual (2021). URL: https://developer.
estimated as approximately 3.7 times. arm.com/documentation/ddi0500/j.
It is worth noting that we didn’t use parallelism for [6] Qualcomm Extends Hexagon DSP (2013) URL:
acceleration. Moreover, no preprocessing, e.g., image http://pages.cs.wisc.edu/~danav/pubs/qcom/hexa
tiling, was performed. Probably, this technique may gon_microreport2013_v5.pdf.
increase the performance of the approach as well. [7] Qualcomm Hexagon DSP: An Architecture
Optimized for Mobile Multimedia and
Communications (2013). URL: https://developer.
6. Conclusions qualcomm.com/download/hexagon/hexagon-dsp-
In conclusion, we propose a method of convolution architecture.pdf.
operation acceleration. We have shown that speed [8] Griffith, Arthur. GCC: The Complete Reference.
improvement can be achieved if kernels have been McGraw-Hill, Inc., 2002.
reduced to integer values that allow SIMD command [9] B. C. Lopes, R. Auler. Getting Started with LLVM
usage. Furthermore, despite SIMD itself leading to a Core Libraries. Packt. Publishing ltd. (2014).
significant boost of performance, we were able to push [10] ARM Compute Library (2021). URL:
the frontiers even further by exploiting the considerable https://developer.arm.com/ip-products/
difference in time for simultaneous 32-byte loading processors/machine-learning/compute-librar
compared to their one-by-one loading and using buffer [11] A. Nicolau, Loop Quantization: Unwinding for
(one-time load for the kernel row—48-byte), and loading Fine-Grain Parallelism Exploitation. Cornell
operations are partially substituted with cyclic shift. University (1985).
About ALC, we should mention in addition. There [12] Xue, Jingling. Loop Tiling for Parallelism, vol. 575
was a severe code rewriting event in this lib. (2000).
Furthermore, the patches became cumulative ("less [13] T. Veldhuizen, Expression Templates. C++ Report
description more code"). This fact brought more 7.5 (1995) 26–31.
obscurity/obfuscation than clarity/understanding. So,
we will compare the ACL lib and modifications of our
suggested approach in our following paper but it is
needed to mention that ALC provides all additional code
optimization approaches that we mentioned above
(cycle unrolling, image tiling, etc.).
To test the approach we performed a comparison
with the cv::filter2D(...) function from the OpenCV
library. Our results suggest the current approach leads
to significant speedup (mean values: ~3.7× compared to
OpenCV). Measuring acceleration for different kernels
and images we observed no dependence on image size,
but kernel size may influence the result—for kernels
smaller than 8×8 we were able to achieve ×7.379
391