<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Enhancement of convolution operation performance using SIMD of AArch64 ⋆</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Andrii</forename><surname>Shevchenko</surname></persName>
							<email>lllandreyshevchenkolll@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">National Aviation University</orgName>
								<address>
									<addrLine>1 Lubomyra Guzara ave</addrLine>
									<postCode>03058</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pylyp</forename><surname>Prystavka</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">National Aviation University</orgName>
								<address>
									<addrLine>1 Lubomyra Guzara ave</addrLine>
									<postCode>03058</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Enhancement of convolution operation performance using SIMD of AArch64 ⋆</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3E884361720323FD8BDCFCEC1547F282</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>convolution</term>
					<term>SIMD</term>
					<term>optimization</term>
					<term>performance 1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Optimization of two-dimensional convolution through 16-bit SIMD technologies of ARM x64 (aarch64) is considered. It is shown that by utilizing inline assembler and 16-bit SIMD commands of aarch64, one can achieve a significant performance increase compared to the similar functions of OpenCV. Throughout the research, filter coefficients were quantized to match the 8-bit range.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The process of automatic program code vectorization (APCV) is based on the SIMD instructions of the CPU. APCV is utilized in modern compilers, e.g., GCC and Clang/LLVM. The benefits that provide APCV can be achieved by compiling the program with -O3 (or "aggressive" -O4/-Ofast) flag (actually, the flag may differ depending on the platform and compiler). But as was shown in <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>, we achieve performance from APCV less than we need in the context of digital image (DI) processing. Moreover, DI problems have great importance due to the wide variety of applications in video-stream processing (stabilization, filtration, noise correction, or applying some effects for a single image, etc.). In processing DI, one should always take into account the following features:</p><p>1. The computational complexity of the method chosen.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.</head><p>Whether the method is optimized.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>Hardware resources of the target architecture. One can emphasize resource-demanding (but significant) operations of DI processing: convolution, scaling (mostly achieved through convolution), and analysis (of color, brightness, contrast, etc.). Convolution operation (CO) <ref type="bibr" target="#b0">(1)</ref> is the simplest but most valuable and resourcedemanding operation: </p><p>where 𝑖 = 𝑎, … , 𝑊 − (𝑟 − 𝑎) − 1, 𝑗 = 𝑎 , … , 𝐻 − (𝑐 − 𝑎 ) − 1 are indexing pixels of the destination image p; W and H are the width and height of the source P and destination p images (we neglect border effects in the destination at the moment), Γ is the kernel of the convolution (matrix r × c), and a, a´ is so-called "anchors" that define relative position of a filtered point within the kernel.</p><p>Equation ( <ref type="formula" target="#formula_0">1</ref>) is rather general and perfectly compatible with cv::filter2D(...) function of the OpenCV library <ref type="bibr" target="#b2">[3]</ref>, but further we will give our attention to square kernels, i.e. r = c, and thus from now on we presume kernel to be squareshaped without special mentioning. So, every pixel of the destination DI can be calculated simultaneously. This means that the task can be parallelized. To accomplish this task, hardware developers created a set of parallel computing platforms (PCP) (like Nvidia CUDA, ATI Stream Technology (ATI-ST), etc.) to perform these parallelizable problems. To create some common approach for all PCP Chronos is distributing the OpenCL API/lib (just like Nvidia distributes CUDA and so on). So every PCP developer provides the software toolkit to interact with the PCP: programming language with C-like syntax, additional modules, frameworks, etc.</p><p>Modern GPUs that can perform in parallel nearly any parallelizable task is the basis of PCPs. Currently, GeForce and ATI video accelerators are very popular for CNN learning and significantly accelerate the process. But the core/base of this calculation process is performed by shader blocks of GPU. The curious fact is that shader blocks of GPU are similar to mobile CPUs with RISK architecture, which are used in modern smartphones.</p><p>In conclusion, for to positive effect on the software that is using the DI processing (image filtration, scaling, edge detection, blurring, etc.), it is essential to provide speed improvement of CO. This will automatically lead to speed improvement in such fields/areas as multimedia (video codecs), CNN learning, etc. It is worth noting that CO (1) is the basis for (CNN) functioning if it (CO) is used in the base optimized computation approach like "Integer-Arithmetic-Only". Therefore, by accelerating CO we achieve higher CNN performance and decrease its learning time. Moreover, GPU architecture is based/has been created to improve/solve/speed up such tasks as CO and similar tasks. 0000-0003-3863-0473 (A. Shevchenko); 0000-0002-0360-2459 (P. <ref type="bibr">Prystavka)</ref> In the current contribution, we propose a new method of CO optimization that utilizes ARMx64 SIMD-like aarch64 (NEON64) operations. Since NEON64 can be applied to integer-valued kernels, we will demonstrate a method for kernels with real values that allows this technique to be implied. Besides, to prove the suggested approach is effective we provide an experimental comparison of a human-made code (based on this approach) with recognized solutions like OpenCV lib.</p><p>The rest of the paper is organized as follows. First, we consider NEON64's pros and cons, and in subsection II-B we introduce the reduction/proposed method itself.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">A brief overview of modern software optimization</head><p>Will perform the overview in a "bottom to top" styleconsider hardware first, then software, and then algorithmic methods of performance enhancement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Acceleration using hardware</head><p>It is worth noting that to significantly enhance software product performance, well-chosen hardware architecture is the most important. The first question is whether the task (e.g., CO) allows parallelization of data flow (or instructions flow). Flynn's taxonomy <ref type="bibr" target="#b3">[4]</ref> gives a general perspective on possible solutions. Today NEON64 (A64 instruction set; Neon SIMD for ARM64 CPUs) principles are implemented in both RISC (e.g., Cortex-A53-72/X1 ARM64 CPUs) <ref type="bibr" target="#b4">[5]</ref> and CISC (e.g., Intel x64/x86 series) CPUs. To obtain access to such a feature (NEON64) specific extensions of the assembly language are needed. CISC architecture implies SSEn and AVX1/2 extensions of the assembly language. In contrast, MIMD principles are not implemented in modern CPUs but are partially supported by GPUs. As was noted before, based on RISC architecture modern GPUs provide parallel computing features due to shader blocks-CPU (SCPU). SCPU contains some specific 128/256/...-bit registers using whom partial MIMD principles are implemented. The number of these special registers is 32 or more, above the number of SIMD registers that modern ARM64 CPUs have. The bad part is you cannot access SIMD/MIMD instructions directly through the language that maintains the possibility of communicating with SIMD regs of SCPU. There are some preordered intrinsics and pre-implemented operations accessible: bit shifts, binary logic, etc. Mostly, programmers use specific frameworks to access the mentioned features, e.g. CUDA and OpenCL for GPU, or OpenCL for CPU/DSP/FPGA.</p><p>Except using GPU, one can employ co-processor units, e.g. Digital Signal Processor (DSP) like Qualcomm Hexagon. It has been developed for embedding into Snapdragon-6XX/8XX CPUs to reduce the CPU load by up to ∼75% and improve audio/video encoding/decoding performance by up to ∼18 times <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. Moreover, compared to simple NEON64, its performance is ∼4 times higher. This DSP uses a very long instruction word (VLIW), which means multithreading at the assembler level (as SIMD) during one interruption, three assembly instructions with different inputs are processed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Optimization using software</head><p>The software we use (e.g. compiler itself, additional libraries, frameworks) highly influences program performance (that we produce) by employing different optimizations to use more effective the hardware platform capabilities. In the scope of the current paper, we are primarily concerned with their ability to perform vectorization without significant loss in precision and speed.</p><p>Let's consider three well-known compilers: GNU Compiler Collection (GCC/G++) <ref type="bibr" target="#b7">[8]</ref>, Clang <ref type="bibr" target="#b8">[9]</ref>, and nvcc (compiles cu-files for CUDA).</p><p>The most popular nowadays is still the GCC compiler developed/supported by the FSF community. The first versions of GCC were a collection of compilers for different programming languages developed by Richard Stallman. Nowadays GCC is no longer a GNU C compiler now it is a GNU Compiler Collection. GNU is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures, and operating systems.</p><p>GCC's main competitor is Clang. For example, Apple already uses it as the basic compiler for its products. Moreover, the UNIX/BSD OS/distributives also use it as a default compiler. The Android NDK no longer uses GCC and by default, the clang compiler is used for it. Clang itself is a frontend for different programming languages, e.g. C, C++, Objective-C, Objective-C++, and OpenCL. The actual generation of binary code and vectorization is performed by the LLVM framework. Both GCC and Clang are performance-oriented, but still, they fail compared to human-made assembly code <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr">10]</ref>.</p><p>nvcc is the last compiler that we want to mention. It widely utilizes NVidia CUDA plus the power of C language, which significantly improves PC performance with NVidia GPU only. The main peculiarity is that these GPUs can use SIMT Architecture whose core feature is that the multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads.</p><p>But as we can see, the mentioned compilers and technologies introduce significant heterogeneity in the field of program optimization. They represent a family of separated devices/technologies. In response, the OpenCL standard was developed (The Khronos Group Inc.) that is supported by all mentioned hardware developers and provides access to parallel computations on GPU/DSP/CPU.</p><p>But PCPs have a drawback-a big overhead on transferring data through the bus. To avoid the problem, programmers organize data into pools, which allows for achieving more than a 20-fold increase in performance compared to CPU (CNN learning perfectly fits in this model). But using big pools is not always the solutionwhile processing streams from a video camera does not at all.</p><p>One more reasonable approach to achieve performance enhancement of DI processing is supplied by different libraries (proprietary or not) like OpenCV or arm ComputeLibrary. Many of them contain NEON64-optimized code for armeaby-v7a and arm64-v8a. Another smart strategy is to use a collection of libraries that can be combined into a single framework. As a result, the advantages of one library compensate drawbacks of the others. OpenCV and ACL <ref type="bibr" target="#b8">[9]</ref> are good examples of libraries comprising a wide variety of algorithms, including DI processing, and DI analysis. Moreover, OpenCV contains even modules for CNN learning, optimized for different CPU architectures that use SIMD (AVX1/2/SSE4, NEON64) and GPU optimized approaches/solutions. Also, OpenCV is well-known for its high-quality DI processing. Thus, further, we will OpenCV as a reference for comparison.</p><p>At the moment SIMD optimization has spread over a wide range of programming products, both proprietary and open-source. For example, the kernel of Windows 10 OS is widely used AVX1/2/3DNow SIMD optimizations to achieve better performance (obviously, this influences the whole system). Oracle Java VM utilizes AVX1/2/3DNow and thus any Java application runs faster. But, using SIMD optimization, they all face the issue of translating floatingpoint code to fixed-point with acceptable loss in precision. Therefore, it is quite complicated. Thus SIMD optimizations used in proprietary software are mostly non-disclosable.</p><p>One more technique to mention is the so-called loop unrolling and tiling <ref type="bibr" target="#b9">[11]</ref><ref type="bibr" target="#b10">[12]</ref><ref type="bibr" target="#b11">[13]</ref>. This technique avoids redundant comparison operations at the cost of slightly enlarging the out/binary file. It is mainly performed by utilizing the compiler or by introducing appropriate assembly inline code into the application.</p><p>Some libraries like ACL may use high-level programming language features (e.g., templates in C++) to perform loop unrolling. A simplified ACL-style code is provided in the listing to demonstrate an example implementation in Figure <ref type="figure" target="#fig_1">1</ref>: Loop unrolling with C++ templates. Our previous paper <ref type="bibr" target="#b1">[2]</ref> provided a detailed description that leads to a huge (over +25%) speed improvement to an algorithm. However, the ARM64 architecture was significantly improved compared to the ARMv7-A. If not go too deep in details main conclusion about this kind of approach is that we do not need this technique. Moreover, we have done some simple research in which we compare the speed of two equivalent functions, one with loop unrolling and another without it. The result was unexpected. The function with a loop unrolling gives a 3-5% speed reduction. To get proof about the fact that the loop was unrolled the IDA was used. As on the ARMv7-A arch, the cycle was unrolled on ARM64 by the clang (9 versions) compiler, and as expected the body of the bottleneck CO function part was repeated 8 times. But there is one thing to mention-the bottleneck CO function part was covered by redundant comparisons which can have such a negative effect. The ACL lib part that was optimized using NEON64 was rewritten without a loop unrolling approach.</p><p>This unusual fact gives food for think and in further research about different CO approaches/methods, we will cover (go deeper) them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Optimization using special algorithms</head><p>Let's focus on CO. The primary obstacle for SIMD optimization is the act of translation of floating-point CO algo into fixed-point algorithm CO algo with an acceptable loss of precision or even without it. First of all, SIMD operations will be performed on integers further. Any kernel can be represented in form ( <ref type="formula">2</ref>), but the more precise the result we want, the more digits should have γi,j. So, we should set some constraints on γ to avoid overflow when doing CO because of the platform's limitations on which we intend to run the program.</p><p>Suppose, every pixel in the original image is represented as a byte and thus possesses 8-bit values 0, …, 255. The same range is possessed by kernel elements γi,j. Intermediate results are stored as 16-bit signed or unsigned values. To warrant that no overflow occurs, we should ensure that it does not happen on any algorithm step. If the kernel has positive elements only, a condition we need looks as follows</p><formula xml:id="formula_1">8 16 , 0 0 (2 1) 2 1. r r i j i j         <label>(3)</label></formula><p>Substantially, this means that even the largest possible inputs from the image do not lead to overflow.</p><p>If the kernel contains negative elements, the condition should be much more complicated and depend on the order of additions when doing CO. Instead, we will use much stronger but more straightforward to check the condition</p><formula xml:id="formula_2">8 16 1 , 0 0 (2 1) | | 2 1, r r i j i j          <label>(4)</label></formula><p>independent of the operations' order. Moreover, this condition can be slightly relaxed-we can use it for positive and negative entries of the kernel γ separately. And the last thing to mention: one can easily obtain similar results for signed/unsigned 32-bit intermediate values by substituting 16 → 32 in (3) and ( <ref type="formula" target="#formula_2">4</ref>). What we propose is selecting for giving Γ the most extensive ν possible, such that γ still satisfies (3) or (4) (which one depends on whether the kernel is purely positive or not). Of course, we shouldn't be concerned about whether any valuable kernels can be reduced to a suitable form/size because there are plenty of them.</p><p>In conclusion, modern hardware provides mechanisms for vectorization, i.e., SIMD technologies, that programmers can use to enhance the performance of the application. In most cases, this technology is utilized by the compiler to generate binary code without the participation of the programmer. A suitable choice of the library may be handy as well-many libraries contain SIMD-optimized code. But in some cases, human intervention is needed to get the most optimal result. More specifically-the code must represent the function/code which can be/suitable for the SIMD optimization. However, it is not always possible, and in our case restrictions <ref type="bibr" target="#b2">(3,</ref><ref type="bibr" target="#b3">4)</ref> should be satisfied. In the next section, we will provide a new method of CO optimization and then compare it with existing results from OpenCV lib.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Optimization of Convolution Operation using SIMD</head><p>In the current contribution, we propose a new Convolution Operation (CO) optimization method based on the SIMD technique. We presume that the target kernel satisfies the 3 rd condition. This section will provide all necessary considerations and an inline assembly code that illustrates the proposed approach. The following section will be devoted to an experimental comparison of this method's performance to known CO implementations of OpenCV.</p><p>Regarding condition (4), the provided code should be just slightly modified. Therefore, we will avoid redundant code listings and deliver code that realizes condition (3). In contrast, all necessary modifications for a realization of condition (4) will be described at the end of the section. We start with the basic implementation of CO (see Figure <ref type="figure" target="#fig_2">2b</ref>). It contains no specific optimizations but still is a good point to begin our considerations.</p><p>Here νn are the NEON64 registers. Regarding syntax and instructions order, we will strictly follow ARM reference manuals. For the sake of simplicity, we avoided normalization by the coefficient ν in (see Figure <ref type="figure" target="#fig_2">2b</ref>), but for completeness, let us provide it separately (see Figure <ref type="figure" target="#fig_2">2c</ref>).</p><p>In (Figure <ref type="figure" target="#fig_2">2c</ref>) we suppose data for normalization to be stored in registers v12-v15, while v1[0] contains the normalization coefficient ν. The presented code is in some sense multipurpose and may be used with different CO implementations.</p><p>Now we switch gears to the CO optimization itself by utilizing NEON64. In (see Figure <ref type="figure" target="#fig_2">2b</ref>) have been provided a naive version/approach of this operation (in assembly code). But this variant contains one significant drawback-data loading. The data loading/storing process is the slowest operation because it involves sub/inner processes like communication with the CPU and RAM. Even though such hardware approaches like CPU cache cover this operation, it is still slow.</p><p>To avoid this problem, one of the registers was used as a buffer. The following approach (see Figure <ref type="figure" target="#fig_2">2a</ref>) avoids this problem by using one of the registers as a buffer. It is known that simultaneous loading of 16 bytes is quicker than loading them one by one. Thus we use one register for preloading extra data and then use this data to perform byte-by-byte shift to exclude redundant load operations.</p><p>Let's comment on the sections of this code/approach (Figure <ref type="figure" target="#fig_2">2b</ref>). This is a naïve approach representing the loading operation for each kernel element and loading source image elements (lines 5, 6). The loading and storing operations are the most expensive operations. (lines from <ref type="bibr" target="#b7">[8]</ref><ref type="bibr" target="#b8">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr" target="#b9">[11]</ref> represent the multiply-and-accumulated image values (v2, v3) with the kernel element (v0). The results of these operations are stored in the buffer regs (v12, v13, v14, v15). The buffer regs represent the result of the 8-bit multiplication of image values on each kernel element extended up to 16-bit unsigned int using the "umlal" operation. These operations are performed for every kernel element. So as you can see, this is time time-consuming approach.</p><p>Let's comment on the sections of this code/approach (see Figure <ref type="figure" target="#fig_2">2a</ref>): line 4 loading 48 bytes of grayscale image to v0-v2; line 5 loading 16 bytes of CO kernel in v8; lines 13,14,17,18 provide conversion from 8-bit to 16-bit and multiplication calculation with kernel element in v5 simultaneously. Please note that v0-v2 registers contain part of the image that should be convolved with the kernel stored in v8. Register v2 is exploited as a buffer for 16 more bytes of the input image to speed up the CO by utilizing the "ext" operations. Moreover, data from buffer v2 is being used to perform cyclically shifting content of v0 (line 26), v1 (line 29), and v2 (line 32 with itself) byte-by-byte performed with the "ext" command. It is not quite clear but we utilize different names of the registers to save shifted states of v0-v2 (lines <ref type="bibr" target="#b10">12,</ref><ref type="bibr">16,</ref><ref type="bibr" target="#b9">11,</ref><ref type="bibr">23)</ref> which is called the register rename technique. Also as you can see we utilized some reordering of instructions which brought little obfuscation. Nevertheless, all this gives about 7-10% speedup in comparison to the ordered instruction set which utilizes the process of saving all the time in the same names registers names v0..v2.</p><p>Moreover, if we save the result of the shift in the same register name (like v0, v1, v2), we receive speed-reduced impacts. This is because the operation "ext" saves the result in a state of progress, and when the next operation tries to obtain the content of the v0 (or v1, or v2), it produces the waiting/bottleneck state. So, the more such conditions appear in the program, the less win of time provided by the algorithm. The most resource part of optimized CO algo (Fig. <ref type="figure" target="#fig_2">2a</ref>) was almost entirely described by us. Finally, the "case" state (Fig. <ref type="figure" target="#fig_2">2a</ref>, lines 37 up to 63) represents the calculation finishing of the kernel row.</p><p>So as you can see the main feature of the presented approach (see Figure <ref type="figure" target="#fig_2">2a</ref>) is the usage of cyclic shift (i.e., ext v10.16b, v0.16b, v1.16b, #1) that provides the kernel buffering, and thus, we need fewer operations of loading. One more thing that should be mentioned is the pre-save of the shifted data (see Figure <ref type="figure" target="#fig_2">2a</ref>) (in lines 11,15) on to 1 element and (in lines 19, 22) on to 2 elements were used for the current iteration of CO. Other "ext" operations (lines 25, 28, 31) provide the data initialization for the next iteration of CO. Worth noting, that provided (see Figure <ref type="figure" target="#fig_2">2a</ref>) demands a kernel containing not more than 16 elements in one row. Another variation of this interpretation in which CO kernel size is more than 16 elements should utilize data reinitialization of the base registers, which can be seen (in lines 3-4). Let's comment on the sections of this code/approach (Figure <ref type="figure" target="#fig_2">2c</ref>). This section provides the normalization of coefficient ν. Lines from 2-9 represent the conversion process from 16-bit data types up to 32-bit data types. Saving all data needs twice as much register stack (v12-v19) as it was before (v12-v15). Lines from 10-17 represent the data conversion from unsigned integer (32-bit) up to (32-bit) floating-point. As we mentioned earlier, this code works for kernels satisfying conditions (3). To make it applicable to kernels satisfying (4), we need to change all "umlal" operations to "smlal" but before it, the extended operation is required (like "sxtl"). These small but crucial changes transform (see Figure <ref type="figure" target="#fig_2">2a</ref>) into code that works with signed integer kernels. Depending on elements in the given kernel, one can choose between these two options.</p><p>In conclusion, we found a class of kernels that allow significant optimization CO utilizing NEON64 and implementing appropriate code/algo. For example, the Subband low pass filtering kernel, like (5). </p><formula xml:id="formula_3">            <label>(5)</label></formula><p>Furthermore, we achieved a significant CO speedup by exploiting substantial differences in time for simultaneous 16-byte loading with a byte-shift approach compared to one-by-one line loading. More detailed results and considerations of the measurement procedure will be presented in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental setup and results</head><p>Ground truth. To evaluate our results certain reference is needed. As the etalon, we chose functions cv::filter2d(...) from the OpenCV library. The latter is well-known among AI and DIP researchers due to its high-quality and optimized code. Especially when quick prototyping is needed.</p><p>For comparison, we used the latest stable tag available when we started to research. The release tag is 4.5.2 (2021-04-02 11:23) for OpenCV. The compilation was performed with clang-9 -the latest stable clang version. We ensured that libraries utilize vectorization, compiling them with flags: -DCMAKE_BUILD_TYPE=RELEASE -DENABLE_NEON=ON ... and the compilation process was with the verbose mode on. The result is that some critical fields like "CPU_BASELINE" (NEON F16) and "C++ flags (Release)" (...-O3 -DNDEBUG...) provided needed content. Also, we mention the fact that OpenCV lib was linked as a dynamic library.</p><p>Devices. To make our measurements more relevant, we used such a device as Odroid-C4. This helps us understand the influence of architecture, CPU series, and other parameters on the execution time. The Odroid-C4 CPU is Cortex-A55; the OS is Ubuntu 20.04; Linux 5.7.0-odroid-arm64 is the kernel, and its API is aarch64. The CPU series of this device is Amlogic S905X3 which is more powerful than the latest Raspberry Pi CPUs.</p><p>Measurement procedure. The pivoting parameter we need to measure is the execution time of each function. Such measurement might be tricky since it is highly susceptible to transition processes in any GNU OS (Ubuntu, Android, etc.).</p><p>To avoid this problem, we used the following procedure: each function (cv::filter2d(...) and proposed method -newCO(...)) was successively called three times (for robustness and to simulate Grayscale processing), and the result was stored to the array-this is one data point. Then, after collecting 35 data points, we calculated the median value and treated it as twice the function's execution time under consideration.</p><p>Kernel sizes varied 2×2, 3×3, …, 15×15 for experiments with our implementation and cv::filter2d(...). DIs were generated with equal width and height, the corresponding formula follows Color intensity designates relative time consumption for reference function about the proposed method. Acceleration one may achieve by using the presented approach instead of the reference function (the brighter is color-the greater is acceleration). Legends on each plot designate how to translate color to acceleration; if this number is greater than 1, it is profitable to use the proposed method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>First, we compared the time consumption of the proposed code (see Figure <ref type="figure" target="#fig_2">2a</ref>) and reference function cv::filter2d(...). The result is presented in Figs. <ref type="figure">5a and 5b</ref>. As coordinates, we use sizes of kernel and image. At the same time, color intensity designates acceleration, which one may achieve using the proposed method instead of the reference method (e.g., a fraction of the execution times of the reference function divided by the execution time of the proposed method).</p><p>Despite the presented results demonstrating the advantage of the proposed method, there is still room for improvement. For example, it seems the compiler cannot unroll cycles effectively on its own, and we mentioned this above. But if we do the same as was done in ACLunroll all "bottle-neck cycles" on our own, it seems we can achieve a more speedy approach/results. Thus, we may reach an additional 10-20% acceleration by utilizing techniques <ref type="bibr" target="#b9">[11]</ref><ref type="bibr" target="#b10">[12]</ref><ref type="bibr" target="#b11">[13]</ref> by writing cycle unrolling with the online assembly by hand.</p><p>Results for the modified code are shown in Figure <ref type="figure" target="#fig_5">3</ref>. We have compared the time consumption of the proposed method (see Figure <ref type="figure" target="#fig_2">2a</ref>) and function cv::filter2d(...). Besides, we varied image sizes up to 4500×4500 (~20 [MP]) to emulate modern cameras and picture libs.</p><p>As Figure <ref type="figure" target="#fig_5">3</ref> suggests, acceleration is independent (almost) of the input size, e.g. complexity (big-O) of our solution and reference solutions coincide. Some small decline in acceleration (but it is still greater than 2) may be noted for big kernels (13×13 … 15×15) and smaller kernels (2×2 … 4×4). Regarding mean acceleration, it is estimated as approximately 3.7 times.</p><p>It is worth noting that we didn't use parallelism for acceleration. Moreover, no preprocessing, e.g., image tiling, was performed. Probably, this technique may increase the performance of the approach as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>In conclusion, we propose a method of convolution operation acceleration. We have shown that speed improvement can be achieved if kernels have been reduced to integer values that allow SIMD command usage. Furthermore, despite SIMD itself leading to a significant boost of performance, we were able to push the frontiers even further by exploiting the considerable difference in time for simultaneous 32-byte loading compared to their one-by-one loading and using buffer (one-time load for the kernel row-48-byte), and loading operations are partially substituted with cyclic shift.</p><p>About ALC, we should mention in addition. There was a severe code rewriting event in this lib. Furthermore, the patches became cumulative ("less description more code"). This fact brought more obscurity/obfuscation than clarity/understanding. So, we will compare the ACL lib and modifications of our suggested approach in our following paper but it is needed to mention that ALC provides all additional code optimization approaches that we mentioned above (cycle unrolling, image tiling, etc.).</p><p>To test the approach we performed a comparison with the cv::filter2D(...) function from the OpenCV library. Our results suggest the current approach leads to significant speedup (mean values: ~3.7× compared to OpenCV). Measuring acceleration for different kernels and images we observed no dependence on image size, but kernel size may influence the result-for kernels smaller than 8×8 we were able to achieve ×7.379 acceleration compared to cv::filter2D(...), while for larger kernels presented approach allows ~3.7 speedup.</p><p>We expect the current approach to be useful for realtime image processing and convolutional neural network training as it significantly reduces processing time.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Loop unrolling with C++ templates Thus, we should represent elements of the kernel Γ from (1) in a suitable form:Γ , = 𝜈𝛾 , , 𝜈 ∈ 𝑅, 𝛾 , ∈ 𝑍 (2) where ν is a coefficient for normalization. Now we can perform/discuss the most resource-demanding part (additions and multiplications) in a SIMD style and afterward normalize the result.Any kernel can be represented in form (2), but the more precise the result we want, the more digits should have γi,j. So, we should set some constraints on γ to avoid overflow when doing CO because of the platform's limitations on which we intend to run the program.Suppose, every pixel in the original image is represented as a byte and thus possesses 8-bit values 0, …, 255. The same range is possessed by kernel elements γi,j. Intermediate results are stored as 16-bit signed or unsigned values. To warrant that no overflow occurs, we should ensure that it does not happen on any algorithm step. If the kernel has positive elements only, a condition we need looks as follows</figDesc><graphic coords="3,311.16,83.40,205.32,164.16" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>CO optimization with SIMD NEON64: (a) optimized approach; (b) naive approach; (c) normalization procedure and saving the result.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>Finally, lines 19-25 describe the normalization process with the ν coefficient (placed in v1.s[0]). All other lines (26-46) represent the reverse process: the normalized data converts from the (32-bit) floating-point up to (8-bit) unsigned integer (lines 26-45) and the result saving (line 46).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 3 :</head><label>3</label><figDesc>…] denote the integer part of the number. Results are further presented in the form of fractions cv::filter2d(...) execution time divided by execution time of proposed/our implementation. Performance comparison of the CO usage of cv::filter2D vs. the proposed method on the devices with Cortex-A55 ARM CPU.</figDesc></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Investigation of the Implementation of the Linear Operator of Digital Image Convolution in 16-bit Computing</title>
		<author>
			<persName><forename type="first">P</forename><surname>Prystavka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shevchenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Actual Problems of Automation and Information Technologies</title>
		<imprint>
			<biblScope unit="issue">20</biblScope>
			<biblScope unit="page" from="78" to="90" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">A SIMD-based Approach to the Enhancement of Convolution Operation Performance</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shevchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tymchyshyn</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>CMiGIN</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<ptr target="https://docs.opencv.org/4.x/d4/d86/group__imgproc__filter.html#ga27c049795ce870216ddfb366086b5a04" />
		<title level="m">Image Filtering</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Very High-Speed Computing Systems</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Flynn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="1966">1966</date>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="1901" to="1909" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<ptr target="https://developer.arm.com/documentation/ddi0500/j" />
		<title level="m">ARM Cortex-A53 MPCore Processor Technical Reference Manual</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_microreport2013_v5.pdf" />
		<title level="m">Qualcomm Extends Hexagon DSP</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<ptr target="https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf" />
		<title level="m">Qualcomm Hexagon DSP: An Architecture Optimized for Mobile Multimedia and Communications</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">GCC: The Complete Reference</title>
		<author>
			<persName><forename type="first">Arthur</forename><surname>Griffith</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<publisher>McGraw-Hill, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Getting Started with LLVM Core Libraries</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Lopes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Auler</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>Packt. Publishing ltd</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Loop Quantization: Unwinding for Fine-Grain Parallelism Exploitation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Nicolau</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1985">1985</date>
		</imprint>
		<respStmt>
			<orgName>Cornell University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Loop Tiling for Parallelism</title>
		<author>
			<persName><forename type="first">Jingling</forename><surname>Xue</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="volume">575</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Expression Templates</title>
		<author>
			<persName><forename type="first">T</forename><surname>Veldhuizen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">C++ Report</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="26" to="31" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
