1. Introduction

S. Jamali Golzar);

Demystifying Power-of-Two Quantization: Benchmarking Inference on AVX and RVV

Saleh Jamali Golzar

Giuseppe Pagano

Biagio Cosenza

0 0 University of Salerno , Italy

2025

000 0 0002

Recent trends in deep learning models have been characterized by a rapid increase in the number of parameters to achieve higher performance across a wide range of tasks. However, this growth in model size has intensified computational demands, leading to increased power consumption and latency during inference, making model compression more important than ever. As a result, it has become extremely important to provide and use eficient compression techniques, such as quantization, on diferent target architectures and platforms. This work revisits power-of-two (PoT) quantization (specifically the MatMul kernel) for inference workloads, evaluating a wide range of configurations, including fixed- and floating-point PoT, and targeting AVX512 and RVV-1.0. Our work proposes techniques tailored to the underlying architecture, including two methods for unpacking 8- and 16-bit PoT-encoded data for AVX512, two packing configurations for RVV-1.0, as well as a novel, lightweight solution for handling signed infinity and NaN floating-point values properly. Experimental results for floating-point PoT quantization for MatMul workloads show speedups of up to 3.67.

eol>AVX512 Deep Neural Networks Quantization RVV

1. Introduction

❸ A novel approach to handle infinity and NaN floating-point values for PoT with an acceptable overhead; ❹ An experimental evaluation of the PoT implementations for RVV-1.0 on SpacemiT K1, for AVX512 on Intel Xeon 5218, Xeon 8260 and AMD Ryzen 9 7950X, all evaluated with autotuning, diferent input sizes and compilers.

2. Background on Quantization

Quantization reduces the precision of model parameters to improve latency and memory usage, trading of an acceptable accuracy drop. Depending on its scale and ofset, it can be symmetric/asymmetric and uniform/non-uniform [ 8 ]. Post Training Quantization (PTQ) applies quantization after training with optional fine-tuning, while Quantization Aware Training (QAT) integrates it directly into training for better accuracy under the same constraints. The Power-of-Two quantization ofers higher dynamic range and more eficient implementation compared to fixed-point counterparts. This work focuses on QAT PoT quantization.

2.1. Fixed-point PoT

DeepShift [ 4 ] ofers two schemes to train PoT-quantized models, namely DeepShift-Q and DeepShift-PS, which use the same PoT encoding for the forward-pass but various approaches for the backward-pass. It uses a shift matrix ˜ along with a sign matrix ˜ to replace a floating-point weight matrix of with ˜ Flip(2 , ˜) [ 4 ]. Furthermore, since the sign of the elements of ˜ only indicates left or right shifts, the function Flip(, ˜) = sign(˜) handles the negation of . This work maps full-precision values to {0} ∪ {±2 1 }, 1 ∈ W.

3 [ 9 ] aims to solve the gradient vanishing [10] and weight sign freezing problems when using re-parametrization to train models with discrete power-of-two weights. Similar to DeepShift, it maps full-precision values to {0} ∪ {±2 2 }, 2 ∈ W. The weights are re-parameterized by Eq. 1 where is defined in Eq. 2, is the target quantization bitwidth, and ⊮() = +1 if ( ≥ 0) else 0. shift = ⊮(sparse) * {︀ 2⊮(sign) − 1 }︀ 2

⏟ zero⏞ ⏟ Sign⏞ ⏟Scal⏞e = ⊮()(−1 + 1) 1 ≤ ≤ , 0 = 0 (1) (2) (3) (4) 2.2. Floating-point PoT DenseShift[ 5 ] on the other hand improves the accuracy by removing zero from the dynamic range, efectively mapping full-precision values to {±2 3 } instead of {0} ∪ {±2 4 }, 3, 4 ∈ W. Compared to 3 [ 9 ], this work uses the same recursion formula for , while using Eq. 3 for computing quantized weights. DenseShift [ 4 ] reports a 1.6x speedup on an ARM A57 Neon.

shift = {︀ 2⊮(sign) − 1 }︀ 2

⏟ Sign⏞ ⏟Scal⏞e

For clarification, for = 3 is computed using the Eq. 2 recursively and simplified to Eq. 4. The range of 3 is {0, 1, 2, 3}, leading to quantization levels of {±1, ±2, ±4, ±8} for shift (Eq. 5). Every lfoating point weight in the original model is replaced by four unique floating point trainable parameters: sign and 1, 2, 3 for = 3. As a comparison, for =3, 3 [ 9 ] yields a range of {0, ±1, ±2, ±4}. 3 = ⊮(3)⊮(2)⊮(1) + ⊮(3)⊮(2) + ⊮(3) 3 ∈ {0, 1, 2, 3} =⇒ shift ∈ {±1, ±2, ±4, ±8} (5)

DenseShift requires only two bits to store the exponents of the quantized levels ({±1, ±2, ±4, ±8} along with one bit for their signs, adding up to three bits ( = 3). It exploits the floating-point format to replace multiplications of floating point numbers by PoT values, with unsigned integer additions (Fig. 1). Our work investigates the conventional fixed-point PoT briefly, along with a deeper study of the floating-point PoT. Building on top of DenseShift, we explore how the sign and exponent bits can be eficiently encoded and packed for modern processors with SIMD capabilities, targeting various values for . In addition, we introduce strategies to correctly handle special floating-point values like NaN and ±∞. ),

Element of A

Element of BPoT

= + PoT = Element of C + PoT + 0 =

2.3. Other Work

ShiftAddViT [11] and P2-ViT [12] target PoT quantization for Vision Transformers. Similar studies [13, 14, 15, 16, 17, 18, 19] propose specialized RISC-V extensions, FPGA-, and ASIC-based designs related to PoT and ternary models. Other works, such as SIMDE [20] aim for portability of SIMD kernels across diferent ISAs. Unfortunately, the support for the needed AVX512 extensions for our use case was missing in SIMDE. In addition, APoT [21] and the related literature [22] pursue a diferent approach for implementing PoT quantization.

3. Experimental Methodology

This section provides details on ❶ handwritten vectorized kernels for RVV-1.0 and AVX512, ❷ handwritten vectorized floating-point PoT kernels for AVX512 and RVV-1.0, ❸ the same kernels with extra logic to handle ±∞ and NaN properly, and ❹ handwritten vectorized fixed-point PoT kernels for AVX512 and RVV-1.0.

3.1. Notation

We define the target operation as = where is the non-trainable input tensor and is the trainable tensor, known at compile-time. Using PoT quantization, = ⨁︀ PoT is used to replace multiplication with an addition along with some extra operations, depending on whether a fixed-point or a floating-point PoT is used. We use a Python-like approach for loops and arrays, meaning that the index range “0 : 2 : 8” (first, step, last) maps to the set {0, 2, 4, 6}. However, we use address ofsets (as in C++) for vectorized loads and stores. To refer to a PoT quantization configuration, we use T1:T2:E:P where T1 and T2 refer to and PoT. The number of bits required to store the signed exponent of each element of in PoT is denoted by E, while P refers to the number of PoT-encoded elements of , packed into a word of PoT. An example would be F32:U8:E5:P1. Also, to save space, FPoT and FXPoT are used to refer to floating-point and fixed-point PoT quantization.

3.2. Quantization Methodology

We have limited our work only to MatMul inference workload with a known second operand at compile-time. This is critical, as the encoding operation on the raw elements of tensor B afects runtime performance unless the tensor data is known at compile time, in which case the encoding can be performed ahead of model inference at no additional cost. Throughout this work, we use the proposed PoT scheme in DenseShift ({±2 }) for the floating point PoT and assume the quantized PoT weight tensor is always the second operand.

For FXPoT, considering += * operation in a MatMul kernel, variables , , and are signed integers and = log2(||). Therefore, * can be replaced by sign() · ( ≪ ) . This approach replaces an integer multiplication with a variable shift (opposed to immediate) and some extra logic to handle the negation, if is negative.

For FPoT, in floating-point multiplication of += * , using DenseShift’s floating-point methodology, we replace * with a direct unsigned integer addition of the sign and exponent fields of , with a PoT quantized value [ 5 ], assuming that the value is not represented in two’s complement and its sign and positive exponent field align with those of (Fig. 1).

3.3. Encoding 3.4. Baselines

This paper aims to investigate how PoT influences MatMul inference. Therefore, we use a naive MatMul implementation (no tiling, single-thread). The code consists of independent executables for each configuration of PoT kernels, each with three baselines of scalar, scalar-autovectorized, and intrinsic-based. For example, Alg. 1 shows the baseline Float32 MatMul kernel with RVV-1.0. Note that AVX512 does not have an FMA instruction for integer types, so the non-fused multiplication and addition instructions are used. Algorithm 1 The baseline RVV-1.0 F32:F32.

1: for ← 0 : do UF0 2: for ← 0 : do UF1 3: ← 0, ← 0 ◁ vfmv_v_f_f32m1 (for ) 4: for ← 0 : : do UF2 5: ← vsetvl_e32m1( − ) 6: ← [, ], ← [, ] ◁ vle32_v_f32m1 7: ← + * ◁ vfmacc_vv_f32m1 8: ← reduce( ) ◁ vfmv_f_s_f32m1_f32(vfredosum_vs_f32m4_f32m1()) 9: [, ] ← 3.5. Autotuning We used brute-force autotuning implemented in Bash, sweeping all the possible combinations for the defined tunable parameters. A shared combination of unrolling factors for all kernels in a benchmark executable was used to reduce the autotuning time. Each PoT kernel benchmark has one executable. The tuning objective is to minimize the runtime of the PoT kernels. The unroling factors assigned to each loop (if any) are shown with the blue markings on the algorithms.

4. FPoT-quantized MatMul on AVX512

Tensor PoT can have more than one word per element and has to be unpacked so each word aligns with its relevant Float32 fields of tensor . We introduce two methods for unpacking. The first unpacking logic, which is shared by the two RVV-1.0 kernels and one of the AVX512 kernels, and will from now on be referred to as Unpacking1, consists of unpacking 8-bit elements into 32-bit elements.

The second unpacking method, referred to as Unpacking2, is implemented using the permutexvar intrinsic to place an element of value zero to the left of every 16-bit element of the packed vector, while unpacking to 32-bit vector registers. This is done by providing 0xFF as the indices for permutexvar.

The first AVX kernel ( F32:U8:E5:P1) unpacks the elements with Unpacking1, creating a mask using the sign bits to remove the sign bit itself (mask_sub) to perform multiplication via addition. Finally, the sign bits are flipped selectively by mask_xor (Alg. 2). The second AVX kernel with Unpacking2 (F32:U16:E8:P1) uses 16-bit words to simplify the runtime operations, using permutexvar to unpack the elements by inserting zeros to align the fields against Float32 (Alg. 3).

Algorithm 2 AVX512 FPoT Unpacking1 F32:U8:E5:P1 1: ← 0x1F 2: ← 0x20 3: ← 0x80000000 4: for ← 0 : do UF0 5: for ← 0 : do UF1 6: ← 0 7: for ← 0 : 64 : do UF2 8: ← [, ] 9: for ← 0 : 4 do 10: ← [, + 16] 11: ← [16] 12: int ← cast( ) 13: ← cmp( , ) 14: ← mask_sub( , , , ) 15: ← ≪ 23 16: ← int + 17: ← mask_xor( , , , ) 18: ← cast( ) 19: ← + 20: ← reduce( ) 21: [, ] ← ◁ m5_set1_epi32 ◁ m5_set1_epi32 ◁ m5_set1_epi32

5. FPoT-quantized MatMul on RVV

For RVV, Vector Length Specific (VLS) is a programming style with vectors whose size is known beforehand and always static, allowing for greater optimization opportunities at the cost of less portability. In a Vector Length Agnostic (VLA) code, the size of the vectors depends on the target machine and the code has to query for it. This leads to better portability across diferent machines. As for Vector Length Multiplier (LMUL), for a value of greater than one, the vector registers are combined to form a vector register group. If LMUL is less than one, only fractions of the vector registers are utilized Algorithm 3 AVX512 FPoT Unpacking2 F32:U16:E8:P1 1: indices ← {0xFF, 0, 0xFF, 1, 0xFF, 2, . . . , 0xFF, 15} 2: indices ← indices 3: for ← 0 : do UF0 4: for ← 0 : do UF1 5: ← 0 6: for ← 0 : 16 : do UF2 7: ← [, ] 8: int ← cast( ) 9: ← [, ] 10: 512 ← cast( ) 11: 512, unpacked ← permute( indices, 512) 12: ← int + 512, unpacked 13: ← cast( ) 14: ← + 15: sum ← reduce( ) 16: [ + ] ← sum ◁ alignas(64) ◁ m5_load_si512 (LMUL ∈ { 81 , 1 , 12 , 1, 2, 4, 8}). If set to dynamic, the compiler will adapt the LMUL for the code, unless a 4 specific LMUL is enforced by the code. Both F32:U8:E5:P1 (Alg. 5) and F32:U8:E3:P2 (Alg. 4) with Unpacking1 are VLA and implemented via riscv_vwcvtu intrinsic. They both create a mask for the negative elements (cmp) to remove the sign bits (mask_sub) and XOR the sign bits (mask_xor) after multiplication by addition (Alg. 5). In the other kernel shown in Alg. 4, after the initial unpacking, lower and upper nibbles of the 8-bit elements of the vectors are concatenated and reordered via the use of concat and gather. Then, a similar logic to that of E5:P1 is utilized.

Algorithm 4 RVV-1.0 FPoT F32:U8:E3:P2.

1: vlmax ← vsetvlmax_e32m8() 2: elements ← vsetvlmax_e8m2() 3: indices ← {0} 4: for ← 0 : elements do 5: if mod 2 then 6: indices[] ← /2 7: index ← indices 8: for ← 0 : do UF0 9: for ← 0 : do UF1 10: ← 0 11: ← 0 12: for ← 0 : : do UF2 13: ← vsetvl_e32m8( − ) 14: ← [, ] 15: ← [, ] 16: ← lowerHalf( ) 17: ← upperHalf( ) 18: ← concat( , ) 19: ← gather( , index) 20: ← cmp( , 0b11111) 21: ← mask_sub( , , , 0b100000) 22: ← expand( ) 23: ← ≪ 23 24: uint ← rintrp( ) 25: uint ← uint + 26: ← rintrp( uint ) 27: ← + 28: ← reduce( ) 29: [, ] ← ◁ vle8_v_u8m2 ◁ vfmv_v_f_f32m8 ◁ vle32_v_f32m8 ◁ vle32_v_u8m1 ◁ vand_vx_u8m1 ◁ vand_vx_u8m1 ◁ vset_v_u8m1_u8m2 ◁ vrgather_vv_u8m2 ◁ vmsgtu_vx_u8m1_b8 ◁ vsub_vx_u8m1_tumu ◁ vwcvtu_x_x_v_u32m8(vwcvtu_x_x_v_u16m4())

◁ vsll_vx_u32m8 ◁ vreinterpret_v_f32m8_u32m8

◁ vadd_vv_u32m8 ◁ vreinterpret_v_u32m8_f32m8

◁ vfadd_vv_f32m8 ◁ vfmv_f_s_f32m1_f32(vfredosum_vs_f32m4_f32m1()) Algorithm 5 RVV-1.0 FPoT F32:U8:E5:P1. 6. Handling ±∞ and NaN Since FPoT replaces a multiplication of two floating point values with an integer addition, the edge cases defined in IEEE754 are no longer handled by the hardware. Therefore, the exponent bits after performing addition can overflow and flip the sign bit of the output. DenseShift [ 5 ] is also prone to this issue. A full software solution for this could render the FPoT speedup pointless, but we could still implement extra logic to manage just the ±∞ and NaN elements of A. These values (±∞ and NaN ) both have the exact exponents of 0xFF. These values (±∞ and NaN ) both have exponent bits set to 0xFF. Only the non-trainable tensor () can contain such values, as the trainable tensor is restricted to valid PoT quantized levels in DenseShift, which do not include zero or special IEEE754 representations. As shown in Fig. 2, since ∞ · = ∞ if is not a ∞, NaN, or zero, as long as the FPoT exponents are not added to the bad elements of tensor , the accumulated results in Float32 will be correct. Later, the hardware would handle the edge cases for any pairs of ±∞ or NaN values in Float32. As shown in Alg. 6, the mask inf is used to perform masked addition by excluding the bad elements from the addition with the FPoT exponents. The extra instructions needed to perform the mask addition come with an acceptable overhead, discussed in Sec. 8.3.

A B 1 0 0xFF 0x01

Bypass ? 0x000 Unsigned Add. + ?

7. FXPoT

To clarify the nature of speedup, if any, for the FXPoT kernels compared to normal fixed-point MatMul, we developed two benchmarks for AVX512 and RVV. The idea is to see if a performance gain could be achieved by using shift instructions to replace integer multiplications in a loop. Therefore, we excluded negation logic for handling negative numbers from the code and opted for I32:I32 data types for the baseline kernel. For AVX512, the multiplication intrinsic instances (mullo_epi32) are replaced by bitwise left-shifts. The elements of and are loaded with load_si512, leftshifted with sllv_epi32, and accumulated by add_epi32. The resulted vector is then reduced to a scalar using reduce_add_ps and written to the output tensor . The same approach is used for the RVV kernel using sll_vv_i32m4, vadd_vv_i32m4, and vredsum_vs_i32m4_i32m1. Note that the FXPoT kernels do not need more than 5 bits to store the exponent bits, since the selected base data type is 32 bits, which means the PoT type is I32:U32:E5:P1.

8. Experimental Evaluation

Autotuning For RISC-V, we used dynamic LMUL and let the autotuner script find the best between VLA and VLS. Note that VLS-build is selected by adding GCC flag -mrvv-vector-bits=zvl and LLVM lfag -mrvv-vec tor-bits=256. For GCC, VLS build flags work in conjunction with the -march. To manage slow tuning, we only considered tuning the unrolling factor for the inner-most loop with a value from {1, 2, 4}. We kept SMT (i.e., Intel’s hyperthreading) active and the rest of the cores idling. The CPU afinity was enforced with taskset. The SIMD kernels’ measurements are repeated 75 times. Tensor Shapes We considered square matrices ( × ) that satisfy the constraints of all our baseline and PoT kernels. For the AVX512 kernels, the most restrictive constraint is for to be a multiple of 64. For the RVV kernels, the VLA design imposes no restrictions on , while E3:P2 encoding requires to be a multiple of 2. Therefore, we considered ∈ {64, 128, 256, 512, 768, 1024, 2048, 4096, 5120} to cover a wide range of shapes while keeping the runtime of the experiments manageable. Handling the cases for non-square matrices and the ones that do not satisfy the constraints of our kernels are straightforward, but left for future work, as our main goal is to prove the efectiveness of the core idea behind FPoT and FXPoT in a fair setting.

8.1. AVX512 Results

The results for Alg. 2 and 3 for various square matrix sizes (N) are shown in Fig. 3. The same data for = 5 and LLVM 18 is shown in Fig. 4. Unpacking1 is 1.13x (geomean) faster than Unpacking2 with LLVM compilers. The profiling data on Xeon 5218 with LLVM 18 using Intel Advisor and VTune indicates that the baselines are L2-bound and perform worse for = 3 as opposed to = 1, while the FPoT kernels sustain the same performance thanks to the increased arithmetic intensity of FPoT by the spared loads. This leads to the steep increase of speedup in 1024 ≤ < 4096 . For ≤ 768 , the tensors can fit entirely in the caches, diminishing the efectiveness of FPoT kernels over the baselines. Because Unpacking2-based kernels use F32:U16 rather than F32:U8 as in Unpacking1, the increase in arithmetic intensity is smaller, resulting in performance that relies more heavily on the L3 cache. This is evident in the green lines in Fig. 3; performance correlates with L3 cache size (Ryzen 9 > Xeon 8260 > Xeon 5218) and the kernels that use Unpack2 achieve higher speedup than the ones using Unpack1, on devices with larger L3 caches. Unpack1 (blue lines) achieves better speedups on devices with smaller L3 caches. Furthermore, a better performance is observed for Unpacking2-based FPoT on Xeon 8260 compared to Xeon 5218. The oficial specifications indicate that Xeon 5218 has one FMA unit, while Xeon 8260 has two, which may partly explain the steeper speedup increase on Xeon 5218. The documentation of Ryzen 9 does not specify the number of available FMA units. Unlike the Intel CPUs, Unpacking2-based FPoT is 1.72x faster on Ryzen 9 than Unpacking1 with GCC compilers. As in Table 2, Ryzen 9 has faster memory. Therefore, the spared load instructions by PoT are less efective compared to Intel devices. LLVM 17 and 18 perform best, with geomean speedup of 1.28 on Intel and 0.85 on AMD. The maximum speedup for Xeon 5218, Xeon 8260, and Ryzen 9 are 3.67, 3.60, and 2.28, respectively. 8.2. RVV-1.0 Results The results for Alg. 1, 5, and 4 for diferent values on Banana Pi F3 with SpacemiT-K1 SoC are shown in Fig. 3. The results for = 5 with LLVM 18 are demonstrated in Fig. 4. The geomean speedup of FPoT E3:P2 over E5:P1 is 0.72 for GCC, due to the overhead of unpacking. The maximum speedup on SpacemiT-K1 is 1.45x with LLVM 17, followed by 1.43x with GCC 14.2 and 1.42x with LLVM 18. The geomean speedup of LLVM 17 and 18 is 1.01. 8.3. AVX512 ±∞ and NaN Handling Results In this section, we experimentally evaluated the infinity and NaN handling algorithm, as described in Sec. 6 and Alg. 6. We built the kernel with LLVM 18 and ran it on Xeon 5218. A slightly lower speedup is observed (Fig. 5) compared to the kernels that do not account for the extra ±∞ and NaN handling (Fig. 3). The maximum speedup for this kernel is 2.78, while the kernel described in Alg. 2 achieves 3.67.

8.4. FXPoT Results

We used LLVM 18 on Xeon 5218 and Banana Pi F3 (SpacemiT-K1) with autotuning. Fig. 6 provides the results for these kernels. As can be seen, FXPoT provides no meaningful and dependable speedup over the baseline. Since both FXPoT and its baseline are using 32-bit data type (I32 and U32) without saving any load instructions, the transient gain for ≤ 512 on Xeon 5218 represents architectural properties such as the available units for shifting and multiplication.

8.5. Autovectorization

To measure how well diferent compilers were able to auto-vectorize the scalar baseline kernels, we define Speedup = ttSSNAVA . Note that SAV and SNA refer to scalar kernels with and without auto-vectorization. We also provide Speedup = ttSSINMAD to compare FPoT kernels against scalar non-autovectorized baselines. SIMD refers to the FPoT handwritten SIMD kernels, all shown in Fig. 7, for the matrix size of 5 and LLVM 18. The compiler flags used for these settings are similar to the ones listed in Table 3, but without lfags to disable autovectorization. The SNA and SAV kernels’ measurements are repeated 7 times. Note that Speedup is a good approximate indicator of the upper bound for Speedup. For AVX512, GCC achieves the highest geomean speedupss, with 7.34 on Intel and 14.75 on AMD. For speedupvs, GCC also leads with 9.0 (Intel) and 11.81 (AMD). For RVV, the best-performing compiler is GCC with a geomean speedupss of 3.66.

9. Discussion

Based on our results, FPoT provided a maximum speedup of 3.67. Using the geomean speedup with the best compiler, implementing FPoT with permutexvar yielded a 1.16x speedup on Ryzen 9, but resulted in a 0.87x slowdown on the Intel CPUs (Fig. 3). Furthermore, on SpacemiT-K1, FPoT achieved a geomean speedup of 1.39x without packing multiple elements into a byte (Fig. 3). Finally, LLVM compilers yielded better speedup overall for AVX512 and RVV (Fig. 3). While DenseShift [ 5 ] uses F16:F16 GEMM kernels from NCNN [23], we opted for Float32 handwritten baselines, as the AVX512-FP16 ISA extension is only available since Saphire Rapids on Intel, even though the SpacemiT-K1 has support for Float16. A direct comparison with DenseShift would not be fair, considering the diferent CPU architectures, ISA, data types, workload, and missing details on the encoding scheme used in DenseShift. Unfortanely, we did not have access to an ARM A57 SoC to reproduce the results from DenseShift with our implementation. We hope our work would encourage future research in this direction, providing detailed information on the encoding scheme and the baselines used for comparison. Nevertheless, we observed a similar speedup of 1.45x on SpacemiT-K1 with RVV-1.0 compared to a speedup of 1.6x on ARM A57 Neon [ 5 ] (we report the results published in the paper because the artifacts are not available). To the best of our knowledge, no other work provides results for handwritten AVX512 and RVV-1.0-based FPoT quantized MatMul with various configurations. 10. Conclusion This work focuses on AVX512 and RVV-1.0 implementations of floating-point PoT quantized MatMul inference kernels, initially proposed by DenseShift. On AVX512, we proposed two ways to implement unpacking for floating-point PoT MatMul with one element per word. Furthermore, for RVV-1.0, we explored the implementation of floating-point PoT MatMul inference with one and two elements per word. We proposed and evaluated a novel approach to handle ±∞ and NaN floating-point values. Overall, the experimental results indicated that floating-point PoT quantization for MatMul inference yields speedups up to 3.67x on Xeon 5218 AVX512 and 1.45x on SpacemiT-K1 RVV-1.0 in the single-thread configuration.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [10] S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions, Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (1998) 107–116. [11] H. You, H. Shi, Y. Guo, Y. Lin, ShiftAddViT: Mixture of multiplication primitives towards eficient vision transformer, in: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, NeurIPS, 2023. [12] H. Shi, X. Cheng, W. Mao, Z. Wang, P2-ViT: Power-of-two post-training quantization and acceleration for fully quantized vision transformer, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 32 (2024) 1704–1717. doi:10.1109/TVLSI.2024.3422684. [13] D. A. Gudovskiy, L. Rigazio, ShiftCNN: Generalized low-precision architecture for inference of convolutional neural networks, arXiv preprint arXiv:1706.02393 (2017). [14] R. Saha, J. Haris, J. Cano, Accelerating pot quantization on edge devices, in: 2024 31st IEEE Int.

Conf. on Electronics, Circuits and Systems (ICECS), IEEE, 2024, pp. 1–4. [15] T. Dupuis, Y. Fournier, M. AskariHemmat, N. El Zarif, F. Leduc-Primeau, J. P. David, Y. Savaria, Sparq: A custom risc-v vector processor for eficient sub-byte quantized inference, in: 21st IEEE Interregional NEWCAS Conf. (NEWCAS), IEEE, 2023, pp. 1–5. [16] G. Rutishauser, J. Mihali, M. Scherer, L. Bonini, xTern: Energy-eficient ternary neural network inference on risc-v-based edge systems, in: IEEE 35th Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP), IEEE, 2024, pp. 206–213. [17] S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, P. Kitsos, A survey on risc-v-based machine learning ecosystem, Information 14 (2023) 64. [18] X. Geng, S. Liu, J. Jiang, K. Jiang, H. Jiang, Compact powers-of-two: An eficient non-uniform quantization for deep neural networks, in: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2024, pp. 1–6. doi:10.23919/DATE58400.2024.10546652. [19] T. Xia, B. Zhao, J. Ma, G. Fu, W. Zhao, N. Zheng, P. Ren, An energy-and-area-eficient cnn accelerator for universal powers-of-two quantization, IEEE Transactions on Circuits and Systems I: Regular Papers 70 (2023) 1242–1255. doi:10.1109/TCSI.2022.3227608. [20] J.-H. Li, J.-K. Lin, Y.-C. Su, C.-W. Chu, L.-T. Kuok, H.-M. Lai, C.-L. Lee, J.-K. Lee, SIMD Everywhere optimization from arm neon to risc-v vector extensions, arXiv preprint arXiv:2309.16509 (2023). [21] Y. Li, X. Dong, W. Wang, Additive powers-of-two quantization: An eficient non-uniform discretization for neural networks, in: 8th Int. Conf. on Learning Representations, ICLR, 2020. [22] Y. M. Kim, K. Han, W.-K. Lee, H. J. Chang, S. O. Hwang, Non-zero grid for accurate 2-bit additive power-of-two cnn quantization, IEEE Access 11 (2023) 32051–32060. doi:10.1109/ACCESS.2023. 3259959. [23] H. Ni, The NCNN contributors, NCNN, 2017. URL: https://github.com/Tencent/ncnn.

[1]

W. A.

Wulf ,

S. A.

McKee , Hitting the memory wall: implications of the obvious , SIGARCH Comput. Archit. News 23 ( 1995 ) 20 - 24 . doi: 10 .1145/216585.216588.

[2]

Hofmann ,

Borgeaud ,

Mensch , E. Buchatskaya,

Cai ,

Rutherford , D. d. L. Casas , L. A.

Hendricks , J.

Welbl , A.

Clark , et al., Training compute-optimal large language models , arXiv preprint arXiv:2203.15556 ( 2022 ).

[3]

Frantar ,

Ashkboos ,

Hoefler ,

Alistarh , GPTQ: Accurate post-training quantization for generative pre-trained transformers , arXiv preprint arXiv:2210.17323 ( 2022 ).

[4]

Elhoushi ,

Chen ,

Shafiq ,

Y. H.

Tian ,

J. Y.

Li , DeepShift: Towards multiplication-less neural networks , in: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2021 , virtual, June 19-25, 2021 , Computer Vision Foundation / IEEE, 2021 , pp. 2359 - 2368 . doi: 10 .1109/CVPRW53098. 2021 . 00268 .

[5]

Li ,

Liu ,

R. H.

Yang ,

Courville ,

Xing ,

V. P.

Nia , DenseShift: Towards accurate and eficient low-bit power-of-two quantization , in: IEEE/CVF Int. Conf. on Computer Vision , ICCV, IEEE, 2023 , pp. 16964 - 16974 . doi: 10 .1109/ICCV51070. 2023 . 01560 .

[6]

Jamali Golzar , G. Karimian,

Shoaran , M. Fattahi Sani, DGCNN on FPGA: acceleration of the point cloud classifier using FPGAS, Circuits , Systems, and Signal Processing 42 ( 2023 ) 748 - 779 .

[7]

Wang ,

Sun ,

Liu ,

S. E.

Sarma , M. M. Bronstein , J. M. Solomon , Dynamic graph cnn for learning on point clouds , ACM Transactions on Graphics (tog) 38 ( 2019 ) 1 - 12 .

[8]

Nagel ,

Fournarakis ,

R. A.

Amjad ,

Bondarenko , M. van Baalen,

Blankevoort , A white paper on neural network quantization , ArXiv abs/2106 .08295 ( 2021 ).

[9]

Li ,

Liu ,

Yu , W. Liu,

Xu ,

V. P.

Nia , S3: sign-sparse-shift reparametrization for efective training of low-bit shift networks , in: Proceedings of the 35th Int. Conf. on Neural Information Processing Systems , NIPS, 2021 .