<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Jamali Golzar);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Demystifying Power-of-Two Quantization: Benchmarking Inference on AVX and RVV</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saleh Jamali Golzar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Pagano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Biagio Cosenza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Salerno</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recent trends in deep learning models have been characterized by a rapid increase in the number of parameters to achieve higher performance across a wide range of tasks. However, this growth in model size has intensified computational demands, leading to increased power consumption and latency during inference, making model compression more important than ever. As a result, it has become extremely important to provide and use eficient compression techniques, such as quantization, on diferent target architectures and platforms. This work revisits power-of-two (PoT) quantization (specifically the MatMul kernel) for inference workloads, evaluating a wide range of configurations, including fixed- and floating-point PoT, and targeting AVX512 and RVV-1.0. Our work proposes techniques tailored to the underlying architecture, including two methods for unpacking 8- and 16-bit PoT-encoded data for AVX512, two packing configurations for RVV-1.0, as well as a novel, lightweight solution for handling signed infinity and NaN floating-point values properly. Experimental results for floating-point PoT quantization for MatMul workloads show speedups of up to 3.67.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AVX512</kwd>
        <kwd>Deep Neural Networks</kwd>
        <kwd>Quantization</kwd>
        <kwd>RVV</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>❸ A novel approach to handle infinity and NaN floating-point values for PoT with an
acceptable overhead;
❹ An experimental evaluation of the PoT implementations for RVV-1.0 on SpacemiT K1, for
AVX512 on Intel Xeon 5218, Xeon 8260 and AMD Ryzen 9 7950X, all evaluated with autotuning,
diferent input sizes and compilers.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background on Quantization</title>
      <p>
        Quantization reduces the precision of model parameters to improve latency and memory usage, trading
of an acceptable accuracy drop. Depending on its scale and ofset, it can be symmetric/asymmetric and
uniform/non-uniform [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Post Training Quantization (PTQ) applies quantization after training with
optional fine-tuning, while Quantization Aware Training (QAT) integrates it directly into training for
better accuracy under the same constraints. The Power-of-Two quantization ofers higher dynamic
range and more eficient implementation compared to fixed-point counterparts. This work focuses on
QAT PoT quantization.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Fixed-point PoT</title>
        <p>
          DeepShift [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] ofers two schemes to train PoT-quantized models, namely DeepShift-Q and DeepShift-PS,
which use the same PoT encoding for the forward-pass but various approaches for the backward-pass.
It uses a shift matrix ˜ along with a sign matrix ˜ to replace a floating-point weight matrix of  with
˜
Flip(2 , ˜) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Furthermore, since the sign of the elements of ˜ only indicates left or right shifts, the
function Flip(, ˜) =  sign(˜) handles the negation of . This work maps full-precision values to
{0} ∪ {±2 1 }, 1 ∈ W.
        </p>
        <p>
          3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] aims to solve the gradient vanishing [10] and weight sign freezing problems when using
re-parametrization to train models with discrete power-of-two weights. Similar to DeepShift, it maps
full-precision values to {0} ∪ {±2 2 }, 2 ∈ W. The weights are re-parameterized by Eq. 1 where  is
defined in Eq. 2,  is the target quantization bitwidth, and ⊮() = +1 if ( ≥ 0) else 0.
shift = ⊮(sparse) * {︀ 2⊮(sign) − 1 }︀ 2
        </p>
        <p>
          ⏟ zero⏞ ⏟ Sign⏞ ⏟Scal⏞e
 = ⊮()(−1 + 1) 1 ≤  ≤ , 
0 = 0
(1)
(2)
(3)
(4)
2.2. Floating-point PoT
DenseShift[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] on the other hand improves the accuracy by removing zero from the dynamic range,
efectively mapping full-precision values to {±2 3 } instead of {0} ∪ {±2 4 }, 3, 4 ∈ W. Compared
to 3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], this work uses the same recursion formula for , while using Eq. 3 for computing quantized
weights. DenseShift [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] reports a 1.6x speedup on an ARM A57 Neon.
        </p>
        <p>shift = {︀ 2⊮(sign) − 1 }︀ 2</p>
        <p>⏟ Sign⏞ ⏟Scal⏞e</p>
        <p>
          For clarification,  for  = 3 is computed using the Eq. 2 recursively and simplified to Eq. 4. The
range of 3 is {0, 1, 2, 3}, leading to quantization levels of {±1, ±2, ±4, ±8} for shift (Eq. 5). Every
lfoating point weight in the original model is replaced by four unique floating point trainable parameters:
sign and 1, 2, 3 for  = 3. As a comparison, for  =3, 3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] yields a range of {0, ±1, ±2, ±4}.
3 = ⊮(3)⊮(2)⊮(1) + ⊮(3)⊮(2) + ⊮(3)
3 ∈ {0, 1, 2, 3} =⇒ shift ∈ {±1, ±2, ±4, ±8}
(5)
        </p>
        <p>DenseShift requires only two bits to store the exponents of the quantized levels ({±1, ±2, ±4, ±8}
along with one bit for their signs, adding up to three bits ( = 3). It exploits the floating-point format
to replace multiplications of floating point numbers by PoT values, with unsigned integer additions (Fig.
1). Our work investigates the conventional fixed-point PoT briefly, along with a deeper study
of the floating-point PoT. Building on top of DenseShift, we explore how the sign and exponent bits
can be eficiently encoded and packed for modern processors with SIMD capabilities, targeting various
values for  . In addition, we introduce strategies to correctly handle special floating-point values like
NaN and ±∞.
),</p>
        <sec id="sec-2-1-1">
          <title>Element of A</title>
          <p>+</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Element of BPoT</title>
          <p>=

+
PoT
=
Element of C   + PoT

+
0
=</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Other Work</title>
        <p>ShiftAddViT [11] and P2-ViT [12] target PoT quantization for Vision Transformers. Similar studies
[13, 14, 15, 16, 17, 18, 19] propose specialized RISC-V extensions, FPGA-, and ASIC-based designs related
to PoT and ternary models. Other works, such as SIMDE [20] aim for portability of SIMD kernels across
diferent ISAs. Unfortunately, the support for the needed AVX512 extensions for our use case was
missing in SIMDE. In addition, APoT [21] and the related literature [22] pursue a diferent approach for
implementing PoT quantization.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Methodology</title>
      <p>This section provides details on ❶ handwritten vectorized kernels for RVV-1.0 and AVX512, ❷
handwritten vectorized floating-point PoT kernels for AVX512 and RVV-1.0, ❸ the same kernels with extra logic
to handle ±∞ and NaN properly, and ❹ handwritten vectorized fixed-point PoT kernels for AVX512
and RVV-1.0.</p>
      <sec id="sec-3-1">
        <title>3.1. Notation</title>
        <p>We define the target operation as  =  where  is the non-trainable input tensor and  is the
trainable tensor, known at compile-time. Using PoT quantization,  =  ⨁︀ PoT is used to replace
multiplication with an addition along with some extra operations, depending on whether a fixed-point
or a floating-point PoT is used. We use a Python-like approach for loops and arrays, meaning that the
index range “0 : 2 : 8” (first, step, last) maps to the set {0, 2, 4, 6}. However, we use address ofsets (as in
C++) for vectorized loads and stores. To refer to a PoT quantization configuration, we use T1:T2:E:P
where T1 and T2 refer to  and PoT. The number of bits required to store the signed exponent of
each element of  in PoT is denoted by E, while P refers to the number of PoT-encoded elements of
, packed into a word of PoT. An example would be F32:U8:E5:P1. Also, to save space, FPoT and
FXPoT are used to refer to floating-point and fixed-point PoT quantization.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Quantization Methodology</title>
        <p>We have limited our work only to MatMul inference workload with a known second operand at
compile-time. This is critical, as the encoding operation on the raw elements of tensor B afects runtime
performance unless the tensor data is known at compile time, in which case the encoding can be
performed ahead of model inference at no additional cost. Throughout this work, we use the proposed
PoT scheme in DenseShift ({±2 }) for the floating point PoT and assume the quantized PoT weight
tensor is always the second operand.</p>
        <p>For FXPoT, considering  +=  *  operation in a MatMul kernel, variables , , and  are signed
integers and  = log2(||). Therefore,  *  can be replaced by sign() · ( ≪ ) . This approach
replaces an integer multiplication with a variable shift (opposed to immediate) and some extra logic to
handle the negation, if  is negative.</p>
        <p>
          For FPoT, in floating-point multiplication of  +=  *  , using DenseShift’s floating-point
methodology, we replace  *  with a direct unsigned integer addition of the sign and exponent fields of , with a
PoT quantized value [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], assuming that the value is not represented in two’s complement and its sign
and positive exponent field align with those of  (Fig. 1).
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Encoding</title>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Baselines</title>
        <p>This paper aims to investigate how PoT influences MatMul inference. Therefore, we use a naive
MatMul implementation (no tiling, single-thread). The code consists of independent executables for
each configuration of PoT kernels, each with three baselines of scalar, scalar-autovectorized, and
intrinsic-based. For example, Alg. 1 shows the baseline Float32 MatMul kernel with RVV-1.0. Note
that AVX512 does not have an FMA instruction for integer types, so the non-fused multiplication and
addition instructions are used.
Algorithm 1 The baseline RVV-1.0 F32:F32.</p>
        <p>1: for  ← 0 :  do UF0
2: for  ← 0 :  do UF1
3:  ← 0,  ← 0 ◁ vfmv_v_f_f32m1 (for  )
4: for  ← 0 :  :  do UF2
5:  ← vsetvl_e32m1( − )
6:  ← [, ],   ← [, ] ◁ vle32_v_f32m1
7:  ←   +  *   ◁ vfmacc_vv_f32m1
8:  ← reduce( ) ◁ vfmv_f_s_f32m1_f32(vfredosum_vs_f32m4_f32m1())
9: [, ] ←  
3.5. Autotuning
We used brute-force autotuning implemented in Bash, sweeping all the possible combinations for the
defined tunable parameters. A shared combination of unrolling factors for all kernels in a benchmark
executable was used to reduce the autotuning time. Each PoT kernel benchmark has one executable.
The tuning objective is to minimize the runtime of the PoT kernels. The unroling factors assigned to
each loop (if any) are shown with the blue markings on the algorithms.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. FPoT-quantized MatMul on AVX512</title>
      <p>Tensor PoT can have more than one word per element and has to be unpacked so each word aligns
with its relevant Float32 fields of tensor . We introduce two methods for unpacking. The first
unpacking logic, which is shared by the two RVV-1.0 kernels and one of the AVX512 kernels, and will
from now on be referred to as Unpacking1, consists of unpacking 8-bit elements into 32-bit elements.</p>
      <p>The second unpacking method, referred to as Unpacking2, is implemented using the permutexvar
intrinsic to place an element of value zero to the left of every 16-bit element of the packed vector, while
unpacking to 32-bit vector registers. This is done by providing 0xFF as the indices for permutexvar.</p>
      <p>The first AVX kernel ( F32:U8:E5:P1) unpacks the elements with Unpacking1, creating a mask using
the sign bits to remove the sign bit itself (mask_sub) to perform multiplication via addition. Finally,
the sign bits are flipped selectively by mask_xor (Alg. 2). The second AVX kernel with Unpacking2
(F32:U16:E8:P1) uses 16-bit words to simplify the runtime operations, using permutexvar to unpack
the elements by inserting zeros to align the fields against Float32 (Alg. 3).</p>
      <p>Algorithm 2 AVX512 FPoT Unpacking1 F32:U8:E5:P1
1:  ← 0x1F
2:  ← 0x20
3:  ← 0x80000000
4: for  ← 0 :  do UF0
5: for  ← 0 :  do UF1
6:  ← 0
7: for  ← 0 : 64 :  do UF2
8:  ← [, ]
9: for  ← 0 : 4 do
10:  ← [,  + 16]
11:  ←  [16]
12: int ← cast( )
13:  ← cmp( , )
14:  ← mask_sub( , , , )
15:  ←   ≪ 23
16:  ←  int + 
17:  ← mask_xor( , , , )
18:  ← cast( )
19:  ←   + 
20:  ← reduce( )
21: [, ] ←  
◁ m5_set1_epi32
◁ m5_set1_epi32
◁ m5_set1_epi32</p>
    </sec>
    <sec id="sec-5">
      <title>5. FPoT-quantized MatMul on RVV</title>
      <p>For RVV, Vector Length Specific (VLS) is a programming style with vectors whose size is known
beforehand and always static, allowing for greater optimization opportunities at the cost of less portability.
In a Vector Length Agnostic (VLA) code, the size of the vectors depends on the target machine and
the code has to query for it. This leads to better portability across diferent machines. As for Vector
Length Multiplier (LMUL), for a value of greater than one, the vector registers are combined to form
a vector register group. If LMUL is less than one, only fractions of the vector registers are utilized
Algorithm 3 AVX512 FPoT Unpacking2 F32:U16:E8:P1
1: indices ← {0xFF, 0, 0xFF, 1, 0xFF, 2, . . . , 0xFF, 15}
2: indices ←  indices
3: for  ← 0 :  do UF0
4: for  ← 0 :  do UF1
5:  ← 0
6: for  ← 0 : 16 :  do UF2
7:  ← [, ]
8: int ← cast( )
9:  ← [, ]
10: 512 ← cast( )
11: 512, unpacked ← permute( indices, 512)
12:  ←  int + 512, unpacked
13:  ← cast( )
14:  ←   + 
15: sum ← reduce( )
16: [ + ] ← sum
◁ alignas(64)
◁ m5_load_si512
(LMUL ∈ { 81 , 1 , 12 , 1, 2, 4, 8}). If set to dynamic, the compiler will adapt the LMUL for the code, unless a
4
specific LMUL is enforced by the code. Both F32:U8:E5:P1 (Alg. 5) and F32:U8:E3:P2 (Alg. 4) with
Unpacking1 are VLA and implemented via riscv_vwcvtu intrinsic. They both create a mask for the
negative elements (cmp) to remove the sign bits (mask_sub) and XOR the sign bits (mask_xor) after
multiplication by addition (Alg. 5). In the other kernel shown in Alg. 4, after the initial unpacking,
lower and upper nibbles of the 8-bit elements of the vectors are concatenated and reordered via the use
of concat and gather. Then, a similar logic to that of E5:P1 is utilized.</p>
      <p>Algorithm 4 RVV-1.0 FPoT F32:U8:E3:P2.</p>
      <p>1: vlmax ← vsetvlmax_e32m8()
2: elements ← vsetvlmax_e8m2()
3: indices ← {0}
4: for  ← 0 : elements do
5: if  mod 2 then
6: indices[] ← /2
7: index ←  indices
8: for  ← 0 :  do UF0
9: for  ← 0 :  do UF1
10:  ← 0
11:  ← 0
12: for  ← 0 :  :  do UF2
13:  ← vsetvl_e32m8( − )
14:  ← [, ]
15:  ← [, ]
16:  ← lowerHalf( )
17:  ← upperHalf( )
18:  ← concat( , )
19:  ← gather( , index)
20:  ← cmp( , 0b11111)
21:  ← mask_sub( , , , 0b100000)
22:  ← expand( )
23:  ←   ≪ 23
24: uint ← rintrp( )
25: uint ←  uint + 
26:  ← rintrp( uint )
27:  ←   + 
28:  ← reduce( )
29: [, ] ←  
◁ vle8_v_u8m2
◁ vfmv_v_f_f32m8
◁ vle32_v_f32m8
◁ vle32_v_u8m1
◁ vand_vx_u8m1
◁ vand_vx_u8m1
◁ vset_v_u8m1_u8m2
◁ vrgather_vv_u8m2
◁ vmsgtu_vx_u8m1_b8
◁ vsub_vx_u8m1_tumu
◁ vwcvtu_x_x_v_u32m8(vwcvtu_x_x_v_u16m4())</p>
      <p>◁ vsll_vx_u32m8
◁ vreinterpret_v_f32m8_u32m8</p>
      <p>◁ vadd_vv_u32m8
◁ vreinterpret_v_u32m8_f32m8</p>
      <p>
        ◁ vfadd_vv_f32m8
◁ vfmv_f_s_f32m1_f32(vfredosum_vs_f32m4_f32m1())
Algorithm 5 RVV-1.0 FPoT F32:U8:E5:P1.
6. Handling ±∞ and NaN
Since FPoT replaces a multiplication of two floating point values with an integer addition, the edge
cases defined in IEEE754 are no longer handled by the hardware. Therefore, the exponent bits after
performing addition can overflow and flip the sign bit of the output. DenseShift [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is also prone to
this issue. A full software solution for this could render the FPoT speedup pointless, but we could still
implement extra logic to manage just the ±∞ and NaN elements of A. These values (±∞ and NaN )
both have the exact exponents of 0xFF. These values (±∞ and NaN ) both have exponent bits set to
0xFF. Only the non-trainable tensor () can contain such values, as the trainable tensor is restricted to
valid PoT quantized levels in DenseShift, which do not include zero or special IEEE754 representations.
As shown in Fig. 2, since ∞ ·  = ∞ if  is not a ∞, NaN, or zero, as long as the FPoT exponents are
not added to the bad elements of tensor , the accumulated results in Float32 will be correct. Later,
the hardware would handle the edge cases for any pairs of ±∞ or NaN values in Float32. As shown
in Alg. 6, the mask inf is used to perform masked addition by excluding the bad elements from the
addition with the FPoT exponents. The extra instructions needed to perform the mask addition come
with an acceptable overhead, discussed in Sec. 8.3.
      </p>
      <p>A
B
1
0
0xFF
0x01</p>
      <p>C</p>
      <p>Bypass
?
0x000
Unsigned Add. +
?</p>
    </sec>
    <sec id="sec-6">
      <title>7. FXPoT</title>
      <p>To clarify the nature of speedup, if any, for the FXPoT kernels compared to normal fixed-point MatMul,
we developed two benchmarks for AVX512 and RVV. The idea is to see if a performance gain could
be achieved by using shift instructions to replace integer multiplications in a loop. Therefore, we
excluded negation logic for handling negative numbers from the code and opted for I32:I32 data
types for the baseline kernel. For AVX512, the multiplication intrinsic instances (mullo_epi32)
are replaced by bitwise left-shifts. The elements of  and  are loaded with load_si512,
leftshifted with sllv_epi32, and accumulated by add_epi32. The resulted vector is then reduced to a
scalar using reduce_add_ps and written to the output tensor . The same approach is used for the
RVV kernel using sll_vv_i32m4, vadd_vv_i32m4, and vredsum_vs_i32m4_i32m1. Note that the
FXPoT kernels do not need more than 5 bits to store the exponent bits, since the selected base data type
is 32 bits, which means the PoT type is I32:U32:E5:P1.</p>
    </sec>
    <sec id="sec-7">
      <title>8. Experimental Evaluation</title>
      <p>Autotuning For RISC-V, we used dynamic LMUL and let the autotuner script find the best between VLA
and VLS. Note that VLS-build is selected by adding GCC flag -mrvv-vector-bits=zvl and LLVM
lfag -mrvv-vec tor-bits=256. For GCC, VLS build flags work in conjunction with the -march. To
manage slow tuning, we only considered tuning the unrolling factor for the inner-most loop with a
value from {1, 2, 4}. We kept SMT (i.e., Intel’s hyperthreading) active and the rest of the cores idling.
The CPU afinity was enforced with taskset. The SIMD kernels’ measurements are repeated 75 times.
Tensor Shapes We considered square matrices ( ×  ) that satisfy the constraints of all our baseline
and PoT kernels. For the AVX512 kernels, the most restrictive constraint is for  to be a multiple of 64.
For the RVV kernels, the VLA design imposes no restrictions on  , while E3:P2 encoding requires 
to be a multiple of 2. Therefore, we considered  ∈ {64, 128, 256, 512, 768, 1024, 2048, 4096, 5120}
to cover a wide range of shapes while keeping the runtime of the experiments manageable. Handling
the cases for non-square matrices and the ones that do not satisfy the constraints of our kernels are
straightforward, but left for future work, as our main goal is to prove the efectiveness of the core idea
behind FPoT and FXPoT in a fair setting.</p>
      <sec id="sec-7-1">
        <title>8.1. AVX512 Results</title>
        <p>The results for Alg. 2 and 3 for various square matrix sizes (N) are shown in Fig. 3. The same data
for  = 5 and LLVM 18 is shown in Fig. 4. Unpacking1 is 1.13x (geomean) faster than Unpacking2
with LLVM compilers. The profiling data on Xeon 5218 with LLVM 18 using Intel Advisor and VTune
indicates that the baselines are L2-bound and perform worse for  = 3 as opposed to  = 1, while
the FPoT kernels sustain the same performance thanks to the increased arithmetic intensity of FPoT by
the spared loads. This leads to the steep increase of speedup in 1024 ≤  &lt; 4096 . For  ≤ 768 , the
tensors can fit entirely in the caches, diminishing the efectiveness of FPoT kernels over the baselines.
Because Unpacking2-based kernels use F32:U16 rather than F32:U8 as in Unpacking1, the increase
in arithmetic intensity is smaller, resulting in performance that relies more heavily on the L3 cache.
This is evident in the green lines in Fig. 3; performance correlates with L3 cache size (Ryzen 9 &gt; Xeon
8260 &gt; Xeon 5218) and the kernels that use Unpack2 achieve higher speedup than the ones using
Unpack1, on devices with larger L3 caches. Unpack1 (blue lines) achieves better speedups on devices
with smaller L3 caches. Furthermore, a better performance is observed for Unpacking2-based FPoT on
Xeon 8260 compared to Xeon 5218. The oficial specifications indicate that Xeon 5218 has one FMA unit,
while Xeon 8260 has two, which may partly explain the steeper speedup increase on Xeon 5218. The
documentation of Ryzen 9 does not specify the number of available FMA units. Unlike the Intel CPUs,
Unpacking2-based FPoT is 1.72x faster on Ryzen 9 than Unpacking1 with GCC compilers. As in Table 2,
Ryzen 9 has faster memory. Therefore, the spared load instructions by PoT are less efective compared
to Intel devices. LLVM 17 and 18 perform best, with geomean speedup of 1.28 on Intel and 0.85 on AMD.
The maximum speedup for Xeon 5218, Xeon 8260, and Ryzen 9 are 3.67, 3.60, and 2.28, respectively.
8.2. RVV-1.0 Results
The results for Alg. 1, 5, and 4 for diferent  values on Banana Pi F3 with SpacemiT-K1 SoC are shown
in Fig. 3. The results for  = 5 with LLVM 18 are demonstrated in Fig. 4. The geomean speedup of
FPoT E3:P2 over E5:P1 is 0.72 for GCC, due to the overhead of unpacking. The maximum speedup on
SpacemiT-K1 is 1.45x with LLVM 17, followed by 1.43x with GCC 14.2 and 1.42x with LLVM 18. The
geomean speedup of LLVM 17 and 18 is 1.01.
8.3. AVX512 ±∞ and NaN Handling Results
In this section, we experimentally evaluated the infinity and NaN handling algorithm, as described in
Sec. 6 and Alg. 6. We built the kernel with LLVM 18 and ran it on Xeon 5218. A slightly lower speedup
is observed (Fig. 5) compared to the kernels that do not account for the extra ±∞ and NaN handling
(Fig. 3). The maximum speedup for this kernel is 2.78, while the kernel described in Alg. 2 achieves 3.67.</p>
      </sec>
      <sec id="sec-7-2">
        <title>8.4. FXPoT Results</title>
        <p>We used LLVM 18 on Xeon 5218 and Banana Pi F3 (SpacemiT-K1) with autotuning. Fig. 6 provides the
results for these kernels. As can be seen, FXPoT provides no meaningful and dependable speedup over
the baseline. Since both FXPoT and its baseline are using 32-bit data type (I32 and U32) without saving
any load instructions, the transient gain for  ≤ 512 on Xeon 5218 represents architectural properties
such as the available units for shifting and multiplication.</p>
      </sec>
      <sec id="sec-7-3">
        <title>8.5. Autovectorization</title>
        <p>To measure how well diferent compilers were able to auto-vectorize the scalar baseline kernels, we define
Speedup = ttSSNAVA . Note that SAV and SNA refer to scalar kernels with and without auto-vectorization.
We also provide Speedup = ttSSINMAD to compare FPoT kernels against scalar non-autovectorized baselines.
SIMD refers to the FPoT handwritten SIMD kernels, all shown in Fig. 7, for the matrix size of 5 and
LLVM 18. The compiler flags used for these settings are similar to the ones listed in Table 3, but without
lfags to disable autovectorization. The SNA and SAV kernels’ measurements are repeated 7 times. Note
that Speedup is a good approximate indicator of the upper bound for Speedup. For AVX512, GCC
achieves the highest geomean speedupss, with 7.34 on Intel and 14.75 on AMD. For speedupvs, GCC also
leads with 9.0 (Intel) and 11.81 (AMD). For RVV, the best-performing compiler is GCC with a geomean
speedupss of 3.66.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>9. Discussion</title>
      <p>
        Based on our results, FPoT provided a maximum speedup of 3.67. Using the geomean speedup with the
best compiler, implementing FPoT with permutexvar yielded a 1.16x speedup on Ryzen 9, but resulted
in a 0.87x slowdown on the Intel CPUs (Fig. 3). Furthermore, on SpacemiT-K1, FPoT achieved a geomean
speedup of 1.39x without packing multiple elements into a byte (Fig. 3). Finally, LLVM compilers
yielded better speedup overall for AVX512 and RVV (Fig. 3). While DenseShift [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses F16:F16 GEMM
kernels from NCNN [23], we opted for Float32 handwritten baselines, as the AVX512-FP16 ISA
extension is only available since Saphire Rapids on Intel, even though the SpacemiT-K1 has support
for Float16. A direct comparison with DenseShift would not be fair, considering the diferent CPU
architectures, ISA, data types, workload, and missing details on the encoding scheme used in DenseShift.
Unfortanely, we did not have access to an ARM A57 SoC to reproduce the results from DenseShift with
our implementation. We hope our work would encourage future research in this direction, providing
detailed information on the encoding scheme and the baselines used for comparison. Nevertheless,
we observed a similar speedup of 1.45x on SpacemiT-K1 with RVV-1.0 compared to a speedup of
1.6x on ARM A57 Neon [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (we report the results published in the paper because the artifacts are not
available). To the best of our knowledge, no other work provides results for handwritten AVX512 and
RVV-1.0-based FPoT quantized MatMul with various configurations.
10. Conclusion
This work focuses on AVX512 and RVV-1.0 implementations of floating-point PoT quantized MatMul
inference kernels, initially proposed by DenseShift. On AVX512, we proposed two ways to implement
unpacking for floating-point PoT MatMul with one element per word. Furthermore, for RVV-1.0, we
explored the implementation of floating-point PoT MatMul inference with one and two elements per
word. We proposed and evaluated a novel approach to handle ±∞ and NaN floating-point values.
Overall, the experimental results indicated that floating-point PoT quantization for MatMul inference
yields speedups up to 3.67x on Xeon 5218 AVX512 and 1.45x on SpacemiT-K1 RVV-1.0 in the single-thread
configuration.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[10] S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem
solutions, Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (1998) 107–116.
[11] H. You, H. Shi, Y. Guo, Y. Lin, ShiftAddViT: Mixture of multiplication primitives towards eficient
vision transformer, in: Advances in Neural Information Processing Systems 36: Annual Conference
on Neural Information Processing Systems, NeurIPS, 2023.
[12] H. Shi, X. Cheng, W. Mao, Z. Wang, P2-ViT: Power-of-two post-training quantization and
acceleration for fully quantized vision transformer, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 32 (2024) 1704–1717. doi:10.1109/TVLSI.2024.3422684.
[13] D. A. Gudovskiy, L. Rigazio, ShiftCNN: Generalized low-precision architecture for inference of
convolutional neural networks, arXiv preprint arXiv:1706.02393 (2017).
[14] R. Saha, J. Haris, J. Cano, Accelerating pot quantization on edge devices, in: 2024 31st IEEE Int.</p>
      <p>Conf. on Electronics, Circuits and Systems (ICECS), IEEE, 2024, pp. 1–4.
[15] T. Dupuis, Y. Fournier, M. AskariHemmat, N. El Zarif, F. Leduc-Primeau, J. P. David, Y. Savaria,
Sparq: A custom risc-v vector processor for eficient sub-byte quantized inference, in: 21st IEEE
Interregional NEWCAS Conf. (NEWCAS), IEEE, 2023, pp. 1–5.
[16] G. Rutishauser, J. Mihali, M. Scherer, L. Bonini, xTern: Energy-eficient ternary neural network
inference on risc-v-based edge systems, in: IEEE 35th Int. Conf. on Application-specific Systems,
Architectures and Processors (ASAP), IEEE, 2024, pp. 206–213.
[17] S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, P. Kitsos, A survey on risc-v-based machine
learning ecosystem, Information 14 (2023) 64.
[18] X. Geng, S. Liu, J. Jiang, K. Jiang, H. Jiang, Compact powers-of-two: An eficient non-uniform
quantization for deep neural networks, in: Design, Automation &amp; Test in Europe Conference &amp;
Exhibition (DATE), 2024, pp. 1–6. doi:10.23919/DATE58400.2024.10546652.
[19] T. Xia, B. Zhao, J. Ma, G. Fu, W. Zhao, N. Zheng, P. Ren, An energy-and-area-eficient cnn
accelerator for universal powers-of-two quantization, IEEE Transactions on Circuits and Systems
I: Regular Papers 70 (2023) 1242–1255. doi:10.1109/TCSI.2022.3227608.
[20] J.-H. Li, J.-K. Lin, Y.-C. Su, C.-W. Chu, L.-T. Kuok, H.-M. Lai, C.-L. Lee, J.-K. Lee, SIMD Everywhere
optimization from arm neon to risc-v vector extensions, arXiv preprint arXiv:2309.16509 (2023).
[21] Y. Li, X. Dong, W. Wang, Additive powers-of-two quantization: An eficient non-uniform
discretization for neural networks, in: 8th Int. Conf. on Learning Representations, ICLR, 2020.
[22] Y. M. Kim, K. Han, W.-K. Lee, H. J. Chang, S. O. Hwang, Non-zero grid for accurate 2-bit additive
power-of-two cnn quantization, IEEE Access 11 (2023) 32051–32060. doi:10.1109/ACCESS.2023.
3259959.
[23] H. Ni, The NCNN contributors, NCNN, 2017. URL: https://github.com/Tencent/ncnn.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Wulf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McKee</surname>
          </string-name>
          ,
          <article-title>Hitting the memory wall: implications of the obvious</article-title>
          ,
          <source>SIGARCH Comput. Archit. News</source>
          <volume>23</volume>
          (
          <year>1995</year>
          )
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1145/216585.216588.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , E. Buchatskaya,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          , D. d. L.
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hendricks</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Welbl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Training compute-optimal large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2203.15556</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Frantar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ashkboos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hoefler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alistarh</surname>
          </string-name>
          , GPTQ:
          <article-title>Accurate post-training quantization for generative pre-trained transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2210.17323</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shafiq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>DeepShift: Towards multiplication-less neural networks</article-title>
          ,
          <source>in: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops</source>
          <year>2021</year>
          , virtual, June 19-25,
          <year>2021</year>
          , Computer Vision Foundation / IEEE,
          <year>2021</year>
          , pp.
          <fpage>2359</fpage>
          -
          <lpage>2368</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPRW53098.
          <year>2021</year>
          .
          <volume>00268</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Nia</surname>
          </string-name>
          ,
          <article-title>DenseShift: Towards accurate and eficient low-bit power-of-two quantization</article-title>
          ,
          <source>in: IEEE/CVF Int. Conf. on Computer Vision</source>
          , ICCV, IEEE,
          <year>2023</year>
          , pp.
          <fpage>16964</fpage>
          -
          <lpage>16974</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV51070.
          <year>2023</year>
          .
          <volume>01560</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jamali Golzar</surname>
          </string-name>
          , G. Karimian,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shoaran</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fattahi Sani, DGCNN on FPGA: acceleration of the point cloud classifier using FPGAS, Circuits</article-title>
          ,
          <source>Systems, and Signal Processing</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>748</fpage>
          -
          <lpage>779</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Bronstein</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Solomon</surname>
          </string-name>
          ,
          <article-title>Dynamic graph cnn for learning on point clouds</article-title>
          ,
          <source>ACM Transactions on Graphics (tog) 38</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nagel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fournarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          , M. van Baalen,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blankevoort</surname>
          </string-name>
          ,
          <article-title>A white paper on neural network quantization</article-title>
          ,
          <source>ArXiv abs/2106</source>
          .08295 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Nia</surname>
          </string-name>
          ,
          <article-title>S3: sign-sparse-shift reparametrization for efective training of low-bit shift networks</article-title>
          ,
          <source>in: Proceedings of the 35th Int. Conf. on Neural Information Processing Systems</source>
          , NIPS,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>