<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pisa, Italy.
* Corresponding author.
$ adriano.marquesgarcia@unito.it (A. M. Garcia); giulio.malenza@unito.it (G. Malenza); robert.birke@unito.it (R. Birke);
marco.aldinucci@unito.it (M. Aldinucci)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Assessing Large Language Models Inference Performance on a 64-core RISC-V CPU with Silicon-Enabled Vectors</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adriano Marques Garcia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulio Malenza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Birke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Aldinucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Torino</institution>
          ,
          <addr-line>Corso Svizzera 185, 10149 Torino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Keywords RISC-V</institution>
          ,
          <addr-line>RVV, PyTorch, LLM, XuanTie C920, SOPHON SG2042, OpenBLAS, BLIS</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The rising usage of compute-intensive AI applications with fast response time requirements, such as text generation using large language models, underscores the need for more eficient and versatile hardware solutions. This drives the exploration of emerging architectures like RISC-V, which has the potential to deliver strong performance within tight power constraints. The recent commercial release of processors with RISC-V Vector (RVV) silicon-enabled extensions further amplifies the significance of RISC-V architectures, ofering enhanced capabilities for parallel processing and accelerating tasks critical to large language models and other AI applications. This work aims to evaluate the BERT and GPT-2 language models inference performance on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled RVV v0.7.1. We benchmarked the models with and without RVV, using OpenBLAS and BLIS as BLAS backends for PyTorch to enable vectorization. Enabling RVV in OpenBLAS improved the inference performance by up to 40% in some cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The contributions of this work are as follows: (1) We provide a tutorial on how to enable RISC-V
Vectors v0.7.1 support for PyTorch in the SOPHON SG204. (2) We evaluate how enabling RVV impacts
the inference performance of large language models (BERT and GPT-2). (3) We evaluate an experimental
version of the BLIS library that leverages RVV 0.7.1 on a real RISC-V architecture. (4) We analyze the
scalability of model inference performance when increasing the parallelism of the SOPHON SG2042
64-core RISC-V processor.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Large Language Models (LLMs)</title>
        <p>
          A language model can be defined as a probabilistic model of the natural language. It is a model that
predicts the likelihood of a sequence of words given the preceding context [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]. Such models learn a
language structure, grammar, and vocabulary from vast amounts of text data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. They then use that
knowledge to perform various natural-language processing tasks, such as text generation, translation,
summarization, sentiment analysis, and more [
          <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
          ]. The first language model was proposed in
1980, and researchers and the industry have been increasingly improving these models over the past
decades. In 2017, researchers from Google introduced the BERT model [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ](LLM), which was notable
for its dramatic improvement over previous state-of-the-art models. BERT introduced the use of the
transformer architecture and attention mechanism, laying the way for the upcoming large language
models (LLM). LLMs experienced a boost in popularity among the general public with the release
of ChatGPT, a chat tool based on the series of Generative Pre-trained Transformer (GPT) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] models
developed by OpenAI [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          GPT-2 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is a well-known state-of-the-art language model developed by OpenAI, the second in
the series. It is also the last model fully disclosed by OpenAI before Microsoft acquired the exclusive
rights to GPT-3 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. It is considered a causal language model that predicts the next token (word) in the
sequence (sentence) by only attending to the tokens to the left; that is, the model can not see into the
future. Although the latest GPT version is GPT-4 Turbo, GPT-2 still presents incredible text generation
capabilities. It is designed to generate human-like text based on the input it receives.
        </p>
        <p>
          BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] stands for Bidirectional Encoder Representations from Transformers. It is a language
model based on the transformer architecture. BERT is not a causal language model but a masked one. It
is an encoder-only architecture, lacking a decoder, which means that BERT can not be prompted or
generate text. Although it was not originally designed to predict and generate the next sentences or
words of a text, it still can be fine-tuned for this purpose. However, it increases the computational cost,
and the quality of the generated text may not be comparable to that of other models, such as GPT-2.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. RISC-V Vectors (RVV) Enabled Silicon</title>
        <p>
          The vector instructions for RISC-V are defined in the open “V” vector ISA extension standardized by
RISC-V International [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The first stable release, 1.0, has now been rectified. While most versions
never made it into silicon [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], one can find of-the-shelf hardware supporting either v0.7.1, e.g., SOPHON
SG2042 SoC used here or v1.0, e.g., Canaan Kendryte K230 SoC.
        </p>
        <p>
          SOPHON SG2042 system-on-chip (SoC) contains 64 RISC-V cores divided into 16 clusters connected
through a grid network [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Each cluster comprises a XuanTie C920 4-core RISC-V CPU. Each core
is equipped with 64KB of L1 instruction and data cache, and each cluster of 4 cores shares a 1MB of
L2 cache. The unified L2 cache can handle two access requests in parallel within one cycle. The grid
interconnect finally ofers access to 64MB of level 3 cache shared among all 64 cores. Four DDR4-3200
memory controllers are used to manage access to the main memory system. For peripherals, the SG2042
is equipped with 32 PCIe Gen4 lanes.
        </p>
        <p>
          XuanTie C920 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is a homogeneous high-performance 64-bit multi-core RISC-V CPU architecture
designed by T-Head that supports 1 to 4 cores at a maximum operation frequency of 2 GHz. It targets
high-performance applications and implements a 12-stage, out-of-order, multiple-issue superscalar
pipeline. Based on the RISC-V instruction set architecture (ISA), this CPU provides the RV64GCV
instruction set [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], which supports standard vector instructions extension version 0.7.1 (RVV 0.7.1).
The vector processing unit’s vector registers are 128 bits long and support the FP16, FP32, FP64, INT8,
INT16, INT32, and INT64 data types.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. PyTorch with Vector Support</title>
      <p>
        PyTorch can delegate the computation to various low-level libraries to best leverage the capabilities of
the underlying hardware, e.g. Intel MKL [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] when running on Intel CPUs or cuDNN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and MIOpen
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] when running on NVIDIA or AMD GPUs, or more generic ones such as various libraries from the
BLAS family.
      </p>
      <p>
        Colonnelli et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] describe the first porting of PyTorch [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] v2.0 to the RISC-V ISA, but the
underlying platform in use did ofer limited acceleration capabilities, i.e. only fused multiply-add (FMA)
support. Newer RISC-V silicons also support the RVV standard, providing vector instructions that can
process multiple computations in parallel following the same instruction multiple data (SIMD) parallel
computing paradigm. To leverage such capability, we rely on the recently ported OpenBLAS or BLIS [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
libraries. These BLAS-like libraries can be compiled with and without RVV support based on the defined
target. Notice that due to the diferent versions of the RVV standard, a careful match between target
architecture and a compatible compiler is required to compile vector instructions enabled code for the
SG2042 SoC correctly.
      </p>
      <p>
        More in detail, we first compiled OpenBLAS v0.3.26 using the XuanTie GNU Compiler Toolchain
v2.8.0 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and set TARGET=C910V to enable RVV. For BLIS, we used a modified version [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] based on
BLIS v0.9.0, set “rv64iv0p7” as the target, and compiled it using the LLVM-EPI-0.71 compiler. Then, we
compiled PyTorch v2.3 for Python v3.10.10, using Xuantie’s GCC v13.2, OpenMP v4.5, OpenBLAS/BLIS,
and enabling only the following build options: USE_OPENMP=1 to leverage the multiple cores on the
SG2042 SoC; USE_BLAS=1 and USE_LAPACK=1 to use the BLAS libraries as main computation backend;
USE_KINETO=ON for profiling; and USE_NUMPY=ON for numpy support. Note that since the work
in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], the SLEEF vector math library now supports only RVV v1.0, which is incompatible with the
SG2042 SoC.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Methodology and Results</title>
      <p>This section presents the model inference performance of BERT and GPT-2 for text generation. We run
the experiments on the Milk-V Pioneer Box, a commercial ready-to-use development platform in the
form of a desktop computer equipped with a Pioneer motherboard powered by the SOPHON SG2042.
The one we used in this paper is equipped with 128GB of RAM DDR4 (3200MHz) and a 1TB PCIe 3.0
SSD. The operating system is Linux fedora-riscv 6.1.31.</p>
      <sec id="sec-4-1">
        <title>4.1. Benchmarks</title>
        <p>
          We wrote a simple text generation benchmark application based on the GPT-2 and BERT language
models to test and evaluate the performance of LLM models. This application receives a prompt as input
and generates/predicts the next  tokens (words) as output. For GPT-2, we used the GPT-2 (Revision
909a290) model provided by OpenAI, with 124M parameters [
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ]. For BERT, we used the
bert-largecased pre-trained version2 (commit 06fa25d), which is similar in size to GPT-2 but has 24 layers, 4096
intermediate dimensions, 1024 hidden dimensions, 16 attention heads, and 336M parameters [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To
make the generated text more human-like, we added extra routines to control some aspects, such as
sentence length.
        </p>
        <p>For all the experiments, we used the following input prompt both for BERT and GPT-2:
1https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi
2https://huggingface.co/google-bert/bert-large-cased
– “The quick brown fox ran”
Here is an example of the generated output text from GPT-2:</p>
        <p>– “The quick brown fox ran of. This fox needs to think for a while. This fox needs a rest. ”
And here is an example of the text generated by BERT:</p>
        <p>– “The quick brown fox ran. Run fast fox running hard. Go ahead. Running on faster the fox went
forward quickly.”</p>
        <p>Notice that the generation depends on sampling the next token (word) probabilities via a random
number generator. Hence, unless the seed of the random number generator is fixed, each execution
leads to diferent sentences being generated.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Results</title>
        <p>The chart with the performance results of BERT on the SOPHON SG2042 RISC-V CPU is presented
in Figure 1. This chart compares the performance of the BERT benchmark with PyTorch using diferent
BLAS runtimes. For this, we compiled five diferent versions of PyTorch: (1) PyTorch without OpenBLAS
or BLIS (PyTorch-default); (2) PyTorch + BLIS compiled with generic target architecture; (3) PyTorch +
an experimental BLIS library with support for RVV v0.7.1; (4) PyTorch + OpenBLAS targeting generic
RISC-V architecture (no RVV); (5) PyTorch + OpenBLAS targeting the specific C910 architecture on the
SOPHON SG2042 CPU.</p>
        <p>For generating 25 tokens on the RISC-V CPU with a single thread, BERT with PyTorch-default is
about 2 times slower than PyTorch + OpenBLAS with RVV. With 32 threads, it is 22 times slower. With
24 threads, PyTorch + OpenBLAS with RVV shows a 40% performance gain compared to OpenBLAS
without RVV. This performance gain may be brought solely by SIMD operations. However, other
optimizations may be applied to OpenBLAS when compiling it to target a specific architecture rather
than a generic one.</p>
        <p>On the other hand, BLIS shows no significant performance gains over the sequential execution with
PyTorch-default. We wrote test programs in C++ to test OpenBLAS and BLIS’s capabilities in generating
RVV instructions for GEMM operations, and both libraries passed the test. They generate RVV 0.7.1
instructions in the assembly code, which are properly executed. In these tests, BLIS with RVV behaves
as expected and presents better performance. Our test programs implement the same main operation
the models execute (aten::addmm). We also ensured that PyTorch was running the BLIS library as the
backend for the main operations and that the parallelism was being set correctly. Therefore, a deeper
investigation will be needed to understand the BLIS results.</p>
        <p>gpt2 - Text Generation - 25 tokens
PyTorch-default
PyTorch+OpenBLAS-no-RVV
PyTorch+OpenBLAS-with-RVV
PyTorch+BLIS-no-RVV
PyTorch+BLIS-with-RVV
1
2
4
8</p>
        <p>
          The inference performance results using the GPT-2 model are presented in Figure 2. The behavior of
the PyTorch-default is somewhat similar to the one observed with BERT. Increasing parallelism only
leads to performance degradation. This is mainly due to the cost added by sharing data in the upper
levels of the memory hierarchy, which is worsened in a mesh-like memory architecture of the SG2042
CPU [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. PyTorch + BLIS also behaves the exact same way as in the BERT results. With OpenBLAS,
performance scales only up to 24 threads. We hypothesize it is primarily due to overheads added by
data communication plus a lighter load imposed by GPT-2 for generating 25 tokens if compared to
BERT. In GPT-2, PyTorch + OpenBLAS with 1 and 2 threads perform unexpectedly poorly compared to
PyTorch-default. Further investigation is needed to understand this behavior. However, with a higher
number of threads, PyTorch + OpenBLAS with RVV enabled performs the best, improving over 30% the
performance running with 16 threads compared to the sequential PyTorch-default. With 64 threads, the
OpenBLAS with RVV is twice as fast as the OpenBLAS with no RVV. For both BERT and GPT-2, we ran
experiments using diferent smaller model sizes and observed similar results.
        </p>
        <p>Although the inference performance results show gains using PyTorch + OpenBLAS with RVV, it is
unclear whether this gain comes from using RVV or other optimizations. OpenBLAS and BLIS do not
provide tracing mechanisms to detail the type of instructions executed. OneDNN (oneAPI Deep Neural
Network Library), for example, is a library that can be used with PyTorch to provide RVV support
and can show which layers of the models execute vector instructions. However, the latest version
of oneDNN only supports RVV v1.0 and still cannot vectorize any of the main layers of the models
we tested, so we did not use it in this work. We tried using an experimental version of oneDNN that
leverages RVV v0.7, but it is incompatible with other PyTorch dependencies.</p>
        <p>Both language models should be highly vectorizable. Most of the operations involved in these
two models are aten::addmm (matrix multiplication and addition) and also aten::mm for GPT-2.
In addition, for BERT, we also notice that a minor percentage of processing time is spent on the
aten::gelu (Gaussian Error Linear Units) primitive. For both models, the rest of the primitives
comprise several calls of more negligible operations, as shown in Figures 3 and 4. Since OpenBLAS and
BLIS are expected to properly vectorize matrix × matrix operations and both manage to do it in our
tests without involving PyTorch, we are unsure why BLIS presented this expected behavior.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        To our knowledge, no other work in literature investigated the performance of language models on a
RISC-V architecture similar to the one we use in this work. Brown et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] evaluated the parallel
workloads from the RAJAPerf benchmarking suite on the SOPHON SG2042 SoC and compared the
performance with state-of-the-art ARM and x86 architectures. The authors explored the RVV v0.7.1
capabilities using the XuanTie GCC compiler provided by the vendor to enable single-precision
autovectorization on the RAJAPerf kernels. They observed that the SG2042 CPU delivers up to ten times
more performance per core than the nearest widely-available RISC-V hardware. Still, it was 4 to 8
times outperformed by the x86 CPUs in multi-threaded scenarios. They added that using custom thread
mapping strategies is essential for leveraging performance on this architecture.
      </p>
      <p>
        Lee et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also tested the RAJPerf kernels on a RISC-V architecture with silicon-enabled RVV v0.7.1.
But it is the T-Head C906 single-core CPU. They compared the RVV performance against the ARM
NEON and SVE instruction sets and against the SiFive U74 RISC-V, which does not implement RVV.
In some cases, the vectorized code on the C906 outperformed the U74 CPU with no vectors in about
80%. However, many other factors may influence this result when comparing two diferent platforms.
Both [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] point out the challenges of developing and running vectorized code on RISC-V due to
the immaturity of tooling and hardware.
      </p>
      <p>Igual et al. [30] evaluated general matrix multiplication (GEMM) kernels on the C906 and C910
T-Head RISC-V architectures, both implementing RVV v0.7.1. Evaluating GEMM kernels is important
because it is frequently the type of operation found inside language model layers. They evaluated
the performance of the SGEMM kernels using OpenBLAS targeting RISC-V and RVV, as well as a
version of OpenBLAS targeting a generic architecture with no particular support for RVV. They report
performance gains up to 80% when using OpenBLAS with RVV. However, they also do not distinguish
specific vector operations from any other optimizations that could be added by OpenBLAS.</p>
      <p>
        The three related work we found [
        <xref ref-type="bibr" rid="ref29 ref6">29, 6, 30</xref>
        ] evaluate the RVV capabilities in similar CPUs as this
work. However, only [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] investigated the SOPHON SG2042, with its complex NUMA design, but they
did not focus on evaluating RVV performance. Although these works evaluated kernels with common
operations found in inference model layers, evaluating the whole application adds extra complexity
and can be more representative of real-world scenarios.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        This paper assessed the inference performance of pre-trained language models on a multi-core RISC-V
CPU with silicon-enabled vectors version 0.7.1. The first challenge involved building the PyTorch
acceleration library with support for vectorization on RISC-V architectures using OpenBLAS and BLIS.
The experimental results showed that GPT-2 and BERT models perform better when using PyTorch
+ OpenBLAS with RVV in most test cases. However, it is unclear whether vectorization, diferent
optimizations, or a wider combination of factors leverage this performance gain. Even though related
work shows that OpenBLAS targeting RVV can provide up to 80% performance increase [30], they
also do not guarantee that this performance gain comes only from using RVV. Unfortunately, to this
point, the lack of performance analysis tools is a major drawback of the commercially available RISC-V
architectures [
        <xref ref-type="bibr" rid="ref29 ref6">6, 29</xref>
        ].
      </p>
      <p>In future work, a deeper and more comprehensive investigation should be carried out to understand
better the RVV’s impact on the performance of inference models on RISC-V architectures. For instance,
verifying what model layers can be vectorized by OpenBLAS and BLIS in PyTorch may provide useful
information on the maximum performance improvement that could be achieved solely through
vectorization. Also, it would be paramount to investigate the efects of the SG2042 NUMA design on the
performance, perhaps testing diferent task scheduling policies and comprehending what overheads are
involved.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the Spoke “FutureHPC &amp; BigData” of the ICSC - Centro
Nazionale di Ricerca in “High Performance Computing, Big Data and Quantum Computing”, funded by
the EU - NextGenerationEU and the EuPilot project funded by EuroHPC JU under G.A. 101034126.
[30] F. Igual, L. Piñuel, S. Catalán, H. Martínez, A. Castelló, E. Quintana-Ortí, Automatic generation of
micro-kernels for performance portability of matrix multiplication on risc-v vector processors,
in: Proceedings of the SC ’23 Workshops of The International Conference on High Performance
Computing, Network, Storage, and Analysis, SC-W ’23, Association for Computing Machinery,
New York, NY, USA, 2023, p. 1523–1532. URL: https://doi.org/10.1145/3624062.3624229. doi:10.
1145/3624062.3624229.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://doi.org/10.18653/v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Zhang,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Zhang, W. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Pre-trained models: Past, present and future</article-title>
          ,
          <source>AI</source>
          Open 2
          <article-title>(</article-title>
          <year>2021</year>
          )
          <fpage>225</fpage>
          -
          <lpage>250</lpage>
          . doi:
          <volume>10</volume>
          .1016/J. AIOPEN.
          <year>2021</year>
          .
          <volume>08</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Greengard</surname>
          </string-name>
          ,
          <article-title>Will risc-v revolutionize computing?</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>30</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Elsadek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Y.</given-names>
            <surname>Tawfik</surname>
          </string-name>
          ,
          <article-title>Risc-v resource-constrained cores: A survey and energy comparison</article-title>
          ,
          <source>in: 2021 19th IEEE International New Circuits and Systems Conference (NEWCAS)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/NEWCAS50681.
          <year>2021</year>
          .
          <volume>9462781</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Asanović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <article-title>Instruction sets should be free: The case for risc-v, EECS Department</article-title>
          , University of California, Berkeley, Tech. Rep. UCB/EECS-2014-
          <volume>146</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. K. L.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamieson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , R. Jesus,
          <article-title>Test-driving risc-v vector hardware for hpc</article-title>
          , in: A.
          <string-name>
            <surname>Bienz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Weiland</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Baboulin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Kruse (Eds.),
          <source>High Performance Computing</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <source>Sg2042 technical reference manual</source>
          ,
          <year>2023</year>
          . Available at https://github.com/milkv-pioneer/ pioneer-files/blob/main/hardware/SG2042-TRM.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Sequence to sequence learning with neural networks</article-title>
          , in: Z.
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , K. Weinberger (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>27</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2014</year>
          . URL: https://proceedings. neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brown</surname>
          </string-name>
          , V. Dellapietra,
          <string-name>
            <given-names>P.</given-names>
            <surname>Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <article-title>Class-based n-gram models of natural language</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>18</volume>
          (
          <year>1992</year>
          )
          <fpage>467</fpage>
          -
          <lpage>479</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Pre-trained language models for text generation: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>56</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3649449. doi:
          <volume>10</volume>
          .1145/ 3649449.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Koroteev</surname>
          </string-name>
          ,
          <article-title>BERT: A review of applications in natural language processing and understanding</article-title>
          ,
          <source>CoRR abs/2103</source>
          .11943 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2103.11943. arXiv:
          <volume>2103</volume>
          .
          <fpage>11943</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prati</surname>
          </string-name>
          ,
          <article-title>Adversarial training for aspect-based sentiment analysis with BERT</article-title>
          ,
          <source>in: 25th International Conference on Pattern Recognition, ICPR</source>
          <year>2020</year>
          , Virtual Event / Milan, Italy, January
          <volume>10</volume>
          -
          <issue>15</issue>
          ,
          <year>2021</year>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>8797</fpage>
          -
          <lpage>8803</lpage>
          . URL: https://doi.org/10.1109/ICPR48806.
          <year>2021</year>
          .
          <volume>9412167</volume>
          . doi:
          <volume>10</volume>
          .1109/ICPR48806.
          <year>2021</year>
          .
          <volume>9412167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , Introducing openai,
          <year>2015</year>
          . URL: https://openai.com/blog/ introducing-openai/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <article-title>Openai is giving microsoft exclusive access to its gpt-3 language model</article-title>
          , MIT Technology Review (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>RISC-</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <article-title>Risc-v "v" vector extension, 2024</article-title>
          . URL: https://github.com/riscv/riscv-v-spec/blob/master/ v-spec.adoc.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qi</surname>
          </string-name>
          , Xuantie-
          <volume>910</volume>
          :
          <article-title>a commercial multi-core 12-stage pipeline out-of-order 64-bit high performance risc-v processor with vector extension</article-title>
          ,
          <source>in: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture</source>
          , ISCA '20, IEEE Press,
          <year>2020</year>
          , p.
          <fpage>52</fpage>
          -
          <lpage>64</lpage>
          . URL: https://doi.org/10.1109/ISCA45697.
          <year>2020</year>
          .
          <volume>00016</volume>
          . doi:
          <volume>10</volume>
          .1109/ISCA45697.
          <year>2020</year>
          .
          <volume>00016</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Waterman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I. U.</given-names>
            level
            <surname>Isa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Waterman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <article-title>The risc-v instruction set manual</article-title>
          , Volume I:
          <string-name>
            <surname>User-Level</surname>
            <given-names>ISA</given-names>
          </string-name>
          ', version
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>U.</given-names>
            <surname>Foundation</surname>
          </string-name>
          ,
          <article-title>oneapi math kernel library (onemkl) interfaces</article-title>
          , https://github.com/oneapi-src/ oneMKL,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chetlur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Woolley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vandermersch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          , E. Shelhamer,
          <article-title>cudnn: Eficient primitives for deep learning</article-title>
          ,
          <source>CoRR abs/1410</source>
          .0759 (
          <year>2014</year>
          ). URL: http://arxiv.org/abs/1410. 0759. arXiv:
          <volume>1410</volume>
          .
          <fpage>0759</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fultz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamazov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowell</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nandhimandalam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nasyrov</surname>
          </string-name>
          , I. Perminov,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Filippov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Daga</surname>
          </string-name>
          ,
          <string-name>
            <surname>Miopen:</surname>
          </string-name>
          <article-title>An open source library for deep learning primitives</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1910</year>
          .00078.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>I. Colonnelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Birke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aldinucci</surname>
          </string-name>
          ,
          <article-title>Experimenting with pytorch on risc-v, in: RISC-V Summit Europe 2023</article-title>
          , Barcelona, Spain,
          <year>2023</year>
          . URL: https://iris.unito.it/retrieve/ 429bf344-9090
          <string-name>
            <surname>-</surname>
          </string-name>
          42c3
          <string-name>
            <surname>-</surname>
          </string-name>
          809c-1b8ac320a930/
          <fpage>2023</fpage>
          -06-08-
          <article-title>Iacopo-COLONNELLI-abstract</article-title>
          .pdf, poster.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, highperformance deep learning library</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2019</year>
          , pp.
          <fpage>8024</fpage>
          -
          <lpage>8035</lpage>
          . URL: http://papers.neurips.cc/paper/ 9015-pytorch
          <article-title>-an-imperative-style-high-performance-deep-learning-library</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nassyr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Mood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herten</surname>
          </string-name>
          ,
          <article-title>Programmatically Reaching the Roof: Automated BLIS Kernel Generator for SVE and RVV</article-title>
          ,
          <string-name>
            <surname>Technical</surname>
            <given-names>Report</given-names>
          </string-name>
          , Jülich Supercomputing Center,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .34734/ FZJ-2023-03437.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>T.-H. S. C.</surname>
          </string-name>
          <article-title>Ltd</article-title>
          .,
          <article-title>T-head gnu compiler toolchain</article-title>
          , https://github.com/T-head
          <article-title>-Semi/ xuantie-gnu-</article-title>
          <string-name>
            <surname>toolchain</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>HF</given-names>
            <surname>Canonical Model</surname>
          </string-name>
          <string-name>
            <surname>Maintainers</surname>
          </string-name>
          ,
          <source>gpt2 (revision 909a290)</source>
          ,
          <year>2022</year>
          . URL: https://huggingface.co/gpt2. doi:
          <volume>10</volume>
          .57967/hf/0039.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , M. Jamieson,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Is risc-v ready for hpc prime-time: Evaluating the 64-core sophon sg2042 risc-v cpu</article-title>
          ,
          <source>in: Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing</source>
          , Network, Storage, and
          <string-name>
            <surname>Analysis</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1566</fpage>
          -
          <lpage>1574</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>