Assessing Large Language Models Inference Performance
                         on a 64-core RISC-V CPU with Silicon-Enabled Vectors
                         Adriano Marques Garcia1,* , Giulio Malenza1 , Robert Birke1 and Marco Aldinucci1
                         1
                             Computer Science Department, University of Torino, Corso Svizzera 185, 10149 Torino, Italy


                                        Abstract
                                        The rising usage of compute-intensive AI applications with fast response time requirements, such as text
                                        generation using large language models, underscores the need for more efficient and versatile hardware solutions.
                                        This drives the exploration of emerging architectures like RISC-V, which has the potential to deliver strong
                                        performance within tight power constraints. The recent commercial release of processors with RISC-V Vector
                                        (RVV) silicon-enabled extensions further amplifies the significance of RISC-V architectures, offering enhanced
                                        capabilities for parallel processing and accelerating tasks critical to large language models and other AI applications.
                                        This work aims to evaluate the BERT and GPT-2 language models inference performance on the SOPHON SG2042
                                        64-core RISC-V architecture with silicon-enabled RVV v0.7.1. We benchmarked the models with and without RVV,
                                        using OpenBLAS and BLIS as BLAS backends for PyTorch to enable vectorization. Enabling RVV in OpenBLAS
                                        improved the inference performance by up to 40% in some cases.

                                        Keywords
                                        RISC-V, RVV, PyTorch, LLM, XuanTie C920, SOPHON SG2042, OpenBLAS, BLIS


                         1. Introduction
                         In recent years, natural language processing (NLP) advancements have skyrocketed. This was largely
                         propelled by developing sophisticated large language models (LLMs) such as BERT (Bidirectional Encoder
                         Representations from Transformers) and GPT (Generative Pre-trained Transformer). These models have
                         showcased remarkable capabilities in understanding, generating, and even summarizing human-like
                         language accurately and fluently [1, 2]. This exceptional performance is fuelled by increasingly larger
                         foundational models with billions of parameters.
                            To cope with, on the one hand, the growing demand for more efficient and versatile NLP systems and,
                         on the other hand, increasingly larger and resource-hungry models, it becomes crucial to assess the
                         inference performance of such models across various hardware architectures. Among these architectures,
                         the RISC-V (Reduced Instruction Set Computer-Five) architecture has emerged as a compelling option
                         that stands out as an open-source instruction set architecture (ISA) that has gained considerable traction
                         due to its flexibility, scalability and potential for customization [3]. Besides, the RISC-V architecture
                         is known for its potential to deliver strong performance within tight power constraints, which sets it
                         apart from traditional Complex Instruction Set Computing (CISC) architectures [4, 5, 6].
                            The main goal of this work is to assess the inference performance of the BERT and GPT-2 lan-
                         guage models on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled vectors (RVV)
                         v0.7.1 [7]. For this goal, we first built the PyTorch library using OpenBLAS and BLIS to leverage support
                         for RVV v.0.7.1 instructions. Then, we wrote a text-generation Python script for each model and ran
                         experiments measuring the inference times. We compare the model inference performance with PyTorch
                         without using OpenBLAS/BLIS, using OpenBLAS/BLIS built for generic RISC-V architectures, and using
                         OpenBLAS/BLIS targeting the specific RISC-V architecture on the SOPHON SG2042.

                         BigHPC2024: Special Track on Big Data and High-Performance Computing, co-located with the 3rd Italian Conference on Big Data
                         and Data Science, ITADATA2024, September 17 – 19, 2024, Pisa, Italy.
                         *
                           Corresponding author.
                         $ adriano.marquesgarcia@unito.it (A. M. Garcia); giulio.malenza@unito.it (G. Malenza); robert.birke@unito.it (R. Birke);
                         marco.aldinucci@unito.it (M. Aldinucci)
                          0000-0003-4796-773X (A. M. Garcia); 0009-0006-4862-7429 (G. Malenza); 0000-0003-1144-3707 (R. Birke);
                         0000-0001-8788-0829 (M. Aldinucci)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                1
  The contributions of this work are as follows: (1) We provide a tutorial on how to enable RISC-V
Vectors v0.7.1 support for PyTorch in the SOPHON SG204. (2) We evaluate how enabling RVV impacts
the inference performance of large language models (BERT and GPT-2). (3) We evaluate an experimental
version of the BLIS library that leverages RVV 0.7.1 on a real RISC-V architecture. (4) We analyze the
scalability of model inference performance when increasing the parallelism of the SOPHON SG2042
64-core RISC-V processor.


2. Background
2.1. Large Language Models (LLMs)
A language model can be defined as a probabilistic model of the natural language. It is a model that
predicts the likelihood of a sequence of words given the preceding context [8, 9]. Such models learn a
language structure, grammar, and vocabulary from vast amounts of text data [10]. They then use that
knowledge to perform various natural-language processing tasks, such as text generation, translation,
summarization, sentiment analysis, and more [11, 12, 13]. The first language model was proposed in
1980, and researchers and the industry have been increasingly improving these models over the past
decades. In 2017, researchers from Google introduced the BERT model [1](LLM), which was notable
for its dramatic improvement over previous state-of-the-art models. BERT introduced the use of the
transformer architecture and attention mechanism, laying the way for the upcoming large language
models (LLM). LLMs experienced a boost in popularity among the general public with the release
of ChatGPT, a chat tool based on the series of Generative Pre-trained Transformer (GPT) [2] models
developed by OpenAI [14].
   GPT-2 [15] is a well-known state-of-the-art language model developed by OpenAI, the second in
the series. It is also the last model fully disclosed by OpenAI before Microsoft acquired the exclusive
rights to GPT-3 [16]. It is considered a causal language model that predicts the next token (word) in the
sequence (sentence) by only attending to the tokens to the left; that is, the model can not see into the
future. Although the latest GPT version is GPT-4 Turbo, GPT-2 still presents incredible text generation
capabilities. It is designed to generate human-like text based on the input it receives.
   BERT [1] stands for Bidirectional Encoder Representations from Transformers. It is a language
model based on the transformer architecture. BERT is not a causal language model but a masked one. It
is an encoder-only architecture, lacking a decoder, which means that BERT can not be prompted or
generate text. Although it was not originally designed to predict and generate the next sentences or
words of a text, it still can be fine-tuned for this purpose. However, it increases the computational cost,
and the quality of the generated text may not be comparable to that of other models, such as GPT-2.

2.2. RISC-V Vectors (RVV) Enabled Silicon
The vector instructions for RISC-V are defined in the open “V” vector ISA extension standardized by
RISC-V International [17]. The first stable release, 1.0, has now been rectified. While most versions
never made it into silicon [6], one can find off-the-shelf hardware supporting either v0.7.1, e.g., SOPHON
SG2042 SoC used here or v1.0, e.g., Canaan Kendryte K230 SoC.
   SOPHON SG2042 system-on-chip (SoC) contains 64 RISC-V cores divided into 16 clusters connected
through a grid network [7]. Each cluster comprises a XuanTie C920 4-core RISC-V CPU. Each core
is equipped with 64KB of L1 instruction and data cache, and each cluster of 4 cores shares a 1MB of
L2 cache. The unified L2 cache can handle two access requests in parallel within one cycle. The grid
interconnect finally offers access to 64MB of level 3 cache shared among all 64 cores. Four DDR4-3200
memory controllers are used to manage access to the main memory system. For peripherals, the SG2042
is equipped with 32 PCIe Gen4 lanes.
   XuanTie C920 [18] is a homogeneous high-performance 64-bit multi-core RISC-V CPU architecture
designed by T-Head that supports 1 to 4 cores at a maximum operation frequency of 2 GHz. It targets
high-performance applications and implements a 12-stage, out-of-order, multiple-issue superscalar


                                                    2
pipeline. Based on the RISC-V instruction set architecture (ISA), this CPU provides the RV64GCV
instruction set [19], which supports standard vector instructions extension version 0.7.1 (RVV 0.7.1).
The vector processing unit’s vector registers are 128 bits long and support the FP16, FP32, FP64, INT8,
INT16, INT32, and INT64 data types.


3. PyTorch with Vector Support
PyTorch can delegate the computation to various low-level libraries to best leverage the capabilities of
the underlying hardware, e.g. Intel MKL [20] when running on Intel CPUs or cuDNN [21] and MIOpen
[22] when running on NVIDIA or AMD GPUs, or more generic ones such as various libraries from the
BLAS family.
   Colonnelli et al. [23] describe the first porting of PyTorch [24] v2.0 to the RISC-V ISA, but the
underlying platform in use did offer limited acceleration capabilities, i.e. only fused multiply-add (FMA)
support. Newer RISC-V silicons also support the RVV standard, providing vector instructions that can
process multiple computations in parallel following the same instruction multiple data (SIMD) parallel
computing paradigm. To leverage such capability, we rely on the recently ported OpenBLAS or BLIS [25]
libraries. These BLAS-like libraries can be compiled with and without RVV support based on the defined
target. Notice that due to the different versions of the RVV standard, a careful match between target
architecture and a compatible compiler is required to compile vector instructions enabled code for the
SG2042 SoC correctly.
   More in detail, we first compiled OpenBLAS v0.3.26 using the XuanTie GNU Compiler Toolchain
v2.8.0 [26] and set TARGET=C910V to enable RVV. For BLIS, we used a modified version [25] based on
BLIS v0.9.0, set “rv64iv0p7” as the target, and compiled it using the LLVM-EPI-0.71 compiler. Then, we
compiled PyTorch v2.3 for Python v3.10.10, using Xuantie’s GCC v13.2, OpenMP v4.5, OpenBLAS/BLIS,
and enabling only the following build options: USE_OPENMP=1 to leverage the multiple cores on the
SG2042 SoC; USE_BLAS=1 and USE_LAPACK=1 to use the BLAS libraries as main computation backend;
USE_KINETO=ON for profiling; and USE_NUMPY=ON for numpy support. Note that since the work
in [23], the SLEEF vector math library now supports only RVV v1.0, which is incompatible with the
SG2042 SoC.


4. Experimental Methodology and Results
This section presents the model inference performance of BERT and GPT-2 for text generation. We run
the experiments on the Milk-V Pioneer Box, a commercial ready-to-use development platform in the
form of a desktop computer equipped with a Pioneer motherboard powered by the SOPHON SG2042.
The one we used in this paper is equipped with 128GB of RAM DDR4 (3200MHz) and a 1TB PCIe 3.0
SSD. The operating system is Linux fedora-riscv 6.1.31.

4.1. Benchmarks
We wrote a simple text generation benchmark application based on the GPT-2 and BERT language
models to test and evaluate the performance of LLM models. This application receives a prompt as input
and generates/predicts the next 𝑛 tokens (words) as output. For GPT-2, we used the GPT-2 (Revision
909a290) model provided by OpenAI, with 124M parameters [27, 28]. For BERT, we used the bert-large-
cased pre-trained version2 (commit 06fa25d), which is similar in size to GPT-2 but has 24 layers, 4096
intermediate dimensions, 1024 hidden dimensions, 16 attention heads, and 336M parameters [1]. To
make the generated text more human-like, we added extra routines to control some aspects, such as
sentence length.
   For all the experiments, we used the following input prompt both for BERT and GPT-2:
1
    https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi
2
    https://huggingface.co/google-bert/bert-large-cased


                                                          3
   – “The quick brown fox ran”
Here is an example of the generated output text from GPT-2:
   – “The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest.”
And here is an example of the text generated by BERT:
   – “The quick brown fox ran. Run fast fox running hard. Go ahead. Running on faster the fox went
forward quickly.”
   Notice that the generation depends on sampling the next token (word) probabilities via a random
number generator. Hence, unless the seed of the random number generator is fixed, each execution
leads to different sentences being generated.

4.2. Performance Results
Figures 1 and 2 present the experimental results. The bars in the charts are the average of 10 executions,
and the whiskers represent the standard deviation. It is important to note that we consider only the
execution time of the model inference, not the total runtime of the application. We used our BERT and
GPT-2 applications to generate 25 tokens (words) in all tests.

                                            bert-large-cased - Text Generation - 25 tokens
                                  600
                                        PyTorch-default
                                        PyTorch+OpenBLAS-no-RVV
                                  500   PyTorch+OpenBLAS-with-RVV
                       time (s)


                                        PyTorch+BLIS-no-RVV
                                  400   PyTorch+BLIS-with-RVV
                   xlabel


                                  300
             Inference


                                  200

                                  100

                                    0
                                        1     2      4       8     ylabel
                                                                     16     24     32        48   64
                                                                 OMP threads

Figure 1: BERT inference performance using different BLAS runtimes on the SG2042 RISC-V CPU.


   The chart with the performance results of BERT on the SOPHON SG2042 RISC-V CPU is presented
in Figure 1. This chart compares the performance of the BERT benchmark with PyTorch using different
BLAS runtimes. For this, we compiled five different versions of PyTorch: (1) PyTorch without OpenBLAS
or BLIS (PyTorch-default); (2) PyTorch + BLIS compiled with generic target architecture; (3) PyTorch +
an experimental BLIS library with support for RVV v0.7.1; (4) PyTorch + OpenBLAS targeting generic
RISC-V architecture (no RVV); (5) PyTorch + OpenBLAS targeting the specific C910 architecture on the
SOPHON SG2042 CPU.
   For generating 25 tokens on the RISC-V CPU with a single thread, BERT with PyTorch-default is
about 2 times slower than PyTorch + OpenBLAS with RVV. With 32 threads, it is 22 times slower. With
24 threads, PyTorch + OpenBLAS with RVV shows a 40% performance gain compared to OpenBLAS
without RVV. This performance gain may be brought solely by SIMD operations. However, other
optimizations may be applied to OpenBLAS when compiling it to target a specific architecture rather
than a generic one.
   On the other hand, BLIS shows no significant performance gains over the sequential execution with
PyTorch-default. We wrote test programs in C++ to test OpenBLAS and BLIS’s capabilities in generating
RVV instructions for GEMM operations, and both libraries passed the test. They generate RVV 0.7.1
instructions in the assembly code, which are properly executed. In these tests, BLIS with RVV behaves
as expected and presents better performance. Our test programs implement the same main operation
the models execute (aten::addmm). We also ensured that PyTorch was running the BLIS library as the


                                                                  4
backend for the main operations and that the parallelism was being set correctly. Therefore, a deeper
investigation will be needed to understand the BLIS results.

                                                  gpt2 - Text Generation - 25 tokens
                                   40
                                        PyTorch-default
                                   35   PyTorch+OpenBLAS-no-RVV
                                        PyTorch+OpenBLAS-with-RVV
                        time (s)
                                   30   PyTorch+BLIS-no-RVV
                                        PyTorch+BLIS-with-RVV
                                   25
                    xlabel


                                   20
              Inference


                                   15
                                   10
                                    5
                                    0
                                        1     2      4       8     ylabel
                                                                     16     24     32   48   64
                                                                 OMP threads

Figure 2: GPT-2 inference performance with different BLAS runtimes on the SG2042 RISC-V CPU.


   The inference performance results using the GPT-2 model are presented in Figure 2. The behavior of
the PyTorch-default is somewhat similar to the one observed with BERT. Increasing parallelism only
leads to performance degradation. This is mainly due to the cost added by sharing data in the upper
levels of the memory hierarchy, which is worsened in a mesh-like memory architecture of the SG2042
CPU [7]. PyTorch + BLIS also behaves the exact same way as in the BERT results. With OpenBLAS,
performance scales only up to 24 threads. We hypothesize it is primarily due to overheads added by
data communication plus a lighter load imposed by GPT-2 for generating 25 tokens if compared to
BERT. In GPT-2, PyTorch + OpenBLAS with 1 and 2 threads perform unexpectedly poorly compared to
PyTorch-default. Further investigation is needed to understand this behavior. However, with a higher
number of threads, PyTorch + OpenBLAS with RVV enabled performs the best, improving over 30% the
performance running with 16 threads compared to the sequential PyTorch-default. With 64 threads, the
OpenBLAS with RVV is twice as fast as the OpenBLAS with no RVV. For both BERT and GPT-2, we ran
experiments using different smaller model sizes and observed similar results.


Figure 3: PyTorch Profiler output when running BERT on a single core of the SOPHON SG2042 CPU.


Figure 4: PyTorch Profiler output when running GPT-2 on a single core of the SOPHON SG2042 CPU.


   Although the inference performance results show gains using PyTorch + OpenBLAS with RVV, it is
unclear whether this gain comes from using RVV or other optimizations. OpenBLAS and BLIS do not
provide tracing mechanisms to detail the type of instructions executed. OneDNN (oneAPI Deep Neural
Network Library), for example, is a library that can be used with PyTorch to provide RVV support
and can show which layers of the models execute vector instructions. However, the latest version
of oneDNN only supports RVV v1.0 and still cannot vectorize any of the main layers of the models
we tested, so we did not use it in this work. We tried using an experimental version of oneDNN that
leverages RVV v0.7, but it is incompatible with other PyTorch dependencies.


                                                                  5
   Both language models should be highly vectorizable. Most of the operations involved in these
two models are aten::addmm (matrix multiplication and addition) and also aten::mm for GPT-2.
In addition, for BERT, we also notice that a minor percentage of processing time is spent on the
aten::gelu (Gaussian Error Linear Units) primitive. For both models, the rest of the primitives
comprise several calls of more negligible operations, as shown in Figures 3 and 4. Since OpenBLAS and
BLIS are expected to properly vectorize matrix × matrix operations and both manage to do it in our
tests without involving PyTorch, we are unsure why BLIS presented this expected behavior.


5. Related Work
To our knowledge, no other work in literature investigated the performance of language models on a
RISC-V architecture similar to the one we use in this work. Brown et al. [29] evaluated the parallel
workloads from the RAJAPerf benchmarking suite on the SOPHON SG2042 SoC and compared the
performance with state-of-the-art ARM and x86 architectures. The authors explored the RVV v0.7.1
capabilities using the XuanTie GCC compiler provided by the vendor to enable single-precision auto-
vectorization on the RAJAPerf kernels. They observed that the SG2042 CPU delivers up to ten times
more performance per core than the nearest widely-available RISC-V hardware. Still, it was 4 to 8
times outperformed by the x86 CPUs in multi-threaded scenarios. They added that using custom thread
mapping strategies is essential for leveraging performance on this architecture.
   Lee et al. [6] also tested the RAJPerf kernels on a RISC-V architecture with silicon-enabled RVV v0.7.1.
But it is the T-Head C906 single-core CPU. They compared the RVV performance against the ARM
NEON and SVE instruction sets and against the SiFive U74 RISC-V, which does not implement RVV.
In some cases, the vectorized code on the C906 outperformed the U74 CPU with no vectors in about
80%. However, many other factors may influence this result when comparing two different platforms.
Both [29] and [6] point out the challenges of developing and running vectorized code on RISC-V due to
the immaturity of tooling and hardware.
   Igual et al. [30] evaluated general matrix multiplication (GEMM) kernels on the C906 and C910
T-Head RISC-V architectures, both implementing RVV v0.7.1. Evaluating GEMM kernels is important
because it is frequently the type of operation found inside language model layers. They evaluated
the performance of the SGEMM kernels using OpenBLAS targeting RISC-V and RVV, as well as a
version of OpenBLAS targeting a generic architecture with no particular support for RVV. They report
performance gains up to 80% when using OpenBLAS with RVV. However, they also do not distinguish
specific vector operations from any other optimizations that could be added by OpenBLAS.
   The three related work we found [29, 6, 30] evaluate the RVV capabilities in similar CPUs as this
work. However, only [29] investigated the SOPHON SG2042, with its complex NUMA design, but they
did not focus on evaluating RVV performance. Although these works evaluated kernels with common
operations found in inference model layers, evaluating the whole application adds extra complexity
and can be more representative of real-world scenarios.


6. Conclusion
This paper assessed the inference performance of pre-trained language models on a multi-core RISC-V
CPU with silicon-enabled vectors version 0.7.1. The first challenge involved building the PyTorch
acceleration library with support for vectorization on RISC-V architectures using OpenBLAS and BLIS.
The experimental results showed that GPT-2 and BERT models perform better when using PyTorch
+ OpenBLAS with RVV in most test cases. However, it is unclear whether vectorization, different
optimizations, or a wider combination of factors leverage this performance gain. Even though related
work shows that OpenBLAS targeting RVV can provide up to 80% performance increase [30], they
also do not guarantee that this performance gain comes only from using RVV. Unfortunately, to this
point, the lack of performance analysis tools is a major drawback of the commercially available RISC-V
architectures [6, 29].


                                                    6
   In future work, a deeper and more comprehensive investigation should be carried out to understand
better the RVV’s impact on the performance of inference models on RISC-V architectures. For instance,
verifying what model layers can be vectorized by OpenBLAS and BLIS in PyTorch may provide useful
information on the maximum performance improvement that could be achieved solely through vec-
torization. Also, it would be paramount to investigate the effects of the SG2042 NUMA design on the
performance, perhaps testing different task scheduling policies and comprehending what overheads are
involved.


Acknowledgments
This work has been partially supported by the Spoke “FutureHPC & BigData” of the ICSC - Centro
Nazionale di Ricerca in “High Performance Computing, Big Data and Quantum Computing”, funded by
the EU - NextGenerationEU and the EuPilot project funded by EuroHPC JU under G.A. 101034126.


References
 [1] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transform-
     ers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Conference of the
     North American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, NAACL-HLT, Association for Computational Linguistics, 2019, pp. 4171–4186. URL:
     https://doi.org/10.18653/v1/n19-1423. doi:10.18653/V1/N19-1423.
 [2] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
     M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao,
     J. Zhu, Pre-trained models: Past, present and future, AI Open 2 (2021) 225–250. doi:10.1016/J.
     AIOPEN.2021.08.002.
 [3] S. Greengard, Will risc-v revolutionize computing?, Communications of the ACM 63 (2020) 30–32.
 [4] I. Elsadek, E. Y. Tawfik, Risc-v resource-constrained cores: A survey and energy comparison, in:
     2021 19th IEEE International New Circuits and Systems Conference (NEWCAS), 2021, pp. 1–5.
     doi:10.1109/NEWCAS50681.2021.9462781.
 [5] K. Asanović, D. A. Patterson, Instruction sets should be free: The case for risc-v, EECS Department,
     University of California, Berkeley, Tech. Rep. UCB/EECS-2014-146 (2014).
 [6] J. K. L. Lee, M. Jamieson, N. Brown, R. Jesus, Test-driving risc-v vector hardware for hpc, in:
     A. Bienz, M. Weiland, M. Baboulin, C. Kruse (Eds.), High Performance Computing, Springer Nature
     Switzerland, Cham, 2023, pp. 419–432.
 [7] C. Wei, Sg2042 technical reference manual, 2023. Available at https://github.com/milkv-pioneer/
     pioneer-files/blob/main/hardware/SG2042-TRM.pdf.
 [8] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in:
     Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (Eds.), Advances in Neural
     Information Processing Systems, volume 27, Curran Associates, Inc., 2014. URL: https://proceedings.
     neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
 [9] P. Brown, V. Dellapietra, P. Souza, J. Lai, R. Mercer, Class-based n-gram models of natural language,
     Computational Linguistics 18 (1992) 467–479.
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
     A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
     B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Lan-
     guage models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
     H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso-
     ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
     1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[11] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, J.-R. Wen, Pre-trained language models for text generation:


                                                    7
     A survey, ACM Comput. Surv. 56 (2024). URL: https://doi.org/10.1145/3649449. doi:10.1145/
     3649449.
[12] M. V. Koroteev, BERT: A review of applications in natural language processing and understanding,
     CoRR abs/2103.11943 (2021). URL: https://arxiv.org/abs/2103.11943. arXiv:2103.11943.
[13] A. Karimi, L. Rossi, A. Prati, Adversarial training for aspect-based sentiment analysis with BERT,
     in: 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy,
     January 10-15, 2021, IEEE, 2020, pp. 8797–8803. URL: https://doi.org/10.1109/ICPR48806.2021.
     9412167. doi:10.1109/ICPR48806.2021.9412167.
[14] G. Brockman, I. Sutskever, Introducing openai, 2015. URL: https://openai.com/blog/
     introducing-openai/.
[15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
     unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[16] K. Hao, Openai is giving microsoft exclusive access to its gpt-3 language model, MIT Technology
     Review (2020).
[17] RISC-V, Risc-v "v" vector extension, 2024. URL: https://github.com/riscv/riscv-v-spec/blob/master/
     v-spec.adoc.
[18] C. Chen, X. Xiang, C. Liu, Y. Shang, R. Guo, D. Liu, Y. Lu, Z. Hao, J. Luo, Z. Chen, C. Li, Y. Pu, J. Meng,
     X. Yan, Y. Xie, X. Qi, Xuantie-910: a commercial multi-core 12-stage pipeline out-of-order 64-bit
     high performance risc-v processor with vector extension, in: Proceedings of the ACM/IEEE 47th
     Annual International Symposium on Computer Architecture, ISCA ’20, IEEE Press, 2020, p. 52–64.
     URL: https://doi.org/10.1109/ISCA45697.2020.00016. doi:10.1109/ISCA45697.2020.00016.
[19] A. Waterman, Y. Lee, D. Patterson, K. Asanovic, V. I. U. level Isa, A. Waterman, Y. Lee, D. Patterson,
     The risc-v instruction set manual, Volume I: User-Level ISA’, version 2 (2014) 1–79.
[20] U. Foundation, oneapi math kernel library (onemkl) interfaces, https://github.com/oneapi-src/
     oneMKL, 2024.
[21] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn:
     Efficient primitives for deep learning, CoRR abs/1410.0759 (2014). URL: http://arxiv.org/abs/1410.
     0759. arXiv:1410.0759.
[22] J. Khan, P. Fultz, A. Tamazov, D. Lowell, C. Liu, M. Melesse, M. Nandhimandalam, K. Nasyrov,
     I. Perminov, T. Shah, V. Filippov, J. Zhang, J. Zhou, B. Natarajan, M. Daga, Miopen: An open source
     library for deep learning primitives, 2019. arXiv:1910.00078.
[23] I. Colonnelli, R. Birke, M. Aldinucci,            Experimenting with pytorch on risc-v,                 in:
     RISC-V Summit Europe 2023, Barcelona, Spain, 2023. URL: https://iris.unito.it/retrieve/
     429bf344-9090-42c3-809c-1b8ac320a930/2023-06-08-Iacopo-COLONNELLI-abstract.pdf, poster.
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
     L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil-
     amkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-
     performance deep learning library, in: Advances in Neural Information Processing Sys-
     tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/
     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[25] S. Nassyr, K. H. Mood, A. Herten, Programmatically Reaching the Roof: Automated BLIS Kernel
     Generator for SVE and RVV, Technical Report, Jülich Supercomputing Center, 2023. doi:10.34734/
     FZJ-2023-03437.
[26] T.-H. S. C. Ltd., T-head gnu compiler toolchain, https://github.com/T-head-Semi/
     xuantie-gnu-toolchain, 2024.
[27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
     unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[28] HF Canonical Model Maintainers, gpt2 (revision 909a290), 2022. URL: https://huggingface.co/gpt2.
     doi:10.57967/hf/0039.
[29] N. Brown, M. Jamieson, J. Lee, P. Wang, Is risc-v ready for hpc prime-time: Evaluating the 64-core
     sophon sg2042 risc-v cpu, in: Proceedings of the SC’23 Workshops of The International Conference
     on High Performance Computing, Network, Storage, and Analysis, 2023, pp. 1566–1574.


                                                       8
[30] F. Igual, L. Piñuel, S. Catalán, H. Martínez, A. Castelló, E. Quintana-Ortí, Automatic generation of
     micro-kernels for performance portability of matrix multiplication on risc-v vector processors,
     in: Proceedings of the SC ’23 Workshops of The International Conference on High Performance
     Computing, Network, Storage, and Analysis, SC-W ’23, Association for Computing Machinery,
     New York, NY, USA, 2023, p. 1523–1532. URL: https://doi.org/10.1145/3624062.3624229. doi:10.
     1145/3624062.3624229.


                                                   9