<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Assessing Large Language Models Inference Performance on a 64-core RISC-V CPU with Silicon-Enabled Vectors</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Adriano</forename><surname>Marques</surname></persName>
							<email>adriano.marquesgarcia@unito.it</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">University of Torino</orgName>
								<address>
									<addrLine>Corso Svizzera 185</addrLine>
									<postCode>10149</postCode>
									<settlement>Torino</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giulio</forename><surname>Malenza</surname></persName>
							<email>giulio.malenza@unito.it</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">University of Torino</orgName>
								<address>
									<addrLine>Corso Svizzera 185</addrLine>
									<postCode>10149</postCode>
									<settlement>Torino</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Robert</forename><surname>Birke</surname></persName>
							<email>robert.birke@unito.it</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">University of Torino</orgName>
								<address>
									<addrLine>Corso Svizzera 185</addrLine>
									<postCode>10149</postCode>
									<settlement>Torino</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Aldinucci</surname></persName>
							<email>marco.aldinucci@unito.it</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">University of Torino</orgName>
								<address>
									<addrLine>Corso Svizzera 185</addrLine>
									<postCode>10149</postCode>
									<settlement>Torino</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Assessing Large Language Models Inference Performance on a 64-core RISC-V CPU with Silicon-Enabled Vectors</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E1304E3158CCBB8A99DD7ABB2F67CD13</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>RISC-V, RVV, PyTorch, LLM, XuanTie C920, SOPHON SG2042, OpenBLAS, BLIS (M. Aldinucci) 0000-0003-4796-773X (A. M. Garcia)</term>
					<term>0009-0006-4862-7429 (G. Malenza)</term>
					<term>0000-0003-1144-3707 (R. Birke)</term>
					<term>0000-0001-8788-0829 (M. Aldinucci)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The rising usage of compute-intensive AI applications with fast response time requirements, such as text generation using large language models, underscores the need for more efficient and versatile hardware solutions. This drives the exploration of emerging architectures like RISC-V, which has the potential to deliver strong performance within tight power constraints. The recent commercial release of processors with RISC-V Vector (RVV) silicon-enabled extensions further amplifies the significance of RISC-V architectures, offering enhanced capabilities for parallel processing and accelerating tasks critical to large language models and other AI applications. This work aims to evaluate the BERT and GPT-2 language models inference performance on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled RVV v0.7.1. We benchmarked the models with and without RVV, using OpenBLAS and BLIS as BLAS backends for PyTorch to enable vectorization. Enabling RVV in OpenBLAS improved the inference performance by up to 40% in some cases.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, natural language processing (NLP) advancements have skyrocketed. This was largely propelled by developing sophisticated large language models (LLMs) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). These models have showcased remarkable capabilities in understanding, generating, and even summarizing human-like language accurately and fluently <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. This exceptional performance is fuelled by increasingly larger foundational models with billions of parameters.</p><p>To cope with, on the one hand, the growing demand for more efficient and versatile NLP systems and, on the other hand, increasingly larger and resource-hungry models, it becomes crucial to assess the inference performance of such models across various hardware architectures. Among these architectures, the RISC-V (Reduced Instruction Set Computer-Five) architecture has emerged as a compelling option that stands out as an open-source instruction set architecture (ISA) that has gained considerable traction due to its flexibility, scalability and potential for customization <ref type="bibr" target="#b2">[3]</ref>. Besides, the RISC-V architecture is known for its potential to deliver strong performance within tight power constraints, which sets it apart from traditional Complex Instruction Set Computing (CISC) architectures <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>.</p><p>The main goal of this work is to assess the inference performance of the BERT and GPT-2 language models on the SOPHON SG2042 64-core RISC-V architecture with silicon-enabled vectors (RVV) v0.7.1 <ref type="bibr" target="#b6">[7]</ref>. For this goal, we first built the PyTorch library using OpenBLAS and BLIS to leverage support for RVV v.0.7.1 instructions. Then, we wrote a text-generation Python script for each model and ran experiments measuring the inference times. We compare the model inference performance with PyTorch without using OpenBLAS/BLIS, using OpenBLAS/BLIS built for generic RISC-V architectures, and using OpenBLAS/BLIS targeting the specific RISC-V architecture on the SOPHON SG2042.</p><p>The contributions of this work are as follows: <ref type="bibr" target="#b0">(1)</ref> We provide a tutorial on how to enable RISC-V Vectors v0.7.1 support for PyTorch in the SOPHON SG204. <ref type="bibr" target="#b1">(2)</ref> We evaluate how enabling RVV impacts the inference performance of large language models (BERT and GPT-2). ( <ref type="formula">3</ref>) We evaluate an experimental version of the BLIS library that leverages RVV 0.7.1 on a real RISC-V architecture. <ref type="bibr" target="#b3">(4)</ref> We analyze the scalability of model inference performance when increasing the parallelism of the SOPHON SG2042 64-core RISC-V processor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Large Language Models (LLMs)</head><p>A language model can be defined as a probabilistic model of the natural language. It is a model that predicts the likelihood of a sequence of words given the preceding context <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. Such models learn a language structure, grammar, and vocabulary from vast amounts of text data <ref type="bibr" target="#b9">[10]</ref>. They then use that knowledge to perform various natural-language processing tasks, such as text generation, translation, summarization, sentiment analysis, and more <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. The first language model was proposed in 1980, and researchers and the industry have been increasingly improving these models over the past decades. In 2017, researchers from Google introduced the BERT model <ref type="bibr" target="#b0">[1]</ref>(LLM), which was notable for its dramatic improvement over previous state-of-the-art models. BERT introduced the use of the transformer architecture and attention mechanism, laying the way for the upcoming large language models (LLM). LLMs experienced a boost in popularity among the general public with the release of ChatGPT, a chat tool based on the series of Generative Pre-trained Transformer (GPT) <ref type="bibr" target="#b1">[2]</ref> models developed by OpenAI <ref type="bibr" target="#b13">[14]</ref>.</p><p>GPT-2 <ref type="bibr" target="#b14">[15]</ref> is a well-known state-of-the-art language model developed by OpenAI, the second in the series. It is also the last model fully disclosed by OpenAI before Microsoft acquired the exclusive rights to GPT-3 <ref type="bibr" target="#b15">[16]</ref>. It is considered a causal language model that predicts the next token (word) in the sequence (sentence) by only attending to the tokens to the left; that is, the model can not see into the future. Although the latest GPT version is GPT-4 Turbo, GPT-2 still presents incredible text generation capabilities. It is designed to generate human-like text based on the input it receives.</p><p>BERT <ref type="bibr" target="#b0">[1]</ref> stands for Bidirectional Encoder Representations from Transformers. It is a language model based on the transformer architecture. BERT is not a causal language model but a masked one. It is an encoder-only architecture, lacking a decoder, which means that BERT can not be prompted or generate text. Although it was not originally designed to predict and generate the next sentences or words of a text, it still can be fine-tuned for this purpose. However, it increases the computational cost, and the quality of the generated text may not be comparable to that of other models, such as GPT-2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">RISC-V Vectors (RVV) Enabled Silicon</head><p>The vector instructions for RISC-V are defined in the open "V" vector ISA extension standardized by RISC-V International <ref type="bibr" target="#b16">[17]</ref>. The first stable release, 1.0, has now been rectified. While most versions never made it into silicon <ref type="bibr" target="#b5">[6]</ref>, one can find off-the-shelf hardware supporting either v0.7.1, e.g., SOPHON SG2042 SoC used here or v1.0, e.g., Canaan Kendryte K230 SoC.</p><p>SOPHON SG2042 system-on-chip (SoC) contains 64 RISC-V cores divided into 16 clusters connected through a grid network <ref type="bibr" target="#b6">[7]</ref>. Each cluster comprises a XuanTie C920 4-core RISC-V CPU. Each core is equipped with 64KB of L1 instruction and data cache, and each cluster of 4 cores shares a 1MB of L2 cache. The unified L2 cache can handle two access requests in parallel within one cycle. The grid interconnect finally offers access to 64MB of level 3 cache shared among all 64 cores. Four DDR4-3200 memory controllers are used to manage access to the main memory system. For peripherals, the SG2042 is equipped with 32 PCIe Gen4 lanes.</p><p>XuanTie C920 <ref type="bibr" target="#b17">[18]</ref> is a homogeneous high-performance 64-bit multi-core RISC-V CPU architecture designed by T-Head that supports 1 to 4 cores at a maximum operation frequency of 2 GHz. It targets high-performance applications and implements a 12-stage, out-of-order, multiple-issue superscalar pipeline. Based on the RISC-V instruction set architecture (ISA), this CPU provides the RV64GCV instruction set <ref type="bibr" target="#b18">[19]</ref>, which supports standard vector instructions extension version 0.7.1 (RVV 0.7.1). The vector processing unit's vector registers are 128 bits long and support the FP16, FP32, FP64, INT8, INT16, INT32, and INT64 data types.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">PyTorch with Vector Support</head><p>PyTorch can delegate the computation to various low-level libraries to best leverage the capabilities of the underlying hardware, e.g. Intel MKL <ref type="bibr" target="#b19">[20]</ref> when running on Intel CPUs or cuDNN <ref type="bibr" target="#b20">[21]</ref> and MIOpen <ref type="bibr" target="#b21">[22]</ref> when running on NVIDIA or AMD GPUs, or more generic ones such as various libraries from the BLAS family.</p><p>Colonnelli et al. <ref type="bibr" target="#b22">[23]</ref> describe the first porting of PyTorch <ref type="bibr" target="#b23">[24]</ref> v2.0 to the RISC-V ISA, but the underlying platform in use did offer limited acceleration capabilities, i.e. only fused multiply-add (FMA) support. Newer RISC-V silicons also support the RVV standard, providing vector instructions that can process multiple computations in parallel following the same instruction multiple data (SIMD) parallel computing paradigm. To leverage such capability, we rely on the recently ported OpenBLAS or BLIS <ref type="bibr" target="#b24">[25]</ref> libraries. These BLAS-like libraries can be compiled with and without RVV support based on the defined target. Notice that due to the different versions of the RVV standard, a careful match between target architecture and a compatible compiler is required to compile vector instructions enabled code for the SG2042 SoC correctly.</p><p>More in detail, we first compiled OpenBLAS v0.3.26 using the XuanTie GNU Compiler Toolchain v2.8.0 <ref type="bibr" target="#b25">[26]</ref> and set TARGET=C910V to enable RVV. For BLIS, we used a modified version <ref type="bibr" target="#b24">[25]</ref> based on BLIS v0.9.0, set "rv64iv0p7" as the target, and compiled it using the LLVM-EPI-0.7<ref type="foot" target="#foot_0">1</ref> compiler. Then, we compiled PyTorch v2.3 for Python v3.10.10, using Xuantie's GCC v13.2, OpenMP v4.5, OpenBLAS/BLIS, and enabling only the following build options: USE_OPENMP=1 to leverage the multiple cores on the SG2042 SoC; USE_BLAS=1 and USE_LAPACK=1 to use the BLAS libraries as main computation backend; USE_KINETO=ON for profiling; and USE_NUMPY=ON for numpy support. Note that since the work in <ref type="bibr" target="#b22">[23]</ref>, the SLEEF vector math library now supports only RVV v1.0, which is incompatible with the SG2042 SoC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Methodology and Results</head><p>This section presents the model inference performance of BERT and GPT-2 for text generation. We run the experiments on the Milk-V Pioneer Box, a commercial ready-to-use development platform in the form of a desktop computer equipped with a Pioneer motherboard powered by the SOPHON SG2042. The one we used in this paper is equipped with 128GB of RAM DDR4 (3200MHz) and a 1TB PCIe 3.0 SSD. The operating system is Linux fedora-riscv 6.1.31.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Benchmarks</head><p>We wrote a simple text generation benchmark application based on the GPT-2 and BERT language models to test and evaluate the performance of LLM models. This application receives a prompt as input and generates/predicts the next 𝑛 tokens (words) as output. For GPT-2, we used the GPT-2 (Revision 909a290) model provided by OpenAI, with 124M parameters <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref>. For BERT, we used the bert-largecased pre-trained version<ref type="foot" target="#foot_1">2</ref> (commit 06fa25d), which is similar in size to GPT-2 but has 24 layers, 4096 intermediate dimensions, 1024 hidden dimensions, 16 attention heads, and 336M parameters <ref type="bibr" target="#b0">[1]</ref>. To make the generated text more human-like, we added extra routines to control some aspects, such as sentence length.</p><p>For all the experiments, we used the following input prompt both for BERT and GPT-2:</p><p>-"The quick brown fox ran" Here is an example of the generated output text from GPT-2:</p><p>-"The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest." And here is an example of the text generated by BERT:</p><p>-"The quick brown fox ran. Run fast fox running hard. Go ahead. Running on faster the fox went forward quickly."</p><p>Notice that the generation depends on sampling the next token (word) probabilities via a random number generator. Hence, unless the seed of the random number generator is fixed, each execution leads to different sentences being generated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Performance Results</head><p>Figures <ref type="figure" target="#fig_1">1 and 2</ref> present the experimental results. The bars in the charts are the average of 10 executions, and the whiskers represent the standard deviation. It is important to note that we consider only the execution time of the model inference, not the total runtime of the application. We used our BERT and GPT-2 applications to generate 25 tokens (words) in all tests. The chart with the performance results of BERT on the SOPHON SG2042 RISC-V CPU is presented in Figure <ref type="figure" target="#fig_0">1</ref>. This chart compares the performance of the BERT benchmark with PyTorch using different BLAS runtimes. For this, we compiled five different versions of PyTorch: (1) PyTorch without OpenBLAS or BLIS (PyTorch-default); (2) PyTorch + BLIS compiled with generic target architecture; (3) PyTorch + an experimental BLIS library with support for RVV v0.7.1; (4) PyTorch + OpenBLAS targeting generic RISC-V architecture (no RVV); (5) PyTorch + OpenBLAS targeting the specific C910 architecture on the SOPHON SG2042 CPU.</p><p>For generating 25 tokens on the RISC-V CPU with a single thread, BERT with PyTorch-default is about 2 times slower than PyTorch + OpenBLAS with RVV. With 32 threads, it is 22 times slower. With 24 threads, PyTorch + OpenBLAS with RVV shows a 40% performance gain compared to OpenBLAS without RVV. This performance gain may be brought solely by SIMD operations. However, other optimizations may be applied to OpenBLAS when compiling it to target a specific architecture rather than a generic one.</p><p>On the other hand, BLIS shows no significant performance gains over the sequential execution with PyTorch-default. We wrote test programs in C++ to test OpenBLAS and BLIS's capabilities in generating RVV instructions for GEMM operations, and both libraries passed the test. They generate RVV 0.7.1 instructions in the assembly code, which are properly executed. In these tests, BLIS with RVV behaves as expected and presents better performance. Our test programs implement the same main operation the models execute (aten::addmm). We also ensured that PyTorch was running the BLIS library as the backend for the main operations and that the parallelism was being set correctly. Therefore, a deeper investigation will be needed to understand the BLIS results. The inference performance results using the GPT-2 model are presented in Figure <ref type="figure" target="#fig_1">2</ref>. The behavior of the PyTorch-default is somewhat similar to the one observed with BERT. Increasing parallelism only leads to performance degradation. This is mainly due to the cost added by sharing data in the upper levels of the memory hierarchy, which is worsened in a mesh-like memory architecture of the SG2042 CPU <ref type="bibr" target="#b6">[7]</ref>. PyTorch + BLIS also behaves the exact same way as in the BERT results. With OpenBLAS, performance scales only up to 24 threads. We hypothesize it is primarily due to overheads added by data communication plus a lighter load imposed by GPT-2 for generating 25 tokens if compared to BERT. In GPT-2, PyTorch + OpenBLAS with 1 and 2 threads perform unexpectedly poorly compared to PyTorch-default. Further investigation is needed to understand this behavior. However, with a higher number of threads, PyTorch + OpenBLAS with RVV enabled performs the best, improving over 30% the performance running with 16 threads compared to the sequential PyTorch-default. With 64 threads, the OpenBLAS with RVV is twice as fast as the OpenBLAS with no RVV. For both BERT and GPT-2, we ran experiments using different smaller model sizes and observed similar results.  Although the inference performance results show gains using PyTorch + OpenBLAS with RVV, it is unclear whether this gain comes from using RVV or other optimizations. OpenBLAS and BLIS do not provide tracing mechanisms to detail the type of instructions executed. OneDNN (oneAPI Deep Neural Network Library), for example, is a library that can be used with PyTorch to provide RVV support and can show which layers of the models execute vector instructions. However, the latest version of oneDNN only supports RVV v1.0 and still cannot vectorize any of the main layers of the models we tested, so we did not use it in this work. We tried using an experimental version of oneDNN that leverages RVV v0.7, but it is incompatible with other PyTorch dependencies.</p><p>Both language models should be highly vectorizable. Most of the operations involved in these two models are aten::addmm (matrix multiplication and addition) and also aten::mm for GPT-2. In addition, for BERT, we also notice that a minor percentage of processing time is spent on the aten::gelu (Gaussian Error Linear Units) primitive. For both models, the rest of the primitives comprise several calls of more negligible operations, as shown in Figures <ref type="figure" target="#fig_3">3 and 4</ref>. Since OpenBLAS and BLIS are expected to properly vectorize matrix × matrix operations and both manage to do it in our tests without involving PyTorch, we are unsure why BLIS presented this expected behavior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Related Work</head><p>To our knowledge, no other work in literature investigated the performance of language models on a RISC-V architecture similar to the one we use in this work. Brown et al. <ref type="bibr" target="#b28">[29]</ref> evaluated the parallel workloads from the RAJAPerf benchmarking suite on the SOPHON SG2042 SoC and compared the performance with state-of-the-art ARM and x86 architectures. The authors explored the RVV v0.7.1 capabilities using the XuanTie GCC compiler provided by the vendor to enable single-precision autovectorization on the RAJAPerf kernels. They observed that the SG2042 CPU delivers up to ten times more performance per core than the nearest widely-available RISC-V hardware. Still, it was 4 to 8 times outperformed by the x86 CPUs in multi-threaded scenarios. They added that using custom thread mapping strategies is essential for leveraging performance on this architecture.</p><p>Lee et al. <ref type="bibr" target="#b5">[6]</ref> also tested the RAJPerf kernels on a RISC-V architecture with silicon-enabled RVV v0.7.1. But it is the T-Head C906 single-core CPU. They compared the RVV performance against the ARM NEON and SVE instruction sets and against the SiFive U74 RISC-V, which does not implement RVV. In some cases, the vectorized code on the C906 outperformed the U74 CPU with no vectors in about 80%. However, many other factors may influence this result when comparing two different platforms. Both <ref type="bibr" target="#b28">[29]</ref> and <ref type="bibr" target="#b5">[6]</ref> point out the challenges of developing and running vectorized code on RISC-V due to the immaturity of tooling and hardware.</p><p>Igual et al. <ref type="bibr" target="#b29">[30]</ref> evaluated general matrix multiplication (GEMM) kernels on the C906 and C910 T-Head RISC-V architectures, both implementing RVV v0.7.1. Evaluating GEMM kernels is important because it is frequently the type of operation found inside language model layers. They evaluated the performance of the SGEMM kernels using OpenBLAS targeting RISC-V and RVV, as well as a version of OpenBLAS targeting a generic architecture with no particular support for RVV. They report performance gains up to 80% when using OpenBLAS with RVV. However, they also do not distinguish specific vector operations from any other optimizations that could be added by OpenBLAS.</p><p>The three related work we found <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b29">30]</ref> evaluate the RVV capabilities in similar CPUs as this work. However, only <ref type="bibr" target="#b28">[29]</ref> investigated the SOPHON SG2042, with its complex NUMA design, but they did not focus on evaluating RVV performance. Although these works evaluated kernels with common operations found in inference model layers, evaluating the whole application adds extra complexity and can be more representative of real-world scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>This paper assessed the inference performance of pre-trained language models on a multi-core RISC-V CPU with silicon-enabled vectors version 0.7.1. The first challenge involved building the PyTorch acceleration library with support for vectorization on RISC-V architectures using OpenBLAS and BLIS. The experimental results showed that GPT-2 and BERT models perform better when using PyTorch + OpenBLAS with RVV in most test cases. However, it is unclear whether vectorization, different optimizations, or a wider combination of factors leverage this performance gain. Even though related work shows that OpenBLAS targeting RVV can provide up to 80% performance increase <ref type="bibr" target="#b29">[30]</ref>, they also do not guarantee that this performance gain comes only from using RVV. Unfortunately, to this point, the lack of performance analysis tools is a major drawback of the commercially available RISC-V architectures <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b28">29]</ref>.</p><p>In future work, a deeper and more comprehensive investigation should be carried out to understand better the RVV's impact on the performance of inference models on RISC-V architectures. For instance, verifying what model layers can be vectorized by OpenBLAS and BLIS in PyTorch may provide useful information on the maximum performance improvement that could be achieved solely through vectorization. Also, it would be paramount to investigate the effects of the SG2042 NUMA design on the performance, perhaps testing different task scheduling policies and comprehending what overheads are involved.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: BERT inference performance using different BLAS runtimes on the SG2042 RISC-V CPU.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: GPT-2 inference performance with different BLAS runtimes on the SG2042 RISC-V CPU.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: PyTorch Profiler output when running BERT on a single core of the SOPHON SG2042 CPU.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: PyTorch Profiler output when running GPT-2 on a single core of the SOPHON SG2042 CPU.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://huggingface.co/google-bert/bert-large-cased</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been partially supported by the Spoke "FutureHPC &amp; BigData" of the ICSC -Centro Nazionale di Ricerca in "High Performance Computing, Big Data and Quantum Computing", funded by the EU -NextGenerationEU and the EuPilot project funded by EuroHPC JU under G.A. 101034126.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">BERT: pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/N19-1423</idno>
		<ptr target="https://doi.org/10.18653/v1/n19-1423.doi:10.18653/V1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Doran</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Pre-trained models: Past, present and future</title>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.1016/J.AIOPEN.2021.08.002</idno>
	</analytic>
	<monogr>
		<title level="j">AI Open</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="225" to="250" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Will risc-v revolutionize computing?</title>
		<author>
			<persName><forename type="first">S</forename><surname>Greengard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="page" from="30" to="32" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Risc-v resource-constrained cores: A survey and energy comparison</title>
		<author>
			<persName><forename type="first">I</forename><surname>Elsadek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">Y</forename><surname>Tawfik</surname></persName>
		</author>
		<idno type="DOI">10.1109/NEWCAS50681.2021.9462781</idno>
	</analytic>
	<monogr>
		<title level="m">19th IEEE International New Circuits and Systems Conference (NEWCAS)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Instruction sets should be free: The case for risc-v</title>
		<author>
			<persName><forename type="first">K</forename><surname>Asanović</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Patterson</surname></persName>
		</author>
		<idno>UCB/EECS-2014-146</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
		<respStmt>
			<orgName>EECS Department, University of California, Berkeley, Tech. Rep</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Test-driving risc-v vector hardware for hpc</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">K L</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jamieson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jesus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">High Performance Computing</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Bienz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Weiland</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Baboulin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Kruse</surname></persName>
		</editor>
		<meeting><address><addrLine>Switzerland, Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Nature</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="419" to="432" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Wei</surname></persName>
		</author>
		<ptr target="https://github.com/milkv-pioneer/pioneer-files/blob/main/hardware/SG2042-TRM.pdf" />
		<title level="m">Sg2042 technical reference manual</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Sequence to sequence learning with neural networks</title>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Lawrence</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">27</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Class-based n-gram models of natural language</title>
		<author>
			<persName><forename type="first">P</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dellapietra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mercer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="467" to="479" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Hadsell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Balcan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Pre-trained language models for text generation: A survey</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-R</forename><surname>Wen</surname></persName>
		</author>
		<idno type="DOI">10.1145/3649449</idno>
		<idno>doi:</idno>
		<ptr target="10.1145/3649449" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">BERT: A review of applications in natural language processing and understanding</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">V</forename><surname>Koroteev</surname></persName>
		</author>
		<idno>CoRR abs/2103.11943</idno>
		<ptr target="https://arxiv.org/abs/2103.11943.arXiv:2103.11943" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Adversarial training for aspect-based sentiment analysis with BERT</title>
		<author>
			<persName><forename type="first">A</forename><surname>Karimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Prati</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICPR48806.2021.9412167</idno>
		<ptr target="https://doi.org/10.1109/ICPR48806.2021.9412167.doi:10.1109/ICPR48806.2021.9412167" />
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event</title>
				<meeting><address><addrLine>Milan, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">January 10-15, 2021. 2020</date>
			<biblScope unit="page" from="8797" to="8803" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Brockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<ptr target="https://openai.com/blog/introducing-openai/" />
		<title level="m">Introducing openai</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">OpenAI blog</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">9</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Openai is giving microsoft exclusive access to its gpt-3 language model</title>
		<author>
			<persName><forename type="first">K</forename><surname>Hao</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>MIT Technology Review</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">vector extension</title>
		<author>
			<persName><surname>Risc-V</surname></persName>
		</author>
		<ptr target="https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc" />
	</analytic>
	<monogr>
		<title level="m">Risc-v</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Xuantie-910: a commercial multi-core 12-stage pipeline out-of-order 64-bit high performance risc-v processor with vector extension</title>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qi</surname></persName>
		</author>
		<idno type="DOI">10.1109/ISCA45697.2020.00016</idno>
		<ptr target="https://doi.org/10.1109/ISCA45697.2020.00016.doi:10.1109/ISCA45697.2020.00016" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA &apos;20</title>
				<meeting>the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA &apos;20</meeting>
		<imprint>
			<publisher>IEEE Press</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="52" to="64" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The risc-v instruction set manual, Volume I: User-Level ISA</title>
		<author>
			<persName><forename type="first">A</forename><surname>Waterman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Patterson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Asanovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">I U</forename><surname>Level Isa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Waterman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Patterson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">version</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1" to="79" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">U</forename><surname>Foundation</surname></persName>
		</author>
		<ptr target="https://github.com/oneapi-src/oneMKL" />
		<title level="m">oneapi math kernel library (onemkl) interfaces</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">cudnn: Efficient primitives for deep learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chetlur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Woolley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vandermersch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Catanzaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shelhamer</surname></persName>
		</author>
		<idno>CoRR abs/1410.0759</idno>
		<ptr target="http://arxiv.org/abs/1410.0759.arXiv:1410.0759" />
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fultz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tamazov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lowell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Melesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nandhimandalam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Nasyrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Perminov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Filippov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Natarajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Daga</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.00078</idno>
		<title level="m">Miopen: An open source library for deep learning primitives</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Experimenting with pytorch on risc-v</title>
		<author>
			<persName><forename type="first">I</forename><surname>Colonnelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Birke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Aldinucci</surname></persName>
		</author>
		<ptr target="https://iris.unito.it/retrieve/429bf344-9090-42c3-809c-1b8ac320a930/2023-06-08-Iacopo-COLONNELLI-abstract.pdf,poster" />
	</analytic>
	<monogr>
		<title level="m">RISC-V Summit Europe 2023</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Pytorch: An imperative style, highperformance deep learning library</title>
		<author>
			<persName><forename type="first">A</forename><surname>Paszke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bradbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Killeen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gimelshein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Antiga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Desmaison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Devito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Raison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tejani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chilamkurthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chintala</surname></persName>
		</author>
		<ptr target="http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 32</title>
				<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8024" to="8035" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Programmatically Reaching the Roof: Automated BLIS Kernel Generator for SVE and RVV</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nassyr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Mood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herten</surname></persName>
		</author>
		<idno type="DOI">10.34734/FZJ-2023-03437</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
		<respStmt>
			<orgName>Jülich Supercomputing Center</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">T.-H</forename><forename type="middle">S C</forename><surname>Ltd</surname></persName>
		</author>
		<ptr target="https://github.com/T-head-Semi/xuantie-gnu-toolchain" />
		<title level="m">T-head gnu compiler toolchain</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">OpenAI blog</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">9</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<idno type="DOI">10.57967/hf/0039</idno>
		<ptr target="https://huggingface.co/gpt2.doi:10.57967/hf/0039" />
		<title level="m">HF Canonical Model Maintainers</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page">2</biblScope>
		</imprint>
	</monogr>
	<note>revision 909a290</note>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Is risc-v ready for hpc prime-time: Evaluating the 64-core sophon sg2042 risc-v cpu</title>
		<author>
			<persName><forename type="first">N</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jamieson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SC&apos;23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis</title>
				<meeting>the SC&apos;23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1566" to="1574" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Automatic generation of micro-kernels for performance portability of matrix multiplication on risc-v vector processors</title>
		<author>
			<persName><forename type="first">F</forename><surname>Igual</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Piñuel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Catalán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Martínez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Castelló</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Quintana-Ortí</surname></persName>
		</author>
		<idno type="DOI">10.1145/3624062.3624229</idno>
		<idno>doi:10. 1145/3624062.3624229</idno>
		<ptr target="https://doi.org/10.1145/3624062.3624229" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SC &apos;23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W &apos;23</title>
				<meeting>the SC &apos;23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W &apos;23<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1523" to="1532" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
