<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complexifying BERT using LoRA Adapters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT - University of Bologna</institution>
          ,
          <addr-line>Via Zamboni, 32, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the first results of a pilot study for transforming a real-valued pre-trained transformer encoder into a complex-valued one. Following recent findings about pre-training using LoRA, the main idea is to employ complex-valued LoRA adapters to make the trick and continue the pre-training of a given Italian model for setting up the adapters. After pre-training, the proposed complex-valued model has been evaluated on a standardised benchmark for Italian natural-language understanding obtaining very encouraging results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Complex-valued Transformers</kwd>
        <kwd>Language-Model Pre-Training</kwd>
        <kwd>LoRA Adapters</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Italian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Yang et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] concentrate on the development of a  = − . In the context of matrices, the conjugate
transcomplex-valued transformer for speech, signal and au- pose (also known as the Hermitian transpose) involves
dio data that are naturally complex-valued after Fourier taking the transpose of a matrix and then taking the
comTransform. plex conjugate of each element; given a complex-valued
      </p>
      <p>
        Wang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], working on positional embeddings matrix , it is usually denoted as †.
and proposing a solution for modelling both the global
absolute positions of words and their order relationships, 3.2. LoRA Adapters
introduced a small complex-valued transformer
architecture to test their ideas.
      </p>
      <p>
        The works from Eilers and Jiang [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Li et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
have the goal of providing a complete model for
building complex-valued transformer encoders, describing
possible building blocks for doing it, testing diferent
configurations and parameters.
      </p>
      <p>As we said before, all these works pre-train their
proposal from scratch and none of them proposed to use
adapters as we will describe in the next section.</p>
    </sec>
    <sec id="sec-2">
      <title>3. The Proposed Model</title>
      <p>The starting point for our work is the BERT model. BERT
(Bidirectional Encoder Representations from
Transformers) is a language representation model introduced by
Google in 2018. It is designed to pre-train deep
bidirectional representations by jointly conditioning on both
left and right context in all layers, making it deeply
bidirectional.</p>
      <p>Even if the present work is devoted to “complexify”
the BERT architecture for Italian, all the steps presented
in the following sections can be used for any pre-trained
version of BERT in diferent languages. Moreover, these
steps forms, in principle, building blocks to complexify
any transformer architecture.</p>
      <sec id="sec-2-1">
        <title>3.1. Complex Numbers</title>
        <p>Complex numbers are an extension of the real number
system. They consist of two parts: a real part and an
imaginary part. The imaginary part is defined using the
imaginary unit , where 2 = − 1. A complex number
is typically written in the form  =  + , where 
and  are real numbers. Given , ℛ() and ℐ() return,
respectively, the real and imaginary part of .</p>
        <p>The development of complex numbers allows for a
more complete understanding of algebraic equations,
especially those that have no real solutions and are crucial
in various fields such as engineering, physics, and applied
mathematics, providing tools for analysing waveforms,
electrical circuits, and quantum mechanics.</p>
        <p>
          All the standard algebraic operations on real numbers
can be extended or defined also on the complex field C.
Moreover, the complex conjugate of a complex number is
obtained by changing the sign of its imaginary part. For
a complex number  =  +  its complex conjugate is
When fine-tuning a pre-trained language model, the goal
is to adjust the model parameters to better fit a specific
task. However, large language models have millions or
billions of parameters, making this process
resourceintensive. LoRA [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] addresses this by introducing a
low-rank decomposition approach to fine-tuning.
        </p>
        <p>Suppose we have a pre-trained model with weight
matrices  in various layers. For simplicity, consider a
single weight matrix  ∈ R× . LoRA approximates
the update to the weight matrix ∆  using a low-rank
factorization. Instead of directly updating  , as  ′ =
 + ∆  , we decompose the update as ∆  =  ·  ,
where  ∈ R×  and  ∈ R× , with  ≪ (, ).
 and  are the learnable parameters, while  usually
remains fixed.</p>
        <p>LoRA adapters provide an eficient method for
finetuning large models by leveraging low-rank
approximations. This approach reduces the number of trainable
parameters and computational cost while maintaining
the model’s performance, making it a practical solution
for adapting large-scale pre-trained models to specific
tasks.</p>
        <p>
          Moreover, Lialin et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] showed that we can safely
apply LoRA also for pre-training transformer encoders
from scratch obtaining performances comparable to the
original models.
        </p>
        <p>Given these premises, the main idea introduced by this
work is to define  and  as complex-valued matrices
used to adapt a generic weight matrix  of the
pretrained real-valued model to produce complex-valued
outputs. All the  matrices will be kept frozen and the
standard LoRA forward update with input vector  will
become  = ( +  · †) .</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.3. Embeddings</title>
        <p>The BERT embedding layer is responsible for converting
input tokens into dense vectors that can be processed
by subsequent layers. It consists of three main
components, the Token Embeddings, that map each token to
a fixed-size vector representation, the Segment
Embeddings, that add a segment identifier to each token to
distinguish between diferent segments (e.g., sentences)
and the Positional Embeddings that mark positional
information to capture the order of tokens. These three
embeddings are learned during the pre-training phase
and summed to form the final input embedding, which is
then passed to the transformer encoder layers for further
a bias vector. Mathematically, it can be described as
processing.
 =  ·  + , where  is the input vector,  the
Each component represents the corresponding
embedweight matrix and  the bias vector.
dings as a real-valued matrix that can be made
complexAs before, to tranform a real-valued linear layer into
valued by summing a complex-valued LoRA adapter as
a complex-valued one, it is suficient to apply a LoRA
The self-attention mechanism can be succinctly ex- a generic complex vector  ∈ C by first computing
adapter to the weight matrix and add a further
complexvalued bias vector  to the result, mathematically:</p>
        <p>=  · ( +  · †) + ( + ).</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.6. Complex Layer Normalisation</title>
        <p>
          As suggested in Eilers and Jiang [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Li et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ],
normalising real and imaginary parts separately could lead to
poor normalisations and very elliptical distributions.
Inspired by the work of Eilers and Jiang [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we normalised
        </p>
        <p>In order to convert the real-valued self-attention mech-  for applying an afine transformation to the normalised
described in Section 3.2.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.4. Multi-head Self-Attention</title>
        <p>Self-attention is a mechanism in neural networks that
allows each element of an input sequence to focus on,
or “attend to”, other elements in the same sequence. In
the context of BERT and other transformer models,
selfattention helps capture the relationships and
dependencies between words, regardless of their distance from
each other in the text.
pressed in matrix form as:</p>
        <p>=  ·  ,  =  ·   ,  =  ·  
(, ,  ) =  
︂(  ·  )︂
√

· 
where 
 ,   , ∈</p>
        <p>R×  is the input embedding matrix,</p>
        <p>
          ∈ R×  are projection matrices,  is
the input embedding size and  = /#ℎ. The
output matrix, once concatenated the contributions of
the diferent heads and further projected into the initial
dimension , contains the context-aware representations
for each word in the input sequence, incorporating
information from all other words as determined by their
relevance.
anism to manage complex-valued inputs, it is suficient
to modify the three projections matrices  ,   ,  
using a complex-valued LoRA adapter as shown before
and modify the attention computation as
︂( | · †| )︂
√

· 
(, ,  ) =  
The complex-valued Query and Key vectors are then
multiplied and the modulus of each complex-valued
component for the resulting vector is computed (as suggested
in Eilers and Jiang [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Li et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]), normalised by √

and transformed into a probability distribution by the
softmax function to be used as attention vector for the
complex-valued vector  .
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>3.5. Linear Layers</title>
        <p>A linear layer, also known as a fully connected layer or
dense layer, is a fundamental building block in
transformer networks. It performs a linear transformation on
the input data by applying a weight matrix and adding
() =
C() =</p>
        <p>1 ∑︁</p>
        <p>=1
︃(  (ℛ())
(ℛ(), ℐ())
(ℛ(), ℐ()))︃
 (ℐ())
where   and  indicate the real-valued Variance
and Covariance functions, and then produce a normalised
output vector
′ =  ·
√︁
C− 1() ·
︃(
ℛ( − ()))︃
ℐ( − ())
+ 
where  and  are two vectors of the same dimension of
vector.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.7. Activation Function</title>
        <p>
          In BERT, the primary activation function used is the
Gaussian Error Linear Unit (GELU). We extended this function
to complex-valued inputs in a simple way following Li
et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] as:
 () =  (ℛ()) +   (ℐ())
where  ∈ C is a generic complex-valued vector. With
regard to the pooling layer, we applied the same principle
to the tanh activation.
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>3.8. Training Heads and Loss Functions</title>
        <p>In BERT, the term “training heads” refers to the additional
layers added on top of the base BERT model for solving
specific tasks. These heads are tailored to the type of
problem BERT is being fine-tuned to solve. The most
common training heads include the Masked Language
Model (MLM) and Next Sentence Prediction (NSP) heads</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>
        All the experiments presented in this work rely on the
same base Italian BERT model used as baseline in Basile
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], namely “dbmdz/bert-base-italian-xxl-uncased”
(abbreviated as ‘ItalianBERT_XXL’ as in the cited paper),
available in the Huggingface model repository1.
      </p>
      <sec id="sec-3-1">
        <title>4.1. Datasets for Pre-training and</title>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation</title>
        <p>Pre-Training. The dataset we used for continuing the
pre-training of the proposed model in order to set up
the complex-valued LoRA parameters is similar to that
used for pre-training the basic model from DBMDZ. It
is formed by the 1/3/2022 dump of the Italian Wikipedia
available on the Huggingface datasets repository and an
equivalent “BookCorpus” we built using Italian ebooks.</p>
        <p>
          During the pre-training phase we adopted the same
hyperparameters used for training BERT, namely a
learning rate of 1e-4, with a linear schedule with warmup, and
a batch size of 512.
used for BERT pre-training and Sequence/Token Clas- from Basile et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and with a batch size of 32 (unique
sification heads trained alongside the base BERT model exception the task TE that did not converge with a batch
during fine-tuning, enabling the model to be adapted to size bigger than 4).
various NLU tasks by leveraging its robust contextual
embeddings. 4.2. Results
        </p>
        <p>In the proposed model, all these training heads are
configured in the same way as a single LoRA-adapted
linear layer, as described in Section 3.5, applying the
modulus function for transforming the complex-valued
output into a real-valued one and inject it into a standard
real-valued Cross Entropy loss function.</p>
        <p>
          In our evaluation experiments we adopted the
hyperparameters proposed in Basile et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for
maintaining comparability, but our models are bigger and more
Evaluation. The performance evaluation for the pro- complex and, maybe, need more training epochs and/or
posed complex-valued model has been performed by re- diferent learning rates to achieve a full convergence
durlying on the Unified Interactive Natural Understanding ing the fine-tuning phase for evaluation. For example,
of the Italian Language (UINAUIL) dataset collection, a we were forced to reduce the learning rate to 1e-5 for
benchmark of six tasks for Italian Natural Language Un- each model evaluated on TE benchmark to favour
conderstanding [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Table 1 lists the datasets contained vergence. Again, we clarify that the goal of this work
in UINAUIL with a short task description and datasets is not to beat other systems in the leaderboards, but to
dimensions. show the efectiveness of this approach for complexifying
        </p>
        <p>It is important to clarify that the goal of this work is transformer architectures and we think that the results
not to produce a powerful model for achieving the best confirm our initial research question.
scores in the leaderboards, but instead we relied on a stan- Having complexified BERT matrices by adding LoRA
dardised dataset to verify if our complex-valued model is adapters, we have no guarantee, in principle, that the
sysable to produce reliable embeddings that can be used for tem will not converge to the original BERT-based model
solving downstream tasks through fine-tuning exhibiting setting all adapters to zero and nullify all imaginary part
similar performances with standard real-valued models in the complex-valued model. We checked this in various
(in this case, the cited ‘ItalianBERT_XXL’). ways and, as shown in Figure 1, some randomly chosen</p>
        <p>All models has been fine-tuned for exactly 2 epochs, complex-valued components of token embeddings for
with a learning rate of 1e-4, as in the cited experiments the CmplxBERTLoRA_16 model show to cover the entire
complex space in a uniform way, supporting the idea that</p>
        <sec id="sec-3-2-1">
          <title>1https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased</title>
          <p>
            The influential paper from Reimers and Gurevych [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
makes clear to the community that reporting a single
score for each DNN training/evaluation session could
be heavily afected by the system random initialisation
and we should instead report the mean and standard
deviation of various runs, with the same setting, in order
to get a more accurate picture of the real systems
performance and make more reliable comparisons between
them. For these reasons, any result proposed in this
paper is presented as the mean and standard deviation of
the relevant metric over 5 runs with diferent random
initialisations. We have also recomputed, using the same
protocol, the baseline results from Basile et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] and
introduced a further baseline that always assigns the
highest frequency class.
          </p>
          <p>Table 2 shows the number of parameters for all the
models tested in our experiments, split between trainable
and non-trainable.</p>
          <p>Table 3 shows the performance results of the various
models in solving the UINAUIL tasks: our proposed
models exhibit performances in line with the original model
and sometimes better, especially for small-to-mid LoRA
ranks, with  equal to 16, 32 and 64.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Discussion and Conclusions</title>
      <p>the pre-training phase consistently adapted the starting
real-valued model to produce reliable complex-valued
embeddings.</p>
      <p>We did also some experiments with a real-valued LoRA
model containing about the same number of parameters This pilot study presents only the first step for
proposof CmplxBERTLoRA_8, adding real-valued adapters of ing building blocks based on LoRA adapters for
complexrank 16, to investigate if a complex-valued transformer ifying any kind of transformer, either for representation
is able to produce better results that an equivalent real- learning or for text generation or for both processes
tovalued one, but such experiments did not show any rele- gether. All the complex-valued models were pre-trained
vant performance diferences between the two models. on various GPUs for speeding up the experiments, but</p>
      <p>This work presented a relevant set of experiments for a general CmplxBERTLoRA model can be trained on a
testing the idea of being able to complexify a Transformer single 12/16GB GPU without problems, while the
preencoder architecture like BERT by using complex-valued training of a complex-valued BERT model from scratch
LoRA adapters. The obtained results on Italian models would have required at least 4 NVIDIA A100 64GB GPUs
are very encouraging showing in a clear way that this for obtaining results in reasonable time. Using LoRA for
technique is efective in transforming a real-valued pre- ‘complexifying’ a model mitigates the need of complex
trained model into a complex-valued one maintaining and expensive computational infrastructures not easily
the same level of performance. available to any scholar.</p>
      <p>We have to say that the UINAUIL benchmark is not Code and models are available on github2.
without problems: TE dataset is very small and such
large models struggle to reliably converge to a reasonable
minimum during training leading to very unstable results.</p>
      <p>FactA is very problematic as well: classes are strongly
skewed and the Max_Freq_Baseline, always choosing the
highest-frequency class, is able to achieve an accuracy
of 0.967! For all these reasons, we think that these two
benchmarks should be excluded from any real evaluation.</p>
      <sec id="sec-4-1">
        <title>2https://github.com/ftamburin/CmplxBERTLoRA</title>
        <p>Experiments results when testing the considered models on the UIANUIL tasks, presented as mean and standard deviation of
5 runs. The oficial metric is marked with an arrow pointing in the direction of the best values. The best result for each task is
marked in boldface while the underlined value is the best result obtained by our complex-valued model.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arjovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Unitary evolution recurrent neural networks</article-title>
          ,
          <source>in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - ICML'16</source>
          , JMLR.org,
          <year>2016</year>
          , p.
          <fpage>1120</fpage>
          -
          <lpage>1128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Trouillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , E. Gaussier, G. Bouchard,
          <article-title>Complex embeddings for simple link prediction</article-title>
          ,
          <source>in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - ICML'16</source>
          , JMLR.org,
          <year>2016</year>
          , p.
          <fpage>2071</fpage>
          -
          <lpage>2080</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bilaniuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Serdyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rostamzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Pal</surname>
          </string-name>
          , Deep Complex Networks,
          <source>in: Proc. of the International Conference on Learning Representations, ICLR</source>
          <year>2018</year>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H. H.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , Complex Transformer:
          <article-title>A Framework for Modeling Complex-Valued Sequence</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>4232</fpage>
          -
          <lpage>4236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eilers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Building Blocks for a ComplexValued Transformer Architecture</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Simonsen</surname>
          </string-name>
          ,
          <article-title>Encoding word order in complex embeddings</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>A survey of quantum-cognitively inspired sentiment analysis models</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>56</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          ,
          <article-title>A quantum-like approach to word sense disambiguation</article-title>
          , in: R. Mitkov, G. Angelova (Eds.),
          <source>Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP</source>
          <year>2019</year>
          ), INCOMA Ltd.,
          <string-name>
            <surname>Varna</surname>
          </string-name>
          , Bulgaria,
          <year>2019</year>
          , pp.
          <fpage>1176</fpage>
          -
          <lpage>1185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          , in: C.
          <string-name>
            <surname>Burges</surname>
          </string-name>
          , et al. (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lialin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shivagunde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muckatira</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Rumshisky,
          <article-title>ReLoRA: High-Rank Training Through Low-Rank Updates</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations</source>
          , Vienna, Austria,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bioglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>UINAUIL: A unified benchmark for Italian natural language understanding</article-title>
          , in: D.
          <string-name>
            <surname>Bollegala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ritter (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>3</volume>
          :
          <string-name>
            <surname>System</surname>
            <given-names>Demonstrations)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>348</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Adapting Pre-trained
          <source>Language Models for Quantum Natural Language Processing</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>13812</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Reporting Score Distributions Makes a Diference: Performance Study of LSTM-networks for Sequence Tagging</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>