1. Introduction

Complexifying BERT using LoRA Adapters

Fabio Tamburini

0 0 FICLIT - University of Bologna , Via Zamboni, 32, Bologna , Italy

This paper presents the first results of a pilot study for transforming a real-valued pre-trained transformer encoder into a complex-valued one. Following recent findings about pre-training using LoRA, the main idea is to employ complex-valued LoRA adapters to make the trick and continue the pre-training of a given Italian model for setting up the adapters. After pre-training, the proposed complex-valued model has been evaluated on a standardised benchmark for Italian natural-language understanding obtaining very encouraging results.

eol>Complex-valued Transformers Language-Model Pre-Training LoRA Adapters Evaluation Italian

1. Introduction

Yang et al. [ 4 ] concentrate on the development of a = − . In the context of matrices, the conjugate transcomplex-valued transformer for speech, signal and au- pose (also known as the Hermitian transpose) involves dio data that are naturally complex-valued after Fourier taking the transpose of a matrix and then taking the comTransform. plex conjugate of each element; given a complex-valued

Wang et al. [ 6 ], working on positional embeddings matrix , it is usually denoted as †. and proposing a solution for modelling both the global absolute positions of words and their order relationships, 3.2. LoRA Adapters introduced a small complex-valued transformer architecture to test their ideas.

The works from Eilers and Jiang [ 5 ] and Li et al. [ 14 ] have the goal of providing a complete model for building complex-valued transformer encoders, describing possible building blocks for doing it, testing diferent configurations and parameters.

As we said before, all these works pre-train their proposal from scratch and none of them proposed to use adapters as we will describe in the next section.

3. The Proposed Model

The starting point for our work is the BERT model. BERT (Bidirectional Encoder Representations from Transformers) is a language representation model introduced by Google in 2018. It is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers, making it deeply bidirectional.

Even if the present work is devoted to “complexify” the BERT architecture for Italian, all the steps presented in the following sections can be used for any pre-trained version of BERT in diferent languages. Moreover, these steps forms, in principle, building blocks to complexify any transformer architecture.

3.1. Complex Numbers

Complex numbers are an extension of the real number system. They consist of two parts: a real part and an imaginary part. The imaginary part is defined using the imaginary unit , where 2 = − 1. A complex number is typically written in the form = + , where and are real numbers. Given , ℛ() and ℐ() return, respectively, the real and imaginary part of .

The development of complex numbers allows for a more complete understanding of algebraic equations, especially those that have no real solutions and are crucial in various fields such as engineering, physics, and applied mathematics, providing tools for analysing waveforms, electrical circuits, and quantum mechanics.

All the standard algebraic operations on real numbers can be extended or defined also on the complex field C. Moreover, the complex conjugate of a complex number is obtained by changing the sign of its imaginary part. For a complex number = + its complex conjugate is When fine-tuning a pre-trained language model, the goal is to adjust the model parameters to better fit a specific task. However, large language models have millions or billions of parameters, making this process resourceintensive. LoRA [ 11 ] addresses this by introducing a low-rank decomposition approach to fine-tuning.

Suppose we have a pre-trained model with weight matrices in various layers. For simplicity, consider a single weight matrix ∈ R× . LoRA approximates the update to the weight matrix ∆ using a low-rank factorization. Instead of directly updating , as ′ = + ∆ , we decompose the update as ∆ = · , where ∈ R× and ∈ R× , with ≪ (, ). and are the learnable parameters, while usually remains fixed.

LoRA adapters provide an eficient method for finetuning large models by leveraging low-rank approximations. This approach reduces the number of trainable parameters and computational cost while maintaining the model’s performance, making it a practical solution for adapting large-scale pre-trained models to specific tasks.

Moreover, Lialin et al. [ 12 ] showed that we can safely apply LoRA also for pre-training transformer encoders from scratch obtaining performances comparable to the original models.

Given these premises, the main idea introduced by this work is to define and as complex-valued matrices used to adapt a generic weight matrix of the pretrained real-valued model to produce complex-valued outputs. All the matrices will be kept frozen and the standard LoRA forward update with input vector will become = ( + · †) .

3.3. Embeddings

The BERT embedding layer is responsible for converting input tokens into dense vectors that can be processed by subsequent layers. It consists of three main components, the Token Embeddings, that map each token to a fixed-size vector representation, the Segment Embeddings, that add a segment identifier to each token to distinguish between diferent segments (e.g., sentences) and the Positional Embeddings that mark positional information to capture the order of tokens. These three embeddings are learned during the pre-training phase and summed to form the final input embedding, which is then passed to the transformer encoder layers for further a bias vector. Mathematically, it can be described as processing. = · + , where is the input vector, the Each component represents the corresponding embedweight matrix and the bias vector. dings as a real-valued matrix that can be made complexAs before, to tranform a real-valued linear layer into valued by summing a complex-valued LoRA adapter as a complex-valued one, it is suficient to apply a LoRA The self-attention mechanism can be succinctly ex- a generic complex vector ∈ C by first computing adapter to the weight matrix and add a further complexvalued bias vector to the result, mathematically:

= · ( + · †) + ( + ).

3.6. Complex Layer Normalisation

As suggested in Eilers and Jiang [ 5 ], Li et al. [ 14 ], normalising real and imaginary parts separately could lead to poor normalisations and very elliptical distributions. Inspired by the work of Eilers and Jiang [ 5 ], we normalised

In order to convert the real-valued self-attention mech- for applying an afine transformation to the normalised described in Section 3.2.

3.4. Multi-head Self-Attention

Self-attention is a mechanism in neural networks that allows each element of an input sequence to focus on, or “attend to”, other elements in the same sequence. In the context of BERT and other transformer models, selfattention helps capture the relationships and dependencies between words, regardless of their distance from each other in the text. pressed in matrix form as:

= · , = · , = · (, , ) = ︂( · )︂ √ · where , , ∈

R× is the input embedding matrix,

∈ R× are projection matrices, is the input embedding size and = /#ℎ. The output matrix, once concatenated the contributions of the diferent heads and further projected into the initial dimension , contains the context-aware representations for each word in the input sequence, incorporating information from all other words as determined by their relevance. anism to manage complex-valued inputs, it is suficient to modify the three projections matrices , , using a complex-valued LoRA adapter as shown before and modify the attention computation as ︂( | · †| )︂ √ · (, , ) = The complex-valued Query and Key vectors are then multiplied and the modulus of each complex-valued component for the resulting vector is computed (as suggested in Eilers and Jiang [ 5 ], Li et al. [ 14 ]), normalised by √ and transformed into a probability distribution by the softmax function to be used as attention vector for the complex-valued vector .

3.5. Linear Layers

A linear layer, also known as a fully connected layer or dense layer, is a fundamental building block in transformer networks. It performs a linear transformation on the input data by applying a weight matrix and adding () = C() =

1 ∑︁

=1 ︃( (ℛ()) (ℛ(), ℐ()) (ℛ(), ℐ()))︃ (ℐ()) where and indicate the real-valued Variance and Covariance functions, and then produce a normalised output vector ′ = · √︁ C− 1() · ︃( ℛ( − ()))︃ ℐ( − ()) + where and are two vectors of the same dimension of vector.

3.7. Activation Function

In BERT, the primary activation function used is the Gaussian Error Linear Unit (GELU). We extended this function to complex-valued inputs in a simple way following Li et al. [ 14 ] as: () = (ℛ()) + (ℐ()) where ∈ C is a generic complex-valued vector. With regard to the pooling layer, we applied the same principle to the tanh activation.

3.8. Training Heads and Loss Functions

In BERT, the term “training heads” refers to the additional layers added on top of the base BERT model for solving specific tasks. These heads are tailored to the type of problem BERT is being fine-tuned to solve. The most common training heads include the Masked Language Model (MLM) and Next Sentence Prediction (NSP) heads

4. Experiments

All the experiments presented in this work rely on the same base Italian BERT model used as baseline in Basile et al. [ 13 ], namely “dbmdz/bert-base-italian-xxl-uncased” (abbreviated as ‘ItalianBERT_XXL’ as in the cited paper), available in the Huggingface model repository1.

4.1. Datasets for Pre-training and Evaluation

Pre-Training. The dataset we used for continuing the pre-training of the proposed model in order to set up the complex-valued LoRA parameters is similar to that used for pre-training the basic model from DBMDZ. It is formed by the 1/3/2022 dump of the Italian Wikipedia available on the Huggingface datasets repository and an equivalent “BookCorpus” we built using Italian ebooks.

During the pre-training phase we adopted the same hyperparameters used for training BERT, namely a learning rate of 1e-4, with a linear schedule with warmup, and a batch size of 512. used for BERT pre-training and Sequence/Token Clas- from Basile et al. [ 13 ] and with a batch size of 32 (unique sification heads trained alongside the base BERT model exception the task TE that did not converge with a batch during fine-tuning, enabling the model to be adapted to size bigger than 4). various NLU tasks by leveraging its robust contextual embeddings. 4.2. Results

In the proposed model, all these training heads are configured in the same way as a single LoRA-adapted linear layer, as described in Section 3.5, applying the modulus function for transforming the complex-valued output into a real-valued one and inject it into a standard real-valued Cross Entropy loss function.

In our evaluation experiments we adopted the hyperparameters proposed in Basile et al. [ 13 ] for maintaining comparability, but our models are bigger and more Evaluation. The performance evaluation for the pro- complex and, maybe, need more training epochs and/or posed complex-valued model has been performed by re- diferent learning rates to achieve a full convergence durlying on the Unified Interactive Natural Understanding ing the fine-tuning phase for evaluation. For example, of the Italian Language (UINAUIL) dataset collection, a we were forced to reduce the learning rate to 1e-5 for benchmark of six tasks for Italian Natural Language Un- each model evaluated on TE benchmark to favour conderstanding [ 13 ]. Table 1 lists the datasets contained vergence. Again, we clarify that the goal of this work in UINAUIL with a short task description and datasets is not to beat other systems in the leaderboards, but to dimensions. show the efectiveness of this approach for complexifying

It is important to clarify that the goal of this work is transformer architectures and we think that the results not to produce a powerful model for achieving the best confirm our initial research question. scores in the leaderboards, but instead we relied on a stan- Having complexified BERT matrices by adding LoRA dardised dataset to verify if our complex-valued model is adapters, we have no guarantee, in principle, that the sysable to produce reliable embeddings that can be used for tem will not converge to the original BERT-based model solving downstream tasks through fine-tuning exhibiting setting all adapters to zero and nullify all imaginary part similar performances with standard real-valued models in the complex-valued model. We checked this in various (in this case, the cited ‘ItalianBERT_XXL’). ways and, as shown in Figure 1, some randomly chosen

All models has been fine-tuned for exactly 2 epochs, complex-valued components of token embeddings for with a learning rate of 1e-4, as in the cited experiments the CmplxBERTLoRA_16 model show to cover the entire complex space in a uniform way, supporting the idea that

1https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased

The influential paper from Reimers and Gurevych [ 15 ] makes clear to the community that reporting a single score for each DNN training/evaluation session could be heavily afected by the system random initialisation and we should instead report the mean and standard deviation of various runs, with the same setting, in order to get a more accurate picture of the real systems performance and make more reliable comparisons between them. For these reasons, any result proposed in this paper is presented as the mean and standard deviation of the relevant metric over 5 runs with diferent random initialisations. We have also recomputed, using the same protocol, the baseline results from Basile et al. [ 13 ] and introduced a further baseline that always assigns the highest frequency class.

Table 2 shows the number of parameters for all the models tested in our experiments, split between trainable and non-trainable.

Table 3 shows the performance results of the various models in solving the UINAUIL tasks: our proposed models exhibit performances in line with the original model and sometimes better, especially for small-to-mid LoRA ranks, with equal to 16, 32 and 64.

5. Discussion and Conclusions

the pre-training phase consistently adapted the starting real-valued model to produce reliable complex-valued embeddings.

We did also some experiments with a real-valued LoRA model containing about the same number of parameters This pilot study presents only the first step for proposof CmplxBERTLoRA_8, adding real-valued adapters of ing building blocks based on LoRA adapters for complexrank 16, to investigate if a complex-valued transformer ifying any kind of transformer, either for representation is able to produce better results that an equivalent real- learning or for text generation or for both processes tovalued one, but such experiments did not show any rele- gether. All the complex-valued models were pre-trained vant performance diferences between the two models. on various GPUs for speeding up the experiments, but

This work presented a relevant set of experiments for a general CmplxBERTLoRA model can be trained on a testing the idea of being able to complexify a Transformer single 12/16GB GPU without problems, while the preencoder architecture like BERT by using complex-valued training of a complex-valued BERT model from scratch LoRA adapters. The obtained results on Italian models would have required at least 4 NVIDIA A100 64GB GPUs are very encouraging showing in a clear way that this for obtaining results in reasonable time. Using LoRA for technique is efective in transforming a real-valued pre- ‘complexifying’ a model mitigates the need of complex trained model into a complex-valued one maintaining and expensive computational infrastructures not easily the same level of performance. available to any scholar.

We have to say that the UINAUIL benchmark is not Code and models are available on github2. without problems: TE dataset is very small and such large models struggle to reliably converge to a reasonable minimum during training leading to very unstable results.

FactA is very problematic as well: classes are strongly skewed and the Max_Freq_Baseline, always choosing the highest-frequency class, is able to achieve an accuracy of 0.967! For all these reasons, we think that these two benchmarks should be excluded from any real evaluation.

2https://github.com/ftamburin/CmplxBERTLoRA

Experiments results when testing the considered models on the UIANUIL tasks, presented as mean and standard deviation of 5 runs. The oficial metric is marked with an arrow pointing in the direction of the best values. The best result for each task is marked in boldface while the underlined value is the best result obtained by our complex-valued model.

[1]

Arjovsky ,

Shah ,

Bengio , Unitary evolution recurrent neural networks , in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - ICML'16 , JMLR.org, 2016 , p. 1120 - 1128 .

[2]

Trouillon ,

Welbl ,

Riedel , E. Gaussier, G. Bouchard, Complex embeddings for simple link prediction , in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - ICML'16 , JMLR.org, 2016 , p. 2071 - 2080 .

[3]

Trabelsi ,

Bilaniuk ,

Zhang ,

Serdyuk ,

Subramanian ,

J. F.

Santos ,

Mehri ,

Rostamzadeh ,

Bengio ,

C. J.

Pal , Deep Complex Networks, in: Proc. of the International Conference on Learning Representations, ICLR 2018 , 2018 .

[4]

Yang ,

M. Q.

Ma ,

Li ,

Y.-H. H.

Tsai ,

Salakhutdinov , Complex Transformer: A Framework for Modeling Complex-Valued Sequence , in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020 ), 2020 , pp. 4232 - 4236 .

[5]

Eilers ,

Jiang , Building Blocks for a ComplexValued Transformer Architecture , in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE Signal Processing Society , 2023 .

[6]

Wang ,

Zhao ,

Lioma ,

Li ,

Zhang ,

J. G.

Simonsen , Encoding word order in complex embeddings , in: Proceedings of the International Conference on Learning Representations , 2020 .

[7]

Liu ,

Li ,

Wang ,

Zhang ,

Song , A survey of quantum-cognitively inspired sentiment analysis models , ACM Comput. Surv . 56 ( 2023 ).

[8]

Tamburini , A quantum-like approach to word sense disambiguation , in: R. Mitkov, G. Angelova (Eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019 ), INCOMA Ltd., Varna , Bulgaria, 2019 , pp. 1176 - 1185 .

[9]

Mikolov , I. Sutskever,

Chen , G. Corrado,

Dean , Distributed representations of words and phrases and their compositionality , in: C. Burges , et al. (Eds.), Advances in Neural Information Processing Systems 26 , Curran

Associates

, Inc., 2013 , pp. 3111 - 3119 .

[10]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 .

[11]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-rank adaptation of large language models , in: Proceedings of the International Conference on Learning Representations , 2022 .

[12]

Lialin ,

Shivagunde ,

Muckatira , A . Rumshisky, ReLoRA: High-Rank Training Through Low-Rank Updates , in: Proceedings of the International Conference on Learning Representations , Vienna, Austria, 2024 .

[13]

Basile ,

Bioglio ,

Bosca ,

Bosco ,

Patti , UINAUIL: A unified benchmark for Italian natural language understanding , in: D. Bollegala , R. Huang , A . Ritter (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3 : System

Demonstrations)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 348 - 356 .

[14]

Li ,

Wang ,

Zhu ,

Lioma ,

Liu , Adapting Pre-trained Language Models for Quantum Natural Language Processing , 2023 . arXiv: 2302 . 13812 .

[15]

Reimers , I. Gurevych , Reporting Score Distributions Makes a Diference: Performance Study of LSTM-networks for Sequence Tagging , in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , ACL , Copenhagen, Denmark, 2017 , pp. 338 - 348 .