1. Introduction

S. Robertson, H. Zaragoza, The probabilis- Linguistics

10.18653/v1/D19-1410

Exploring Text-Embedding Retrieval Models for the Italian Language

Yuri Noviello

0 1

Fabio Tamburini

0 1 0 FICLIT - University of Bologna , Via Zamboni, 32 , Italy 1 YN: Conceptualization , Investigation, Software, Formal FT: Methodology, Supervision, Writing - Review & Editing

2024

11 2023 0000 0001

Text retrieval systems have become essential in the field of natural language processing (NLP), serving as the backbone for applications such as search engines, document indexing, and information retrieval. With the rise of generative AI, particularly Retrieval-Augmented Generation (RAG) systems, the demand for robust text retrieval models has increased. However, existing large language models (LLMs) and datasets are often insuficiently optimized for Italian, limiting their performance in Italian text retrieval tasks. This paper addresses this gap by proposing both a data collection and specialized models tailored for Italian text retrieval. Through extensive experimentation, we analyze the improvements and limitations in retrieval performance, paving the way for more efective Italian NLP applications.

eol>Italian embedding text embedding retrieval model

1. Introduction

tasks. This shortfall highlights a significant area for improvement and development within the Italian NLP comIn recent years, text retrieval systems have emerged as munity. a cornerstone of the natural language processing (NLP) To address this gap, our work aims to propose both ifeld. These systems are crucial in various applications, novel datasets and specialized models optimized for Italincluding search engines, document indexing, and infor- ian text retrieval. By focusing exclusively on the Italian mation retrieval tasks. Their primary function is to fetch language, we strive to enhance the performance of rerelevant pieces of text from large corpora, enabling efi- trieval tasks. cient and accurate information access. This capability is The primary contribution of this paper is the introcrucial for numerous industries, including legal, medical, duction of a comprehensive Italian text retrieval system, and customer service sectors, where timely and precise encompassing both a curated dataset collection and speinformation retrieval can significantly impact decision- cialized language models. Through extensive experimenmaking processes. tation and rigorous evaluation, we demonstrate the ef

With the advent of generative AI, the importance fectiveness of our approach, setting the stage for more of text retrieval systems has only amplified. Ad- advanced and reliable Italian text retrieval solutions apvanced systems, particularly chatbots based on Retrieval- plicable across diverse tasks.

Augmented Generation (RAG) [ 1 ], have become essential tools for various purposes. RAG systems combine retrieval mechanisms with generative models to produce 2. Related Works contextually relevant and accurate responses in conversational AI applications. This integration has enhanced the capabilities of chatbots, making them more eficient in providing precise information and engaging in meaningful dialogues.

Despite the impressive performance of recent large language models (LLMs) as conversational agents in Italian contexts, there remains a notable gap in the resources and models specifically designed for Italian text retrieval The development of text embedding models has seen significant advancements over the years, evolving from simple word representations to sophisticated contextual embeddings. Early models like Word2Vec [ 2 ] and GloVe [ 3 ] set the foundation by capturing semantic relationships between words through fixed-size vector representations. These models, however, lacked the ability to understand context, leading to the development of more advanced techniques.

Transformers have revolutionized the field of NLP by introducing mechanisms to capture context and relationships across entire sentences. BERT (Bidirectional Encoder Representations from Transformers [ 4 ]) marked a signicfiant milestone, providing deep contextualized word embeddings by considering both left and right con

3. Data

for various large language models (LLMs), such as GPT-3 dataset encompasses 18 diferent languages, it does not [ 5 ] and T5 [ 6 ], which further extend the capabilities of include any Italian data. Given the dataset high quality, transformers by scaling up model size and training data. particularly in defining hard negatives through manual

Sentence Transformers, an extension of the trans- annotation, we decided to translate the dataset into Italformer architecture [ 7 ], focus on generating embeddings ian using automated methods. In particular, we focused for whole sentences rather than individual words. Models on the English section of the dataset, which is organized like SBERT (Sentence-BERT) enhance the performance as shown in Table 1. of sentence-level tasks, such as semantic textual similarity and information retrieval, by fine-tuning BERT Table 1 specifically for sentence embeddings. This approach has English data organization of MIRACL demonstrated significant improvements in capturing the Split Query Passage semantic meaning of sentences, but specific training cor- train 2,863 29,416 pora, annotated with sentence similarity scores, must be dev 799 8,350 provided for setting up the system. corpus - 32,893,221

In the realm of multilingual models, the multilingual E5 family has emerged as a robust solution for handling The translation process aimed to preserve these qualimultiple languages within a single model architecture [8]. ties while adapting the content to Italian, thereby creating These models are pre-trained on a multilingual corpus, a robust resource for training and evaluating Italian text enabling them to perform efectively across diferent lin- retrieval models. guistic contexts. The multilingual E5 models leverage the To translate the dataset, we experimented with two strengths of transformer architectures to provide high- diferent approaches: a large language model (LLM) transquality embeddings for numerous languages, including lation via the PaLM 2 API [13] and an open-source ofline less-resourced ones. This makes them particularly valu- translation via Argos Translate [14]. The translation able for tasks requiring cross-lingual understanding and quality was evaluated to ensure that the Italian version retrieval. maintained the dataset integrity and usefulness for train

The continuous evolution of text embedding mod- ing efective retrieval models. els, from standard embeddings to advanced transformerbased approaches, highlights the dynamic nature of NLP research. Each progression addresses the limitations of its 3.1.1. Datasets translation using PaLM 2 predecessors, contributing to more accurate and contextaware representations, which are crucial for a wide array of applications in natural language understanding and information retrieval.

We performed the translation of the whole training and development English sets of MIRACL using PaLM 2 API [13]. Due to budget constraints, we did not translate the entire corpus, as it would have required approximately €10,000, given the huge number of documents. We used the following prompt in order to obtain the Italian translation: The quality and abundance of the data is one of the main Translate the following text in Italian. aspect in order to obtain high quality text embedding Write the translation only: models. The data used in this work for training the {text} models were adapted from the following datasets: MIRACL [9], SQuAD-it [10], MLDR [11] and WikipediaQA- We used the same prompt for both queries and ita [12]. Among these, only the Multilingual Long- documents. For documents, we used the model Document Retrieval (MLDR) was used as-is, as it already text-bison-32k@002, and for queries, we relied on contains 2, 151 examples of Italian triplets in the form text-bison@002. This resulted in a total of 37, 351 of query-positive passage-negative passage. API calls, as some documents are associated with multiFollowing sections detail the processing of the other ple queries. datasets.

3.1. MIRACL-it The Multilingual Information Retrieval Across a Continuum of Languages (MIRACL) dataset is widely used for building multilingual information retrieval models, such as the multilingual E5 models family [8]. Although the

3.1.2. Open-source ofline translation using Argos

Translate

Argos Translate is an open-source library that uses Open

NMT for translation and supports multiple language model packages [14]. We utilized the English-to-Italian model to translate the training and development sets of MIRACL, including the entire corpus. 3.1.3. Translations quality evaluation

The translation performed by PaLM 2, as reported in the

Technical Report [13] and confirmed by our empirical tests, is considered high-quality. To measure the quality of the translation performed by Argos Translate, we used the SOTA automatic metric BLEURT [15] and we used the PaLM 2 translations as reference. Since we do not 4. Methodology have the entire corpus translated by the LLM, we conducted the evaluation only on the overlapping portion of 4.1. Contrastive learning on labeled data the translated datasets, resulting in a corpus of 33, 689 documents.

RAG finetuning. It contains more than 100, 000 question

answer pairs. Similar to SQuAD-it, we considered only the question and context attributes for each example and applied the same hard negative mining strategy using the BM25 algorithm.

This work implements a dual-encoder model that uses

a combination of supervised loss functions to achieve efective learning.

The dual-encoder model encodes queries and passages separately to produce their respective embeddings: = Encoderquery() = Encoderpassage()

The similarity score between a query and a passage is computed as the dot product of their embeddings:

= ·

The embeddings are normalized before computing the dot product, resulting in cosine similarity: (1) (2) (3) (4) (5)

The average BLEURT score of 0.625 indicates that

Argos Translate produced a decent translation, validating its use as a cost-efective alternative for text embedding model fine-tuning and evaluation. q^ = ‖‖ and p^ = ‖‖

Thus, the similarity score becomes:

= q^ · p^ 3.2. SQuAD-it For a batch of queries and passages, the contrastive loss encourages higher similarity scores for matching SQuAD-it is obtained through semi-automatic transla- query-passage pairs and lower scores for non-matching tion of the SQuAD dataset into Italian, it contains more pairs. The loss function is defined as: than 60, 000 question-answer pairs. For these experiments, we considered only the question and context tanrteitprgilabetutsitevisneotfhpeeaasfcoshramdgaetoa,fsweqetuepexerarfymorp-mlpee.odTshihaetrnid,vnseiengcpaetaiwvsesesanmgeeiend-- cont = 1 ∑=︁1 [︃− log ∑︀e=x1pe(xp( / )/ ) ]︃ (6) ing. We used the standard BM25 algorithm [16] to extract where is the batch size, is the temperature pathe top-10 similar documents for each query, excluding rameter, and represents the similarity score for the positive passages for the given query. This process en- matching query-passage pair. sured that the dataset was suitably challenging for training robust retrieval models.

3.3. WikipediaQA-ita The WikipediaQA-ita is a datasets synthetically generated using a custom model from ReDiX Informatica; it has been created on Italian and specifically designed for 4.2. Fine-tuning procedure We performed our answer-generation experiments by using the following base models:

1. Minerva-1B [17], 2. Qwen2-1.5B [18], 3. Gemma-2B [19],

We relied on the foundational versions of these models.

To speed up the computation, we implemented a LoRA ifne-tuning procedure. As a pooling strategy, we used EOS (End-Of-Sequence) pooling and normalized the embeddings. While we did not apply any prefix for passages, we added the following prefix to queries: Given a search query, retrieve relevant passages that answer the query.\nQuery:

2. Recall@100: Measures the proportion of relevant

documents retrieved among the top 100 results. 3. nDCG@10 (Normalized Discounted Cumulative Gain): Measures the ranking quality by comparing the order of results to the ideal ranking, emphasizing higher ranks.

5. Discussion and Analysis We also experimented with using an Italian text pre

ifx but found no significant diference in performance. We propose a comparison of the performance of diferent Therefore, we opted for an English prefix to maintain models on our Italian benchmark. For this analysis, we consistency with other open-source models. considered the Multilingual Sentence Transformers mod

The fine-tuning process was executed on a weighted els [22] and the multilingual versions of the E5 models mixture of the datasets reported in Table 2. During this family. The scores are reported in Table 3. phase, the tokenization of the datasets documents was truncated at 512 tokens. We trained the model in mixed 5.1. Argos vs PaLM precision for 3 epochs, using a learning rate of 10− 5.

For each model, we conducted two fine-tuning experi- By observing the performance on the MIRACL sets transments: one using the dataset with MIRACL data trans- lated with PaLM 2 and Argos Translate, we found that lated with PaLM 2 and another using the dataset trans- every model achieved better results on the dataset translated with Argos Translate. lated with the PaLM 2 API. This behavior can be attributed to the higher translation quality provided by Table 2 PaLM 2, which likely ofers clearer sentence structures Fine-tuning datasets organization for the models to process.

Source Sample However, since the diference in the results is very MIRACL-it 100% marginal, we can state that the machine translation proMLDR-it 100% vided by Argos Translate is a valid and cost-efective SQuAD-it 20% alternative for text embedding modeling.

WikipediaQA-ita 10% On the contrary, we did not find any significant correlation between the models trained with diferent translation versions, given their small diference in scores, 4.3. Evaluation procedure except for the MLDR-it evaluation of gemma-2B-Argos, which will be discussed later. This indicates that while translation quality can impact performance, the overall diference may not be substantial enough to render one method vastly superior to the other in practical applications for this specific task.

For the evaluation, we considered only the datasets for

whose we already had the representation of relevance judgments (Qrels) in the TREC standard format [20], namely MIRACL-it and MLDR-it. This setup allows for a comprehensive evaluation of Retrieval Systems for the Italian language, encompassing both small/medium and 5.2. Multilingual Sentence Transformers large documents.

As with the training procedure, we evaluated each Generally, the performance of the Multilingual Senmodel using both the dataset with MIRACL data trans- tence Transformers is similar when evaluated on the lated with PaLM 2 and the dataset translated with Argos MIRACL-it sets. However, there is a notably sigTranslate. To ensure consistency, we conducted evalu- nificant performance gap for the MLDR-it dataset. ations only on the overlapping portions of the datasets We attribute the very poor performance of the between the two translations. paraphrase-multi-MiniLM-L12-v2 model to its

After creating the embeddings for both the test queries small maximum input token length of 128 tokens, which and documents, we used FAISS [21] to retrieve relevant is unsuitable for datasets containing long documents. As documents. Finally, we employed the original implemen- expected, both our proposed models and the E5 models tation of TREC-eval for metrics computation. outperform all the Multilingual Sentence Transformers We evaluated the models using the following metrics: across all metrics on every dataset.

1. MRR@10 (Mean Reciprocal Rank): Measures the average of the reciprocal ranks of the first relevant document retrieved. 5.3. Multilingual E5 Models The Multilingual E5 Models achieved very high scores

in the evaluation of both datasets. In particular, the multilingual-E5-large model achieved the best MRR@10, Recall@100, and nDCG@10 scores on both translations of the MIRACL dataset. As expected, the multilingual-E5-large outperformed the base version, although the performance gap narrows with longer documents (MLDR-it).

5.4. Proposed Models

per lies in illustrating a strategy for fine-tuning Large Language Models (LLMs) to achieve efective semantic representations of Italian texts. Additionally, we provide original models and datasets that serve as a starting point to bridge the performance gap between models designed for Italian and those optimized for other languages.

Our results demonstrate that the proposed models achieve performance comparable with state-of-the-art models for medium-sized documents and even surpass them when dealing with datasets containing very long documents. This suggests that our tailored approach to Italian text retrieval is not only viable but also highly efective.

By observing the scores obtained by our proposed mod

els, it appears that the models based on Minerva-1B achieved lower scores compared to the others, suggest- 6.1. Limitations and Future works ing that it may not be the most suitable foundation model for this type of task.

The results obtained by the Gemma-2B and Qwen2-1.5B based models are very similar, except for the low MRR@10 and nDCG@10 scores obtained by gemma-2B-Argos on the MLDR-it dataset, which could indicate worse training stability caused by data translated with Argos Translate. However, the model achieved the best Recall@100 score on the same dataset, suggesting that this behavior may be caused by random noise during fine-tuning.

Finally, our proposed models achieved both the first and second best scores for each metric associated with the MLDR-it test set, demonstrating their efectiveness in handling long document retrieval tasks.

One of the main limitations of this study is the limited

availability of hardware resources. Our fine-tuning process involved a significantly smaller number of dataset examples, well below 50, 000, compared to the multilingual E5 models, which were pre-trained on over 2 billion text pairs and fine-tuned on more than 1 million.

Additionally, we were unable to evaluate the proposed models on the complete MIRACL corpus, as it would have required more than 100 hours of computation per model. This restriction has highlighted a key area for potential improvement in our research. Future work could benefit significantly from experiments involving larger quantities of Italian data and the application of more advanced model architectures.

6. Conclusions This work presents a comprehensive study on models and datasets focused on Information Retrieval (IR) for Italian documents. The primary contribution of this pa 7. Online Resources The fine-tuned adapters and the datasets have been made

available (Models1, Datasets2).

8. Implementation Details All the experiments were executed on a Compute Engine Virtual Machine with 2 NVIDIA L4 GPUs. 8.1. Translation While the ofline translation relies on the model proposed by Argos Translated, to speed up computation, we directly utilized the API of CTranslate2 [23]. 8.2. Fine-tuning The fine-tuning experiments were conducted using an

adaptation of the code from the Tevatron Toolkit [24]. The primary modifications included excluding the "title" attribute from document encoding to simulate a realistic scenario and filtering out queries not associated with negative passages.

8.3. Evaluation

Similar to the fine-tuning process, the evaluation was conducted without considering the "title" attribute for documents. Each model was evaluated according to the instructions provided by the authors. For creating embeddings with the Multilingual Sentence Transformers, we relied on the sentence-transformers implementation. For all other models, we used the transformers library [25].

Acknowledgments We would like to thank Dinova Srl for funding this research and providing access to the Google Cloud Virtual Machines used in this project. Their support has been essential for this work. Credit author statement 1https://huggingface.co/collections/yuri-no/

italian-retrieval-llm-adapters-667ab367ce13150b7c774078 2https://huggingface.co/collections/yuri-no/ italian-retrieval-datasets-667acdccf922286634ef603b sociation for Computational Linguistics, USA, 1996, p. 373–410. URL: https://doi.org/10.3115/1119018. 1119070. doi:10.3115/1119018.1119070. [21] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, The faiss library (2024). arXiv:2401.08281. [22] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. Hernandez Abrego, S. Yuan, C. Tar, Y.-h. Sung, B. Strope, R. Kurzweil, Multilingual universal sentence encoder for semantic retrieval, in: A. Celikyilmaz, T.-H. Wen (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 87–94. URL: https://aclanthology. org/2020.acl-demos.12. doi:10.18653/v1/2020. acl-demos.12. [23] OpenNMT, Ctranslate2, https://github.com/

OpenNMT/CTranslate2, 2019. [24] L. Gao, X. Ma, J. Lin, J. Callan, Tevatron: An efifcient and flexible toolkit for neural retrieval, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 3120–3124. URL: https://doi.org/10.1145/3539618. 3591805. doi:10.1145/3539618.3591805. [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-Art Natural Language Processing, Association for Computational Linguistics, 2020, pp. 38–45. URL: https:// www.aclweb.org/anthology/2020.emnlp-demos.6.

[1]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. Yih,

Rocktäschel ,

Riedel ,

Kiela , Retrievalaugmented generation for knowledge-intensive nlp tasks , in: Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS '20, Curran Associates Inc., Red

Hook

, NY , USA, 2020 .

[2]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , Proceedings of Workshop at ICLR 2013 ( 2013 ).

[3]

Pennington ,

Socher , C. Manning, GloVe: Global vectors for word representation , in: A. Moschitti , B. Pang , W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Doha, Qatar, 2014 , pp. 1532 - 1543 . URL: https://aclanthology.org/ D14-1162. doi: 10 .3115/v1/ D14 -1162.

[4]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/ v1/ N19 -1423.

[5]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings. neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .

[6]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-totext transformer , Journal of Machine Learning Research 21 ( 2020 ) 1 - 67 . URL: http://jmlr.org/papers/ v21/ 20 - 074 .html.

[7]

Reimers , I. Gurevych , Sentence-BERT: Sentence embeddings using Siamese BERT-networks , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), Proceed-