Extending Monolingual Asymmetric Semantic Search Models for Multilingual Query Processing using Knowledge Distillation

Extending Monolingual Asymmetric Semantic Search Models for Multilingual Query Processing using Knowledge Distillation IrynaYurchuk a.yurchuk@gmail.com Taras Shevchenko National University of Kyiv

Volodymyrska Street, 60 01033 Kyiv Ukraine

DanyloBoiko danylo.boiko@knu.ua Taras Shevchenko National University of Kyiv

Volodymyrska Street, 60 01033 Kyiv Ukraine

Extending Monolingual Asymmetric Semantic Search Models for Multilingual Query Processing using Knowledge Distillation 1613-0073 509FF72F1CBA4609D4C609806384E9A5 GROBID - A machine learning software for extracting information from scholarly documents asymmetric semantic search multilingual embedding knowledge distillation1

Semantic search is a key task in today s world, where the amount of data is growing rapidly. In this work, the main role is devoted to cases when long answers must be found to a short query (known as asymmetric search). The teacher model with 30,522 tokens of vocabulary and the student model with 119,547 tokens of vocabulary are basic for training a multilingual asymmetric semantic search model using multilingual knowledge distillation. The authors used the reciprocal rank (RR), the mean average precision (MAP), and the normalized discounted cumulative gain (NDCG) to evaluate the obtained model.

Introduction

Semantic search is a set of search algorithms that work based on understanding the meaning of text. This approach effectively handles synonyms, abbreviations, and spelling errors, unlike keyword search engines that rely on exact lexical matches to find documents. It is useful for grading and assessing academic work [1], integrating search functionalities on e-commerce [2,3], information retrieval in the petrochemical sector using a fusion of video transcript data with other data sources [4], helping Human Resources employees to target relevant people for their events and trainings [5], performing social search [6], IoT systems [7], and other different fields [8].

Ontologies and knowledge graphs are classical approaches to the design and implementation of semantic search systems, but recently these approaches have been enriched or replaced by statistical algorithms [9] and AI techniques [10] for query expansion, generating question-precise training statistics, semi-supervised gaining knowledge, etc.

This work aims to study the possibility of using the bi-encoder architecture to implement asymmetric semantic search when storage contains indexed data in English with the necessary to search information in Ukrainian.

Background and Algorithms

The intention behind semantic search is to transform passages into a multidimensional vector space. During the search, the query is similarly embedded in this vector space, which allows to identify the 2 desired quantity of most relevant matches. This approach ensures that even if the wording is different, the meaning remains the same and the system can provide accurate retrievals (Figure 1). There are two main types of semantic search that work with different types of data. For symmetric search, queries and passages in the corpus have approximately the same length and content. In turn, asymmetric search typically uses short queries (e.g., a question or a few keywords) and longer passages answering those queries.

Nowadays, a lot of trained asymmetric semantic search models based on both bi-encoder and cross-encoder architectures [11] are available for English. Bi-encoder models are efficient for largescale retrieval due to their ability to use precomputed passage embeddings from storage, making them ideal for speed-critical tasks, while the cross-encoder models are known for higher accuracy by directly capturing similarity between query and passages, suited for accuracy-sensitive scenarios like reranking where computational cost is less of a concern.

To use asymmetric semantic search for less common languages or even a mix of them, we need to train new models, which will require a lot of data and computational power. Fortunately, there is a way to facilitate this using multilingual knowledge distillation [12] (Figure 2). This approach requires a teacher model for the source language and a set of pairs (each pair includes a sentence in the source language and its translation). A new student model attempts to approximate the output of a teacher model for both source and target sentences using the mean squared error (MSE) loss. The student model could have the structure and weights of the teacher model, or it could be a different network architecture since the student model learns representation of the teacher model. This allows the student model to achieve robust generalization across languages.

It is critical that identical sentences in different languages have similar vector representations. That is why the vector space properties of the source language, obtained from the teacher model, must be applied to other languages.

From the information above, it is evident that the input parameters of semantic search models are queries and passages. In the field of multilingual search, there are three main paradigms:

• Multilingual-to-monolingual: this approach allows to accept queries in multiple languages and compare them with passages in a single language. • Monolingual-to-multilingual: conversely, the idea behind this paradigm is to accept queries in a single language and return passages in multiple languages. • Multilingual-to-multilingual: provides the highest search adaptability and allows users to formulate queries and peruse passages across multiple languages. However, this flexibility may reduce the average accuracy due to the large number of possible language pairs. This paper explores the scenario when a storage contains indexed data in English and it becomes necessary to search information in other languages, particularly Ukrainian. We will take the original model (data in the index must be produced by this model) trained for asymmetric semantic search, using the bi-encoder architecture, and add capabilities to handle multilingual queries (in our case, bilingual, i.e., English and Ukrainian).

Using knowledge distillation will significantly reduce the training time since we will mimic the teacher model on translated queries. To avoid unnecessary training on passages and to ensure consistency between previously indexed and new documents, we will continue to use the teacher model to create embeddings for new passages. The output of both models can be evaluated using a similarity metric to find the most relevant passages (Figure 3). To determine how similar two vectors are, we can use different metrics (cosine similarity, Euclidean distance, and dot product). Measuring Euclidean distance for high dimensional vectors becomes impractical, as they will be very far apart simply because of the vastness of the space they inhabit. Using cosine similarity, which measures the angle between two vectors by paying attention to their direction and ignoring magnitude, or dot product, which measures the overall congruence .

In practice, we can generate sentence embeddings using specially designed modifications of the Bidirectional Encoder Representations from Transformers (BERT) model, known as Sentence Transformers (SBERT). It provides two types of state-of-the-art asymmetric semantic search models, one tuned for cosine similarity, the other for dot product. Cosine similarity tuned models prefer to retrieve shorter passages, while dot product tuned models prefer to retrieve longer passages.

Datasets

MS MARCO

Microsoft Machine Reading Comprehension (MS MARCO) [13] is a large-scale dataset created by Microsoft for training and evaluating information retrieval systems. It is widely used to benchmark the performance of various models on tasks such as reading comprehension, question answering, and passage ranking.

The dataset comprises over a million anonymous questions offering a collection of short, real-world, natural language queries split into the train (808,731 queries), development (101,093 queries), and evaluation (101,092 queries) subsets. We will use the train and development subsets to train and evaluate the student model respectively, so it makes sense to look at their length and word patterns (Figure 4). Furthermore, the dataset contains 8,841,823 passages that are required to provide natural language answers (a question may have multiple answers or no answers at all). Although passages are not involved in the knowledge distillation for multilingual-to-monolingual models, it still seems reasonable to analyze them to figure out what the teacher model was trained on (Figure 5). The length distribution of the combination of queries from the train and development subsets is markedly skewed, with the majority being quite brief, typically under 100 characters, predominantly concentrated between 15-25 characters. The frequency drops significantly for longer queries, with very few extending beyond 100 characters. The most common bigrams (two-word sequence of words) indicate that the majority of queries are fact-oriented. These patterns reveal that a significant number of queries are short questions asking , , , The length distribution of the passages has a right-skewed plot with the noticeable peak around 200-300 characters. The character type distribution shows that letters are mostly used, with a median of about 250 characters per passage and a wide interquartile range reflecting considerable variation. Digits and punctuation marks are used less frequently, but there are outliers indicating some passages with a large number of such characters.

The scale and real-world nature of the dataset makes it attractive for training and evaluating machine learning models, but the original MS MARCO contains only English search queries. To train and evaluate models for other languages, the dataset must be translated in one of the available ways. For Ukrainian, the OPUS-MT English to East Slavic neural machine translation model [14] demonstrated high-quality results in a relatively short time.

TREC 19

In 2019, the National Institute of Standards and Technology (NIST), in collaboration with Microsoft, organized the Text Retrieval Conference (TREC) deep learning track [15] benchmark competition. This event was aimed to foster research in the information retrieval direction using deep learning techniques. As the official evaluation dataset, organizers provided a list of 200 queries and a pool of documents and passages, labeled by NIST assessors using multi-graded judgments.

Although the deliberate selection of the queries at an intuitive level implies high-quality data and a lack of outliers, it still seems reasonable to check directly the length and word patterns (Figure 6). The length distribution peaks at 20-25 characters, indicating that most queries are concise and right-skewed, with fewer longer queries beyond 60 characters. The most frequent bigrams begin with common phrases such as "what," "definition," and "how," suggesting that queries are structured as questions, often seeking definitions or factual information.

Looking at the queries from both datasets, it is clear that TREC 19 queries are generally shorter and more focused on concrete evidence, while MS MARCO queries might be significantly longer and more complex. It is also important to note that TREC 19 does not have outliers, unlike MS MARCO, where some queries can exceed 400 characters.

These days, the TREC 19 has become a recognized benchmark for various information retrieval and deep learning tasks such as document and passage retrieval. In the task of full search, we can evaluate up to 1,000 passages for each query based on their estimated likelihood of containing the answer.

To evaluate the trained model, all 200 queries were manually translated into Ukrainian by a native speaker, which guarantees the veracity of the results.

Training and Evaluation

We will use the DistilBERT base multilingual (cased) [16] as a student model. This model was trained on the concatenation of Wikipedia in 104 different languages (including English and Ukrainian), has 6 layers, 768 dimensions, and 12 heads, totalizing 134 million parameters (compared to 177 million parameters for the BERT base multilingual).

The MS MARCO DistilBERT base v4 [17] was chosen as a teacher model. It embeds text into a 768-dimensional vector space and can be used for clustering and semantic search. This model was fine-tuned on the original MS MARCO passage ranking dataset and optimized to generate embeddings for queries and passages.

We trained the student model for 5 epochs with a batch size of 24, 10,000 warm-up steps, and a learning rate of 2e-5. The entire training process on the train and development (used for intermediate evaluation) subsets took about 8 hours using the Apple M3 Max chip (16-ore CPU, 40-ore GPU). To measure the difference between computed and target query embeddings we used the MSE loss (Figure 7). The teacher model has a vocabulary of 30,522 tokens, the same as the original BERT base. This vocabulary includes common words, sub-words, and special tokens to deal with English. The student model, on the other hand, is a distilled version of the BERT multilingual. The extended vocabulary of 119,547 tokens covers a variety of multiple symbols from different languages, allowing to efficiently process and understand text in different linguistic contexts.

Any dataset for semantic search or information retrieval systems has selection bias. It can be related to the data source, the date of publication, and the personal preferences of the publisher. Achieving high performance on cross-lingual tasks depends on the ability to seamlessly map sentences from different languages into a single vector space. The similarity between alphabets also significantly affects accuracy. Languages with similar alphabets, such as English and German, typically produce more accurate results than languages with dissimilar alphabets, such as English and Ukrainian.

Using the original English and machine-translated Ukrainian search queries, we achieved a reasonable level of accuracy (about 93%) for English to Ukrainian and Ukrainian to English translation tasks, indicating a high level of proficiency in both directions (Figure 8). It is very important to monitor performance on the subset that did not participate in training to avoid overfitting. MSE on the evaluation subset steadily decreases, indicating that the student model learns to more accurately mimic the results of the teacher model (Figure 9). Evaluating the performance of information retrieval systems is a critical step in improving their efficiency. NIST assessors labeled the TREC 19 using multi-graded judgments, making it easy to measure all the necessary and widely used state-of-the-art metrics.

Reciprocal rank (RR) calculates the score for the first relevant passage in the ranked list. This metric is very important when we need to evaluate the occurrence of the most relevant passage:

𝑅𝑅 = 1 𝑟𝑒𝑙 ,(1)

where rel the rank of the first relevant passage in the list. Mean average precision (MAP) evaluates both the relevance of the suggested passages and the position of the most relevant passages at the top. For each query, the average precision (AP) is determined by calculating the arithmetic mean of the precision scores for every position in which a relevant passage was found. This metric focuses on the ability to distinguish relevant and irrelevant items:

𝑀𝐴𝑃@𝑘 = 1 𝑄 ∑ 𝐴𝑃@𝑘 𝑖 𝑄 𝑖=1 ,(2)

where Q the total number of queries. Normalized discounted cumulative gain (NDCG) measures the ability of machine learning algorithms to sort passages by relevance. NDCG is determined by dividing the discounted cumulative gain (DCG) by the ideal DCG, representing the best version of the rating. This is useful in many scenarios where we expect passages to be sorted by relevance:

𝐷𝐶𝐺@𝑘 = 𝐷𝐶𝐺@𝑘 𝐼𝐷𝐶𝐺@𝑘 = ∑ 𝑟𝑒𝑙 𝑖 𝑙𝑜𝑔 2 (𝑖 + 1) 𝑘 𝑖=1 ∑ 𝑟𝑒𝑙 𝑖 * 𝑙𝑜𝑔 2 (𝑖 + 1) 𝑘 𝑖=1 ,(3)

where k the number of items considered in the calculation. Both MAP and NDCG reflect ranking quality, but account for rank reduction in different ways. MAP gives more weight to relevant passages at the top of the list because this metric is based on precision, making it more sensitive to changes in early positions. DCG assigns decreasing weights to passages proceeding down the hierarchical order, but they are logarithmic and reduce the contribution of passages not very quickly.

In the case of models based on the bi-encoder architecture, it makes sense to evaluate up to 100 passages using the above metrics (Table 1). If the obtained performance is not sufficient, a reranking model based on the cross-encoder architecture can be used to improve results. The bilingual model trained using knowledge distillation inherited the capabilities of the monolingual teacher model and shows reasonable results for both languages. Obtained 768-dimensional vectors are optimized to work with cosine similarity as expected.

We can notice a slight decrease in performance for English, which is not critical since we got the ability to search passages using multiple languages. A higher score does not necessarily mean higher performance in production. At some point, models can become too specialized on the MS MARCO and its selection bias.

It is impossible to create a perfect dataset for semantic search. Manual creation of a subset even for evaluation is quite expensive and unfortunately always has some selection bias. This is a longrecognized problem, but there is no good solution for it, especially at the scale of millions of queries and passages.

The number of languages may not be limited to two, but it is better to reasonable add only justified languages and keep a balance between performance and multilingual search capabilities. Otherwise, new languages to the model degrades performance because the capacity of the model remains the same.

The performance of the trained model was also affected by the quality of the data, which in our case is a combination of original and machine-translated queries. Cloud solutions such as Google Translate or DeepL could marginally improve the results but would not compare to manual translation by native speakers. The time taken to create pairs using the neural machine translation model was about 50% longer than the time required to train the student model. Intuitively, it makes the knowledge distillation less of a training and more of a translation task.

Conclusions

For multilingual processing of asymmetric semantic search queries, the DistilBERT base multilingual (cased) as a student model and the MS MARCO DistilBERT base v4 as a teacher model can be used as components in the scenario when storage contains indexed data in English with the necessity to search for information in Ukrainian up to the reciprocal rank (RR), the mean average precision (MAP), and the normalized discounted cumulative gain (NDCG) as its evaluating.

The authors achieved a reasonable level of accuracy (about 93%) for English to Ukrainian and Ukrainian to English translation tasks, indicating a high level of proficiency in both directions. In the future, this result will be improved by the quality of the data, which can be a combination of original queries and manual translations by native speakers. The resulting model is useful in industries such as finance, healthcare, and e-commerce, where huge data sets are prevalent and asymmetric semantic search plays a key role to quickly retrieve relevant information.

Figure 1 :1Figure 1: Search in the vector space.

Figure 2 :2Figure 2: The idea of multilingual knowledge distillation.

Figure 3 :3Figure 3: Pipeline of using the trained and original model.

Figure 4 :4Figure 4: Length and word patterns of the queries from the train and development subsets.

Figure 5 :5Figure 5: Length and character type distribution of the passages.

Figure 6 :6Figure 6: Length and word patterns of the queries from the evaluation dataset.

Figure 7 :7Figure 7: MSE loss on the train and evaluation subsets.

Figure 8 :8Figure 8: Accuracy of translation between English and Ukrainian.

Figure 9 :9Figure 9: MSE on the evaluation subset.

Table 11Performance of the original and trained model on TREC 19ModelTREC 2019 EN-ENTREC 2019 UK-ENRR MAP@100 NDCG@100 RR MAP@100 NDCG@100msmarco-distilbert-base-0.960.350.59v4msmarco-distilbert-0.920.290.530.780.250.46multilingual-en-uk

Declaration on Generative AI

The authors have not employed any Generative AI tools.

An Intelligent Semantic Vector Search Model for Grading and Assessing Students TKgosietsile EUOkike 10.1109/i-COSTE60462.2023.10500811 2023 International Conference on Sustainable Technology and Engineering (i-COSTE)

Nadi, Fiji

2023 Semantic Search for Sustainable Platforms Using Transformers SShirol AKulkarni RAgarwal 10.1109/ICETCI58599.2023.10331079 2023 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)

Hyderabad, India

2023 Deep Learning-based Semantic Search Techniques for Enhancing Product Matching in E-commerce FAamir RSherafgan TArbab AJamil FNBhatti AAHameed 10.1109/ICMI60790.2024.10586148 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)

Mt Pleasant, MI, USA

2024. 2024 Unveiling Deeper Petrochemical Insights: Navigating Contextual Question Answering with the Power of Semantic Search and LLM Fine-Tuning KPSaikia DMukherjee SMahapatra PNandy RDas 10.1109/ICCCIS60361.2023.10425564 2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)

Greater Noida, India

2023 Employee Data DSheth ARGupta 10.1109/ICCICA52458.2021.9697114 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA)

Nagpur, India

2021 Semantic Social Searching-An Ontology Based Approach ISindhu FShamsi 10.1109/IMCERT57083.2023.10075145 International Multi-disciplinary Conference in Emerging Research Trends (IMCERT)

Karachi, Pakistan

2023. 2023 Semantic Search System For Real Time Occupancy PSingh KSAcharya MJBeliatis MPresser 10.1109/IoTaIS53735.2021.9628719 2021 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)

Bandung, Indonesia

2021 An Analysis of the Technical Trend of Semantic Search in Natural Language Processing YZheng 10.1109/ICNISC60562.2023.00033 9th Annual International Conference on Network and Information Systems for Computers (ICNISC)

Wuhan, China

2023. 2023 Blending Weighted TF-IDF & BERT Improving Semantic Search Md 10.1109/ICARC54489.2022.9753875 2022 2nd International Conference on Advanced Research in Computing (ICARC)

Belihuloya, Sri Lanka

2022 Ameliorating Semantic Search Through Advanced AI Techniques RV DDhabliya MMathur SDas RKumar SBRao 10.1109/SMARTGENCON60755.2023.10442780 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON)

Bangalore, India

2023 FBC: Fusing Bi-Encoder and Cross-Encoder for Long-Form Text Matching JLiao MJia JDuan JWang 10.3233/faia230426 Frontiers in Artificial Intelligence and Applications

Krakow, Poland

2023 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation NReimers IGurevych 10.18653/v1/2020.emnlp-main.365 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 <author> <persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Marco</surname></persName> </author> <ptr target="https://microsoft.github.io/msmarco" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b13"> <analytic> <title level="a" type="main">OPUS-MT Building open translation services for the World JTiedemann S Proceedings of the 22nd Annual Conference of the European Association for Machine Translation the 22nd Annual Conference of the European Association for Machine Translation

Lisboa, Portugal

2020 TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime NCraswell BMitra EYilmaz DCampos EMVoorhees ISoboroff 10.1145/3404835.3463249 SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter VSanh LDebut JChaumond TWolf 10.48550/arXiv.1910.01108 2019 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks NReimers IGurevych 10.18653/v1/D19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Hong Kong, China

2019