SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach

SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach KimiaAbedini kimia.abedini@studenti.unipd.it University of Padua

Italy

AkanAkysh akan.akysh@studenti.unipd.it University of Padua

Italy

ArwaFahoud arwa.fahoud@studenti.unipd.it University of Padua

Italy

NicolaFerro nicola.ferro@unipd.it University of Padua

Italy

SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach 1613-0073 8DAD8909CA492B5770D46325D7934EAA GROBID - A machine learning software for extracting information from scholarly documents Information Retrieval Search Engines Retrieve Documents Query Expansion LongEval CLEF

This report provides a detailed description of the search engine system designed by Team KALU for the Conference and Labs of the Evaluation Forum (CLEF) LongEval LAB 2024 Task 1. The team, composed of students from the University of Padua, developed this system to efficiently index, search, and retrieve documents. We begin by outlining the problem and then go on to describe our system which mainly works on a collection of documents written in French language, then we explain the various methodologies we implemented. We present our experimental results and analyze them according to the techniques we employed. Finally, We present the outcomes of our experiments and discuss the different techniques used.

Introduction

Search engines have transformed people's access to information. These systems proved that they stand as fundamental tools in people's daily lives, providing a vast amount of data related to various aspects, from academic research and global news to everyday queries like shopping and local weather. However, the exponential growth in data available online presents a significant challenge for search engines in terms of storage, indexing, and retrieval. This paper introduces a solution to the issue by creating an information retrieval system capable of adjusting to evolving data while sustaining high performance.

Our approach is implementing a search engine to address task 1 of the "LongEval" lab proposed by CLEF 2024, which aims to search a corpus of documents and retrieve the most relevant ones to a predefined set of queries gathered from Qwant [1].

The paper is structured as follows: Section 2 shows the related work we have started from; Section 3 outlines our approach; Section 4 explains our experimental setup; Section 5 discusses our main findings; and Section 7 concludes with reflections and future directions.

Related Work

We used the paper by LongEval organizers [2] to understand the task and the datasets provided by CLEF. This helped us understand how the documents and queries were collected and what the main objectives of the task were. The paper also provided baseline performances, which we used to benchmark our system's development.

Based on the works of the CLOSE [3] and the FADERIC [4] teams, we chose to utilize query expansion techniques. We explored various methods, such as different Large Language Model (LLM)s and different prompts. Also, We build our re-ranking approach based on the work of JinaAi [5] which developed the jina-reranker model.

Methodology

In this section, we describe the steps we took to create the different components that comprise the search engine with the different configurations used with each component. • ParsedDocument class represents a parsed document to be indexed. It has two attributes: ID for the unique identifier of the document and body for the document's content. This class provides functionalities to set and retrieve documents' attributes. • DocumentParser class represents an abstract class providing basic functionalities to iterate over the elements of a ParsedDocument, reading and parsing its content. • LongEvalDocumentParser class is the specific DocumentParser for the LongEval corpus. It provides an implementation of a parser for the documents in the TREC format. It reads the document and it replaces all the tags with space and returns ParsedDocument that contains the ID and the Body of the document.

Analyzer

Analyzer is used to apply further processing steps to both the documents and queries. The Analyzer does the functionalities of "Tokenization", "Stemming" and filtering "Stop Words".

• Tokenization is the process of splitting text into tokens (each token can be thought as a single word). We have used StandardTokenizer [6] in our project, which is one of the tokenizers provided by Apache Lucene. • Stemming is The process of removing suffixes or prefixes from words to obtain their root form.

in this system we have tried using FrenchLightStemmer [7]. • Stop Words removal is the process of eliminating words that frequently appear in most documents and carry little meaningful information for search queries.our stopwords list includes words extracted from Luke [8] which is a tool provided by Lucene [9] and also we used the stopword list in Kaggle [10]. Removing stopwords improved the search engine's performance by reducing the index size and, consequently, decreasing the time needed to search the index. Also, we used FrenchElisionFilter [11] that is addresses elision phenomena in French, where certain characters, such as 'l', 'd', 's', 't', 'n', and 'm', followed by an apostrophe, are contracted by eliminating the apostrophe and associated character.

Indexer

Creating an index is one of the main process where we generate a searchable database, known as an index, for parsed documents. This index holds crucial information about the documents such as the words and phrases they contain, their frequency, and their locations within the document. Indexing facilitates swift document retrieval by enabling users to search based on keywords or phrases. To accomplish this task, we developed the following components:

• analyzer: is the analyzer to be used, which is LongEvalAnalyzer.

• similarity: is an object needed to score the relevance of a document based on the query terms it contains. which can be implemented to be either Lucene default implementation that is based on a variant of the Term Frequency-Inverse Document Frequency (TF-IDF) model, or the modern alternative BM25 which is used in our case. • ramBufferSizeMB: the size in megabytes of the RAM buffer for indexing the documents.

• indexPath: the path to the directory where the generated index should be stored.

• docsPath: the path to the documents directory.

• dpCls: an object of the DocumentParser which is responsible for parsing the documents in the collection.

We used BM25 Similarity because unlike TF-IDF, where term frequency linearly affects the score, BM25 introduces a saturation point, which prevents the term frequency component from indefinitely influencing the score. BM25 scores a document based on the query terms appearing in it using the following formula:

BM25(𝐷, 𝑄) = 𝑛 ∑︁ 𝑖=1 IDF(𝑞 𝑖 ) • 𝑓 (𝑞 𝑖 , 𝐷) • (𝑘 1 + 1) 𝑓 (𝑞 𝑖 , 𝐷) + 𝑘 1 • (︁ 1 − 𝑏 + 𝑏 • |𝐷| avgdl )︁

where 𝐷 is the document being scored, 𝑄 is the query consisting of words 𝑞 1 , 𝑞 2 , . . . , 𝑞 𝑛 , 𝑓 (𝑞 𝑖 , 𝐷) is the frequency of the term 𝑞 𝑖 in document 𝐷, IDF(𝑞 𝑖 ) is the inverse document frequency of term 𝑞 𝑖 , |𝐷| is the length of the document, avgdl is the average document length in the text collection. Finally, 𝑘 1 and 𝑏 are free parameters, we left 𝑘 1 = 1.2 and 𝑏 = 0.75 same as most of the applications.

After setting the configuration of the indexer, it does the following:

• The indexer walks through the documents directory to find documents (specifically .txt files) and processes each file for indexing. • It uses DocumentParser class, which is tasked with parsing the documents and creating structured data ParsedDocument, including documents identifiers and bodies. • Each parsed document is converted into a Lucene Document object and added to the index. The document's ID and body are stored as fields within the Lucene Document.

The configurations used for the fields added to the Lucene Document are :

• IDField: The ID field is created by only storing the document ID without storing term frequencies or positions, which are unnecessary for unique identifiers like document IDs, so only the original value of the field (the document ID) is stored directly in the index, allowing it to be retrieved when querying the index.

• BodyField: The body field of the document is configured to store the terms resulted form splitting the body of the document into tokens, along with their frequencies, without storing the original text of the document in an attempt to reduce the size of the index.

Searcher

The Searcher's task is scanning indexed documents, analyzing user queries, and retrieving relevant information. It then presents a ranked list of documents that satisfy the user's information needs.

Our implementation does so by accepting the following parameters:

• analyzer: in this case, an instance of our Analyzer.

• similarity: we decided to use the BM25Similarity [12] function for the first stage of document retrieval due to its efficiency and higher effectiveness compared to other methods. • Run options: parameters for the index path, the topics path, the run path and the run name, the number of the expected topics, and the maximum number of documents retrieved (in our case 1000). • Search options: parameters for query boosting, query boosting value, number of documents to be re-ranked, score calculation mode, query expansion mode, and LLM used for generating the expansion.

Query Expansion

Query Expansion plays a valuable role in improving the performance of our Search Engines based on how it is used. We generate multiple expansions for each query by implementing a Python script. This script retrieves the *.trec topic file and generates synonymous phrases using Meta Llama 3 [13] and Mistral-7B [14] models, both of which are open-source. The prompt used for generating is as follows:

Instruction: Please provide {num_expansion} synonyms in French for the given keyword that convey similar meanings, The output should be a list of words separated by commas without any further punctuation.

The keyword is {word}.

Following this procedure, we slightly cleaned the generated file. Occasionally encountering lowquality or missing generations, we implemented a method in the search section to automatically switch to a second model if the initial one fails.

The sample result for the prompt is: When executing a query, Lucene assigns a score to each matching document based on its relevance. Query boosting enables modification of these scores for particular documents or groups. Through experimentation, we found that mixing query expansion and boolean queries sometimes resulted in poorer outcomes. However, by introducing Lucene's BoostQuery [15], we observed improvements in our evaluation metrics. We experimented with three approaches:

• Multiplying by the number of expansions we have.

• Utilizing a fine-tuned parameter of 14.68, multiplied by the number of expansions we have. • 14.68 × (𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 + 𝑡𝑜𝑡𝑎𝑙_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 − 1) while -num_expansion is the number of queries we selected in query expansion.

-total_expansion is the total number of queries we expanded. our reasoning for this approach was to prioritize exact terms when a word has a lot of meanings. However, our findings suggest that this strategy does not produce the anticipated results.

To summarize, we settled on the second approach, using a boolean query where we added at most three expansions with the SHOULD term, and boosted the main query with MUST.

Document Re-Ranking

Document Re-Ranking is the process of taking the initially ranked list of documents (or items) and re-evaluating their relevance or importance based on new information, constraints, or preferences. our approaches for document Re-Ranking are:

• Secondary ranking function: Apply a secondary ranking function that considers additional criteria or constraints. • Score adjustment: Modify the scores of individual documents based on another score.

For the ranking function, we are considering two different approaches: using Bidirectional Encoder Representations from Transformers (BERT) models and using LLMs. In both cases, the objective is to compute the embeddings of the words and determine the cosine similarity between the query and the document.

After research, we attempt to find a fast SBERT [16] model and a well-tuned LLM to assess performance. We opted for jina-reranker-v1-turbo-en, designed for rapid reranking while maintaining competitive performance, leveraging JinaBERT [17] model as its foundation. Additionally, for the LLM, we chose sentence-croissant-llm-base, engineered to produce French text embeddings. It has been fine-tuned using the recently pre-trained LLM croissantllm/CroissantLLMBase [18].

In the end, we found that employing LLMs for re-ranking is computationally expensive and produces nearly identical results. Consequently, we decided to utilize jina-reranker-v1-turbo-en, ranking the first 200 documents and leaving the rest unchanged.

For score adjustment we used two ways:

• Simple Mode: change the score of the document directly based on secondary ranking function.

• Harmonic Mode: combine the BM25 score with the secondary ranking function score.

𝐻 = 2 1 /𝑥1 + 1 /𝑥2

The Harmonic mode, based on the results, performs better.

Experimental Setup

The experimental setup for our Information Retrieval (IR) system includes using the LongEval collection, which is the official training collection for the 2024 LongEval IR Lab (https://clef-longeval.github.io/). The collection contains French-language web pages and queries, along with their English translations. We used the French data for our experiments.

To assess the performance of our IR system, we used the trec_eval executable to evaluate the results under various configurations. We monitored improvements in the following evaluation metrics produced by trec_eval:

• num_ret: Number of documents retrieved for each query.

• num_rel: Number of relevant documents for each query.

• num_rel_ret: Number of relevant documents retrieved for each query.

• map: Mean Average Precision, indicating the average relevance of retrieved documents across all queries. • rprec: R-Precision, calculated at the rank corresponding to the number of relevant documents for each query. • p@5 & p@10: Precision at 5 and at 10, representing precision scores computed at the top 5 and 10 retrieved documents for each query. • nDCG: Normalized Discounted Cumulative Gain, a metric evaluating ranked lists by considering item relevance.

Our project's Git repository is publicly available at (https://bitbucket.org/upd-dei-stud-prj/seupd2324-kalu/ src/master), The code is openly accessible for replication. We used a MacBook Pro with an M2 Max chip, 12-core CPU, 30-core GPU, and 32GB RAM to compute our runs.

Results and Discussion

In this section, we present some of the most fitting results obtained during the development phase. We are considering five primary milestones that, after multiple trials, substantially improved our Mean Average Precision (MAP) score and the overall number of relevant documents retrieved. Several models were evaluated, focusing on re-ranking and query expansion techniques. Initially, we found out that using the FrenchLightStemFilter [19] as the stemmer, and adjusting the length filter from 2 to 15 (reflecting the tendency of French to have longer words), yielded very positive results [3]. Then to continue we introduce four models: base model, re-rank 100 documents with simple score combination mode, re-rank 100 documents with simple score combination mode using Mistral query expansion with a threshold of three words, and re-rank 100 documents with simple score combination mode using Llama query expansion with three words. The third model, utilizing Mistral query expansion, achieved the highest MAP of 0.044, surpassing the base model. Subsequently, two more models were introduced, we tried to see the difference between score combination models and handling empty expansion cases, so basically, we achieved a higher MAP of 0.0487 compared to the base model, handling empty cases if necessary, utilizing stopwords, and using harmonic score combination.

Results on training data

To sum up, The most successful approaches involved re-ranking using Mistral query expansion with Llama3 replacement, threshold three, and the inclusion of stopwords, with the harmonic mean performing the best among these methods.

Additional models were tested, including summarizing texts using LLM and integrating them into the original texts before indexing, or completely replacing the original text with the summary with flan-t5-3b-summarizer [20]. However, these approaches provided similar results to the simpler methods and required significantly more time to execute. Furthermore, various boosting methods were explored, but most decreased the MAP in the training dataset (see Section 3.4.2). There was also consideration of discarding the use of LLM for re-ranking due to its poor performance.

Results on Test data

In this section we have provided the results obtained by running our algorithms on each of the two available test collections which are short term and long term.

Statistical Analysis

In this section, we conduct a statistical analysis on the retrieval effectiveness for our five submitted runs to CLEF. This evaluation aims to assess each run's performance and find out how well the system retrieves and ranks relevant documents.

We compared the Normalized Discounted Cumulative Gain (nDCG) and Mean Average Precision (MAP) of each runs to understand the performance differences among them, considering both short-term and long-term evaluations. The analysis involves the use of tools such as box plots, two-way Anova and Tukey.

For analysis first we used box plots which are used to represent a distribution of data concisely. Additionally, we applied two-way Analysis of Variance (ANOVA) tests to explore the differences observed in both short-term and long-term evaluations. In addition, we use the Tukey Honest Significant Difference (HSD) test, a post-hoc analysis for ANOVA, which compares group means while controlling for multiple comparisons, to ensure reliable identification of significant differences. French Harmonic Re-ranking using Must

Box Plot

Box plots are graphical tools to represent a distribution of data concisely. In our case, we want to plot the distribution of the scores achieved by our submitted systems on each query of the different test sets, with respect to nDCG and Map. By analysing the nDCG performance of all runs in short-term set we observe that run1 achieve lower nDCG scores, indicating their inferior effectiveness in capturing and ranking relevant documents while the other 4 runs exhibit approximately similar levels of performance. This nDCG performance result is also the same in long-term set runs. From the boxplot, we can observe the distribution of MAP scores for each run. By analysing the Map performance of all runs in short-term and long-term set we observe that run1 has the lowest Map scores, indicating their inferior effectiveness in accuracy, run2 and run3 have approximately same Map scores in short and long-term evaluations while run4 and run5 have the highest Map scores with a slight difference with run2 and run3.

Two-way ANOVA

In a two-way ANOVA test, we check if the factors Topic and System can influence the results and we test on both MAP and nDCG measures. From the result of the two-way ANOVA test we can conclude that both factors (System and Topic) are important in influencing the performance measures (nDCG and MAP) in both short-term and long-term evaluations.

Conclusions and Future Work

In this work, we present our approach to the CLEF Long Eval LAB 2024 task, which aimed to develop an effective and efficient search engine for web documents. Our approach consisted of combining different techniques, including query expansion, re-ranking, and the use of large language models for different purposes. Our experiments showed good results for our approach, with better effectiveness and efficiency than the baseline system provided by CLEF. Combining two scores in the re-ranking phase also improved retrieval performance. We found several areas to improve our approach further.

One promising direction is to use text summarization and title extraction techniques in the parsing part. While we experimented with this approach, it didn't generate significant improvements due to efficiency concerns. However, we believe that refining this technique or exploring alternative approaches could lead to better results.

Another idea is to embed documents using different methods [21] for re-ranking, or text chunks and their summaries because chunking text documents into small pieces is an interesting technique that increases the accuracy and quality of the system which could help capture nuanced semantic relationships between documents. Additionally, including context-awareness when calling LLMs to generate synonyms might have a positive impact on the overall retrieval performance. Furthermore, fine-tune our re-ranker with SBERT using training data and implement a custom reranker specific to this specific task. By leveraging the strengths of different models and techniques, we hope to achieve even better results and push the boundaries of what is possible in LongEval information retrieval.

Figure 1 :1Figure 1: Workflow of the IR system implemented by KALU.

3. 1 .1ParserParser processes the collection of documents, extracting valuable information and filtering out irrelevant data. we also use parser to extract the id and body of the documents. our parser consists of three classes which are ParsedDocumentclass, DocumentParser class and LongEvalDocumentParser class.

Figure 2 :2Figure 2: Document structure without parsing.

Figure 3 :3Figure 3: Analyzer Process.

Figure 4 :4Figure 4: Standard Recall Levels vs Interpolated Precision

Figure 5 :5Figure 5: box plot for both the short-term and long-term set runs (nDCG performance)

(a) long-term set runs (b) short-term set runs

Figure 6 :6Figure 6: box plot for both the short-term and long-term set runs (Map performance)

Figure 7 :7Figure 7: Tukey's HSD test for all the five runs in long-term evaluation

Figure 8 :8Figure 8: Tukey's HSD test for all the five runs in long-term evaluation

Table 11Sample results for query expansion using the given prompt.Query Boosting is a technique used to adjust the score of documents retrieved by a search query, allowing for customization of document relevance based on specific criteria.QueryMeta Llama 3Mistral-7Banti-virus gratuit1. programmes antivirus libres 2. solutions antivirus gratuites.1. logiciel antivirus gratuit 2. solution antivirus gratuit1. clôture1. plaque de boisbardeau2. écran2. tableau3. barrage3. pannée

Table 22Parameters used in the 5 different runs submitted to CLEFParameterRun 1Run 2Run 3Run 4Run 5Token FilterPorterFrenchLight FrenchLight FrenchLight FrenchLightTokenizerStandardStandardStandardStandardStandardLength Filter2-152-152-152-152-15Stop Filter"None""stoplist-fr" "stoplist-fr" "stoplist-fr" "stoplist-fr"Lower Case FilterYesYesYesYesYesSimilarityBM25BM25BM25BM25BM25Query ExpansionNoYesYesYesYesQuery Expansion Model-Llama 3Mistral*Mistral*Mistral*Boolean Clause Main Query Mode "SHOULD""SHOULD""SHOULD""SHOULD""MUST "Re-rankingNoYesYesYesYesScore Combination Mode-SimpleSimpleHarmonicHarmonicNum. of Re-rank Documnet-100100200200

* If the model failed, it would switch to another one.

Table 33Results for systems (top-1000 documents), on the French Train collection and the train query set of LongEval.metricsrun1run2run3run4run5num_q597599599599597num_ret584902 598281 599000 599000 587943num_rel43444362436243624350num_rel_ret35423553358335833578map0.1853 0.2286 0.2313 0.2366 0.2374gm_map0.0484 0.0659 0.0744 0.0817 0.0841Rprec0.1756 0.2271 0.22810.230.2308recal_100.2310.2855 0.2857 0.2915 0.2925recall_1000.56660.5740.5826 0.6174 0.6192recall_10000.81230.8120.8210.8210.8225ndcg0.3692 0.4065 0.4097 0.4161 0.4173ndcg_rel0.2875 0.32740.3290.3344 0.3354Rndcg0.22430.2690.2701 0.2745 0.2754ndcg_cut_100.1954 0.2464 0.2458 0.25110.252ndcg_cut_1000.3162 0.3556 0.3591 0.37270.374ndcg_cut_1000 0.3692 0.4065 0.4097 0.4161 0.4173map_cut_100.1271 0.1691 0.1702 0.1729 0.1735map_cut_1000.1807 0.2244 0.2273 0.2331 0.2339map_cut_10000.1853 0.2286 0.2313 0.2366 0.2374

Table 44Scores of all five systems on short term test collection.Evaluation measurerun1run2run3run4run5Map0.1578 0.1855 0.1875 0.1922 0.1922nDCG0.2984 0.3225 0.3240 0.3302 0.3302nDCG@100.1886 0.2247 0.2264 0.2297 0.2297P@100.1472 0.1745 0.1752 0.1789 0.1789Recall0.5884 0.5867 0.5884 0.5884 0.5884

Table 55Scores of all five systems on long term test collection.Evaluation measurerun1run2run3run4run5Map0.1067 0.1400 0.1395 0.1430 0.1434nDCG0.2193 0.2502 0.2494 0.2535 0.2542nDCG@100.1479 0.1921 0.1912 0.1931 0.1936P@100.1145 0.1407 0.1397 0.1413 0.1417Recall0.4142 0.4136 0.4131 0.4131 0.4142

Table 66Recap of our runs submitted to CLEF.runs languagetyperun1FrenchBaserun2FrenchQuery Expansion using LLama 3run3FrenchRe-ranking simple moderun4FrenchHarmonic Re-ranking using Shouldrun5

Table 77two way ANOVA Results for short-term nDCGSourcedfSSMSFPR(>F)Columns(Systems)4.00.2785 0.0696 21.3414 3.594524e-17Rows(Topics)403.0 82.8188 0.2055 62.99570Error1612.0 5.2587 0.0033--Total2019.0 88.3559---Table 8two way ANOVA Results for short-term MapSourcedfSSMSFPR(>F)Columns(Systems)4.00.3358 0.0840 24.6253 8.216775e-20Rows(Topics)403.0 67.4513 0.1674 49.09150Error1612.0 5.4960 0.0034--Total2019.0 73.2831---Table 9two way ANOVA Results for long-term nDCGSourcedfSSMSFPR(>F)Columns(Systems)4.01.25350.3134 11.8346 1.4110e-09Rows(Topics)1511.0 103.3381 0.0684 2.5829 1.8578e-142Error6044.0 160.0363 0.0265--Total7559.0 264.6278---Table 10two way ANOVA Results for long-term MapSourcedfSSMSFPR(>F)Columns(Systems)4.01.40990.3525 19.2277 9.8898e-16Rows(Topics)1511.0 71.7684 0.0475 2.5909 1.9168e-143Error6044.0 110.7997 0.0183--Total7559.0 183.9780---

Qwant About qwant 2023. 2023-05-20 PG RDeveaud GGonzalez-Saez PMulhem LGoeuriot FPiroi MPopel arXiv:2303.03229 Longevalretrieval: French-english dynamic test collection for continuous web search evaluation 2023 Seupd@ clef: Team close on temporal persistence of ir systems' performance GAntolini NBoscolo MCazzaro MMartinelli SSafavi FShami NFerro CEUR WORKSHOP PROCEEDINGS CEUR-WS 2023 3497 Seupd@ clef: Team faderic on a query expansion and reranking approach for the longeval task EBolzonello CMarchiori DMoschetta RTrevisiol FZanini NFerro CEUR WORKSHOP PROCEEDINGS CEUR-WS 2023 3497 MGünther JOng IMohr AAbdessalem TAbel MKAkram SGuzman GMastrapas SSturua BWang MWerk NWang HXiao arXiv:2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents 2024 ALucene Standardtokenizer 2024. 2024-05-20 ALucene Frenchlightstemmer 2024. 2024-04-20 <author> <persName><forename type="first">A</forename><surname>Lucene</surname></persName> </author> <author> <persName><forename type="first">Luke</forename></persName> </author> <ptr target="https://lucene.apache.org/core/8_11_0/luke/index.html" /> <imprint> <date type="published" when="2024-05-20">2024. 2024-05-20</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b8"> <monogr> <author> <persName><forename type="first">A</forename><surname>Lucene</surname></persName> </author> <ptr target="https://lucene.apache.org/" /> <title level="m">Apache lucene 2023. 2023-05-20 <author> <persName><forename type="first">Frenchkagglestoplist</forename><surname>Kaggle</surname></persName> </author> <ptr target="https://www.kaggle.com/datasets/heeraldedhia/stop-words-in-28-languages?select=french.txt,????" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b10"> <monogr> <author> <persName><forename type="first">A</forename><surname>Lucene</surname></persName> </author> <ptr target="https://lucene.apache.org/core/7_3_1/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html" /> <title level="m">Lucene elisionfilter 2024. 2023-04-20 ALucene Lucene bm25similarity 2024. 2024-04-20 Llama 3 model card 2024 AI@Meta <author> <persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Mensch</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Bamford</surname></persName> </author> <author> <persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>De Las Casas</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Bressand</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lengyel</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lample</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Saulnier</surname></persName> </author> <author> <persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Lavaud</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Stock</surname></persName> </author> <author> <persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lacroix</surname></persName> </author> <author> <persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Sayed</surname></persName> </author> <idno type="arXiv">arXiv:2310.06825</idno> </analytic> <monogr> <title level="j">Mistral 7 2023 ALucene Lucene boostquery 2024. 2024-04-20 Sentence-bert: Sentence embeddings using siamese bert-networks NReimers IGurevych Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics 2019 MGünther JOng IMohr AAbdessalem TAbel MKAkram SGuzman GMastrapas SSturua BWang MWerk NWang HXiao arXiv:2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents 2023 MFaysse PFernandes NMGuerreiro ALoison DMAlves CCorro NBoizard JAlves RRei PHMartins ABCasademunt FYvon AF TMartins GViaud CHudelot PColombo arXiv:2402.00786 Croissantllm: A truly bilingual french-english language model 2024 ASFoundation Apache solr frenchlightstemfilter 2024. 2024-04-20 JClive Multi-purpose summarizer (fine-tuned google/flan-t5-xl on several summarization datasets 2023 and BSD-3-Clause License. Fine-tuned on various summarization datasets including xsum, wikihow, cnn_dailymail/3.0.0, samsum, scitldr/AIC, billsum, TLDR. Designed for academic and general usage with control over summary type by varying the instruction prepended to the source document NMuennighoff NTazi LMagne NReimers 10.48550/ARXIV.2210.07316 arXiv:2210.07316 Mteb: Massive text embedding benchmark 2022 arXiv preprint