Assembling four Open Web Search Components

Assembling four Open Web Search Components TUDresden LindaErben Technische Universität Dresden

Dresden Germany

MariaHampel Technische Universität Dresden

Dresden Germany

Malte-ChristianKuns Technische Universität Dresden

Dresden Germany

VincentMelisch Technische Universität Dresden

Dresden Germany

WilhelmPertsch Technische Universität Dresden

Dresden Germany

LinaRazouk Technische Universität Dresden

Dresden Germany

ReinerStolle Technische Universität Dresden

Dresden Germany

RobertThomasThoss Technische Universität Dresden

Dresden Germany

TuanGiang Technische Universität Dresden

Dresden Germany

JuliusGonsior Technische Universität Dresden

Dresden Germany

Database Research Group Technische Universität Dresden

Dresden, Dresden Germany

Supervision of projects AnjaReusch Database Research Group Technische Universität Dresden

Dresden, Dresden Germany

Supervision of projects GenreClassificaion Database Research Group Technische Universität Dresden

Dresden, Dresden Germany

International Workshop on Open Web Search

March 28 2024 Glasgow Scotland

Assembling four Open Web Search Components 1613-0073 E7F9928488C58CC13D15DB39AFCEEDFB GROBID - A machine learning software for extracting information from scholarly documents Information Retrieval Open Web Search Genre Classification Text Snippet Extraction Query Expansion Text Features

In this work, we present the submission of TU Dresden to WOWS 2024. Four student teams assembled different approaches for Genre Classification, Text Snippet Extraction, Query Expansion, and Text Features. Each implemented component integrates seamlessly into the open web search ecosystem. We present each approach alongside a short evaluation of possible use cases, and hope that our submission will contain viable building blocks for future research to be build on top.

Introduction

This report describes the submission of the team at TU Dresden for the Workshop on Open Web Search WOWS 2024 [1]. The work was conducted during a university-organized hackathon targeted at students. Details about the setup are included in the Appendix in Sec. A. Four teams, consisting of two to three students contributed four components for the open web search ecosystem. We hope that with our submitted components future research on Information Retrieval (IR) can be facilitated.

In summary, this paper is discusses the following four components: Sec. 2 reports the work of the group Genre Classification, which categorizes web pages based on the intent of the page, such as Discussion or Shopping. In Sec. 3 we detail our the submission for the extraction of text snippets. Here, the goal is to divide long documents into shorter ones and return a list of the best snippets. Sec. 4 provides details on the work of the group Query Expansion, which employed Large Language Modelss (LLMs) to generate more related information or variants for a given query. The results for the extraction of text features is highlighted in Sec. 5. The goal of this component was to quantify syntactic or semantic features of natural language such as the readability of a web page. Finally, Sec. 6 draws the conclusions of all our submissions.

Methods

Rule-Based Classifier

The rule-based classifier makes use of a vocabulary list of relevant terms per genre. Comparing the intersection between terms in the genre-specific vocabulary lists, and the terms in the document, the most probable category is the one with the highest intersection. We first remove stop words and subsequently extract the 75 most frequent terms that we compare to the vocabulary lists to classify the genre. We use Snorkel AI [4] for implementation.

The rule-based classifier can be adapted to a precision-oriented method, where the most probable genre needs to be better than a threshold compared to the second most probable genre, otherwise the classification result is abstain.

Multi-Layer Perceptron Classifier

As a typical Machine Learning based method a neural network was used for classification. As features the web pages were converted into a tf-idf vectorspace. We use the Python library scikitlearn [5] for the implementation of the Multi-Layer Perceptron classifier. After an empirical hyperparameter search a neural network using a single hidden layer of 50 neurons, ReLU activation function, stochastic gradient descent in the Adam variant using momentum for optimizations, and a constant learning rate of 0.001 was used.

Experiments

Dataset

For evaluation we used the Genre-KI-04 dataset [2]. This includes vocabulary lists, and the following classification categories: articles, discussion, download, help, link lists, portrait (non private), portrait (private), and shop. Details about the genres can be found in the original paper.

Text Snippet Extraction

Since sophisticated neural ranking models such as cross-encoders generally require a lot of computational effort, a customary retrieval pipeline first retrieves a number of (e.g., 1000) documents using a fast but imprecise retrieval method and then re-ranks those documents using a more precise weighting model [6]. Cross-encoder as introduced by Nogueira and Cho [6] are an example for the latter, which are used to calculate scores for query-document pairs. Apart from their comparatively high computational cost, cross-encoders have another disadvantagetheir limited input size. This weakness is typically mitigated by truncating the document once the maximum number of input tokens is reached. The problem of this procedure is that content which is not in the beginning of a document is not taken into account by cross-encoders. As a result, the ranking of documents may be biased towards those that address the query early on.

In this part, we therefore present a simple method of extracting a number of snippets, i.e., smaller chunks of the document which fit in the cross-encoder as an additional component in a larger retrieval pipeline. Instead of simply truncation documents after a fixed number of tokens, we search for the most relevant passages (ranked snippets) in the document. These ranked snippets are used for the cross-encoder with the goal of a more precise ranking. We show the benefits of this method on two exemplary datasets which contain long documents.

Methodology

The re-ranking process with ranked snippets consists of five steps. An example of those steps for the re-ranking of n = 3 documents (d 1 , d 2 , and d 3 ) is shown alongside the explanation.

First, we subdivide all n documents into snippets. The maximum length of those snippets may be chosen arbitrarily-we defaulted to 250 tokens which is the passage size used by Dalton et al. [7]. The actual length of the snippets may vary since the division process aims to retain context by not separating sentences. For example, we may start with three documents d 1 , d 2 , d 3 . After the first step, each of these documents is divided into several snippets:

s 1 1 s 2 1 . . . s l 1 1 , s 1 2 s 2 2 . . . s l 2 2

, s 1 3 s 2 3 . . . s l 3 3 where s j i denotes the j-th snippet of document d i for j ∈ { 1, . . . , l i } and l i is the number of snippets of d i .

In Step 2, we pre-rank all extracted snippets in relation to the query. To accomplish this, we view the set of all snippets of a document as a corpus. From this corpus, we can create a ranking for the query using one of the following weighting models: Term frequency (TF), BM25 or PL2. We do not use cross-encoder for the pre-ranking of documents, because there may be a multitude of snippets per document depending on document length and therefore ranking all snippets using a cross-encoder can drastically slow down the re-ranking process. After this pre-ranking step, our example snippets might be ranked in the following way:

s 3 3 > s 2 2 > s 4 2 > s 2 3 > s 1 3 > s 3 1 > s 4 3 > s 1 1 > s 3 2 > s 5 1 > . . . . In

Step 3, we can obtain the top k relevant snippets of each document from the pre-ranking, which are later ranked using a cross-encoder. This step ensures that the cross-encoder only needs to rank n • k snippets for n documents instead of all snippets. In order to reduce computational cost, we defaulted to k = 3. In our example, this step results in the following selection:

s 3 1 , s 1 1 , s 5 1 , s 2 2 , s 4 2 , s 3 2 , s 3 3 , s 2 3 , s 1 3 .

Here, s 4 3 is not selected as one of the top snippets of d 3 since it is the (k + 1)-th snippet of d 3 in the ranking despite being ranked relatively high.

In Step 4, the top k snippets of all documents are ranked using a cross-encoder (CE). That way, similar to Step 2, we can more accurately deduce which snippets best match the query-but now the ranking is more precise since we used a CE instead of the simple weighting models used in Step 2. An examplary ranking for our snippets might be: s 4 2 > s 3 3 > s 2 2 > s 5 1 > . . . . The final document ranking ensues from this snippet ranking in Step 5, i.e., the document that provided the best snippet is ranked first. Our example documents are therefore ranked in the following way:

d 2 > d 3 > d 1 .

It should be noted that the goal of this section is to rank documents with regard to a query, and not only passages. Therefore, the result is a ranking of documents. Details on our implementation can be found in Appendix B.1.

Evaluation

In this section, we conduct tests to study the possible improvements of our cross-encoder re-ranking of top k snippets. As baselines, we use BM25 and the dense retriever MonoT5. All further ranking is performed on the top 20 documents retrieved by these two systems. We evaluate the re-ranking with our TF-ranked snippets. For this, we load the previously saved top 3 snippets for each document. To re-rank the documents, we follow the "weakest link" principle, selecting the minimum TF score among the top 3 snippets. This results in the methods BM25+TF-SP and MonoT5+TF-SP. We denote by +CE that the 3 snippets are further re-ranked by a cross-encoder. In addition, we compare the performance of these systems to the cross-encoder's performance when only evaluating the first snippet of each document (which resembles the naïve application of a cross-encoder). These results are denoted by BM25+CE and MonoT5+CE.

To measure the performance of the approaches, we utilize normalized discounted cumulative gain at 10 (NDCG@10) and mean reciprocal rank (MRR). We conduct our tests on the ClueWeb12 [8] and ClueWeb09 [9] datasets, which differ in document size: ClueWeb12 has an average document size of 5641.7 tokens, and ClueWeb09 has an average document size of 1132.6 tokens. The results for the two datasets are plotted in Fig. 2. Our approach of cross-encoder re-ranking with TF-pre-ranked snippets achieves the best performance in both metrics across all our tested datasets (see Appendix B.2 for diagrams of other evaluated datasets). The impact of our TF-ranked snippet pre-selection is relatively high on ClueWeb12 with long documents, while it is more marginal on ClueWeb09. This highlights the importance of snippet pre-selection for longer documents. ClueWeb09 consists of approximately 6 snippets, and ClueWeb12 consists of approximately 23 snippets per document. We assume for our naïve snippet generation approach that information is equally spread throughout a document. A cross-encoder taking the first snippet as input is assumed to capture more relevant information of a document with a size that is closer to the cross-encoder input size. This also explains why MonoT5 scores better on the shorter dataset, especially in comparison to BM25, since MonoT5 also suffers from a limited input size. This proves that there is a need to address the problem of limited input size, especially in large documents like those in ClueWeb12. That information is not always equally spread over a document, like we assumed for our snippet generation, can be concluded when comparing Figs. 4b and 4c. This raises the need of a more advanced approach for snippet generation.

0 0.2 0.4 0.6 BM25 MonoT5 BM25+TF-SP MonoT5+TF-SP BM25+TF-SP+CE MonoT5+TF-SP+CE BM25+CE MonoT5+CE Performance (a) ClueWeb09 (2011) 0 0.2 0.4 0.6 BM25 MonoT5 BM25+TF-SP MonoT5+TF-SP BM25+TF-SP+CE MonoT5+TF-SP+CE BM25+CE MonoT5+CE Performance (b) ClueWeb12 (2013)

Summary

Overall, our results show that selecting top-k pre-ranked snippets is a viable approach to tackle the problem of input size restrictions on Transformer-based retrieval systems. Especially, crossencoders can benefit from this approach since they are inefficient on large documents. Further testing to edge out efficiency and reduce context loss with snippets will be required. Also, it would be beneficial to test multiple pre-ranking systems and values of k for top-k snippet selection. The code for this part can be found in the accompanying repository2 .

Query Expansion and User Query Variants using Large Language Models

Query Expansion and User Query Variants are two common methods to increase the recall of an IR system [10,11,12]. Both methods are based on modifying the query to include more related keywords, thereby causing the IR system to score relevant documents higher. In addition to conventional techniques such as the Kullback-Leibler Divergence (KL) [13,14] or Relevance Model 3 (RM3) [15], recent approaches have embraced the utilization of Large Language Models (LLMs). In this part, we employ various prompts to generate improved and expanded queries using LLMs [16,17], in particular, GPT-3.53 , Llama2 [18] and FLAN-UL2 [19].

Methodology

LLMs have previously been in use for the task of query expansion and studies have been conducted using various methods and language models [16,20,21,22]. Wang et al. [21] employ query2doc, a method where the LLM generates a document for a given query, which is then used as Pseudo-Relevance Feedback (PRF). Model parameters, for GPT-3.5 the parameter min. tokens is unavailable in the API previous work demonstrates improvements across different datasets. In order to weigh the original query more heavily, multiple concatenations of the original query q with a single instance of the LLM's output may be used [16,21]. The resulting expanded query is of the form q ′ = concat({q} * n, LLM out ), where n is the number of times q is concatenated with itself, and LLM out is the LLM-generated version of q. We adopt this approach in our work with n = 5 and employ modified versions of the prompt types suggested by Jagerman et al. [16] While GPT-3.5 and FLAN were prompted to generate five similar queries, Llama was asked to answer the query. Apart from this difference, the prompts for in all experiments are similar and comparable.

Initially, the query, along with the prompt, is fed into the LLM, and its response is concatenated with the original query (n = 5). For evaluation, the Recall@1000 metric of the original and modified queries is compared on the given dataset using BM25. The specific LLMs in use are GPT-3.5 45 , Llama 2 [18] and FLAN-UL2 [19]. Llama 2 and FLAN-UL2 were run locally. Table 1 shows the model configurations that we used in our experiments. The temperature values were chosen empirically in a way such that model outputs are roughly similar. The lower and upper token limitations prevent generation edge cases such as empty responses or endless output, while still allowing for expressive responses. Local models had to fit GPU memory constraints. Hence, we had to employ the quantized versions of the models. We conducted experiments for the prompt types presented above: CoT, Q2E/FS, and Q2E/ZS. While FLAN-UL2 and GPT-3.5 can be prompted without further changes, Llama 2 requires the chat-prompt to follow a pre-defined format, our version of which can be found in the project's repository 6 . We utilize BM25 as the retrieval system in the default configuration of the Tira-framework [23]. The query expansion baselines consist of an unmodified BM25, BM25 with Kullback-Leibler Divergence (KL) and BM25 with RM3.

Evaluation

We measure the recall, which is aggregated over 18 datasets per model, and per prompt type. The datasets cover a range of diverse topics and were provided as part of TIRA [24] / TIREx [25]. The aggregated results can be observed in Table 4.2. Avg easy excludes evidently (cord19, longeval) and presumably (medline) difficult datasets. This highlights the difficulties LLMs experience on specific datasets, especially domain-specific ones: Excluding those, CoT+OpenAI GPT-3.5 Turbo (GPT) now performs 0.03 points better than baseline models. Note that the two Avg rows cannot be compared to one another, as baseline scores have also shifted due to the exclusion of generally low-performing datasets. Detailed results for each dataset can be found in Appendix C.1. For our query expansion approaches, it is evident that the choice of prompt has a large impact on recall performance. The combination of CoT and GPT consistently yields the highest recall in absolute numbers. However, with other prompt types such as Q2E/ZS and Q2E/FS, GPT also frequently achieves the highest recall per dataset, albeit less frequently compared to CoT. In this regard, our results are consistent with those reported in [16]. Although CoT generally performs the best, it exhibits poorer results than the baselines in datasets such as cord19 or the longeval datasets. In these cases, Q2E/ZS and Q2E/FS emerge as better choices, but are still commonly outperformed by the baseline models. Q2E/FS exhibits less convincing effectiveness, presumably because it mimics the relatively short responses of example queries through the Q2E/FS method, resulting in short queries with few new keywords. Q2E/ZS behaves similarly. Although the responses of the LLMs are longer compared to Q2E/FS, as the LLMs do not conform to the rather short examples, the generated responses are overall less extensive than those of CoT, likely resulting in inferior effectiveness. Considering the longeval datasets and cord19, it is evident that they contain either very general or highly specific queries. In the case of nonspecific queries, there is a risk that they may be muddled by the consequently more general, and in the case of CoT, extensive responses from the LLMs. This effect might potentially be reversed by conveying the user intent to the LLM, indicating whether, for instance, in the case of the query "car," the user intends to buy one or have it repaired. With domain-specific queries, it is plausible that models were trained with insufficient knowledge on the subject, resulting in subpar effectiveness.

While our main evaluation is conducted using recall@1000, we also evaluated nDCG@10. RIX [34] Long Word Count Sentence Count

Table 3

Implemented Text Features with the respective formulas. Syllable count and word count were implemented using the provided tools by the Text Feature Libraries Textstat and spaCy.

The results for this metric are detailed in Table 4 in the appendix. Overall, our conclusions for nDCG are similar to those for recall. The generations for each model and each prompt are publicly available in our repository 7 .

Summary

In this part, we generated different versions of query expansions using three LLMs and three prompt templates. We were able to demonstrate that LLMs are capable of improving the recall of user queries. The combination of the prompt CoT alongside GPT proves to be the most promising, improving recall scores by up to 15%. Future research could focus on further templates for using the generated expansions since we only evaluated the qqqqq, response-format.

Text Features

Text Features are quantified metrics describing syntactic or semantic features of natural language. An example is the readability of a text, useful for returning user-dependent search results. A search engine targeted to school children should return results with a high readability score, whereas a search engine with domain experts as target audience will also include texts with low readability scores. Additionally, this could be used to filter out noisy websites. This Open Web Search component8 incorporates two tools for computing text features, namely Textstat [26] and textdescriptives [27] from spaCy [28]. SpaCy's Text Feature analysis is more comprehensive than the one in Textstat, but is less efficient. Per design of the pipeline approach of SpaCy many things are computed in the background, from which only a few are required for the calculation of the text features. This overhead results in a longer runtime which should be considered. Table 3 displays the implemented text features.

Additional contributions besides the integration of the text features components include examining a potential correlation between text features and documents evaluated as relevant by ranked retrieval models. For easier exploration of the document corpus we provide an interactive Jupyter Notebook showing correlation graphs between Ranked Retrieval and Text Features, applicable to arbitrary datasets, as well as the analysis of correlations between Ranked Retrieval and Text Measures.

Evaluation

To verify the capability to differentiate between levels of Readability, unit tests were used. The test data consists of multi-sentence snippets from web pages. These were categorized by difficulty in the following categories (including the amount of test documents): children (3), teenagers (3), academic (3), and simple language (2), depending on what demographic the source was directed to. Initial tests involved a project member assessing the reading level of excerpts and comparing their assessments to the classifications provided by the automated measures, thus proving correct usage of the used text feature libraries at least for the readability scores.

Compared to human assessments, the automated Text Measures often overestimated the reading level, possibly failing to capture the complexities of human reading abilities within their respective indexes. Large-scale dataset computations further highlighted the discrepancies between predicted and human-classified reading levels, corroborating these findings. Despite the observed differences in assessment, the data suggested an inverse proportional relationship between comprehension levels and readability measures.

Experiment Design

The experiments were run on the "antique/test" [35] dataset from the ir_datasets collection [36]. Based on TIRA [37] ranked retrieval models were used to create top-10 results.

Correlations between ranked retrieval and text feature readability

A primary objective of our project was to investigate whether the ranking of relevant documents by ranked retrieval models correlates to document Readability.

Readability of Top 10

First we looked at the top 10 retrieved documents for all queries across multiple retrieval models, the resulting distributions are displayed in Figure 3. The majority of results, assessed using the Flesch Reading Ease, indicates comprehension levels at or below an eighth-grade level, implying a high degree of readability. The high degree of readability was consistently observed across multiple retrieval models. Compared to the overall readability across all documents in a collection, we found that some retrieval models like SBERT or MonoT5 indeed result in a higher readability in the retrieved documents compared to the rest of the corpora, suggesting a potential relationship between relevancy and readability, whereas other retrieval models such as BM25 do not share this characteristic.

A. Hackathon

The paper's work was carried out by students from TU Dresden as part of a one-week hackathon.

The workshop was open to students in the Computer Science program and related fields, and they could earn ECTS credit points for lab work. The hackathon was advertised on the mailing lists of the beginner Information Retrieval courses from the past three years. Interested students could fill a survey indicating their preferred timeframe for the hackathon.

After a date was decided, 10 students signed up for the hackathon, three from the Bachelor's program and seven from the Master's program. The university supervisors prepared four topics, which were advertised beforehand, and the students signed up for their preferred topics. The text features topic was designated to the 3 Bachelor students. The Master students were provided with a peer-reviewed research paper as additional material, which they were required to read and understand before the hackathon.

On the first day of the hackathon, an invited member of the Open Web Search project provided a brief introduction to the Open Web Search ecosystem and TIRA/TIREx. Following this, the teams worked on their components, with supervisors providing guidance through daily checkins. On the fifth and final day of the hackathon, a short presentation from each team was held. Following the hackathon, the students were requested to prepare a report on their work, which served as the basis for this paper.

In retrospective, the short amount of time, one week, motivated the students to work diligently on their project. However, at the end of the week, the students had several open ideas for future work which they could not finish in time. Therefore, more time, even a few days more, might be beneficial for the next iteration of the hackathon. The size of the group ranged between two and three members. The small group size facilitated the organization within each group and kept the management overhead small. The topics of the hackathon were aligned with the basics gained during the Information Retrieval course, but required also reading additional literature and research.

B. Text Snippet Extraction

B.1. Implementation

To implement the described re-ranking steps, we utilized several Python libraries, detailed below to facilitate reproducibility. For snippet extraction in Step 1, we adapted the Spacy-PassageChunker class from the corpus_processing package, as provided by Dalton et al. [7], to allow for variable snippet sizes. The class requires spaCy [38]; we used version 3.3.0 for our implementation. The snippet pre-ranking in Step 2 was implemented using PyTerrier [39], version 0.10.0. For Step 4 we utilized ms-marco-MiniLM-L-6-v2 which has been published on HuggingFace.co [40]. To embed the model into our project, we used the transformers library [40], version 4.38.2, and the PyTorch library [41], version 2.2.0. The results of the preparation steps are accessible via TIRA [24] / TIREx [25].

B.2. Results on other evaluated datasets

Figure 2 :2Figure 2: Experimental results on different datasets, blue bars denote NDCG@10, while red bars indicate MRR.

Figure 4 :4Figure 4: Experimental results on other datasets

Table 11Jagerman et al. [16] follow a similar approach but extend the experiments to include alternative LLMs and additional prompt types. AllModelTemperature min. Tokens max. Tokens Quantization ParametersLlama 21.1102004 bit7BFLAN-UL20.5102008 bit20BGPT-3.5 Turbo0.5-200-175B

: Chain of Thoughts (CoT) where the model is prompted to document its thought process, Query to Expansion with Zero-Shot prompting (Q2E/ZS) where the model should reformulate the query directly, and Query to Expansion with Few-Shot prompting (Q2E/FS), where three examples for the desired query format are provided to the model. For the exact prompt format used, see Appendix C.3. It should be noted that the prompt for Q2E/ZS differs between the models.

Table 22Recall@1000 evaluation results. The best value across different configurations is bolded. Grey values failed to outperform the best baseline effectiveness. Avg denotes the arithmetic mean scores all 18 datasets. Avg easy excludes the cord19, longeval and medline datasets.

BaselineCoTQ2E/FSQ2E/ZSBM25 KL RM3 FLAN Llama GPT FLAN Llama GPT FLAN Llama GPTAvg0.660.67 0.690.680.670.690.660.670.670.670.670.68Avg easy0.720.73 0.750.760.760.780.730.740.730.740.760.76

https://github.com/tira-io/workshop-on-open-web-search-tu-dresden-01 https://github.com/tira-io/workshop-on-open-web-search-tu-dresden-02 https://platform.openai.com/docs/models/gpt-3-5-turbo https://platform.openai.com/docs/models/gpt-3-5-turbo https://platform.openai.com/docs/api-reference/ https://github.com/tira-io/workshop-on-open-web-search-tu-dresden-03 https://github.com/tira-io/workshop-on-open-web-search-tu-dresden-03/tree/main/src/generated https://github.com/tira-io/workshop-on-open-web-search-tu-dresden-04

Acknowledgments

We would like to express our gratitude to the Open Search Foundation for organizing the WOWS 2024 and especially Maik Fröbe, who supported us and our student teams in organizing and conducting our Hackathon which made this submission possible.

In addition, the authors gratefully acknowledge the computing time made available to them on the high-performance computer at the NHR Center of TU Dresden. This center is jointly supported by the Federal Ministry of Education and Research and the state governments participating in the NHR (www.nhr-verein.de/unsere-partner).

C. Query Expansion

Table 7

Prompt formats used for Meta Llama 2 7B Chat (Llama). Note the necessity for a system prompt and the additional formatting sequences due to the instruction fine-tuning of Llama-Chat. Prompts were modified to fit Llama's behaviour.

1st International Workshop on Open Web Search (WOWS) SFarzana MFröbe MGranitzer GHendriksen DHiemstra MPotthast SZerhoudi Advances in Information Retrieval. 46th European Conference on IR Research (ECIR 2024) Lecture Notes in Computer Science Springer 2024 Genre classification of web pages SMeyer Zu Eissen BStein Advances in Artificial Intelligence SBiundo TFrühwirth GPalm

Berlin Heidelberg; Berlin, Heidelberg

Springer 2004. 2004 A taxonomy of web search AZBroder 10.1145/792550.792552 doi:10.1145/792550.792552 SIGIR Forum 36 2002 Snorkel: Rapid training data creation with weak supervision ARatner SHBach HEhrenberg JFries SWu CRé Proceedings of the VLDB endowment. International conference on very large data bases the VLDB endowment. International conference on very large data bases NIH Public Access 2017 11 269 Scikit-learn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 Passage Re-ranking with BERT RNogueira KCho 10.48550/arXiv.1901.04085 2020 JDalton CXiong JCallan 10.48550/arXiv.2003.13624 TREC CAsT 2019: The Conversational Assistance Track Overview 2020 The lemur project and its clueweb12 dataset JCallan Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval 2012 JCallan MHoy CYoo LZhao Clueweb09 data set 2009 Relevance feedback in information retrieval JJ RJr The SMART retrieval system: experiments in automatic document processing 1971 To see, or not to see-is that the query? RRKor Age Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval the 14th annual international ACM SIGIR conference on Research and development in information retrieval 1991 Query improvement in information retrieval using genetic algorithms-a report on the experiments of the trec project JYang RRKor Age ERasmussen Proceedings of the Text REtrieval Conference (TREC-1) the Text REtrieval Conference (TREC-1) 1993 On Information and Sufficiency SKullback RALeibler 10.1214/aoms/1177729694 The Annals of Mathematical Statistics 22 1951 Kullback-leibler divergence revisited FRaiber OKurland 10.1145/3121050.3121062 doi:10.1145/3121050.3121062 Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR '17 the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR '17

New York, NY, USA

Association for Computing Machinery 2017 Relevance-based language models VLavrenko WBCroft ACM SIGIR Forum

New York, NY, USA

ACM 2017 51 RJagerman HZhuang ZQin XWang MBendersky arXiv:2305.03653 Query expansion by prompting large language models 2023 Can generative llms create query variants for test collections? an exploratory study MAlaofi LGallagher MSanderson FScholer PThomas 10.1145/3539618.3591960 doi:10.1145/3539618.3591960 Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23 the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23

New York, NY, USA

Association for Computing Machinery 2023 <author> <persName><forename type="first">H</forename><surname>Touvron</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Martin</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Stone</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Albert</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Almahairi</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Babaei</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Bashlykov</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Batra</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Bhargava</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Bhosale</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Bikel</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Blecher</surname></persName> </author> <author> <persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Ferrer</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Chen</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Cucurull</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Esiobu</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Fernandes</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Fu</surname></persName> </author> <author> <persName><forename type="first">W</forename><surname>Fu</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Fuller</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Gao</surname></persName> </author> <author> <persName><forename type="first">V</forename><surname>Goswami</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Goyal</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Hartshorn</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Hosseini</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Hou</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Inan</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Kardas</surname></persName> </author> <author> <persName><forename type="first">V</forename><surname>Kerkez</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Khabsa</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Kloumann</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Korenev</surname></persName> </author> <author> <persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Lee</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Liskovich</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Lu</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Mao</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Martinet</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Mihaylov</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Mishra</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Molybog</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Nie</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Poulton</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Reizenstein</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Rungta</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Saladi</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Schelten</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Silva</surname></persName> </author> <author> <persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Smith</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Subramanian</surname></persName> </author> <author> <persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Tang</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Taylor</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Williams</surname></persName> </author> <author> <persName><forename type="first">J</forename><forename type="middle">X</forename></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b18"> <monogr> <author> <persName><forename type="first">P</forename><surname>Kuan</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Xu</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Yan</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Zarov</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Fan</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Kambadur</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Narang</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Rodriguez</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Stojnic</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Edunov</surname></persName> </author> <author> <persName><surname>Scialom</surname></persName> </author> <idno type="arXiv">arXiv:2307.09288</idno> <title level="m">Llama 2: Open foundation and fine-tuned chat models 2023 Ul2: Unifying language learning paradigms YTay MDehghani VQTran XGarcia JWei XWang HWChung DBahri TSchuster SZheng The Eleventh International Conference on Learning Representations 2022 Neural text generation for query expansion in information retrieval VClaveau 10.1145/3486622.3493957 WI-IAT 2021 -20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Proceedings of the WI-IAT Conference

Melbourne, Australia

IEEE 2021 Query2doc: Query expansion with large language models LWang NYang FWei Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing the 2023 Conference on Empirical Methods in Natural Language Processing 2023 Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking RRen YQu JLiu WXZhao QShe HWu HWang J.-RWen Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing the 2021 Conference on Empirical Methods in Natural Language Processing 2021 The information retrieval experiment platform MFröbe JHReimer SMacavaney NDeckers SReich JBevendorff BStein MHagen MPotthast 10.1145/3539618.3591888 doi:10. 1145/3539618.3591888 Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23 the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23 ACM 2023 Continuous Integration for Reproducible Shared Tasks with TIRA MFröbe MWiegmann NKolyada BGrahm TElstner FLoebe MHagen BStein MPotthast 10.1007/978-3-031-28241-6_20 doi: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023) Lecture Notes in Computer Science JKamps LGoeuriot FCrestani MMaistro HJoho BDavis CGurrin UKruschwitz ACaputo

Berlin Heidelberg New York

Springer 2023 The Information Retrieval Experiment Platform MFröbe JHReimer SMacavaney NDeckers SReich JBevendorff BStein MHagen MPotthast 10.1145/3539618.3591888 doi: 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) H.-HChen W.-JEDuh H.-HHuang MPKato JMothe BPoblete ACM 2023 SBansal textstat 2016 TextDescriptives: A Python package for calculating a large variety of metrics from text LHansen LROlsen KEnevoldsen 10.21105/joss.05153 Journal of Open Source Software 8 5153 2023 spaCy: Industrial-strength Natural Language Processing in Python MHonnibal IMontani SVan Landeghem ABoyd 10.5281/zenodo.1212303 2020 Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel JPKincaid RPFishburneJr RLRogers BSChissom 1975 In defense of the fog index JBogert The Bulletin of the Association for Business Communication 48 1985 Smog grading-a new readability formula GHMc Laughlin Journal of reading 12 1969 Automated readability index RSenter EASmith 1967 DTIC document A computer readability formula designed for machine scoring MColeman TLLiau Journal of Applied Psychology 60 283 1975 Lix and rix: Variations on a little-known readability index JAnderson Journal of Reading 26 1983 Antique: A non-factoid question answering benchmark HHashemi MAliannejadi HZamani BCroft ECIR 2020 Simplified data wrangling with ir_datasets SMacavaney AYates SFeldman DDowney ACohan NGoharian 10.1145/3404835.3463254 doi:10.1145/3404835.3463254 SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event FDiaz CShah TSuel PCastells RJones TSakai

, Canada

ACM July 11-15, 2021. 2021 Continuous Integration for Reproducible Shared Tasks with TIRA MFröbe MWiegmann NKolyada BGrahm TElstner FLoebe MHagen BStein MPotthast 10.1007/978-3-031-28241-6_20 doi: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023) Lecture Notes in Computer Science JKamps LGoeuriot FCrestani MMaistro HJoho BDavis CGurrin UKruschwitz ACaputo

Berlin Heidelberg New York

Springer 2023 spaCy: Industrial-strength Natural Language Processing in Python IMontani MHonnibal ABoyd SVLandeghem HPeters 10.5281/zenodo.10009823 2023 Declarative experimentation in information retrieval using pyterrier CMacdonald NTonellotto Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval 2020 Transformers: State-of-the-Art Natural Language Processing TWolf LDebut VSanh JChaumond CDelangue AMoi PCistac TRault RLouf MFuntowicz JDavison SShleifer PVon Platen CMa YJernite JPlu CXu TLeScao SGugger MDrame QLhoest ARush 10.18653/v1/2020.emnlp-demos.6 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics QLiu DSchlangen the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics 2020 PyTorch: An Imperative Style, High-Performance Deep Learning Library APaszke SGross FMassa ALerer JBradbury GChanan TKilleen ZLin NGimelshein LAntiga ADesmaison AKopf EYang ZDevito MRaison ATejani SChilamkurthy BSteiner LFang JBai SChintala Advances in Neural Information Processing Systems 32 HWallach HLarochelle ABeygelzimer FBuc EFox RGarnett Curran Associates, Inc 2019