1. Introduction

Overview of the CLEF 2024 SimpleText Task 1: Retrieve Passages to Include in a Simplified Summary

Éric SanJuan

Stéphane Huet

Jaap Kamps

Liana Ermakova

1 0 Avignon Université , LIA , France 1 Université de Bretagne Occidentale , HCTI , France 2 University of Amsterdam , Amsterdam , The Netherlands

This paper presents an overview of the CLEF 2024 SimpleText Task 1 on Content Selection, asking systems to retrieve scientific abstracts in response to a query prompted by a popular science article. Overall, the SimpleText track provides an evaluation platform for the automatic simplification of scientific texts. We discuss the details of the task set-up. First, the SimpleText Corpus with over 4 million academic papers and abstracts. Second, the Topics based on 40 popular science articles in the news and the 114 Queries prompted by them. Third, the Formats of requests and results, the Evaluation labels and Evaluation measures used. Fourth, the Results of the runs submitted by our participants.

eol>information retrieval scientific documents text simplification scientific information retrieval non-expert queries press outlets query-document relationships (Q-rels) popularized science

1. Introduction 2. Task 1: Retrieve Passages to Include in a Simplified Summary

This section details Task 1: Content Selection on retrieve passages to include in a simplified summary.

2.1. Description 2.2. Data

Given a popular science article targeted to a general audience, this task aims at retrieving passages, which can help to understand this article, from a large corpus of academic abstracts and bibliographic metadata. Relevant passages should relate to any of the topics in the source article. We use popular science articles as a source for the types of topics the general public is interested in and as a validation of the reading level that is suitable for them. The main corpus is a large set of scientific abstracts plus associated metadata covering the fields of computer science and engineering. We reuse the collection of academic abstracts from the Citation Network Dataset (12th version released in 2020)1 [5]. This collection was extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. It includes 4,232,520 abstracts in English, published before 2020.

Search requests are based on popular press articles targeted to a general audience, based on The Guardian and Tech Xplore. Each of these popular science articles represents a general topic that has to be analyzed to retrieve relevant scientific information from the corpus.

We provide the URLs to original articles, the title, and the textual content of each popular science article as a general topic. Each general topic was also enriched with one or more specific keyword queries manually extracted from their content, creating a familiar information retrieval task ranking passages or abstracts in response to a query. Available training data from 2023 includes 29 (train) and 34 (test) queries, with the later set having an extensive recall base due to the large number of submissions in 2023 [6].

In 2024, we added between 2 and 5 new queries (with IDs of the form G*.C*) for each of the 20 articles from the Guardian. These topics were generated by ChatGPT 4, with a prompt asking to list the main subtopics related to computer science; they were manually inspected to check they are linked to the original article and are not redundant. They are longer, containing around ten words and focusing on a specific point related to the article. An example of a keyword query is “ system on chip” (T06.1) and an example of a long query is “How AI systems, especially virtual assistants, can perpetuate gender stereotypes?” (G01.C1).

The C1 queries were generated based on the following prompt: In the attached article from the Guardian, list the main sub topics related to computer science and for each topic find at least five related references to scientific publications before 2019 that would have been relevant to be cited in this article. Just provide the references, don’t try to get the full text. We then considered as query the first sub topic. We also considered to use ChatGPT results as a complete run, but few references were returned, many were not indexed in computer science and some did not even exist. That emphasizes the real dificulty of the task of retrieving references to be included in a popular science article.

2.3. Baselines

An ElasticSearch index is provided to participants with access through an API. A JSON dump of the index is also available for participants. This index can be accessed online through queries, e.g. https://clef.termwatch.eu/dblp1/_search?q=biases&size=1000 for the query “Biases.”

We additionally provided two supplementary baselines leveraging bag-of-words models and sparse vector document representations. The first baseline (denoted by “meili” in the results tables) was generated using the Meilisearch system2 and relies on a bucket sort approach. The second baseline (denoted by “boolean”) was constructed using a simple boolean model powered by PostgreSQL GIN text indexing.

For each topic, organizers manually assessed each proposed keyword query retrieved by the baseline run powered by Elasticsearch, ensuring that it retrieved at least five relevant documents. As a consequence, this boolean system, which retrieves all abstracts containing all query keywords within the abstract, is expected to artificially achieve high recall levels at a depth of 5. However, this approach sufers from two limitations: it misses relevant abstracts that do not contain all keywords, and it retrieves irrelevant abstracts that happen to contain all query keywords.

In the case of the long C1 queries, we manually extracted the largest subset of terms that retrieved at least five relevant documents. For these queries, the boolean approach is essentially a manual run, which is indicated by an asterisk (*) in the results tables.

Despite their efectiveness, neural models are computationally expensive, requiring significant training data and processing power. Consequently, most participants rely on a hybrid document retrieval approach. This approach leverages a two-stage process: 1. Initial Retrieval: This phase employs a more traditional and less resource-intensive method, such as tf-idf vectorization. This initial retrieval identifies a set of potentially relevant documents. 2. Re-ranking: The documents retrieved in the first stage are then re-ranked using the more nuanced dense representations provided by neural models. This step refines the initial retrieval results based on the semantic understanding of the neural models.

In previous editions, participants relied on the provided ElasticSearch baseline for the initial retrieval phase. To enhance run diversity and address resource limitations, the organizers this year provided access to two vector databases containing pre-computed paragraph embeddings (for titles and abstracts). These vector databases enable to compare the eficiency of scientific document retrieval techniques using asymmetric sparse document retrieval (based on tf-idf) and symmetric dense passage retrieval (based on pre-computed embeddings).

Two embedding vectors were based on the paragraph cross-encoder MS MARCO Mini LM (allMiniLM-L6-v2)3. These embeddings, along with a search API based on them, have been released to participants. Documents are ranked based on the dot product between the query and the abstract (vir_abstract) or the title (vir_title) using the pg_vector4 PostgreSQL extension and an ivvflat dense vector index (k-means vector clustering with √︀|| centroids).

These dense vector and the boolean baselines can be accessed online through a CGI API5 with three parameters: 2https://www.meilisearch.com/ 3https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 4https://github.com/pgvector/pgvector 5https://clef.termwatch.eu/stvir_test corpus : title, abstract or bool phrase : text passage as query length : number of results to be retrieved

In the case of non boolean query, this API generates the vector embedding of the query on the fly before retrieving results using SQL syntax. For example, for the query “Exploring the use of AI to improve success rates and speed in the pharmaceutical research field”, the top 100 documents whose abstracts are most similar to the query (based on dot product) can be retrieved in JSON format using the following syntax: https://clef.termwatch.eu/stvir\_test? corpus=abstract\\ \&phrase=Exploring the use of AI to improve success rates and speed in the pharmaceutical research field\\ \&length=100

In addition to the dot product similarity measure, we also experimented with cosine distance. However, this alternative approach yielded comparable results.

The Boolean and dense vector baselines are provided as a PostgreSQL database containing four tables: 1. Complete Documents (JSON): full documents in JSON format, enabling access to all content. 2. Textual Content (Boolean Search): title and abstracts of documents, facilitating eficient boolean search operations. 3. Title Embeddings: pre-computed dense vector representations (embeddings) of the document titles. 4. Truncated Abstract Embeddings: pre-computed dense vector representations (embeddings) of the ifrst 110 tokens of each document’s abstract.

2.4. Formats

Ad-hoc passage retrieval Participants should retrieve, for each topic and each query, DBLP abstracts related to the query and relevant to be inserted as a citation in the paper associated with the topic. We encourage participants to take into account passage complexity as well as its credibility/influentialness. Open passage retrieval (optional) Participants are encouraged to extract supplementary relevant queries from the titles or content articles and to provide results based on these supplementary queries. Output format Results should be provided in a TREC style JSON format with the following fields: 1. run_id: Run ID starting with <team_id>_<task_id>_<method_used>, e.g. UBO_Task1_TFIDF 2. manual: Whether the run is manual {0,1} 3. topic_id: Topic ID 4. query_id: Query ID used to retrieve the document (if one of the queries provided for the topic was used; 0 otherwise) 5. doc_id: ID of the retrieved document (to be extracted from the JSON output) 6. rel_score: Relevance score of the passage (in the [ 0-1 ] scale) 7. comb_score: General score that may combine relevance and other aspects: readability, citation measures. . . (in the [ 0-1 ] scale) 8. passage: Text of the selected passage

For each query, the maximum number of distinct DBLP references (doc_id field) must be 100 and the total length of passages should not exceed 1,000 tokens. The idea of taking into account complexity is to have passages easier to understand for non-experts, while the credibility score aims at guiding them on the expertise of authors and the value of publication w.r.t. the article topic. For example, complexity scores can be evaluated using readability scores and credibility scores using bibliometrics.

Here is an output format example:

2.5. Evaluation

To assess topical relevance, we assigned a 0-2 score to each retrieved document based on its content alignment with the original article. To expand the training data for relevance judgments (qrels), we pooled all documents retrieved at depth 10 from all submitted systems. This approach significantly increased the size of the qrels by 9,990 documents, with a particular focus on newly introduced long queries for the Guardian corpus and T06-T11 queries that previously lacked relevance assessments.

Table 2 summarizes the test collection constructed for the CLEF 2024 SimpleText Task 1, in relation to the earlier years (note that earlier topics have been reused in the “train” data).

While generating the long C1 queries using state-of-the-art LLMs, we were surprised by the inability of these models, specifically ChatGPT4, to find relevant references in the computer science domain suitable for inclusion in large audience tech articles. This raises questions about the inherent dificulty of the task and the potential necessity of combining multiple retrieval systems to improve recall. This need was addressed by both participants and organizers this year.

Many participants employed multiple LLMs, not for initial retrieval, but as rerankers within their systems. Additionally, several participants utilized diferent implementations of BM25 compared to the one provided by the organizers for the retrieval stage. These novel end-to-end retrieval approaches, coupled with the 4 new baselines provided, resulted in an unexpectedly high number of unassessed documents among the top ten retrieved documents per run. This phenomenon included queries from previous editions. For instance, among queries G01-G10, there were 3, 843 new documents not returned in the top ten of previous editions. Notably, 954 of these documents appeared relevant to at least one existing topic, and 576 were relevant to one of the newly introduced long C1 queries. This confirms the task’s inherent dificulty but also demonstrates the potential to achieve high recall levels at depth 10.

In addition to topical relevance, we took into account other key aspects of the track, such as the text complexity and the credibility of the retrieved results. These evaluations were performed using automatic metrics.

3. Scientific Passage Retrieval Approaches

In this section, we discuss a range of scientific text retrieval approaches that have been applied by the participants of the track. A total of 11 teams submitted 42 runs in total.

AB/DPV Varadi and Bartulović [7] submitted 1 run for Task 1. They used our ElasticSearch API and took into account an FKGL readability score for their combined score.

Sharingans Ali et al. [8] also submitted 1 run. They experimented with the ColBERT neural ranker and used GPT 3.5 to select the most informative and concise passages for inclusion in the summary. Tomislav/Rowan Mann and Mikulandric [9] submitted a total of 2 runs. They took the top 100 results retrieved by ElasticSearch. Then, they used cosine similarity on TF-IDF vectors as the relevance score and FKGL score as the combined score.

Petra/Regina Elagina and Vučić [10] submitted 1 run, for the first 3 queries only, with the same approach as the previous system.

AIIRLab Largey et al. [11] submitted a total of 5 runs and proposed several models. First, since input queries are short keyword terms, they used query expansion with LLaMA 3 and reranked the top 5,000 results retrieved by TF-IDF with a bi-encoder or a cross-encoder. Second, they applied LLaMa3 as a pairwise re-ranker. Third, they leveraged ElasticSearch with fine-tuned cross-encoders. UBO Vendeville et al. [12] submitted a total of 1 run. They used PyTerrier6 to retrieve documents from TF-IDF scores. Then, the MonoT5 reranker provided by PyTerrier was employed to reorder all extracted documents. UAmsterdam Bakker et al. [13] submitted a total of 6 runs for Task 1. First, they focused on regular information retrieval efectiveness with 2 vanilla baseline runs on an Anserini index, using either BM25 or BM25+RM3, and 2 other runs generated with neural cross-encoder rerankings of these runs by an MS MARCO-trained ranker. Second, 2 further runs filter out the most complex abstracts per request, using the median FKGL readability measure.

Elsevier Capari et al. [14] submitted a total of 10 runs. Their approaches mainly centered on creating a ranking model. They started by assessing the performance of several models on a proprietary test collection of scientific papers. Then, the top-performing model was fine-tuned on a large set of unlabeled documents using the Generative Pseudo Labeling approach. They also experimented with generating new search queries.

LIA submitted a total of 5 runs as baselines for Task 1. All five have been included in the pool of results for qrel evaluation.

Ruby This team (No paper received) submitted a total of 1 run for Task 1. Their approach relies on ElasticSearch and a TF-IDF score.

Arampatzis This team (No paper received) submitted a total of 9 runs for Task 1. As these reports are very close, the Tables below only report their evaluation made on their first run.

4. Results 4.1. Released database

This section details the results of the task, for both the train and test data.

All data and results have been organized within a relational database, which will be released to all active participants. This release will facilitate: • Computation of Diverse Scores. • Addressing qrel Issues.

• Easy Generation of Supplementary Runs.

One particular benefit of the relational database is the ability to easily extend the qrels based on dense vector similarity and similarity thresholds. This capability is especially relevant given the observation that seemingly identical abstracts in the DBLP dataset appear with diferent relevance labels.

4.2. Train results 4.3. Test results

† Evaluated on comb_score. which can be a source of public debates (like privacy, quantum computing, bitcoins...) from those established on the short Tech Xplore queries, which are more specific and related with a scientific paper in peer-reviewed venues (indoor positioning system, RISC-V architecture for space computing, underwater WiFi developed using LEDs and lasers...). Rankings on these two subsets are very similar, which shows the consistency of relevance results across queries.

10 5. Analysis This section provides further analysis of the submitted runs, and the task as whole.

We complement the evaluation above by taking into consideration other aspects essential for Task 1. Table 7 highlights credibility and text complexity. We used simple automatic metrics to provide an overview of the importance and the complexity of the article. First, the average number of bibliographic references among the top 10 results of each query is provided. Second, we provide several metrics provided by the Python library readability7: the average size of vocabulary per abstract, the average ratio of words considered as long (i.e., with at least 7 characters), the average ratio of words considered as complex (i.e., absent from the Dale-Chall word list of 3,000 words recognized by 80 % of fifth graders) and the averaged and median FKGL readability metrics.

A large majority of runs have a similar FKGL of 15, corresponding to university level texts, which can be expected since the document deals with advanced scientific topics. However, AIIRLab runs obtained with bi- or cross-encoders and ordered according to comb scores exhibit a significant higher FKGL readability scores. This diference is related to longer sentences retrieved with this score that with relevance score (average length of 31 words vs 23 words).

Only one run (Sharingans_Task1_marco-GPT3) provided a rephrased extract from the retrieved abstracts, while other runs gave the abstracts in full. This feature translates in the Table in a lower size of vocabulary in their passages.

6. Discussion and Conclusions

This concludes the results for the CLEF 2024 SimpleText Task 1: Content Selection on retrieve passages to include in a simplified summary. Our main findings are the following: First, the Tables on relevance are dominated by neural rankers, in particular, cross-encoders and LLaMA 3 used as a pairwise reranker. Second, a majority of participants relied on ElasticSearch search results. If neural models used in processing steps leveraged these results, other IR systems turned out to be competitive. For instance, LIA_vir_title operating with embedding sentences or UAms_Task1_Anserini_rm3, using an Anserini index have high relevance evaluations. Third, as expected, ranking over systems difers according to the considered criterion. Runs filtered against readability measures tend to have shorter sentences with a more or less drop in relevance. Remarkably, LLaMA 3 used as a reranker seems to not only help to select more relevant documents but also with more concise sentences.

Acknowledgments

This track would not have been possible without the great support of numerous individuals. We want to thank in particular the colleagues and the students who participated in data construction and evaluation. Please visit the SimpleText website for more details on the track.8

Liana Ermakova is funded by the French National Research Agency (ANR) Automatic Simplification of Scientific Texts project (ANR-22-CE23-0019-01), 9 and the MaDICS research group.10 8https://simpletext-project.com/ 9https://anr.fr/Project-ANR-22-CE23-0019 10https://www.madics.fr/ateliers/simpletext/ [5] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, Arnetminer: Extraction and mining of academic social networks, in: KDD’08, 2008, pp. 990–998. [6] E. SanJuan, S. Huet, J. Kamps, L. Ermakova, Overview of the CLEF 2023 simpletext task 1: Passage selection for a simplified summary, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2823–2834. URL: https://ceur-ws.org/Vol-3497/paper-238.pdf. [7] D. P. Varadi, A. Bartulović, SimpleText 2024: Scientific Text Made Simpler Through the Use of AI, in: [15], 2024. [8] S. M. Ali, H. Sajid, O. Aijaz, O. Waheed, F. Alvi, A. Samad, Improving Scientific Text Comprehension:

A Multi-Task Approach with GPT-3.5 Turbo and Neural Ranking, in: [15], 2024. [9] R. Mann, T. Mikulandric, CLEF 2024 SimpleText Tasks 1-3: Use of LLaMA-2 for text simplification, in: [15], 2024. [10] R. Elagina, P. Vučić, AI Contributions to Simplifying Scientific Discourse in SimpleText 2024, in: [15], 2024. [11] N. Largey, R. Maarefdoust, S. Durgin, B. Mansouri, AIIR Lab Systems for CLEF 2024 SimpleText:

Large Language Models for Text Simplification, in: [15], 2024. [12] B. Vendeville, L. Ermakova, P. De Loor, UBO NLP report on the SimpleText track at CLEF 2024, in: [15], 2024. [13] J. Bakker, G. Yüksel, J. Kamps, University of Amsterdam at the CLEF 2024 SimpleText Track, in: [15], 2024. [14] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, Enhancing Scientific Document Simplification through Adaptive Retrieval and Generative Models, in: [15], 2024. [15] G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024: Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024.

[1]

G. M.

Di Nunzio ,

Vezzani ,

Bonato ,

Azarbonyad ,

Kamps ,

Ermakova , Overview of the CLEF 2024 SimpleText task 2: Identify and explain dificult concepts , in: [15] , 2024 .

[2]

Ermakova ,

Laimé ,

McCombie ,

Kamps , Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text , in: [15] , 2024 .

[3] J. D'Souza , S.

Kabongo , H. B.

Giglou , Y. Zhang,

Overview of the CLEF 2024 SimpleText Task 4: SOTA? Tracking the State-of-the-Art in Scholarly Publications , in: [15], 2024 .

[4]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

G. M.

Di Nunzio ,

Vezzani , J. D'Souza , J. Kamps , Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone , in: L. Goeuriot , G. Q.

Philippe Mulhem , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, 2024 .