University of Amsterdam at the CLEF 2024 SimpleText Track Jan Bakker, Göksenin Yüksel and Jaap Kamps University of Amsterdam, Amsterdam, The Netherlands Abstract This paper reports on the University of Amsterdam’s participation in the CLEF 2024 SimpleText track. Our overall goal is to investigate and remove barriers that prevent the general public from accessing scientific literature, hoping to promote science literacy among the general public. Our specific focus is to investigate the relation between the topical relevance and the text complexity of scientific text, as well as develop text simplification approaches for scientific text. Our main findings are the following. First, for lay person scientific passage retrieval, both lexical and zero-shot retrieval models perform well, with only marginal loss of performance for complexity-aware models avoiding the retrieval of passages with low readability. Second, for spotting complex concepts, relative simple approaches based on corpus statistics show competitive precision but low recall. Third, for scientific text simplification different models generate different simplifications with all reasonable overlap with human reference simplifications. Fourth, document or abstract level text simplification incorporate discourse structure and make sentence deletions, which hold great promise to improve the output quality and succinctness for lay users of scientific text. Keywords Information Storage and Retrieval, Natural Language Processing, Wordplay translation, Humor retrieval, Humor classification 1. Introduction While the advent of the internet and social media has given us access to an unprecedented volume of information, it also comes with unprecedented risks due to potential misinformation and disinformation spreading easily. The traditional antidote against misinformation is scientifically grounded information, and everyone agrees on the value and importance of science literacy. In practice, lay persons avoid consulting scientific sources, due to its presumed complexity. Hence, removing any access barriers for lay persons to consult scientific text are of paramount importance. The CLEF 2024 SimpleText track investigates the barriers that ordinary citizens face when accessing scientific literature head-on, by making available corpora and tasks to address different aspects of the problem. For details on the exact track setup, we refer to the Track Overview paper CLEF 2024 LNCS proceedings [1] as well as the detailed task overviews in the CEUR proceedings [2, 3, 4, 5]. We conduct an extensive analysis of the three tasks of the track: Task 1 on Content Selection; Task 2 on Complexity Spotting; and Task 3 on Text Simplification. We submitted in total seven runs for Task 1, focussing both on retrieval effectiveness for popular requests, as well as on the text complexity of the retrieved abstracts. We submitted three baseline runs for Task 2, focusing on straightforward locating rare terms, and on matching scientific text to definitions of terminology. We submitted ten runs for Task 3, exploring three different text simplification models (GPT-2, Wiki, Cochrone) and three levels of simplification (sentence, paragraph, document or abstract). The rest of this paper is structured as follows. Next, in Section 2 we discuss our experimental setup and the specific runs submitted. Section 3 discusses the results of our runs and provides a detailed analysis of the corpus and results for each tasks. We end in Section 4 by discussing our results and outlining the lessons learned. CLEF 2024: Conference and Labs of the Evaluation Forum, September 9–12, 2024, Grenoble, France $ kamps@uva.nl (J. Kamps)  0000-0002-6614-0087 (J. Kamps) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Experimental Setup In this section, we will detail our approach for the three CLEF 2024 SimpleText track tasks. 2.1. Experimental Data For details of the exact task setup and results we refer the reader to the detailed overview of the track in [6]. The basic ingredients of the track are: Corpus The CLEF 2024 SimpleTrack Corpus consists of 4.9 million bibliographic records, including 4.2 million abstracts, and detailed information about authors/affiliations/citations. Context There are 40 popular science articles, with 20 from The Guardian1 and 20 from Tech Xplore.2 Requests For Task 1, there are 176 requests, 109 requests are based on The Guardian and 67 on TechXplore. Abstracts retrieved for these requests form the corpus for the remaining Tasks 2 and 3. This expands the topic set with 1-4 word queries using earlier years with 64 verbose questions on the Guardian articles. Train Data For Task 1, there are relevance judgments for 64 requests (corresponding to 20 Guardian articles, G01–G20, and 5 TechExplore articles, T01–T05), with 61 queries having 10 or more relevant abstracts. For Task 2, there are 576 train sentences with ground truth on complex terms/concepts for a total of 2,579 terms, and 317 test sentences (4.5 per query). For Task 2.3, an additional set of 3,815 other sentences is provided. For Task 3, there are 958 train sentences with human simplifications, matching to 175 train abstracts with human simplifications. There are 4,797 test sentences, and a matching set of 182 test abstracts. Test Data For Task 1, the ultimate test collection consists of 30 queries G1.C1–G10.C1 (10 on the Guardian), T06–T11 (20 on Tech Xplore). with a total of 4,854 judgments (128.5 per query). All 30 queries have 29 or more relevant abstracts. For Task 2, there are 313 test sentences with ground truth on complex terms/concepts for a total of 1,440 terms (4.6 per query). For Task 3, there are 578 test sentences with human simplifications, matching to 103 test abstracts with human simplifications. 2.2. Official Submissions We created runs for all the three tasks of the track, which we will discuss in order. Task 1 This task asks to retrieve passages to include in a simplified summary. We submitted six runs in total, shown in Table 1. We first submitted four baseline runs focusing on regular information retrieval effectiveness. Two are vanilla baseline runs on an Anserini index, using either BM25 or BM25+RM3 with default settings [7].3 The other two runs are neural cross-encoder rerankings of these runs, based on zero-shot application of an MSMARCO trained ranker, reranking the top 100 of either the BM25 or the BM25+RM3 baseline run.4 We submitted two further runs that filter for median FKGL in the runs, both for the top 100 and top 1K crossencoder reranker, following the Complexity Aware Ranking approach of [8]. These runs 1 https://www.theguardian.com/science 2 https://techxplore.com/ 3 https://github.com/castorini/pyserini 4 https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 Table 1 CLEF 2024 SimpleText Track Submissions Task Run Description 1 UAms_Task1_Anserini_bm25 BM25 baseline (Anserini, stemming) 1 UAms_Task1_Anserini_rm3 RM3 baseline (Anserini, stemming) 1 UAms_Task1_CE100 Cross-encoder top 100 1 UAms_Task1_CE1K Cross-encoder top 1,000 1 UAms_Task1_CE100_CAR Cross-encoder top 100 + Complexity filter 1 UAms_Task1_CE1K_CAR Cross-encoder top 1,000 + Complexity filter 2.1 UAms_Task2-1_RareIDF Up to 5 rarest terms on idf from test-large 2023 2.3 UAms_Task2-3_Anserini_bm25 BM25 baseline (Anserini, stemming) 2.3 UAms_Task2-3_Anserini_rm3 RM3 baseline (Anserini, stemming) 3.1 UAms_Task3-1_GPT2 GPT-2 Sentence level 3.1 UAms_Task3-1_GPT2_Check GPT-2 Sentence level, Source checked 3.2 UAms_Task3-2_GPT2_Check_Snt GPT-2 Sentence level, Source checked, merged into abstracts 3.2 UAms_Task3-2_GPT2_Check_Abs GPT-2 Abstract level, Source checked 3.1 UAms_Task3-1_Wiki_BART_Snt Wikiauto trained BART sentence level simplification 3.1 UAms_Task3-1_Cochrane_BART_Snt Cochrane trained BART sentence level simplification 3.2 UAms_Task3-2_Wiki_BART_Par Wikiauto trained BART paragraph level simplification 3.2 UAms_Task3-2_Cochrane_BART_Par Cochrane trained BART paragraph level simplification 3.2 UAms_Task3-2_Wiki_BART_Doc Wikiauto trained BART document level simplification 3.2 UAms_Task3-2_Cochrane_BART_Doc Cochrane trained BART document level simplification simply filters out the most complex abstract per request, using a standard readability measure. The run is aiming to remove up to 50% of the results, with the remaining abstracts in the same relevance order as in the original run. As the train data is limited, and none of the approaches above are specific to scientific text, we also experimented with domain adaptation approaches in post-submission experiments. Task 2 This task asks to identify and explain difficult concepts. We submitted three runs, also shown in Table 1. For Task 2.1 on complexity spotting, we submitted a single run. As sentences have a limited number of words, we observed that naive baseline approaches can obtain reasonable performance already. Hence, our submission is using an idf-based term weighting to locate the most rare terms. Specifically, we used all train and test sentences combined as a reference corpus to calculate document (or rather sentence) frequencies, and use this to rank each term in the source sentence by increasing DF (or decreasing IDF). For Task 2.3, we developed an approach to rank definitions or explanations for a given sentence and term pair. However the provided test data did provide only unmatched sets of scientific sentences and other sentences. Hence we submitted two runs only looking at the textual similarity of the large set of provided ’other’ sentences. Task 3 This task asks to simplify scientific text. We submitted the twelve runs shown in Table 1. Our first set of experiments continues the earlier experiments with a GPT-2 model trained in an unsupervised way. First, we use the basic pretrained model on sentence level input. Second, we check all output against the source to avoid hallucination, and submit this checked version. Third, we merge the sentence level simplifications to create abstract level simplifications. Fourth, we run the model on long abstract level input, to create direct abstract level simplifications. All these four runs use the exact same GPT-2 text simplification model. Our second set of experiments is with different BART trained models, either trained on Wiki-Auto or on aligned Lay Summaries from Cochrane (a home grown Cochrane-Auto). This leads to six runs, using either Wiki or Cochrane train data, and using either sentence level, paragraph level, or document Table 2 Evaluation of SimpleText Task 1 (train data). Run MRR Precision NDCG Bpref MAP 5 10 20 5 10 20 UAms_Task1_Anserini_bm25 0.6503 0.4688 0.3906 0.2818 0.4468 0.3931 0.3405 0.4198 0.2439 UAms_Task1_Anserini_rm3 0.6043 0.4187 0.3609 0.2677 0.4003 0.3581 0.3220 0.4157 0.2297 UAms_Task1_CE100 0.6655 0.4813 0.4312 0.3214 0.4570 0.4206 0.3811 0.3275 0.2235 UAms_Task1_CE1K 0.6603 0.4531 0.4078 0.3089 0.4304 0.3998 0.3668 0.4299 0.2484 UAms_Task1_CE100_CAR 0.6709 0.4687 0.3937 0.2396 0.4530 0.3972 0.3163 0.3144 0.1922 UAms_Task1_CE1K_CAR 0.6403 0.4219 0.3672 0.2484 0.4032 0.3646 0.3092 0.3411 0.1904 GPL Base† 0.3301 0.1594 0.1719 0.1562 0.1560 0.1625 0.1708 0.3945 0.1062 GPL Domain Adapt† 0.4478 0.2719 0.2453 0.1958 0.2530 0.2380 0.2286 0.4012 0.1469 GPL Domain Adapt Remining† 0.5459 0.3125 0.2953 0.2141 0.3034 0.2874 0.2519 0.3978 0.1613 † Post-submission experiment. (abstract) level input. Each of these six runs uses a different model, due to the different train input matching the output settings. 3. Experimental Results In this section, we will present the results of our experiments, in three self-contained subsections following the CLEF 2024 SimpleText Track tasks. 3.1. Task 1: Content Selection We discuss our results for Task 1, asking to retrieve passages to include in a simplified summary. 3.1.1. Retrieval effectiveness Table 2 shows the performance of the Task 1 submissions on the train data. Let us first observe how different our runs are from the pooled runs, as those were based exclusively on the organizer’s provided Elastic Search index and the particular keyword query. Due to the different tokenization and indexing choices in our Anserini index, the fraction of unjudged documents in the top 10s is high. First, the BM25 run has 36.6% and the BM25+RM3 run has 41.6% unjudged in the top 10. Second, the cross-encoder reranking has 27.5% (CE top 100) and 30.8% (CE top 1K) of unjudged, slightly lower due to similar neural reranker contributing to the pool in earlier years. Third, the complexity-aware filtered runs have 34.4% (CAR top 100) and 35.3% (CAR top 1K). Fourth, the domain adapted runs have no less than 50.9–72.2% unjudged in the top 10. In this light, the scores of the train adapted run on the train data are truly impressive. We make a number of observations on the performance on the train set. First, the two Anserini baselines using BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10 of 0.36-0.39 on the train data. The RM3 models underperforms the vanilla BM25 on all measures for train, but has a higher fraction of unjudged documents. The used Anserini index differs from the organizer’s provided Elastic search index that dominates the pool of the train data. Second, the zero-shot reranking with an crossencoder lead to an improvement of retrieval effectiveness over the BM25 first stage ranker, with the top 100 reranking scoring 0.42 NDCG@10 on train. The bpref measure is less sensitive to pooling bias, and the highest bpref score of the top 1K reranking demonstrates the effectiveness of these runs. Third, we observe a favorable outcome for the domain adaptation of the models. The base scores are lower than GPL domain adaptation, and our novel remining strategy for continuous domain adaptation improves over GPL, the state-of-the-art for domain adaptation. Table 3 Evaluation of SimpleText Task 1 (test data). Run MRR Precision NDCG Bpref MAP 5 10 20 5 10 20 UAms_Task1_Anserini 0.7187 0.5600 0.5500 0.4078 0.3867 0.3750 0.3507 0.3994 0.1973 UAms_Task1_Anserini_rm3 0.7878 0.5933 0.5700 0.3611 0.4039 0.3924 0.3282 0.4010 0.1824 UAms_Task1_CE100 0.6618 0.4800 0.5300 0.4044 0.3419 0.3654 0.3452 0.2657 0.1579 UAms_Task1_CE1K 0.5950 0.5133 0.5333 0.4033 0.3571 0.3672 0.3505 0.4031 0.1939 UAms_Task1_CE100_CAR 0.6420 0.5333 0.4700 0.3133 0.3435 0.3199 0.2741 0.2657 0.1321 UAms_Task1_CE1K_CAR 0.6611 0.5467 0.5133 0.2911 0.3800 0.3603 0.2778 0.2676 0.1348 GPL Base† 0.3752 0.2333 0.2100 0.1611 0.1823 0.1642 0.1465 0.3192 0.0654 GPL Domain Adapt† 0.5169 0.2733 0.2667 0.2233 0.2389 0.2240 0.2075 0.3600 0.0983 GPL Domain Adapt Remining† 0.5011 0.3133 0.3033 0.2467 0.2560 0.2412 0.2285 0.3732 0.1084 † Post-submission experiment. Table 3 shows the performance of the Task 1 submissions on the train data. We submitted four runs focusing purely on standard retrieval effectiveness, and two runs addressing text complexity. On the test data, our submission were pooled, except for the combined score runs: we oberve 7.7% (CAR top 100) and 6.0% (CAR top 1K) of unjudged documents in the top 10 of each submission. Also the domain adapted runs have no less than 39.0–60.0% unjudged in the top 10, as they were not pooled. We make a number of observations. First, we observer again that the two Anserini baselines using BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10 of 0.38-0.39 on the test data. The RM3 models now outperforms the vanilla BM25 on all measures except MAP for test. Second, the zero-shot reranking with an crossencoder does not lead to an improvement of retrieval effectiveness over the BM25 first stage ranker on the test data. Again, the bpref measure is less sensitive to pooling bias, and the highest bpref score of the top 1K reranking demonstrates the effectiveness of these runs. Third, the complexity aware ranking runs filtering out the most complex abstract show competitive performance. Although these runs intentionally avoid complex, but topically relevant, results, they obtain higher precision scores and similar NCDG scores, and are almost on par with the runs retrieving complex results. Fourth, recall that the domain adapted runs have not contributed to the pool and have high fractions of unjudged documents (no less than 39.0–60.0% unjudged in the top 10). In this light, again, the scores of the domain adapted runs are quite impressive. We observe again the relative score increase from base ranking, to standard GPL domain adaptation, and the GPL remining approach. We observe again that our novel remining strategy for continuous domain adaptation improves over GPL, the state-of-the-art for domain adaptation. 3.1.2. Analysis This section analyzes various aspects of the submitted runs, where we pay particular attention to two aspects of core interest to the task and the overall use case of the track in which a lay user is accessing complex scientific text. Credibility The first aspect of interest is the credibility of the retrieved information. Whilst one may assume that any scientific paper submitted after peer-review has passed a number of quality control steps during the peer-review process, and hence all retrieved abstract have high credibility. However, it is well-known that lay users have difficulty separating authoritative verses non-authoritative publications, as they are not able to discern the same cue as expert. For example, they are unaware of the reputation of the authors [9]. How authoritative are the results retrieved for our lay user? Table 4 Analysis of SimpleText Task 1 output (over all 176 queries) Run Queries Top Year Citations Length FKGL Avg Med Avg Med Avg Med Avg Med UAms_Anserini_bm25 176 10 2012.9 2015 16.5 3.0 1355.9 1249.0 14.5 14.3 UAms_Anserini_rm3 176 10 2013.2 2015 16.8 3.0 1376.6 1272.5 14.5 14.4 UAms_CE100 176 10 2012.6 2015 20.5 3.0 1192.5 1115.0 14.5 14.4 UAms_CE100_CAR 176 10 2012.6 2015 18.0 3.0 1151.4 1081.0 12.5 12.8 UAms_CE1K 176 10 2012.5 2015 19.4 3.0 1147.0 1061.0 14.5 14.4 UAms_CE1K_CAR 176 10 2012.3 2015 18.5 3.0 1083.2 1009.0 12.4 12.7 GPL Base 176 10 2011.8 2014 13.1 2.0 910.5 970.5 14.3 14.3 GPL Domain Adapt 176 10 2011.9 2014 13.7 2.0 970.3 971.5 14.3 14.2 GPL Domain Adapt Remining 176 10 2011.7 2014 21.3 2.0 953.9 980.0 14.2 14.2 Table 4 shows the year of publication of the top 10 results retrieved for our lay user’s popular science query. The systems retrieve publications with a median recency of 2015, ensuring that our lay user is consulting recent information not yet outdated or revised by more recent publications. This is an encouraging result as standard bibliometric literature ranking approaches have a strong bias for older publications given the fact that citations accumulate over time. But does this mean the results are not noteworthy and lack importance? Table 4 also shows the number of citations of the top 10 results retrieved. We observe that our approach is retrieving results with significantly higher average numbers of citations, when compared to the baseline lexical rankers, with a gain from 17 to 21 citations on average. The GPL runs use a different baseline, but the difference between standard GPL, similar to the non-adapted baseline, and the novel remining approach is striking and obtains the highest average citation score. This higher average citation count is reassuring as it signals high levels of authoritativeness of the retrieved results. As citations are sparse and skewed, the median number of citations is only 2-3 throughout. This also signals that our approach is able to attract very highly cited publications into the top 10 results, leading to the significant average increase. Readability The second aspect of interest is the readability of the retrieved information. We have seen above that the approaches are effective for retrieving relevant scientific papers. However, although topically relevant these paper may contain very advance scientific information that is not easy to understand and interpret by lay users. Recall that this was the motivation to use complexity-aware retrieval approaches [8]. Can complexity-aware search help retrieve relevant and accessible scientific text? Table 4 shows the Flesch-Kincaid Grade Level (FKGL) readability score of the top 10 results retrieved for our lay user’s popular science query. We observe that the lexical and neural rankers retrieve topically relevant information without taking the text complexity into account. Both lexical and neural rankers retrieve information with an FKGL of 14-15 corresponding to university level text complexity. The same holds for the domain adapted runs. This is not surprizing as we have an extensive scientific corpus with an average text complexity of 14-15 reflecting this. Earlier we observed that our complexity-aware retrieval systems obtained almost almost the same effectiveness in terms of retrieval effectiveness. Hence this complexity aware approach was able to rank a similar number of topically relevant documents in the top 10 as standard lexical and neural ranking approaches. But is the complexity-aware approach able to rank more accessible content for our lay user issuing a popular science query? Table 4 shows indeed favorable readability levels for the complexity aware search, with an FKGL of 12-13 corresponding to the exit level of compulsory education. Hence the complexity aware search approach is able to retrieve relevant and accessible content to our lay user. The retrieved source abstracts have a similar readability level as targeted by text simplification systems as discussed in Section 3.3. Table 5 Evaluation of SimpleText Task 2 (test data). Run Recall Terms "d" Overall Average Recall Precision UAms_Task2-1_RareIDF 0.0854 0.0942 0.0259 0.0894 Table 6 Evaluation of SimpleText Task 2: submission UAms_Task2-1_RareIDF, only unique terms in the train (including validation) and test data. Run Precision Recall F1 Score 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Train 0.16 0.14 0.13 0.13 0.12 0.04 0.07 0.10 0.13 0.15 0.06 0.09 0.11 0.11 0.12 Test 0.18 0.16 0.14 0.13 0.12 0.05 0.08 0.10 0.12 0.14 0.07 0.10 0.11 0.12 0.12 3.2. Task 2: Complexity Spotting We continue with Task 2, asking to identify and explain difficult concepts. 3.2.1. Results Task 2.1 Table 5 shows the performance of the Task 2 submission on the test data. At the time of writing, these score were released as (preliminary) scores without much further explanation. The official results seem to focus entirely on recall aspects, or retrieving all terms annotated by the experts. Our simple approach is not expected to do well in terms of recall. We will conduct a more precision oriented evaluation below as additional analysis. Task 2.3 There is no train data for Task 2.3 released, nor any test results made available at the time of writing. We hope and expect that these results will be released in time for the CLEF conferences in Grenoble. 3.2.2. Analysis Table 6 shows the performance of the Task 2 submission on the train and test data. Due to the very limited data available, we treat spot here any terms. We included the complexity level as graded score, in order to filter the Boolean measures on minimal relevance score.5 On the train and test data of earlier years, performance peaked around spotting 3 terms per sentence. Due to the many experts annotating the same set of sentences, we see that both recall and F1 increase over ranks and the highest scores are retrieved for spotting 5 rare terms per sentences. Overall, our simple approach achieves an MRR of 0.2542 (train) and 0.2741 (test) and, taking the difficulty level into account, an NDCG@5 of 0.1446 (train) and 0.1469 (test). Table 7 shows an example sentence with references. In this example, our approach predicts 5 terms, that match one of the annotated references. The top ranked candidate matches one of the references annotated as difficult ("d"). There is a striking number of 16 references, with about 11 unique reference terms. Some references occur in variants (e.g., "simulated F1 car" is rated "d", whereas "F1 car" is rated "e"). Several references do not literally occur in the source sentence: we observe differences in case ("ResNet-18" vs. "resnet-18), plural/singular ("labels" vs. "label", "images" vs. "image"), and verb tense ("is fed" vs. "to be fed", "outputs" vs. "to output"). Table 8 shows the frequency of spotted terms on the train data. We observe a striking variation with 53 sentences having 1 complex terms, and 12 sentences having more than 15 complex terms. This 5 Tables not shown as they exhibit the same qualitative pattern, but at the obvious lower score level. Table 7 Example of SimpleText Task 2.1: source and references. Sentence G06.2_2810968146_2 Source The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking. Reference [’ResNet-18 variant’, ’braking’, ’braking’, ’f1 car’, ’front’, ’image’, ’model’, ’optimal label’, ’resnet- 18’, ’simulated F1 car’, ’steering’, ’steering’, ’throttle’, ’throttle’, ’to be fed’, ’to output’] Difficulty [’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’d’, ’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’m’] Source "d" The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking. Source "m" The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking. Source "e" The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking. Prediction [’resnet-18’, ’throttle’, ’braking’, ’f1’, ’fed’] Table 8 Example of SimpleText Task 2.1: Frequency of terms spotted. Terms/Sentence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 29 Frequency (train) 53 99 90 100 44 55 23 22 16 20 3 5 4 4 1 7 2 2 1 Frequency (test) 18 31 61 65 45 32 26 16 10 3 2 4 Table 9 Example of SimpleText Task 2.1: Spotted term or concept. Source Number of Terms Occurs in Sentence Not in Sentence Train 2,579 2,098 481 Train (case folding) 2,579 2,334 245 Test 1,440 1,312 128 Test (case folding) 1,440 1,347 93 Table 10 CLEF 2024 SimpleText Task 2: Top 1 Semantic Match Run Rouge BERTScore 1 2 L Lsum P R F1 Train 0.3729 0.0946 0.3723 0.3733 0.92 0.93 0.92 Test 0.3825 0.0957 0.3810 0.3825 0.93 0.93 0.92 variation is making the prediction of all terms neigh impossible, and makes averaging over terms an unreliable indicator of the per sentence performance. Evaluation over the sets of top retrieved terms, as we did in Table 6 shows indeed reasonable performance for our basic approach. The recall of our approach is relatively low, as the baseline rarest term approach cannot find multi- word phrases. In addition, many of the ground truth terms do not literally appear in the sentence, and require case folding, morphologically normalization, or even more complex transformations to correctly align with the exact orthography of the scientific text. Table 9 quantifies how often the spotted term or phrase is literally occurring in the sentences. We observe a fraction varying from 6.5% to 18.7%. While many cases concern morphological normalization that is useful to conflate similar concepts across different sentences (base form of verbs, singular for nouns etc). However, the evaluation measures will treat such cases as a failed match, and recall oriented measures should be treated with care. Table 11 Results for CLEF 2024 SimpleText: Task 3.1 sentence-level (top) and Task 3.2 abstract-level (bottom) text simplifi- cation on the train set Lexical complexity score Levenshtein similarity Additions proportion Deletions proportion Compression ratio Sentence splits Exact copies count BLEU FKGL SARI run_id Source 893 14,30 19,18 38,95 1,00 1,00 1,00 1,00 0,00 0,00 8,72 Reference References 893 11,70 100,00 100,00 0,84 1,07 0,72 0,04 0,21 0,37 8,63 UAms_GPT2_Check 714 11,87 35,21 27,35 1,02 1,22 0,87 0,11 0,17 0,14 8,59 UAms_GPT2 714 11,21 34,73 23,69 1,28 1,47 0,79 0,05 0,28 0,12 8,56 UAms_Wiki_BART_Snt 714 12,34 34,19 37,18 0,83 0,99 0,88 0,29 0,02 0,19 8,64 UAms_Cochrane_BART_Snt 714 13,74 26,70 36,69 0,94 0,99 0,95 0,56 0,03 0,08 8,67 Source 175 14,30 19,53 39,95 1,00 1,00 1,00 1,00 0,00 0,00 8,88 Reference 175 11,80 100,00 100,00 0,80 1,04 0,70 0,00 0,20 0,40 8,75 UAms_GPT2_Check_Abs 119 12,75 36,68 16,48 0,59 0,66 0,60 0,01 0,11 0,50 8,61 UAms_GPT2_Check_Snt 119 11,88 35,97 28,86 1,00 1,22 0,85 0,01 0,18 0,15 8,71 UAms_Cochrane_BART_Par 119 16,15 35,12 26,23 0,70 0,59 0,70 0,04 0,08 0,36 8,72 UAms_Wiki_BART_Doc 119 16,45 33,36 28,35 1,01 0,83 0,81 0,00 0,18 0,15 8,73 UAms_Cochrane_BART_Doc 119 14,78 33,23 9,55 0,40 0,40 0,52 0,03 0,01 0,61 8,76 UAms_Wiki_BART_Par 119 13,26 30,31 36,76 0,89 1,00 0,88 0,01 0,03 0,13 8,81 Table 10 evaluates the top 1 rarests term as returned by our baseline approach, and compares it to the entire list of reference terms. As our term is a unigram, we score well on Rouge-1 but not on Rouge-2 (we retained hyphened multiword terms, hence the score is not zero). With BERTScore we can see the semantic relatedness of our top 1 term and the reference terms, ignoring the exact orthography. The scores in terms are very encouraging in terms of over 90% precision, recall, and F1. Note that this evaluation is retricted to our first spotted term, and score 1.0 in case this term is part of any of the expert’s reference terms. 3.3. Task 3: Text Simplification We continue with Task 3, asking to simplify scientific text. 3.3.1. Evaluation Table 11 shows the results on the train data, both in terms of text statistics and in terms of evaluation against the human reference simplifications.6 We make a number of observations. First, looking at the GPT-2 models, we see that both sentence level and abstract level text simplification considerably brings down the FKGL measure, and obtain reasonable SARI and BLEU scores against the reference simplifications. The abstract level simplification leads to deletions of entire sentences, with 50% less tokens than the source, but still outperformaning the sentence level simplification retaining all sentences. Second, the BART model trained on Wiki-Auto and on Cochrane-Auto lay summaries significantly outperforms the GPT-2 model on BLEU with scores of 0.37, signaling high n-gram overlap with the 6 Some of the differences in the number of sentences/abstracts are due to those sources not included in the test source file. This particularly concerns very short fragments from biomedical literature added as additional train data, but not part of the SimpleText corpus. Table 12 Results for CLEF 2024 SimpleText: Task 3.1 sentence-level (top) and Task 3.2 abstract-level (bottom) text simplifi- cation on the test set Lexical complexity score Levenshtein similarity Additions proportion Deletions proportion Compression ratio Sentence splits Exact copies count BLEU FKGL SARI run_id Source 578 13.65 12.02 19.76 1.00 1.00 1.00 1.00 0.00 0.00 8.80 Reference 578 8.86 100.00 100.00 0.70 1.06 0.60 0.01 0.27 0.54 8.51 UAms_GPT2_Check 578 11.47 29.91 15.10 1.02 1.23 0.87 0.14 0.17 0.14 8.68 UAms_GPT2 578 10.91 29.73 13.07 1.30 1.50 0.79 0.06 0.29 0.12 8.63 UAms_Wiki_BART_Snt 578 12.13 27.45 21.56 0.85 0.99 0.89 0.32 0.02 0.16 8.73 UAms_Cochrane_BART_Snt 578 13.22 18.45 19.21 0.95 0.99 0.96 0.59 0.02 0.07 8.77 Source 103 13.64 12.81 21.36 1.00 1.00 1.00 1.00 0.00 0.00 8.88 Reference 103 8.91 100.00 100.00 0.67 1.04 0.60 0.00 0.23 0.53 8.66 UAms_GPT2_Check_Abs 103 12.85 36.47 13.12 0.91 0.92 0.59 0.00 0.18 0.45 8.73 UAms_Cochrane_BART_Doc 103 14.46 33.51 9.39 0.65 0.58 0.54 0.04 0.06 0.53 8.80 UAms_Cochrane_BART_Par 103 16.53 31.58 15.40 1.08 0.80 0.67 0.04 0.15 0.32 8.81 UAms_GPT2_Check_Snt 103 11.57 30.71 15.24 1.54 1.70 0.78 0.00 0.27 0.13 8.77 UAms_Wiki_BART_Doc 103 15.68 26.50 15.11 1.51 1.14 0.76 0.01 0.25 0.11 8.79 UAms_Wiki_BART_Par 103 13.11 23.92 19.49 1.39 1.37 0.81 0.01 0.11 0.10 8.86 human reference simplifications. For abstract level simplification it is encouraging to see that the Cochrane model trained on scientific data is slightly outperforming the Wiki-Auto trained model. Third, the paragraph and document level models trained on Wiki-Auto and Cochrane do again not outperform the sentence level simplifications, under the conditions of the task’s train data. The train data is derived from the sentence level scientific text simplification references from the earlier years of the track. Proper document level text simplification approaches lead to considerable deletions, and perform reasonable given their far more succinct output. Table 12 shows the Task 3 results for both sentence-level (top) and abstract-level (bottom) scientific text simplifications. We make again a number of observations. First, looking at the GPT-2 models, we see again low FKGL scores indicating favorable readability, with reasonable SARI and BLEU scores. The abstract level simplification clearly outperforms the merged sentence level simplifications, despite a far more succinct output. Second, looking at the BART model trained on Wiki-Auto and on Cochrane-Auto lay summaries, we see that the Cochrane model trained on scientific data is clearly outperforming the Wiki-Auto trained model on SARI for document level text simplification. Third, the paragraph and document level models trained on Wiki-Auto and Cochrane do again not outperform the sentence level simplifications, under the conditions of the task’s test data based on aggregated human reference sentence simplifications. These models take discourse structure into account, or may merge or reorder sentences, and are less focused on single sentence wordsmithing, or promoting sentence splits. 3.3.2. Analysis In this section, we look analyze the output of our systems by realigning the simplified text predictions to the source sentences. Table 13 Example of SimpleText Task 3 prediction versus source: deletions, insertions, and whole sentence insertions Topic Document Output G07.1 2111507945 The growth of social media provides a convenient communication scheme way for people ⃒ to communicate , but at the same time it becomes a hotbed of misinformation . ⃒The This wide spread of misinformation over social media is injurious to public ⃒ interest . It is difficult to separate fact from fiction when talking about social media . ⃒We design a framework , which integrates combines ⃒ collective intelligence and machine intelligence , to help identify misinformation . ⃒The basic idea is : ( 1 ) automatically index the expertise of users according to their microblog contents posts ; and ( ⃒2 ) match the experts with the same information given to suspected misinformation . ⃒By sending the suspected misinformation to appropriate experts , we can collect gather the assessments of experts relevant ⃒ data to judge the credibility of the information , and help refute misinformation . ⃒In this paper , we focus on look at expert finding for misinformation identification ⃒ . We ask experts to identify the source of the misinformation , and how it is spread . ⃒We propose a tag-based method approach to index indexing the expertise of microblog users with social tags ⃒. Our approach will allow us to identify which posts are most relevant and which are not . ⃒Experiments on a real world dataset demonstrate show the effectiveness of our method approach for expert finding with respect to misinformation identification in microblogs . Controlled Creativity Text simplification models are based on generative large language models. For example, one of the models we used is a GPT-2 model [10] called the Keep it Simple (KiS). The model is based on GPT-medium, using a straightforward unsupervised training task with an explicit loss in terms of fluency, saliency, and simplicity. Such models are used in generative mode, generating the output in fairly unconstrained mode in order to ensure none of the input is lost (in particular for longer input). As a result there is also a chance that the model continues to generate output after the source has been fully simplified. This can cause the model to overgenerate and produce spurious content. Table 13 shows an example output simplification, combining the input sentences belonging to the abstract of documents 2111507945 retrieved for query G07.1. We show here deletions and insertions relative to the source input sentences (in this case 8 in total). Many simplifications are revisions of the input, but we also observe that sometimes an entire sentence is inserted (shown as xxx). Modern models such as ours generate the simplification, which may lead to additional output being generated at the end. Recall that ⃒ the example as shown in Table 13 merges 8 separate input sentences in the train data (indicated by ⃒), making this occur multiple times at the end of three of the inputs. Spurious Content We analyze the frequency of spurious content in our runs. For human readers, detecting such sentences by simply inspecting the output is hard, as they are very reasonable completions generated with awareness of the preceding context. We experimented with unsupervised approaches to tackle the generation of spurious generation, by post-processing the output in relation to the original input. Similar to the edits as shown in the table, we process input and output, and remove any sentence that has been inserted without grounding in the input. Table 14 quantifies how often such spurious generation occurs. We make a number of observations. First, the spurious generation is not infrequent. Some systems have a marginal number of cases, which may be a result of imperfect alignment due to short sentences or changing word orders. Other systems have many cases, up to 1,390 sentences or 29% (and 111 abstracts or 14%) of the input for the unconstrained GPT2 model. Second, in the GPT2 sentence level case, we remove this additional content in a post-processing step, ensuring all the output is grounded on input sentences. This is effectively removing spurious content from the runs, and also leads to better performance in Table 12. Third, while our post-processing already has a favorable effect on the evaluation measures, we feel that it has great benefits not reflected by these scores. Our post-processing is specifically, and only, Table 14 Analysis of SimpleText Analysis: Spurious generation for sentence-level (top) and abstract-level (bottom) scientific text simplification Run # Input Sentences/Abstracts Spurious Content Number Fraction UAms-1_GPT2 4,797 1,390 0.29 UAms-1_GPT2_Check 4,797 3 0.00 UAms-1_Wiki_BART_Snt 4,797 14 0.00 UAms-1_Cochrane_BART_Snt 4,797 25 0.01 UAms-2_GPT2_Check_Snt 782 111 0.14 UAms-2_GPT2_Check_Abs 782 1 0.00 UAms-2_Wiki_BART_Par 782 46 0.06 UAms-2_Wiki_BART_Doc 782 74 0.09 UAms-2_Cochrane_BART_Par 782 28 0.04 UAms-2_Cochrane_BART_Doc 782 2 0.00 removing spurious generation (or “hallucination”) of the output. These results highlight and quantify the severity of this problem in generative text simplification models such as our GPT2 model. At the same time, it offers a practical approach to tackle this undesirable aspect head-on. 4. Discussion and Conclusions This paper detailed the University of Amsterdam’s participation in the CLEF 2024 SimpleText track. We conducted a range of experiments, for each of the three tasks of the track. For Task 1 on Content Selection, we observed a very solid performance for zero-shot neural reranking, as well as competitive effectivess for complexity-aware rankers that purposely avoid to retrieve results with a high text complexity. For Task 2 on Complexity Spotting, we submitted preliminary approaches based on standard term weighting, and observed that naive approaches can help locate difficult terms. For Task 3 on Text Simplification, we experimented with a range of models and approaches, and observed that sentence-level simplification approaches can be very effective to reduce the complexity of scientific text, and that paragraph and abstract level simplifications lead to far shorter output including whole sentence deletions. Acknowledgments This research was conducted as part of the final research projects of the Master in Artificial Intelligence at the University of Amsterdam. We thank the track and task organizers for their amazing service and effort in making realistic benchmarks for scientific text simplification available. Jaap Kamps is partly funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA # 1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the research. References [1] L. Ermakova, et al., Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts, in: L. Goeuriot, et al. (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, 2024. [2] E. SanJuan, et al., Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to include in a simplified summary, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [3] G. M. D. Nunzio, et al., Overview of the CLEF 2024 SimpleText task 2: Identify and explain difficult concepts, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [4] L. Ermakova, et al., Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [5] J. D’Souza, et al., Overview of the CLEF 2024 SimpleText task 4: Track the state-of-the-art in scholarly publications, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [6] L. Ermakova, T. Miller, A. Bosser, V. M. Palma-Preciado, G. Sidorov, A. Jatowt, Overview of JOKER - CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023, Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 397–415. URL: https://doi.org/10.1007/978-3-031-42448-9_26. doi:10.1007/978-3-031-42448-9\_26. [7] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. F. Nogueira, Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations, in: F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai (Eds.), SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, ACM, 2021, pp. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238. doi:10.1145/3404835. 3463238. [8] L. Ermakova, J. Kamps, Complexity-aware scientific literature search: Searching for relevant and accessible scientific text, in: G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad, J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 16–26. URL: https://aclanthology.org/2024.determit-1.2. [9] J. Kamps, The impact of author ranking in a library catalogue, in: G. Kazai, C. Eickhoff, P. Brusilovsky (Eds.), Proceedings of the 4th ACM Workshop on Online books, complemen- tary social media and crowdsourcing, BooksOnline 2011, Glasgow, United Kingdom, October 24, 2011, ACM, 2011, pp. 35–40. URL: https://doi.org/10.1145/2064058.2064067. doi:10.1145/ 2064058.2064067. [10] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, Keep it simple: Unsupervised simplification of multi-paragraph text, in: ACL/IJCNLP’21: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, 2021, pp. 6365–6378. URL: https: //doi.org/10.18653/v1/2021.acl-long.498.