1. Introduction

University of Amsterdam at the CLEF 2024 SimpleText Track

Jan Bakker

Göksenin Yüksel

Jaap Kamps

0 0 University of Amsterdam , Amsterdam , The Netherlands

This paper reports on the University of Amsterdam's participation in the CLEF 2024 SimpleText track. Our overall goal is to investigate and remove barriers that prevent the general public from accessing scientific literature, hoping to promote science literacy among the general public. Our specific focus is to investigate the relation between the topical relevance and the text complexity of scientific text, as well as develop text simplification approaches for scientific text. Our main findings are the following. First, for lay person scientific passage retrieval, both lexical and zero-shot retrieval models perform well, with only marginal loss of performance for complexity-aware models avoiding the retrieval of passages with low readability. Second, for spotting complex concepts, relative simple approaches based on corpus statistics show competitive precision but low recall. Third, for scientific text simplification diferent models generate diferent simplifications with all reasonable overlap with human reference simplifications. Fourth, document or abstract level text simplification incorporate discourse structure and make sentence deletions, which hold great promise to improve the output quality and succinctness for lay users of scientific text.

eol>Information Storage and Retrieval Natural Language Processing Wordplay translation Humor retrieval Humor classification

1. Introduction 2. Experimental Setup 2.1. Experimental Data

In this section, we will detail our approach for the three CLEF 2024 SimpleText track tasks. For details of the exact task setup and results we refer the reader to the detailed overview of the track in [6]. The basic ingredients of the track are: Corpus The CLEF 2024 SimpleTrack Corpus consists of 4.9 million bibliographic records, including 4.2 million abstracts, and detailed information about authors/afiliations/citations.

Context There are 40 popular science articles, with 20 from The Guardian1 and 20 from Tech Xplore.2 Requests For Task 1, there are 176 requests, 109 requests are based on The Guardian and 67 on TechXplore. Abstracts retrieved for these requests form the corpus for the remaining Tasks 2 and 3. This expands the topic set with 1-4 word queries using earlier years with 64 verbose questions on the Guardian articles.

Train Data For Task 1, there are relevance judgments for 64 requests (corresponding to 20 Guardian articles, G01–G20, and 5 TechExplore articles, T01–T05), with 61 queries having 10 or more relevant abstracts.

For Task 2, there are 576 train sentences with ground truth on complex terms/concepts for a total of 2,579 terms, and 317 test sentences (4.5 per query). For Task 2.3, an additional set of 3,815 other sentences is provided.

For Task 3, there are 958 train sentences with human simplifications, matching to 175 train abstracts with human simplifications. There are 4,797 test sentences, and a matching set of 182 test abstracts.

Test Data For Task 1, the ultimate test collection consists of 30 queries G1.C1–G10.C1 (10 on the Guardian), T06–T11 (20 on Tech Xplore). with a total of 4,854 judgments (128.5 per query). All 30 queries have 29 or more relevant abstracts.

For Task 2, there are 313 test sentences with ground truth on complex terms/concepts for a total of 1,440 terms (4.6 per query).

For Task 3, there are 578 test sentences with human simplifications, matching to 103 test abstracts with human simplifications.

2.2. Oficial Submissions

We created runs for all the three tasks of the track, which we will discuss in order.

Task 1 This task asks to retrieve passages to include in a simplified summary.

We submitted six runs in total, shown in Table 1. We first submitted four baseline runs focusing on regular information retrieval efectiveness. Two are vanilla baseline runs on an Anserini index, using either BM25 or BM25+RM3 with default settings [7].3 The other two runs are neural cross-encoder rerankings of these runs, based on zero-shot application of an MSMARCO trained ranker, reranking the top 100 of either the BM25 or the BM25+RM3 baseline run.4

We submitted two further runs that filter for median FKGL in the runs, both for the top 100 and top 1K crossencoder reranker, following the Complexity Aware Ranking approach of [8]. These runs 1https://www.theguardian.com/science 2https://techxplore.com/ 3https://github.com/castorini/pyserini 4https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 GPT-2 Sentence level GPT-2 Sentence level, Source checked GPT-2 Sentence level, Source checked, merged into abstracts GPT-2 Abstract level, Source checked Wikiauto trained BART sentence level simplification Cochrane trained BART sentence level simplification Wikiauto trained BART paragraph level simplification Cochrane trained BART paragraph level simplification Wikiauto trained BART document level simplification

Cochrane trained BART document level simplification simply filters out the most complex abstract per request, using a standard readability measure. The run is aiming to remove up to 50% of the results, with the remaining abstracts in the same relevance order as in the original run.

As the train data is limited, and none of the approaches above are specific to scientific text, we also experimented with domain adaptation approaches in post-submission experiments.

Task 2 This task asks to identify and explain dificult concepts.

We submitted three runs, also shown in Table 1. For Task 2.1 on complexity spotting, we submitted a single run. As sentences have a limited number of words, we observed that naive baseline approaches can obtain reasonable performance already. Hence, our submission is using an idf-based term weighting to locate the most rare terms. Specifically, we used all train and test sentences combined as a reference corpus to calculate document (or rather sentence) frequencies, and use this to rank each term in the source sentence by increasing DF (or decreasing IDF).

For Task 2.3, we developed an approach to rank definitions or explanations for a given sentence and term pair. However the provided test data did provide only unmatched sets of scientific sentences and other sentences. Hence we submitted two runs only looking at the textual similarity of the large set of provided ’other’ sentences.

Task 3 This task asks to simplify scientific text.

We submitted the twelve runs shown in Table 1. Our first set of experiments continues the earlier experiments with a GPT-2 model trained in an unsupervised way. First, we use the basic pretrained model on sentence level input. Second, we check all output against the source to avoid hallucination, and submit this checked version. Third, we merge the sentence level simplifications to create abstract level simplifications. Fourth, we run the model on long abstract level input, to create direct abstract level simplifications. All these four runs use the exact same GPT-2 text simplification model.

Our second set of experiments is with diferent BART trained models, either trained on Wiki-Auto or on aligned Lay Summaries from Cochrane (a home grown Cochrane-Auto). This leads to six runs, using either Wiki or Cochrane train data, and using either sentence level, paragraph level, or document † Post-submission experiment. (abstract) level input. Each of these six runs uses a diferent model, due to the diferent train input matching the output settings.

3. Experimental Results

In this section, we will present the results of our experiments, in three self-contained subsections following the CLEF 2024 SimpleText Track tasks.

3.1. Task 1: Content Selection

We discuss our results for Task 1, asking to retrieve passages to include in a simplified summary. 3.1.1. Retrieval efectiveness Table 2 shows the performance of the Task 1 submissions on the train data. Let us first observe how diferent our runs are from the pooled runs, as those were based exclusively on the organizer’s provided Elastic Search index and the particular keyword query. Due to the diferent tokenization and indexing choices in our Anserini index, the fraction of unjudged documents in the top 10s is high. First, the BM25 run has 36.6% and the BM25+RM3 run has 41.6% unjudged in the top 10. Second, the cross-encoder reranking has 27.5% (CE top 100) and 30.8% (CE top 1K) of unjudged, slightly lower due to similar neural reranker contributing to the pool in earlier years. Third, the complexity-aware filtered runs have 34.4% (CAR top 100) and 35.3% (CAR top 1K). Fourth, the domain adapted runs have no less than 50.9–72.2% unjudged in the top 10. In this light, the scores of the train adapted run on the train data are truly impressive.

We make a number of observations on the performance on the train set. First, the two Anserini baselines using BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10 of 0.36-0.39 on the train data. The RM3 models underperforms the vanilla BM25 on all measures for train, but has a higher fraction of unjudged documents. The used Anserini index difers from the organizer’s provided Elastic search index that dominates the pool of the train data. Second, the zero-shot reranking with an crossencoder lead to an improvement of retrieval efectiveness over the BM25 first stage ranker, with the top 100 reranking scoring 0.42 NDCG@10 on train. The bpref measure is less sensitive to pooling bias, and the highest bpref score of the top 1K reranking demonstrates the efectiveness of these runs. Third, we observe a favorable outcome for the domain adaptation of the models. The base scores are lower than GPL domain adaptation, and our novel remining strategy for continuous domain adaptation improves over GPL, the state-of-the-art for domain adaptation.

Table 3 shows the performance of the Task 1 submissions on the train data. We submitted four runs focusing purely on standard retrieval efectiveness, and two runs addressing text complexity. On the test data, our submission were pooled, except for the combined score runs: we oberve 7.7% (CAR top 100) and 6.0% (CAR top 1K) of unjudged documents in the top 10 of each submission. Also the domain adapted runs have no less than 39.0–60.0% unjudged in the top 10, as they were not pooled.

We make a number of observations. First, we observer again that the two Anserini baselines using BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10 of 0.38-0.39 on the test data. The RM3 models now outperforms the vanilla BM25 on all measures except MAP for test.

Second, the zero-shot reranking with an crossencoder does not lead to an improvement of retrieval efectiveness over the BM25 first stage ranker on the test data. Again, the bpref measure is less sensitive to pooling bias, and the highest bpref score of the top 1K reranking demonstrates the efectiveness of these runs.

Third, the complexity aware ranking runs filtering out the most complex abstract show competitive performance. Although these runs intentionally avoid complex, but topically relevant, results, they obtain higher precision scores and similar NCDG scores, and are almost on par with the runs retrieving complex results.

Fourth, recall that the domain adapted runs have not contributed to the pool and have high fractions of unjudged documents (no less than 39.0–60.0% unjudged in the top 10). In this light, again, the scores of the domain adapted runs are quite impressive. We observe again the relative score increase from base ranking, to standard GPL domain adaptation, and the GPL remining approach. We observe again that our novel remining strategy for continuous domain adaptation improves over GPL, the state-of-the-art for domain adaptation. 3.1.2. Analysis This section analyzes various aspects of the submitted runs, where we pay particular attention to two aspects of core interest to the task and the overall use case of the track in which a lay user is accessing complex scientific text.

Credibility The first aspect of interest is the credibility of the retrieved information. Whilst one may assume that any scientific paper submitted after peer-review has passed a number of quality control steps during the peer-review process, and hence all retrieved abstract have high credibility. However, it is well-known that lay users have dificulty separating authoritative verses non-authoritative publications, as they are not able to discern the same cue as expert. For example, they are unaware of the reputation of the authors [9]. How authoritative are the results retrieved for our lay user?

Readability The second aspect of interest is the readability of the retrieved information. We have seen above that the approaches are efective for retrieving relevant scientific papers. However, although topically relevant these paper may contain very advance scientific information that is not easy to understand and interpret by lay users. Recall that this was the motivation to use complexity-aware retrieval approaches [8]. Can complexity-aware search help retrieve relevant and accessible scientific text?

Table 4 shows the Flesch-Kincaid Grade Level (FKGL) readability score of the top 10 results retrieved for our lay user’s popular science query. We observe that the lexical and neural rankers retrieve topically relevant information without taking the text complexity into account. Both lexical and neural rankers retrieve information with an FKGL of 14-15 corresponding to university level text complexity. The same holds for the domain adapted runs. This is not surprizing as we have an extensive scientific corpus with an average text complexity of 14-15 reflecting this.

Earlier we observed that our complexity-aware retrieval systems obtained almost almost the same efectiveness in terms of retrieval efectiveness. Hence this complexity aware approach was able to rank a similar number of topically relevant documents in the top 10 as standard lexical and neural ranking approaches. But is the complexity-aware approach able to rank more accessible content for our lay user issuing a popular science query?

Table 4 shows indeed favorable readability levels for the complexity aware search, with an FKGL of 12-13 corresponding to the exit level of compulsory education. Hence the complexity aware search approach is able to retrieve relevant and accessible content to our lay user. The retrieved source abstracts have a similar readability level as targeted by text simplification systems as discussed in Section 3.3.

3.2. Task 2: Complexity Spotting

We continue with Task 2, asking to identify and explain dificult concepts. 3.2.1. Results Task 2.1 Table 5 shows the performance of the Task 2 submission on the test data. At the time of writing, these score were released as (preliminary) scores without much further explanation.

The oficial results seem to focus entirely on recall aspects, or retrieving all terms annotated by the experts. Our simple approach is not expected to do well in terms of recall. We will conduct a more precision oriented evaluation below as additional analysis.

Task 2.3 There is no train data for Task 2.3 released, nor any test results made available at the time of writing. We hope and expect that these results will be released in time for the CLEF conferences in Grenoble. 3.2.2. Analysis Table 6 shows the performance of the Task 2 submission on the train and test data. Due to the very limited data available, we treat spot here any terms. We included the complexity level as graded score, in order to filter the Boolean measures on minimal relevance score. 5 On the train and test data of earlier years, performance peaked around spotting 3 terms per sentence. Due to the many experts annotating the same set of sentences, we see that both recall and F1 increase over ranks and the highest scores are retrieved for spotting 5 rare terms per sentences. Overall, our simple approach achieves an MRR of 0.2542 (train) and 0.2741 (test) and, taking the dificulty level into account, an NDCG@5 of 0.1446 (train) and 0.1469 (test).

Table 7 shows an example sentence with references. In this example, our approach predicts 5 terms, that match one of the annotated references. The top ranked candidate matches one of the references annotated as dificult ("d"). There is a striking number of 16 references, with about 11 unique reference terms. Some references occur in variants (e.g., "simulated F1 car" is rated "d", whereas "F1 car" is rated "e"). Several references do not literally occur in the source sentence: we observe diferences in case ("ResNet-18" vs. "resnet-18), plural/singular ("labels" vs. "label", "images" vs. "image"), and verb tense ("is fed" vs. "to be fed", "outputs" vs. "to output").

Table 8 shows the frequency of spotted terms on the train data. We observe a striking variation with 53 sentences having 1 complex terms, and 12 sentences having more than 15 complex terms. This 5Tables not shown as they exhibit the same qualitative pattern, but at the obvious lower score level. Source Train Train (case folding) Test Test (case folding) Run Train Test

G06.2_2810968146_2 The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking. [’ResNet-18 variant’, ’braking’, ’braking’, ’f1 car’, ’front’, ’image’, ’model’, ’optimal label’, ’resnet18’, ’simulated F1 car’, ’steering’, ’steering’, ’throttle’, ’throttle’, ’to be fed’, ’to output’] [’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’d’, ’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’m’]

The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car, and outputs optimal labels for steering, throttle, braking.

[’resnet-18’, ’throttle’, ’braking’, ’f1’, ’fed’] variation is making the prediction of all terms neigh impossible, and makes averaging over terms an unreliable indicator of the per sentence performance. Evaluation over the sets of top retrieved terms, as we did in Table 6 shows indeed reasonable performance for our basic approach.

The recall of our approach is relatively low, as the baseline rarest term approach cannot find multiword phrases. In addition, many of the ground truth terms do not literally appear in the sentence, and require case folding, morphologically normalization, or even more complex transformations to correctly align with the exact orthography of the scientific text.

Table 9 quantifies how often the spotted term or phrase is literally occurring in the sentences. We observe a fraction varying from 6.5% to 18.7%. While many cases concern morphological normalization that is useful to conflate similar concepts across diferent sentences (base form of verbs, singular for nouns etc). However, the evaluation measures will treat such cases as a failed match, and recall oriented measures should be treated with care. 2,098 2,334 1,312 1,347 481 245 128 93 BERTScore

R 0.93 0.93

F1 0.92 0.92

3.3. Task 3: Text Simplification

We continue with Task 3, asking to simplify scientific text. 3.3.1. Evaluation Table 11 shows the results on the train data, both in terms of text statistics and in terms of evaluation against the human reference simplifications. 6 We make a number of observations. First, looking at the GPT-2 models, we see that both sentence level and abstract level text simplification considerably brings down the FKGL measure, and obtain reasonable SARI and BLEU scores against the reference simplifications. The abstract level simplification leads to deletions of entire sentences, with 50% less tokens than the source, but still outperformaning the sentence level simplification retaining all sentences. Second, the BART model trained on Wiki-Auto and on Cochrane-Auto lay summaries significantly outperforms the GPT-2 model on BLEU with scores of 0.37, signaling high n-gram overlap with the 6Some of the diferences in the number of sentences/abstracts are due to those sources not included in the test source file. This particularly concerns very short fragments from biomedical literature added as additional train data, but not part of the SimpleText corpus. 1.00 0.60 0.87 0.79 0.89 0.96 1.00 0.01 0.14 0.06 0.32 0.59 0.00 0.27 0.17 0.29 0.02 0.02 0.00 0.54 0.14 0.12 0.16 0.07 human reference simplifications. For abstract level simplification it is encouraging to see that the Cochrane model trained on scientific data is slightly outperforming the Wiki-Auto trained model. Third, the paragraph and document level models trained on Wiki-Auto and Cochrane do again not outperform the sentence level simplifications, under the conditions of the task’s train data. The train data is derived from the sentence level scientific text simplification references from the earlier years of the track. Proper document level text simplification approaches lead to considerable deletions, and perform reasonable given their far more succinct output.

Table 12 shows the Task 3 results for both sentence-level (top) and abstract-level (bottom) scientific text simplifications. We make again a number of observations. First, looking at the GPT-2 models, we see again low FKGL scores indicating favorable readability, with reasonable SARI and BLEU scores. The abstract level simplification clearly outperforms the merged sentence level simplifications, despite a far more succinct output. Second, looking at the BART model trained on Wiki-Auto and on Cochrane-Auto lay summaries, we see that the Cochrane model trained on scientific data is clearly outperforming the Wiki-Auto trained model on SARI for document level text simplification. Third, the paragraph and document level models trained on Wiki-Auto and Cochrane do again not outperform the sentence level simplifications, under the conditions of the task’s test data based on aggregated human reference sentence simplifications. These models take discourse structure into account, or may merge or reorder sentences, and are less focused on single sentence wordsmithing, or promoting sentence splits. 3.3.2. Analysis In this section, we look analyze the output of our systems by realigning the simplified text predictions to the source sentences. G07.1 2111507945 The growth of social media provides a convenient communication scheme way for people to communicate , but at the same time it becomes a hotbed of misinformation . ⃒⃒ The This wide spread of misinformation over social media is injurious to public interest . It is dificult to separate fact from fiction when talking about social media . ⃒⃒ We design a framework , which integrates combines collective intelligence and machine intelligence , to help identify misinformation . ⃒⃒ The basic idea is : ( 1 ) automatically index the expertise of users according to their microblog contents posts ; and ( 2 ) match the experts with the same information given to suspected misinformation . ⃒⃒ By sending the suspected misinformation to appropriate experts , we can collect gather the assessments of experts relevant data to judge the credibility of the information , and help refute misinformation . ⃒⃒ In this paper , we focus on look at expert finding for misinformation identification . We ask experts to identify the source of the misinformation , and how it is spread . ⃒⃒ We propose a tag-based method approach to index indexing the expertise of microblog users with social tags . Our approach will allow us to identify which posts are most relevant and which are not . ⃒⃒ Experiments on a real world dataset demonstrate show the efectiveness of our method approach for expert finding with respect to misinformation identification in microblogs .

Controlled Creativity Text simplification models are based on generative large language models. For example, one of the models we used is a GPT-2 model [10] called the Keep it Simple (KiS). The model is based on GPT-medium, using a straightforward unsupervised training task with an explicit loss in terms of fluency, saliency, and simplicity. Such models are used in generative mode, generating the output in fairly unconstrained mode in order to ensure none of the input is lost (in particular for longer input). As a result there is also a chance that the model continues to generate output after the source has been fully simplified. This can cause the model to overgenerate and produce spurious content.

Table 13 shows an example output simplification, combining the input sentences belonging to the abstract of documents 2111507945 retrieved for query G07.1. We show here deletions and insertions relative to the source input sentences (in this case 8 in total). Many simplifications are revisions of the input, but we also observe that sometimes an entire sentence is inserted (shown as xxx). Modern models such as ours generate the simplification, which may lead to additional output being generated at the end. Recall that the example as shown in Table 13 merges 8 separate input sentences in the train data (indicated by ⃒⃒ ), making this occur multiple times at the end of three of the inputs. Spurious Content We analyze the frequency of spurious content in our runs. For human readers, detecting such sentences by simply inspecting the output is hard, as they are very reasonable completions generated with awareness of the preceding context. We experimented with unsupervised approaches to tackle the generation of spurious generation, by post-processing the output in relation to the original input. Similar to the edits as shown in the table, we process input and output, and remove any sentence that has been inserted without grounding in the input.

Table 14 quantifies how often such spurious generation occurs. We make a number of observations. First, the spurious generation is not infrequent. Some systems have a marginal number of cases, which may be a result of imperfect alignment due to short sentences or changing word orders. Other systems have many cases, up to 1,390 sentences or 29% (and 111 abstracts or 14%) of the input for the unconstrained GPT2 model.

Second, in the GPT2 sentence level case, we remove this additional content in a post-processing step, ensuring all the output is grounded on input sentences. This is efectively removing spurious content from the runs, and also leads to better performance in Table 12.

Third, while our post-processing already has a favorable efect on the evaluation measures, we feel that it has great benefits not reflected by these scores. Our post-processing is specifically, and only, removing spurious generation (or “hallucination”) of the output. These results highlight and quantify the severity of this problem in generative text simplification models such as our GPT2 model. At the same time, it ofers a practical approach to tackle this undesirable aspect head-on.

4. Discussion and Conclusions

This paper detailed the University of Amsterdam’s participation in the CLEF 2024 SimpleText track. We conducted a range of experiments, for each of the three tasks of the track.

For Task 1 on Content Selection, we observed a very solid performance for zero-shot neural reranking, as well as competitive efectivess for complexity-aware rankers that purposely avoid to retrieve results with a high text complexity.

For Task 2 on Complexity Spotting, we submitted preliminary approaches based on standard term weighting, and observed that naive approaches can help locate dificult terms.

For Task 3 on Text Simplification , we experimented with a range of models and approaches, and observed that sentence-level simplification approaches can be very efective to reduce the complexity of scientific text, and that paragraph and abstract level simplifications lead to far shorter output including whole sentence deletions.

Acknowledgments

This research was conducted as part of the final research projects of the Master in Artificial Intelligence at the University of Amsterdam. We thank the track and task organizers for their amazing service and efort in making realistic benchmarks for scientific text simplification available. Jaap Kamps is partly funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA # 1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the research. [2] E. SanJuan, et al., Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to include in a simplified summary, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [3] G. M. D. Nunzio, et al., Overview of the CLEF 2024 SimpleText task 2: Identify and explain dificult concepts, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [4] L. Ermakova, et al., Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [5] J. D’Souza, et al., Overview of the CLEF 2024 SimpleText task 4: Track the state-of-the-art in scholarly publications, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024. [6] L. Ermakova, T. Miller, A. Bosser, V. M. Palma-Preciado, G. Sidorov, A. Jatowt, Overview of JOKER - CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023, Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 397–415. URL: https://doi.org/10.1007/978-3-031-42448-9_26. doi:10.1007/978-3-031-42448-9\_26. [7] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. F. Nogueira, Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations, in: F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai (Eds.), SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, ACM, 2021, pp. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238. doi:10.1145/3404835. 3463238. [8] L. Ermakova, J. Kamps, Complexity-aware scientific literature search: Searching for relevant and accessible scientific text, in: G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad, J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Dificulty in a Multilingual Context @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 16–26. URL: https://aclanthology.org/2024.determit-1.2. [9] J. Kamps, The impact of author ranking in a library catalogue, in: G. Kazai, C. Eickhof, P. Brusilovsky (Eds.), Proceedings of the 4th ACM Workshop on Online books, complementary social media and crowdsourcing, BooksOnline 2011, Glasgow, United Kingdom, October 24, 2011, ACM, 2011, pp. 35–40. URL: https://doi.org/10.1145/2064058.2064067. doi:10.1145/ 2064058.2064067. [10] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, Keep it simple: Unsupervised simplification of multi-paragraph text, in: ACL/IJCNLP’21: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, 2021, pp. 6365–6378. URL: https: //doi.org/10.18653/v1/2021.acl-long.498.

[1]

Ermakova , et al., Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts , in: L. Goeuriot , et al. (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, 2024 .