<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Amsterdam at the CLEF 2024 SimpleText Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Bakker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Göksenin Yüksel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Kamps</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports on the University of Amsterdam's participation in the CLEF 2024 SimpleText track. Our overall goal is to investigate and remove barriers that prevent the general public from accessing scientific literature, hoping to promote science literacy among the general public. Our specific focus is to investigate the relation between the topical relevance and the text complexity of scientific text, as well as develop text simplification approaches for scientific text. Our main findings are the following. First, for lay person scientific passage retrieval, both lexical and zero-shot retrieval models perform well, with only marginal loss of performance for complexity-aware models avoiding the retrieval of passages with low readability. Second, for spotting complex concepts, relative simple approaches based on corpus statistics show competitive precision but low recall. Third, for scientific text simplification diferent models generate diferent simplifications with all reasonable overlap with human reference simplifications. Fourth, document or abstract level text simplification incorporate discourse structure and make sentence deletions, which hold great promise to improve the output quality and succinctness for lay users of scientific text.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Storage and Retrieval</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Wordplay translation</kwd>
        <kwd>Humor retrieval</kwd>
        <kwd>Humor classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <sec id="sec-2-1">
        <title>2.1. Experimental Data</title>
        <p>In this section, we will detail our approach for the three CLEF 2024 SimpleText track tasks.
For details of the exact task setup and results we refer the reader to the detailed overview of the track
in [6]. The basic ingredients of the track are:
Corpus The CLEF 2024 SimpleTrack Corpus consists of 4.9 million bibliographic records, including 4.2
million abstracts, and detailed information about authors/afiliations/citations.</p>
        <p>Context There are 40 popular science articles, with 20 from The Guardian1 and 20 from Tech Xplore.2
Requests For Task 1, there are 176 requests, 109 requests are based on The Guardian and 67 on
TechXplore. Abstracts retrieved for these requests form the corpus for the remaining Tasks 2
and 3. This expands the topic set with 1-4 word queries using earlier years with 64 verbose
questions on the Guardian articles.</p>
        <p>Train Data For Task 1, there are relevance judgments for 64 requests (corresponding to 20 Guardian
articles, G01–G20, and 5 TechExplore articles, T01–T05), with 61 queries having 10 or more
relevant abstracts.</p>
        <p>For Task 2, there are 576 train sentences with ground truth on complex terms/concepts for a total
of 2,579 terms, and 317 test sentences (4.5 per query). For Task 2.3, an additional set of 3,815 other
sentences is provided.</p>
        <p>For Task 3, there are 958 train sentences with human simplifications, matching to 175 train
abstracts with human simplifications. There are 4,797 test sentences, and a matching set of 182
test abstracts.</p>
        <p>Test Data For Task 1, the ultimate test collection consists of 30 queries G1.C1–G10.C1 (10 on the
Guardian), T06–T11 (20 on Tech Xplore). with a total of 4,854 judgments (128.5 per query). All 30
queries have 29 or more relevant abstracts.</p>
        <p>For Task 2, there are 313 test sentences with ground truth on complex terms/concepts for a total
of 1,440 terms (4.6 per query).</p>
        <p>For Task 3, there are 578 test sentences with human simplifications, matching to 103 test abstracts
with human simplifications.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Oficial Submissions</title>
        <p>We created runs for all the three tasks of the track, which we will discuss in order.</p>
        <sec id="sec-2-2-1">
          <title>Task 1 This task asks to retrieve passages to include in a simplified summary.</title>
          <p>We submitted six runs in total, shown in Table 1. We first submitted four baseline runs focusing on
regular information retrieval efectiveness. Two are vanilla baseline runs on an Anserini index, using
either BM25 or BM25+RM3 with default settings [7].3 The other two runs are neural cross-encoder
rerankings of these runs, based on zero-shot application of an MSMARCO trained ranker, reranking the
top 100 of either the BM25 or the BM25+RM3 baseline run.4</p>
          <p>We submitted two further runs that filter for median FKGL in the runs, both for the top 100 and
top 1K crossencoder reranker, following the Complexity Aware Ranking approach of [8]. These runs
1https://www.theguardian.com/science
2https://techxplore.com/
3https://github.com/castorini/pyserini
4https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
GPT-2 Sentence level
GPT-2 Sentence level, Source checked
GPT-2 Sentence level, Source checked, merged into abstracts
GPT-2 Abstract level, Source checked
Wikiauto trained BART sentence level simplification
Cochrane trained BART sentence level simplification
Wikiauto trained BART paragraph level simplification
Cochrane trained BART paragraph level simplification
Wikiauto trained BART document level simplification</p>
          <p>Cochrane trained BART document level simplification
simply filters out the most complex abstract per request, using a standard readability measure. The run
is aiming to remove up to 50% of the results, with the remaining abstracts in the same relevance order
as in the original run.</p>
          <p>As the train data is limited, and none of the approaches above are specific to scientific text, we also
experimented with domain adaptation approaches in post-submission experiments.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Task 2 This task asks to identify and explain dificult concepts.</title>
          <p>We submitted three runs, also shown in Table 1. For Task 2.1 on complexity spotting, we submitted a
single run. As sentences have a limited number of words, we observed that naive baseline approaches
can obtain reasonable performance already. Hence, our submission is using an idf-based term weighting
to locate the most rare terms. Specifically, we used all train and test sentences combined as a reference
corpus to calculate document (or rather sentence) frequencies, and use this to rank each term in the
source sentence by increasing DF (or decreasing IDF).</p>
          <p>For Task 2.3, we developed an approach to rank definitions or explanations for a given sentence and
term pair. However the provided test data did provide only unmatched sets of scientific sentences and
other sentences. Hence we submitted two runs only looking at the textual similarity of the large set of
provided ’other’ sentences.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Task 3 This task asks to simplify scientific text.</title>
          <p>We submitted the twelve runs shown in Table 1. Our first set of experiments continues the earlier
experiments with a GPT-2 model trained in an unsupervised way. First, we use the basic pretrained
model on sentence level input. Second, we check all output against the source to avoid hallucination,
and submit this checked version. Third, we merge the sentence level simplifications to create abstract
level simplifications. Fourth, we run the model on long abstract level input, to create direct abstract
level simplifications. All these four runs use the exact same GPT-2 text simplification model.</p>
          <p>Our second set of experiments is with diferent BART trained models, either trained on Wiki-Auto
or on aligned Lay Summaries from Cochrane (a home grown Cochrane-Auto). This leads to six runs,
using either Wiki or Cochrane train data, and using either sentence level, paragraph level, or document
† Post-submission experiment.
(abstract) level input. Each of these six runs uses a diferent model, due to the diferent train input
matching the output settings.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>In this section, we will present the results of our experiments, in three self-contained subsections
following the CLEF 2024 SimpleText Track tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Task 1: Content Selection</title>
        <p>We discuss our results for Task 1, asking to retrieve passages to include in a simplified summary.
3.1.1. Retrieval efectiveness
Table 2 shows the performance of the Task 1 submissions on the train data. Let us first observe how
diferent our runs are from the pooled runs, as those were based exclusively on the organizer’s provided
Elastic Search index and the particular keyword query. Due to the diferent tokenization and indexing
choices in our Anserini index, the fraction of unjudged documents in the top 10s is high. First, the BM25
run has 36.6% and the BM25+RM3 run has 41.6% unjudged in the top 10. Second, the cross-encoder
reranking has 27.5% (CE top 100) and 30.8% (CE top 1K) of unjudged, slightly lower due to similar neural
reranker contributing to the pool in earlier years. Third, the complexity-aware filtered runs have 34.4%
(CAR top 100) and 35.3% (CAR top 1K). Fourth, the domain adapted runs have no less than 50.9–72.2%
unjudged in the top 10. In this light, the scores of the train adapted run on the train data are truly
impressive.</p>
        <p>We make a number of observations on the performance on the train set. First, the two Anserini
baselines using BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10
of 0.36-0.39 on the train data. The RM3 models underperforms the vanilla BM25 on all measures for train,
but has a higher fraction of unjudged documents. The used Anserini index difers from the organizer’s
provided Elastic search index that dominates the pool of the train data. Second, the zero-shot reranking
with an crossencoder lead to an improvement of retrieval efectiveness over the BM25 first stage ranker,
with the top 100 reranking scoring 0.42 NDCG@10 on train. The bpref measure is less sensitive to
pooling bias, and the highest bpref score of the top 1K reranking demonstrates the efectiveness of
these runs. Third, we observe a favorable outcome for the domain adaptation of the models. The base
scores are lower than GPL domain adaptation, and our novel remining strategy for continuous domain
adaptation improves over GPL, the state-of-the-art for domain adaptation.</p>
        <p>Table 3 shows the performance of the Task 1 submissions on the train data. We submitted four runs
focusing purely on standard retrieval efectiveness, and two runs addressing text complexity. On the
test data, our submission were pooled, except for the combined score runs: we oberve 7.7% (CAR top
100) and 6.0% (CAR top 1K) of unjudged documents in the top 10 of each submission. Also the domain
adapted runs have no less than 39.0–60.0% unjudged in the top 10, as they were not pooled.</p>
        <p>We make a number of observations. First, we observer again that the two Anserini baselines using
BM25 with or without RM3 query expansion perform very reasonable with an NDCG@10 of 0.38-0.39
on the test data. The RM3 models now outperforms the vanilla BM25 on all measures except MAP for
test.</p>
        <p>Second, the zero-shot reranking with an crossencoder does not lead to an improvement of retrieval
efectiveness over the BM25 first stage ranker on the test data. Again, the bpref measure is less sensitive
to pooling bias, and the highest bpref score of the top 1K reranking demonstrates the efectiveness of
these runs.</p>
        <p>Third, the complexity aware ranking runs filtering out the most complex abstract show competitive
performance. Although these runs intentionally avoid complex, but topically relevant, results, they
obtain higher precision scores and similar NCDG scores, and are almost on par with the runs retrieving
complex results.</p>
        <p>Fourth, recall that the domain adapted runs have not contributed to the pool and have high fractions
of unjudged documents (no less than 39.0–60.0% unjudged in the top 10). In this light, again, the scores
of the domain adapted runs are quite impressive. We observe again the relative score increase from base
ranking, to standard GPL domain adaptation, and the GPL remining approach. We observe again that
our novel remining strategy for continuous domain adaptation improves over GPL, the state-of-the-art
for domain adaptation.
3.1.2. Analysis
This section analyzes various aspects of the submitted runs, where we pay particular attention to two
aspects of core interest to the task and the overall use case of the track in which a lay user is accessing
complex scientific text.</p>
        <p>Credibility The first aspect of interest is the credibility of the retrieved information. Whilst one may
assume that any scientific paper submitted after peer-review has passed a number of quality control
steps during the peer-review process, and hence all retrieved abstract have high credibility. However, it is
well-known that lay users have dificulty separating authoritative verses non-authoritative publications,
as they are not able to discern the same cue as expert. For example, they are unaware of the reputation
of the authors [9]. How authoritative are the results retrieved for our lay user?</p>
        <p>Readability The second aspect of interest is the readability of the retrieved information. We have
seen above that the approaches are efective for retrieving relevant scientific papers. However, although
topically relevant these paper may contain very advance scientific information that is not easy to
understand and interpret by lay users. Recall that this was the motivation to use complexity-aware
retrieval approaches [8]. Can complexity-aware search help retrieve relevant and accessible scientific
text?</p>
        <p>Table 4 shows the Flesch-Kincaid Grade Level (FKGL) readability score of the top 10 results retrieved
for our lay user’s popular science query. We observe that the lexical and neural rankers retrieve topically
relevant information without taking the text complexity into account. Both lexical and neural rankers
retrieve information with an FKGL of 14-15 corresponding to university level text complexity. The same
holds for the domain adapted runs. This is not surprizing as we have an extensive scientific corpus
with an average text complexity of 14-15 reflecting this.</p>
        <p>Earlier we observed that our complexity-aware retrieval systems obtained almost almost the same
efectiveness in terms of retrieval efectiveness. Hence this complexity aware approach was able to rank
a similar number of topically relevant documents in the top 10 as standard lexical and neural ranking
approaches. But is the complexity-aware approach able to rank more accessible content for our lay user
issuing a popular science query?</p>
        <p>Table 4 shows indeed favorable readability levels for the complexity aware search, with an FKGL of
12-13 corresponding to the exit level of compulsory education. Hence the complexity aware search
approach is able to retrieve relevant and accessible content to our lay user. The retrieved source abstracts
have a similar readability level as targeted by text simplification systems as discussed in Section 3.3.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 2: Complexity Spotting</title>
        <p>We continue with Task 2, asking to identify and explain dificult concepts.
3.2.1. Results
Task 2.1 Table 5 shows the performance of the Task 2 submission on the test data. At the time of
writing, these score were released as (preliminary) scores without much further explanation.</p>
        <p>The oficial results seem to focus entirely on recall aspects, or retrieving all terms annotated by the
experts. Our simple approach is not expected to do well in terms of recall. We will conduct a more
precision oriented evaluation below as additional analysis.</p>
        <p>Task 2.3 There is no train data for Task 2.3 released, nor any test results made available at the time of
writing. We hope and expect that these results will be released in time for the CLEF conferences in
Grenoble.
3.2.2. Analysis
Table 6 shows the performance of the Task 2 submission on the train and test data. Due to the very
limited data available, we treat spot here any terms. We included the complexity level as graded score,
in order to filter the Boolean measures on minimal relevance score. 5 On the train and test data of earlier
years, performance peaked around spotting 3 terms per sentence. Due to the many experts annotating
the same set of sentences, we see that both recall and F1 increase over ranks and the highest scores
are retrieved for spotting 5 rare terms per sentences. Overall, our simple approach achieves an MRR
of 0.2542 (train) and 0.2741 (test) and, taking the dificulty level into account, an NDCG@5 of 0.1446
(train) and 0.1469 (test).</p>
        <p>Table 7 shows an example sentence with references. In this example, our approach predicts 5 terms,
that match one of the annotated references. The top ranked candidate matches one of the references
annotated as dificult ("d"). There is a striking number of 16 references, with about 11 unique reference
terms. Some references occur in variants (e.g., "simulated F1 car" is rated "d", whereas "F1 car" is rated
"e"). Several references do not literally occur in the source sentence: we observe diferences in case
("ResNet-18" vs. "resnet-18), plural/singular ("labels" vs. "label", "images" vs. "image"), and verb tense
("is fed" vs. "to be fed", "outputs" vs. "to output").</p>
        <p>Table 8 shows the frequency of spotted terms on the train data. We observe a striking variation
with 53 sentences having 1 complex terms, and 12 sentences having more than 15 complex terms. This
5Tables not shown as they exhibit the same qualitative pattern, but at the obvious lower score level.
Source
Train
Train (case folding)
Test
Test (case folding)
Run
Train
Test</p>
        <p>G06.2_2810968146_2
The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car,
and outputs optimal labels for steering, throttle, braking.
[’ResNet-18 variant’, ’braking’, ’braking’, ’f1 car’, ’front’, ’image’, ’model’, ’optimal label’,
’resnet18’, ’simulated F1 car’, ’steering’, ’steering’, ’throttle’, ’throttle’, ’to be fed’, ’to output’]
[’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’e’, ’d’, ’d’, ’e’, ’e’, ’e’, ’e’, ’e’, ’m’]</p>
        <p>The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car,
and outputs optimal labels for steering, throttle, braking.</p>
        <p>The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car,
and outputs optimal labels for steering, throttle, braking.</p>
        <p>The model is a ResNet-18 variant, which is fed in images from the front of a simulated F1 car,
and outputs optimal labels for steering, throttle, braking.</p>
        <p>[’resnet-18’, ’throttle’, ’braking’, ’f1’, ’fed’]
variation is making the prediction of all terms neigh impossible, and makes averaging over terms an
unreliable indicator of the per sentence performance. Evaluation over the sets of top retrieved terms, as
we did in Table 6 shows indeed reasonable performance for our basic approach.</p>
        <p>The recall of our approach is relatively low, as the baseline rarest term approach cannot find
multiword phrases. In addition, many of the ground truth terms do not literally appear in the sentence, and
require case folding, morphologically normalization, or even more complex transformations to correctly
align with the exact orthography of the scientific text.</p>
        <p>Table 9 quantifies how often the spotted term or phrase is literally occurring in the sentences. We
observe a fraction varying from 6.5% to 18.7%. While many cases concern morphological normalization
that is useful to conflate similar concepts across diferent sentences (base form of verbs, singular for
nouns etc). However, the evaluation measures will treat such cases as a failed match, and recall oriented
measures should be treated with care.
2,098
2,334
1,312
1,347
481
245
128
93
BERTScore</p>
        <p>R
0.93
0.93</p>
        <p>F1
0.92
0.92</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 3: Text Simplification</title>
        <p>We continue with Task 3, asking to simplify scientific text.
3.3.1. Evaluation
Table 11 shows the results on the train data, both in terms of text statistics and in terms of evaluation
against the human reference simplifications. 6 We make a number of observations. First, looking at
the GPT-2 models, we see that both sentence level and abstract level text simplification considerably
brings down the FKGL measure, and obtain reasonable SARI and BLEU scores against the reference
simplifications. The abstract level simplification leads to deletions of entire sentences, with 50% less
tokens than the source, but still outperformaning the sentence level simplification retaining all sentences.
Second, the BART model trained on Wiki-Auto and on Cochrane-Auto lay summaries significantly
outperforms the GPT-2 model on BLEU with scores of 0.37, signaling high n-gram overlap with the
6Some of the diferences in the number of sentences/abstracts are due to those sources not included in the test source file.
This particularly concerns very short fragments from biomedical literature added as additional train data, but not part of the
SimpleText corpus.
1.00
0.60
0.87
0.79
0.89
0.96
1.00
0.01
0.14
0.06
0.32
0.59
0.00
0.27
0.17
0.29
0.02
0.02
0.00
0.54
0.14
0.12
0.16
0.07
human reference simplifications. For abstract level simplification it is encouraging to see that the
Cochrane model trained on scientific data is slightly outperforming the Wiki-Auto trained model. Third,
the paragraph and document level models trained on Wiki-Auto and Cochrane do again not outperform
the sentence level simplifications, under the conditions of the task’s train data. The train data is derived
from the sentence level scientific text simplification references from the earlier years of the track. Proper
document level text simplification approaches lead to considerable deletions, and perform reasonable
given their far more succinct output.</p>
        <p>
          Table 12 shows the Task 3 results for both sentence-level (top) and abstract-level (bottom) scientific
text simplifications. We make again a number of observations. First, looking at the GPT-2 models, we
see again low FKGL scores indicating favorable readability, with reasonable SARI and BLEU scores. The
abstract level simplification clearly outperforms the merged sentence level simplifications, despite a far
more succinct output. Second, looking at the BART model trained on Wiki-Auto and on Cochrane-Auto
lay summaries, we see that the Cochrane model trained on scientific data is clearly outperforming
the Wiki-Auto trained model on SARI for document level text simplification. Third, the paragraph
and document level models trained on Wiki-Auto and Cochrane do again not outperform the sentence
level simplifications, under the conditions of the task’s test data based on aggregated human reference
sentence simplifications. These models take discourse structure into account, or may merge or reorder
sentences, and are less focused on single sentence wordsmithing, or promoting sentence splits.
3.3.2. Analysis
In this section, we look analyze the output of our systems by realigning the simplified text predictions
to the source sentences.
G07.1 2111507945 The growth of social media provides a convenient communication scheme way for people
to communicate , but at the same time it becomes a hotbed of misinformation . ⃒⃒ The
This wide spread of misinformation over social media is injurious to public interest . It
is dificult to separate fact from fiction when talking about social media . ⃒⃒ We design a
framework , which integrates combines collective intelligence and machine intelligence ,
to help identify misinformation . ⃒⃒ The basic idea is : (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) automatically index the expertise
of users according to their microblog contents posts ; and ( 2 ) match the experts with
the same information given to suspected misinformation . ⃒⃒ By sending the suspected
misinformation to appropriate experts , we can collect gather the assessments of experts
relevant data to judge the credibility of the information , and help refute misinformation
. ⃒⃒ In this paper , we focus on look at expert finding for misinformation identification .
We ask experts to identify the source of the misinformation , and how it is spread . ⃒⃒ We
propose a tag-based method approach to index indexing the expertise of microblog users
with social tags . Our approach will allow us to identify which posts are most relevant and
which are not . ⃒⃒ Experiments on a real world dataset demonstrate show the efectiveness
of our method approach for expert finding with respect to misinformation identification
in microblogs .
        </p>
        <p>Controlled Creativity Text simplification models are based on generative large language models.
For example, one of the models we used is a GPT-2 model [10] called the Keep it Simple (KiS). The model
is based on GPT-medium, using a straightforward unsupervised training task with an explicit loss in
terms of fluency, saliency, and simplicity. Such models are used in generative mode, generating the
output in fairly unconstrained mode in order to ensure none of the input is lost (in particular for longer
input). As a result there is also a chance that the model continues to generate output after the source
has been fully simplified. This can cause the model to overgenerate and produce spurious content.</p>
        <p>Table 13 shows an example output simplification, combining the input sentences belonging to the
abstract of documents 2111507945 retrieved for query G07.1. We show here deletions and insertions
relative to the source input sentences (in this case 8 in total). Many simplifications are revisions of
the input, but we also observe that sometimes an entire sentence is inserted (shown as xxx). Modern
models such as ours generate the simplification, which may lead to additional output being generated
at the end. Recall that the example as shown in Table 13 merges 8 separate input sentences in the train
data (indicated by ⃒⃒ ), making this occur multiple times at the end of three of the inputs.
Spurious Content We analyze the frequency of spurious content in our runs. For human readers,
detecting such sentences by simply inspecting the output is hard, as they are very reasonable completions
generated with awareness of the preceding context. We experimented with unsupervised approaches to
tackle the generation of spurious generation, by post-processing the output in relation to the original
input. Similar to the edits as shown in the table, we process input and output, and remove any sentence
that has been inserted without grounding in the input.</p>
        <p>Table 14 quantifies how often such spurious generation occurs. We make a number of observations.
First, the spurious generation is not infrequent. Some systems have a marginal number of cases,
which may be a result of imperfect alignment due to short sentences or changing word orders. Other
systems have many cases, up to 1,390 sentences or 29% (and 111 abstracts or 14%) of the input for the
unconstrained GPT2 model.</p>
        <p>Second, in the GPT2 sentence level case, we remove this additional content in a post-processing step,
ensuring all the output is grounded on input sentences. This is efectively removing spurious content
from the runs, and also leads to better performance in Table 12.</p>
        <p>Third, while our post-processing already has a favorable efect on the evaluation measures, we feel
that it has great benefits not reflected by these scores. Our post-processing is specifically, and only,
removing spurious generation (or “hallucination”) of the output. These results highlight and quantify
the severity of this problem in generative text simplification models such as our GPT2 model. At the
same time, it ofers a practical approach to tackle this undesirable aspect head-on.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Conclusions</title>
      <p>This paper detailed the University of Amsterdam’s participation in the CLEF 2024 SimpleText track. We
conducted a range of experiments, for each of the three tasks of the track.</p>
      <p>For Task 1 on Content Selection, we observed a very solid performance for zero-shot neural reranking,
as well as competitive efectivess for complexity-aware rankers that purposely avoid to retrieve results
with a high text complexity.</p>
      <p>For Task 2 on Complexity Spotting, we submitted preliminary approaches based on standard term
weighting, and observed that naive approaches can help locate dificult terms.</p>
      <p>For Task 3 on Text Simplification , we experimented with a range of models and approaches, and
observed that sentence-level simplification approaches can be very efective to reduce the complexity of
scientific text, and that paragraph and abstract level simplifications lead to far shorter output including
whole sentence deletions.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was conducted as part of the final research projects of the Master in Artificial Intelligence at the
University of Amsterdam. We thank the track and task organizers for their amazing service and efort in making
realistic benchmarks for scientific text simplification available. Jaap Kamps is partly funded by the Netherlands
Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA # 1518.22.105), the University of
Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this paper are
not necessarily shared or endorsed by those funding the research.
[2] E. SanJuan, et al., Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to include in a
simplified summary, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the
Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[3] G. M. D. Nunzio, et al., Overview of the CLEF 2024 SimpleText task 2: Identify and explain dificult
concepts, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation
Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[4] L. Ermakova, et al., Overview of the CLEF 2024 SimpleText task 3: Simplify scientific text, in:
G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF
2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[5] J. D’Souza, et al., Overview of the CLEF 2024 SimpleText task 4: Track the state-of-the-art in
scholarly publications, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of
the Evaluation Forum (CLEF 2024), CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[6] L. Ermakova, T. Miller, A. Bosser, V. M. Palma-Preciado, G. Sidorov, A. Jatowt, Overview of
JOKER - CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas,
T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International
Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023,
Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 397–415. URL:
https://doi.org/10.1007/978-3-031-42448-9_26. doi:10.1007/978-3-031-42448-9\_26.
[7] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. F. Nogueira, Pyserini: A python toolkit for reproducible
information retrieval research with sparse and dense representations, in: F. Diaz, C. Shah, T. Suel,
P. Castells, R. Jones, T. Sakai (Eds.), SIGIR ’21: The 44th International ACM SIGIR Conference
on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021,
ACM, 2021, pp. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238. doi:10.1145/3404835.
3463238.
[8] L. Ermakova, J. Kamps, Complexity-aware scientific literature search: Searching for relevant
and accessible scientific text, in: G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad,
J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Dificulty in a
Multilingual Context @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 16–26. URL:
https://aclanthology.org/2024.determit-1.2.
[9] J. Kamps, The impact of author ranking in a library catalogue, in: G. Kazai, C. Eickhof,
P. Brusilovsky (Eds.), Proceedings of the 4th ACM Workshop on Online books,
complementary social media and crowdsourcing, BooksOnline 2011, Glasgow, United Kingdom, October
24, 2011, ACM, 2011, pp. 35–40. URL: https://doi.org/10.1145/2064058.2064067. doi:10.1145/
2064058.2064067.
[10] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, Keep it simple: Unsupervised simplification
of multi-paragraph text, in: ACL/IJCNLP’21: Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing, Association for Computational Linguistics, 2021, pp. 6365–6378. URL: https:
//doi.org/10.18653/v1/2021.acl-long.498.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
          </string-name>
          , et al. (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>