<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bologna, Italy
$ kamps@uva.nl (J. Kamps)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>University of Amsterdam at the CLEF 2022 SimpleText Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Femke Mostert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashmita Sampatsing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mink Spronk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Rau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Kamps</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper reports on the University of Amsterdam's participation in the CLEF 2022 SimpleText track. The overall goal of removing barriers that prevent the general public from accessing scientific literature is of great importance to help users make sense of a world of misinformation and shallow opinions. We perform preliminary studies within the track's setup, analyzing the text complexity of searching a large set of academic abstracts in the context of popular science topics emerging in the news, with a specific focus at the relation between the topical relevance and the text complexity of the retrieved information. Our main findings are the following. First, we analyzed a large corpus of scientific abstracts and confirmed that these are highly complex on average, but that the variation is large and many abstracts with accessible readability levels exist. Second, we ran retrieval experiments and found that standard search ignores readability, yet filtering on the desirable reading level still retains competitive performance while avoiding retrieving relevant but incomprehensible results. Third, we ran complexity spotting experiments and found that straightforward lexical complexity or term frequency measures are strong indicators, but have to be combined with the importance of the concept in the broader context of the information request. Fourth, we ran a GPT-2 based text simplification model in a zero-shot way, resulting in conservative rewriting of abstracts, able to significantly reduce the text complexity. More generally, our results demonstrate that text complexity is an essential aspect to consider for improving non-expert access to scientific information, and opens up new routes to develop efective scientific information access technology tailored to needs of the general public.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Storage and Retrieval</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Scientific Information Access</kwd>
        <kwd>Text Simplification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advent of the internet and social media has been revolutionary in changing every aspect
of information creation and information consumption. In the early years, many have lauded
all the positives resulting from this, such as breaking down traditional barriers in access to
information, as well as providing the means to publish anything by anyone [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One of the
prime examples of this is Wikipedia. This is leading in principle to ultimate democratization by
giving every global citizen a voice, and ultimate equality where quality of arguments rather
than the provenance of the author decides the outcome.
      </p>
      <p>In recent years, the emphasis has shifted to the negatives resulting from this, as not only
individual citizens have massively joined this open sphere, but also many other commercial or
political actors have discovered its potential to influence citizens to suit their financial or political
incentives, even leading to social and societal risks. Hence, one of the greatest challenges of
today is how users can navigate in a world of misinformation and disinformation. This is a very
hard and complex problem, and rather than believing in fairy tales and some magic wand that
will make it disappear, we focus on a well-established antidote: objective, scientific information
in the academic literature and associated data.</p>
      <p>
        Every citizen agrees on the importance of objective scientific evidence, yet at the same time
they predominantly rely on shallow secondary information on the web and in social media.
One of the main reasons for not accessing scientific information directly is that they presume
the scientific literature is too dificult. The CLEF 2022 SimpleText track investigates the barriers
that ordinary citizens face when accessing scientific literature head-on, by making available
corpora and tasks to address diferent aspects of the problem. For details on the exact track
setup, we refer to the Track Overview paper CLEF 2022 LNCS proceedings [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as well as the
detailed task overviews in the CEUR proceedings [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ].
      </p>
      <p>We conduct an extensive analysis of the corpus of scientific abstracts and the three tasks of
the track: Task 1 on content selection and avoiding complexity; Task 2 on complexity spotting in
extracted sentences from scientific abstracts; and Task 3 on text simplification proper rewriting
sentences from these abstracts. The rest of this paper is structured as follows. Next, in Section 2
we discuss our experimental setup and the specific runs submitted. Section 3 discusses the
results of our runs and provides a detailed analysis of the corpus and results for each task. We
end in Section 4 by discussing our results and outlining the lesson learned.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <p>In this section, we will detail our approach for the three CLEF 2022 SimpleText track tasks.</p>
      <p>
        For details of the exact task setup and results we refer the reader to the detailed overview of
the track in Ermakova et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The basic ingredients of the track are:
Corpus The CLEF 2022 SimpleTrack Corpus consists of 4.9 million bibliographic records,
including 4.2 million abstracts, and detailed information about authors/afiliations/citations.
Context There are 40 popular science articles, with 20 from The Guardian1 and 20 from Tech
      </p>
      <p>Xplore.2
Requests For Task 1, there are 114 requests with 1-4 queries per context article, 47 requests
are based on The Guardian and 67 on TechXplore. For Task 2, there are 453 train and
116,764 test sentences. For Task 3, there are 648 train and 116,764 test sentences.
Assessments For Task 1, there are qrels for 72 requests (67 requests with at least 1 marginally
relevant abstract, 22 requests with at least 5 relevant abstracts).</p>
      <p>We created runs for all the three tasks of the track, which we will discuss in order.</p>
      <sec id="sec-2-1">
        <title>1https://www.theguardian.com/science 2https://techxplore.com/</title>
        <p>Task 1 This task requires ranking scientific abstracts in response to a non-expert, general
query prompted by a popular science article. We submitted two runs.</p>
        <p>
          The first run, labeled UAms-MF in [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], is a manual run selecting relevant and accessible
results from the top 5 of a vanilla Elastic Search run.
        </p>
        <p>
          Our second run, labeled UAms in [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], is an automatic runs using a reading level/text
complexity score as a filter. Specifically, per request and the top 100 result of a vanilla Elastic
Search run, we remove 50% of the abstracts with the highest text complexity based on the
popular Flesch readability level score.
        </p>
        <p>Task 2 What concept needs to be explained or rewritten in a given sentence, extracted from a
scientific abstract.</p>
        <p>
          Based on preliminary experiments, our submission also labeled UAms in [
          <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
          ] is using an
idf-based term weighting to locate the most rare terms, combined with a simple way to boost
particular syntactic categories. Specifically, we used all train and test sentences combined as a
reference corpus to calculate document (or rather sentence) frequencies, and use this to rank
each term in the source sentence by increasing DF (or decreasing IDF). We include adhoc boost
factors for particular part-of-speech, promoting nouns and demoting verbs and adjectives.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Task 3 Rewrite a sentence from a scientific abstract.</title>
        <p>
          This is a post-submission run, hence not evaluated in the track and task overview papers [
          <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
          ].
We use a standard text simplification model, based on the GPT-2 based keep it simple (KiS) model
of Laban et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We run a pretrained version of this model available from HuggingFace,3 in a
zero-shot way on both the train and test corpus.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>In this section, we will present the results of our experiments, in four self-contained subsections
following the CLEF 2022 SimpleText Track corpus and tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Corpus, Context and Requests</title>
        <p>We start with a preliminary analysis of the complexity of the scientific abstracts, in relation to
the context and requests. To quantify the complexity, we use the Flesch-Kincaid Grade Level
(FKGL) measure based on the lexical and grammatical complexity. This is a simple measure
based on word length and sentence length, which may not be the most accurate for a single
abstract but a reliable approximation when averaging over larger sets of data. The FKGL score
is calibrated to correspond to the readability level suitable for a given school level in the U.S.
school system, as shown in Table 1. While literacy levels vary in the population, even among
adults, one may assume that an average layperson would have finished compulsory education,
corresponding to a high school diploma at a grade level of 12.
3.1.1. Complexity of the Corpus
We down-sampled the corpus by taking every 500th article, resulting in an arbitrary sample of
8,513 non-empty abstracts. As shown in Table 2, the average (median) length of the abstracts is
951 (905) tokens, and the average (median) complexity of the abstracts is 14.55 (14.4) FKGL.</p>
        <p>How complex are scientific abstracts? We can immediately confirm that scientific literature
is indeed complex: the scale is the U.S. grade levels in years, with 12 being the exit level
of compulsory education (high school diploma), hence the observed complexity of 14-15 is</p>
        <sec id="sec-3-1-1">
          <title>3https://huggingface.co/philippelaban/keep_it_simple</title>
          <p>LG15
K
F
10
5
0 0
LG15
K
F
10
5
translating to students half-way in undergraduate or college education.</p>
          <p>What is the target level of complexity? Recall that the track also provides 40 popular science
articles from The Guardian and TechXplore, which are written by professional science journalists
for a general audience. As also shown in Table 2, the average (median) length of these articles
is 5,504 (5,540) tokens, and the average (median) complexity of the articles is 12.53 (12.7) FKGL,
confirming that a FKGL around 12, translating to the readability level of a high school diploma,
is appropriate for general citizens.</p>
          <p>Is every single abstract too complex for an average citizen? Figure 1 (left) shows the
distribution of FKGL readability levels, which show a striking variation ranging from 5 (elementary
school, 10 year old children) to 25 (graduate school domain expert). Figure 1 (right) visualizes
this extreme variation, plotted against the length of the abstracts. There is in fact a weak
correlation between text complexity and length (r=0.1059, highly significant, regression line
with slope 0.0007 in red), but for any length we find abstracts on any level of readability.</p>
          <p>Our analysis confirms the presumption that scientific literature is complex, and a large
fraction of abstracts would be very challenging for a layperson. However, our analysis also
reveals that a large fraction of abstracts is within the readability levels of most adult citizens.
3.1.2. Complexity of the Requests
What subset of abstracts is selected by a general query based on the popular science newspaper
articles? We use the default elastic search engine, and retrieve the top 100 scientific articles for
each request, and analyze the text complexity of each retrieved abstract. Over the 114 queries,
this results in a sample of 11,400 abstracts. As shown also in Table 2, the average (median)
length of the retrieved abstracts is 948 (928) tokens, and the average (median) complexity of
the abstracts is 13.79 (14.4) FKGL. Hence, the retrieved abstracts are comparable to the corpus
statistics, both in terms of length and text complexity, and also the distribution of FKGL (not
shown) is very similar.</p>
          <p>Figure 2 shows the distribution of FKGL readability levels over rank of retrieval (left-hand
side), and over each individual query (right-hand side). In both cases we see that the standard
Manual selection of top 5 Elastic Search
Keep only 50% most readable abstracts
retrieval engine is completely blind to the text complexity, and exclusively focusing on the
topical relevance of the abstract. As a result, for any rank and any topic, we see again a
striking variation in FKGL, ranging from 10 (starting high school, 15 year old children) to 20
(doctoral/PhD candidate).</p>
          <p>There are three conclusions based on our analysis: First, a negative result is that we can
conifrm that scientific abstracts on average are complex, confirming and validating the presumption
of laypersons avoiding scientific information. Second, a positive result is that we also found that
the variation of complexity is dramatic, and a large fraction of abstracts is within the readability
level of an average educated adult citizen. Third, standard search engines are optimized for
topic relevance and are completely ignoring secondary aspects such as the reading level of the
text.</p>
          <p>Generally, this last finding can immediately explain why laypersons have a disappointing
experience when searching academic literature, and explain why they avoid academic sources
even when they care about objective evidence. However, note that this finding is also actionable
to dramatically improve existing search technology, by explicitly factoring readability into the
ranking model, and retrieving all and only relevant abstracts that are not prohibitively complex
for an interested outsider.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 1: Content Selection</title>
        <p>We continue with Task 1, asking to retrieve scientific articles in response to a query based on a
popular science article. We submitted two runs, one based on a manual selection of the top 5
results of the default elastic search engine, and an automatic run filtering the top 100 results for
the 50% of abstracts with the lowest text complexity based on the readability measure.</p>
        <p>Table 3 shows the results of our oficial submissions for Task 1. All scores are based on
the top 5 abstracts retrieved per query, based on a pool of abstracts retrieved by at least 2
submissions, and evaluated on a scale ranging from 0 (irrelevant articles) to 5 (when the abstract
and keywords are relevant with the query and the content of the original article). We see that
the set of documents in the top 5 of the manual run has a higher level of relevance than the
readability level filtered automatic run, and also finds more relevant abstracts per query.</p>
        <p>Using the final qrels, we can also evaluate using familiar search measures, with the resulting
evaluation being shown in Table 4. We also include here the evaluation of the standard Elastic
Search for the designated queries (regular queries without quotes). The qrels include 72 topics,
but our manual run does not include results for topics in which the top 5 abstracts as returned
by Elastic search API were deemed non-relevant for the context of the article. Hence we also
evaluate the manual run over the intersection of 52 topics, for which the manual run includes
at least 1 result.</p>
        <p>We see that the manual run, retrieving between 1 and 5 results per query, has superior early
precision, higher than the Elastic Search baseline. We also see that our automatic run obtains
very reasonable performance with an NDCG@10 of 38%. The performance in comparison to
the original Elastic baseline (scoring an NDCG@10 of 43%) may look unimpressive, as in terms
of relevance ranking we do not outperform the baseline. However, recall that our automatic run
had a diferent aim, radically filtering the abstracts to a reading level agreeable with the intended
layperson user. Hence we also include the Flesch-Kincaid Grade Level (FKGL) readability scores,
and observe that the automatic run is able to return abstracts that are on average 2 years or
school levels lower. That is, the reading level of the automatic run is around 12, corresponding
to exit level compulsory education, or high school diploma, which would be accessible to the
target audience of educated citizens. The baseline approach, in contrast, suggests a reading
level requiring college or university education.</p>
        <p>Assuming that users can select from a ranked list, it is of interest to analyze if, and how
many, relevant and highly relevant abstracts are in the runs. Table 5 evaluates the runs with
Boolean quantization on various levels of relevance. In the top part of the table, we evaluate
on all levels of relevance (with “1” meaning marginally relevant). Here we see very reasonable
early precision scores, in particular for the manual run. The high average precision reflects
suggests even good performance at higher recall levels. In the middle part of the table, we
evaluate on relevance level 2 and higher (with “2” meaning relevant), confirming the superiority
of the manual run to return those abstracts with higher levels of relevance. In the bottom part of
the table, we evaluate on relevance level 4 and higher (with “4” meaning relevant to the article
context), and observe that our automatic run not only returns abstracts that are easier to read, it
also outperforms the baseline system in returning those abstracts of direct interest to the article
context.</p>
        <p>There are three conclusions based on our analysis. First, for every query and every level
of relevance, there exist retrieved documents at a variety of readability levels. Second, a
straightforward filter on readability level is able to retrieve abstracts that have a readability
suitable for an educated citizen. Third, filtering on readability level leads to a small loss of
retrieval efectiveness (as some relevant abstracts have high levels of text complexity), but still
obtains a very reasonable performance in particular for retrieving highly relevant articles.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 2: Complexity Spotting</title>
        <p>We continue with Task 2, asking to locate the most dificult concepts in a sentence extracted
from a potentially relevant abstract, retrieved in response to a general query prompted by a
popular science article. We submitted a single run, using an IDF based approach to find the
least common term in the sentence, while boosting the most likely part of speech (i.e., nouns or
noun phrases).</p>
        <p>Table 6 shows the results of our oficial submission to Task 2, retrieving 263k terms for the
117k sentences in the test corpus. We see that a large fraction of highlighted terms (89%) has
correct term limits, which is reassuring as we focused on selecting unigram tokens or words
rather than complex concepts or phrases. Term dificulty is judged on a scale ranging from 0
(term needs no explanation) until 7 (impossible to understand). A fair fraction of terms selected
(8%) has a high level of dificulty (3 or higher) and the majority of those (5%) a very high
level of dificulty (5 or higher), with lower fractions exactly hitting the term limits (5% and 4%
respectively).</p>
        <p>At the time of writing, no ground truth is released for Task 2. Fortunately, the organizers
released train data in an earlier stage of the track, consisting of 282 sentences with 453 manually
extracted complex concepts. By treating every sentence as a “query” and every extracted concept
as a “document identifier” (the string of characters after tokenizing and removing white-space),
we can calculate standard ranked list measures as shown in Table 7. We observe very reasonable
precision scores, for both selecting the longest and for selecting the least common terms. With
the IDF approach being more efective than the simple word length. Qualitative inspection
reveals that this is particularly caused by abbreviations who tend to be short character strings.</p>
        <p>The gold standard also selects multi-term concepts, which are systematically missed by our
single term or word-based approaches. Still, NDCG (35 to 40%) and MAP (around 30%) remain
very reasonable. The NDCG scores reflect the graded concept dificulty level of the ground
truth, normalized against the ideal ranking of the most dificult term first, and show impressive
performance for these straightforward approaches, and a clearer advantage for the IDF approach
over the length based approach. Qualitative inspection reveals that the concepts annotated as
the most dificult, have often not the highest lexical complexity, but factor in the importance or
centrality of the concept for understanding the sentence and abstract at hand.</p>
        <p>As there are very few terms selected per sentence, leading to a necessarily low precision at
10 or even at 5, we also calculate set based precision, recall, and F1 for the first three results in
Table 8. We see here that the set based F1 on the train data is highest after selecting 2 terms per
sentence.</p>
        <p>There are three conclusions based on our analysis. First, although spotting complex terms is
a hard problem in general, straightforward approaches obtain very reasonable performance in
the context of a single sentence. Second, using text statistics such as inverse word frequencies
performs better than using only local features such as word length, and are particularly helpful
to locate the most dificult terms. Third, lexical complexity as used in readability measures
is not enough to locate the ground truth concepts, and we have to combine complexity with
importance of the concept in the broader context of the information request.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Task 3: Text Simplification</title>
        <p>
          We continue with Task 3, asking to perform text simplification proper, by rewriting a sentence
extracted from a potentially relevant abstract, retrieved in response to a general query prompted
by a popular science article. We only perform post-submission experiments, based on the
zero-shot application of an existing neural text simplification model from [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], called the Keep it
Simple (KiS) model. The model is based on GPT-medium, using a straightforward unsupervised
training task with an explicit loss in terms of fluency, saliency, and simplicity. We are interested
in this model as it is fully trained in an unsupervised way, and could be retrained or fine-tuned
for the corpus or other academic texts without the need for huge human training data.
        </p>
        <p>Table 9 shows the results of applying the KiS model zero-shot on the train and test data in
terms of the generated output. To give an indication of whether the output is indeed simplified,
we analyze the Flesch-Kincaid Grade Level (FKGL) of input and output sentences and the
resulting compression in token length. We make the following three observations. First, we
see a mean and median level of 15 in the scientific abstracts, which we lower by about 3 levels
or years of education, with FKGL 12 corresponding to a high school diploma level. Second,
we also look at the percentage of sentences where the FKGL is lowered, and see that this is
the case in around 80% of sentences. Note that here, the dummy “no change” approach fails
miserably, as not a single sentence is simplified. Third, in terms of sentence length, we see no
significant compression, as the generated sentences are on par with the input sentences. This
may be related to the corpus, as the input sentences from scientific abstracts tend to be not
very long with a mean length of 25.8 (train) and 24.2 (test) tokens, and a median length of 24
and 23 respectively. This very significant reduction in text complexity is an encouraging result
showing the promise to realize the general aims of the track.</p>
        <p>Table 10 shows examples of the generated output for the first three sentences in the train
corpus. On the one hand, we see no major issues in language aspects – every generated sentence
is grammatical and coherent – and no issues with generating uncontrolled untrue information
not contained in the input sentence. On the other hand, we see only very conservative changes,
mostly very light editing in terms of deletions and substitutions. While the examples and
earlier analysis is showing that we are moving in the right direction, the ground truth is a far
more significant simplification. Hence, developing dedicated text simplification approaches for
scientific text remains an important open problem.</p>
        <p>For Task 3, we only performed post-submission experiments and our runs were not judged
in terms of Lexical complexity, Syntax complexity, and Information loss. Hence we only show
Current academic and industrial
research is interested in autonomous
vehicles.</p>
        <p>Drones are increasingly used in
the civilian and commercial domain
and need to be autonomous.</p>
        <p>Governments set guidelines on the
operation ceiling of civil drones. So,
road-tracking based navigation is
attracting interest.
2-gram
3-gram
4-gram
No change
KiS Model</p>
        <p>
          Train
Train
648
648
the automatic evaluation based on the human reference simplifications. Table 11 shows the
automatic evaluation scores for Task 3, using standard SARI and Bleu scores. At the time of
writing no test ground truth is available, so we only report scores on the train data. Note
we apply a zero-shot model that is neither trained nor fine-tuned in any way on the CLEF
SimpleText data, the evaluation on the train corpus is still an independent evaluation of the
model’s quality. On the train corpus, with a single human simplified reference sentence, the
KiS model obtains a Bleu score of 0.2809 and a SARI score of 0.3984. To put this number
into perspective, the original paper reports scores in the range of 0.26 to 0.43 on a Wikipedia
corpus [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Hence, a 40% SARI score is promising in terms of efectiveness.
        </p>
        <p>We also include the dummy “no change” approach making no changes whatsoever, defaulting
to returning the input sentence as is. Unlike in machine translation where this would result in
very low, if any, token overlap, this proves a competitive approach in terms of the SARI and
Bleu scores, as naturally the reference simplification will retain many tokens and n-grams of
the original sentence. Recall from above that this no-change approach simplifies not a single
sentence, resulting in a 0% of sentences scoring lower on the FKGL scores. This clearly indicates
that we need to evaluate multiple aspects to capture the essence of text simplification.</p>
        <p>There are three conclusions based on our analysis. First, an of-the-shelf text simplification
model based on GPT-2 is able to rewrite the sentences from academic abstracts with competitive
SARI and Bleu scores against high quality human text simplifications. Second, although the
model’s revisions are conservative, there are no errors introduced and the output is fluent and
without loss of information. Third, in terms of readability level, the simplification reduces
the level from 15 (college level, undergraduate studies) to 12 (high school diploma, exit level
compulsory education), suggesting a readability level suitable for a large fraction of educated
citizens.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Conclusions</title>
      <p>This paper detailed the University of Amsterdam’s participation in the CLEF 2022 SimpleText
track.</p>
      <p>We performed an extensive analysis of the corpus in terms of the text complexity, leading to
the following findings. First, a negative results is that we can confirm that scientific abstracts
on average are complex, validating and confirming the presumption of laypersons avoiding
scientific information. Second, a positive result is that we also found that the variation of
complexity is dramatic, and a large fraction of abstracts is within the readability level of an
average educated adult citizen. Third, standard search engines are optimized for topic relevance
and are completely ignoring secondary aspects such as the reading level of the text.</p>
      <p>Next, we made two submissions to the first task retrieving academic abstracts in response
to a query based on a popular science article, and found the following. First, for every query
and every level of relevance, there exist retrieved documents at a variety of readability levels.
Second, a straightforward filter on readability level is able to retrieve abstracts that have a
readability suitable for educated citizens. Third, filtering on readability level leads to a small loss
of retrieval efectiveness (as some relevant abstracts have high levels of text complexity), but
still obtains a very reasonable performance in particular for retrieving highly relevant articles.</p>
      <p>We also participated in the second task, asking to spot dificult terms in sentences from
academic abstracts, and made the following observations. First, although spotting complex terms
is a hard problem in general, straightforward approaches obtain very reasonable performance in
the context of a single sentence. Second, using text statistics such as inverse word frequencies
performs better than using only local features such as word length, and are particularly helpful
to locate the most dificult terms. Third, lexical complexity as used in readability measures
is not enough to locate the ground truth concepts, and we have to combine complexity with
importance of the concept in the broader context of the information request.</p>
      <p>Finally, we performed exploratory experiments with a large neural text simplification model
based on GPT-2, and arrived at the following conclusions. First, an of-the-shelf text
simpliifcation model based on GPT-2 is able to rewrite the sentences from academic abstracts with
competitive SARI and Bleu scores against high quality human text simplifications. Second,
although the model’s revisions are conservative, there are no errors introduced and the output
is fluent and without loss of information. Third, in terms of readability level, the simplification
reduces the level from 15 (college level, undergraduate studies) to 12 (high school, exit level
compulsory education), suggesting a readability level suitable for a large fraction of educated
citizens.
This research was conducted as part of the final research projects of the Bachelor in Artificial Intelligence
at the University of Amsterdam, We thank the coordinator Dr. Sander van Splunter for his support
and flexibility to work around the CLEF deadlines. This research is funded in part by the Netherlands
Organization for Scientific Research (NWO CI # CISC.CC.016), and the Innovation Exchange Amsterdam
(POC grant). Views expressed in this paper are not necessarily shared or endorsed by those funding the
research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <article-title>Time's person of the year: You</article-title>
          , TIME
          <volume>168</volume>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , E. SanJuan, J. Kamps,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Ovchinnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nurbakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Araújo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hannachi</surname>
          </string-name>
          , É. Mathurin,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bellot</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2022 SimpleText Lab: Automatic simplification of scientific texts</article-title>
          , in: A.
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>G. D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Martino</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Esposti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>CLEF'22: Proceedings of the Thirteenth International Conference of the CLEF Association, Lecture Notes in Computer Science</source>
          , Springer,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>SanJuan</surname>
          </string-name>
          , S. Huet,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2022 SimpleText Task 1: Passage selection for a simplified summary</article-title>
          ,
          <source>in: [8]</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ovchinnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nurbakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Araújo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hannachi</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2022 SimpleText Task 2: Complexity spotting in scientific abstracts</article-title>
          ,
          <source>in: [ 8]</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ovchinnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nurbakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Araújo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hannachi</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2022 SimpleText Task 3: Query biased simplification of scientific texts</article-title>
          ,
          <source>in: [ 8]</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Laban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>Keep it simple: Unsupervised simplification of multi-paragraph text</article-title>
          , in: ACL/IJCNLP'21:
          <article-title>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6365</fpage>
          -
          <lpage>6378</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>498</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>401</fpage>
          -
          <lpage>415</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00107. doi:
          <volume>10</volume>
          .1162/tacl\_a\_
          <volume>00107</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.),
          <source>Proceedings of the Working Notes of CLEF</source>
          <year>2022</year>
          :
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>