=Paper= {{Paper |id=Vol-3740/paper-213 |storemode=property |title=Extended Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance |pdfUrl=https://ceur-ws.org/Vol-3740/paper-213.pdf |volume=Vol-3740 |authors=Rabab Alkhalifa,Hsuvas Borkakoty,Romain Deveaud,Alaa El-Ebshihy,Luis Espinosa-Anke,Tobias Fink,Petra Galuščáková,Gabriela González-Sáez,Lorraine Goeuriot,David Iommi,Maria Liakata,Harish Tayyar Madabushi,Pablo Medina-Alias,Philippe Mulhem,Florina Piroi,Martin Popel,Arkaitz Zubiaga |dblpUrl=https://dblp.org/rec/conf/clef/AlkhalifaBDEAFG24 }} ==Extended Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance== https://ceur-ws.org/Vol-3740/paper-213.pdf
                         Extended overview of the CLEF 2024 LongEval Lab on
                         Longitudinal Evaluation of Model Performance
                         Notebook for the LongEval Lab at CLEF 2024

                         Rabab Alkhalifa1,2,† , Hsuvas Borkakoty3,† , Romain Deveaud4,† , Alaa El-Ebshihy5,6,† ,
                         Luis Espinosa-Anke3,7,† , Tobias Fink5,6,† , Petra Galuščáková9,† , Gabriela Gonzalez-Saez8,† ,
                         Lorraine Goeuriot8,† , David Iommi5,† , Maria Liakata1,10,11,† , Harish Tayyar Madabushi12,† ,
                         Pablo Medina-Alias12,† , Philippe Mulhem8,† , Florina Piroi5,6,† , Martin Popel13,† and
                         Arkaitz Zubiaga1,†
                         1
                           Queen Mary University of London, UK
                         2
                           Imam Abdulrahman Bin Faisal University, SA
                         3
                           Cardiff University, UK
                         4
                           Qwant, France
                         5
                           Research Studios Austria, Data Science Studio, Vienna, AT
                         6
                           TU Wien, Austria
                         7
                           AMPLYFI, UK
                         8
                           Univ. Grenoble Alpes, CNRS, Grenoble INP1 , LIG, Grenoble, France
                         9
                           University of Stavanger, Stavanger, Norway
                         10
                            Alan Turing Institute, UK
                         11
                            University of Warwick, UK
                         12
                            University of Bath, UK
                         13
                            Charles University, Prague, Czech Republic


                                      Abstract
                                      We describe the second edition of the LongEval CLEF 2024 shared task. This lab evaluates the temporal persistence
                                      of Information Retrieval (IR) systems and Text Classifiers. Task 1 requires IR systems to run on corpora acquired
                                      at several timestamps, and evaluates the drop in system quality (NDCG) along these timestamps. Task 2 tackles
                                      binary sentiment classification at different points in time, and evaluates the performance drop for different
                                      temporal gaps. Overall, 37 teams registered for Task 1 and 25 for Task 2. Ultimately, 14 and 4 teams participated
                                      in Task 1 and Task 2, respectively.

                                      Keywords
                                      Evaluation, Temporal Persistence, Temporal Generalisability, Information Retrieval, Text Classification




                         1
                          Institute of Engineering Univ. Grenoble Alpes.
                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ raalkhalifa@iau.edu.sa (R. Alkhalifa); borkakotyh@cardiff.ac.uk (H. Borkakoty); r.deveaud@qwant.com (R. Deveaud);
                         alaa.el-ebshihy@researchstudio.at (A. El-Ebshihy); espinosa-ankel@cardiff.ac.uk (L. Espinosa-Anke);
                         tobias.fink@researchstudio.at (T. Fink); petra.galuscakova@uis.no (P. Galuščáková);
                         gabriela-nicole.gonzalez-saez@univ-grenoble-alpes.fr (G. Gonzalez-Saez); lorraine.goeuriot@univ-grenoble-alpes.fr
                         (L. Goeuriot); david.iommi@researchstudio.at (D. Iommi); m.liakata@qmul.ac.uk (M. Liakata); htm43@bath.ac.uk
                         (H. T. Madabushi); Philippe.Mulhem@imag.fr (P. Mulhem); florina.piroi@researchstudio.at (F. Piroi); popel@ufal.mff.cuni.cz
                         (M. Popel); a.zubiaga@qmul.ac.uk (A. Zubiaga)
                          0000-0002-2875-5400 (R. Alkhalifa); 0000-0003-3262-0127 (H. Borkakoty); 0000-0003-2676-7405 (R. Deveaud);
                         0000-0001-6644-2360 (A. El-Ebshihy); 0000-0001-6830-9176 (L. Espinosa-Anke); 0000-0002-1045-8352 (T. Fink);
                         0000-0001-6328-7131 (P. Galuščáková); 0000-0003-0878-5263 (G. Gonzalez-Saez); 0000-0001-7491-1980 (L. Goeuriot);
                         0000-0002-4270-5709 (D. Iommi); 0000-0001-5765-0416 (M. Liakata); 0000-0001-5260-3653 (H. T. Madabushi);
                         0009-0001-4202-8664 (P. Medina-Alias); 0000-0002-3245-6462 (P. Mulhem); 0000-0001-7584-6439 (F. Piroi);
                         0000-0002-3628-8419 (M. Popel); 0000-0003-4583-3623 (A. Zubiaga)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
Outside the strict scientific context, the European Artificial Intelligence Act1 , adopted by European
Commission in 2024, stresses in Article 17, section (d), that providers must comply with “examination,
test and validation procedures to be carried out before, during and after the development of the high-risk
AI system, and the frequency with which they have to be carried out”. Without focusing here on the
degree of risk of Information Retrieval or Classification systems, this Act clearly states that AI systems
must tackle evolution. Time is a dimension that is often overlooked when conducting Information
Retrieval (IR) experiments, especially when static data sets are utilized. The advantages of such datasets
are that they are easily used to evaluate and test systems. Some data sets, like CORD19, contain
documents collected at different points in time, showing differences in the set of documents from one
collection time to another. Recent research [1] has demonstrated that models trained on data pertaining
to a particular time period struggle to keep their performance levels when applied on test data that is
distant in time. On the other side, [2] showed that neural systems, especially transformers-based ones,
are not always very sensitive to corpus evolution.
   With the aim of tackling this challenge of making models have persistent quality over time, the
objective of the LongEval lab is twofold: (i) to explore the extent to which temporal differences over
time, as reflected in the evolution of evaluation datasets, results in the deterioration of the performance
of information retrieval and classification systems, and (ii) to propose improved methods that mitigate
performance drop by making models more robust over time.
   The LongEval lab [3] took place as part of the Conference and Labs of the Evaluation Forum (CLEF)
2024, and consisted in two separate tasks: (i) Task 1, described in Section 2, focused on information
retrieval, and (ii) Task 2, described in Section 3, focused on text classification for sentiment analysis.
Both tasks provided labeled datasets enabling analysis and evaluation of models over data evolving in
time (what we call “longitudinally evolving data”). In this paper, we add details to [4], by focusing on
the datasets statistics, and on analysing in details the overall participant runs and results for each task.


2. Task 1 - Retrieval
The retrieval task of LongEval 2024 explores the effect of changes in datasets on retrieval of text
documents. More specifically, we focus on a setup in which the datasets are evolving, as in the
LongEval 2023 Retrieval Task data [3]. This means, that one dataset can be acquired from another by
adding, removing (and replacing) a limited number of documents and queries. The two main scenarios
considered focus on one single system or on several ones, as detailed below:

A single system in an evolving setup
We explore how one selected system behaves when evaluated on several collections, which evolve
along the time. The context in which this task taked place is retrieval performances for Web search.
When considering evolution of Web data along time, we are facing a case when the documents, the
queries and also the relevance continuously evolves. We are then studying how Web search engines
deal with this situation. The considered scenario is then similar to classical ad-hoc search, in the case
of evolving data sets. The evaluation in this scenario consider both the Web search case in which the
top documents are the most important elements considered, and should take into account the evolving
nature of the data. Evaluation should ideally reflect the changes in the collection and especially signal
substantial changes that could lead to performance drop. This would allow to re-train the search engine
model then and only when it is really necessary, and enable much more efficient overall training.
   As described earlier, there is no consensus about the stability of the performance of the neural
networks IR systems along time, but it seems to be lower than in the case of statistical models.
Moreover, the performance strongly depends on the data used for training the neural model. One


1
    https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138_EN.html
objective of the task is to explore the behavior of the neural system in the evolving data scenario.

Comparison of multiple systems in an evolving setup
While in the first point, we explore a single system, comparison of this systems with multiple systems
across evolving collections, should provide more information about systems stability and robustness.

2.1. Description of the task
Compared to the LongEval 2023 Dataset [3], in 2024 we take larger lags between the training and the
test sets. More precisely, the task is composed of:
      • One training set, that contains Web documents, actual user’s queries, and assessments, acquired
         at timestamp 𝑡;
      • Two test sets, acquired later than 𝑡 at time 𝑡′ and 𝑡”, composed of Web documents and user’s
         queries.
The task datasets were created over sequential time periods, which allows doing observations at different
time stamps 𝑡, and most importantly, comparing the performance across different time stamps 𝑡 and
𝑡′ . So, the IR task aims to assess the performance difference between 𝑡′ and 𝑡” when 𝑡′ occurs after 𝑡′ ,
according to teh fact that training set acquired at 𝑡, takes place few months before 𝑡′ .

2.2. Dataset
As for LongEval 2023, in 2024 the data for this task were provided by the French search engine Qwant.
They consist of the queries issued by the users of this search engine, cleaned Web documents, which
were 1) selected to correspond to the queries, and 2) to add additional noise, and relevance judgments,
which were created using a click model. The dataset is fully described in [5]. We provided training data,
which included 599 train queries, with corresponding 9,785 relevance assessments and 2,049,729 Web
pages. All training data were collected during January 2023. The test set corpus is composed of two
subsets: Lag6 acquired in June 2023 (i.e., 6 months later than the training set), and Lag8 acquired in
August 2024 (i.e. acquired 8 months later than the training set). The test dataset contains 4,321,642
documents (June: 1,790,028; August: 2,531,614) and 1,925 test queries (June: 407; August: 1,518). The
datasets are accessible through the lab’s webpage2 and from the TU Wien Research Data Repository3 .
  The data collected from the Qwant search engine is in French. In a way to help participants, the
LongEval data set for the Retrieval task also contains automatic translations into English of both queries
and documents. We mention however that the translations provided by LongEval are only applied to
the first 500 characters of each sentence of the initial French documents downloaded.
  The document and query overlap ratios between the collections is given by Table 1 and Table 2.
We see from these tables that there is a substantial overlap between the Train and the Test collection
documents and (due to the larger size of the August query set) a substantial overlap between the Train /
June queries and the August queries.

Table 1
Ratio of documents shared between the LongEval 2024 train and test collections, row vs. column, i.e. 0.93 means
that 93% of documents in the row collection are also included in the column collection.
                                                  Train 2024   June (Lag6)   August (Lag8)
                                  Train 2024            1.00          0.67            0.93
                                  June (Lag6)           0.77          1.00            0.97
                                  August (Lag8)         0.75          0.69            1.00


 To evaluate the submissions we use one set of relevance judgments: the judgments acquired by the
Qwant click model. For the evaluation, we use the NDCG measure (calculated for each dataset) at 10, as

2
    https://clef-longeval.github.io/
3
    https://doi.org/10.48436/xr350-79683
Table 2
Ratio of the queries shared between the LongEval 2024 train and test collections, rows vs. columns, i.e. 0.99
means that 99% of queries in the row collection are also included in the column collection.
                                            Train 2024   June (Lag6)   August (Lag8)
                            Train 2024            1.00          0.22            0.42
                            June (Lag6)           0.32          1.00            0.56
                            August (Lag8)         0.17          0.15            1.00




                   (a) Lag6 Dataset                                         (b) Lag8 Dataset

Figure 1: Overview of the systems using a neural approach (green) vs. other (yellow).


well as the drop between the Lag8 and Lag6 collection. This allows us to check to which extend the IR
system face the evolution of the data. We also plan to use manual assessments, acquired through the
interface described in section 2.8.

2.3. Submissions
14 teams submitted their systems to the Retrieval task. Each team was allowed to submit up to 10
systems. Together, this a overall of 73 runs submitted. Two teams submitted their runs on the wrong
test data set, so we do not include their submission results in our further analysis.

2.4. Absolute Scores
For the Retrieval task of the LongEval lab, we computed two sets of scores for each of the lags in the
test collection, namely NDCG and MAP. Table 3 gives the overview of them for each run on the Lag6
and Lag8 datasets. For each run, the columns of the table indicate which language was used (English,
French, or both), whether neural approaches were involved (values yes/no), and whether a single or
a combination of several approaches was used (values yes/no). In addition, we show NDCG score
histograms for these runs, in decreasing order, for each dataset, showing whether a run uses any neural
approach (green for yes, yellow for no) in Figure 1, and whether the run uses a combination of more
than a single approach (orange for yes, cyan for no) in Figure 2. This information was acquired from
the participants through a questionnaire the participants had to fill for each submitted run. Figure 3
shows which language each made use of.
From Table 3 we see that the systems which did best for the Lag6 data are also among the top for the
Lag8, where the first ranked nine systems scores are comparable to each other. For instance, the best
system on Lag6, according to the NDCG measure, (dam_run_4), is ranked the second best also on Lag8.
Similarity, the best system on Lag8, according to the NDCG measure, (mouse_run_8), is ranked the
second best also on Lag6. This finding holds for the MAP measure as well.
   Here, we describe the methods used in the top-3 runs, according to the NDCG evaluation measure,
for both Lag6 and Lag8 datasets.
                        (a) Lag6 Dataset                                  (b) Lag8 Dataset
Figure 2: Overview of the systems which use a single approach (orange) and which use a combination of
multiple approaches (cyan)




                        (a) Lag6 Dataset                                  (b) Lag8 Dataset
Figure 3: Overview of the systems which use French (blue), which use English translations (red), and which use
both (purple).


      1. dam_run_4 from the DAM team: This system uses BM25 as a first stage retrieval model, enhanced
         with proximity search, query expansion via synonyms, and the MBNET model [6], which combines
         BERT and XLNET, for re-ranking the results.

      2. mouse_run_8 from MOUSE team: This system also uses BM25 as a first stage retrieval model,
         enhanced with an LLM-based re-ranking model using the Cohere API4 . It utilizes the Llama 3
         model [7] for query expansion.

      3. mouse_run_10 from MOUSE team: Similar to mouse_run_8, this system uses BM25 as first stage
         retrieval model, but it is enhanced with a deep neural-based re-ranking model using PyGaggle. It
         also employs the Llama 3 model for query expansion.

      For the Lag8 dataset, the top-3 systems are:

      1. mouse_run_9 from MOUSE team: This system uses BM25 as a first stage retrieval model, enhanced
         with a deep neural-based re-ranking model using PyGaggle5 . It uses the Mixtral model [8] for
         query expansion.

      2. mouse_run_8 from MOUSE team: Described above.

      3. mouse_run_10 from MOUSE team: Described above.


4
    https://docs.cohere.com/docs/rerank-2
5
    https://github.com/castorini/pygaggle
Generally, most of the solutions chosen by the participants to the LongEval Retrieval task apply a
multi-stage retrieval approach. Often, the first stage involves a lexical-based retrieval (e.g., BM25), and
query expansion methods like PL2 or BO1. Query expansion is also done by employing Large Language
Models, like Mistral or Llama 3. Reranking is done either using neural-based methods or sentence
based transformers. Listwise rerankers and fusing have also been used in reranking of retrieved results.
Notably, the temporal aspect of the LongEval test collection has been used by some participants to
include past query relevance information into query reformulation either from clicklogs or from the
documents deemed relevant in the previous
  Considering the Figures 1, 2 and 3, we see that the shape of the distribution of the NDCG values are
similar for the Lag6 and Lag8 datasets. However, the systems have higher performances on Lag6 than
on Lag8, with maximum 0.4 value for the NDCG on the Lag6 versus 0.3 for the Lag8.


Table 3: NDCG and MAP scores for Lag6, Lag8. Results are sorted according to the NDCG scores on the
         Lag6.
                                                                         NDCG              MAP
    Run Id                    Neural    Comb.          Language       Lag6     Lag8    Lag6     Lag8
    dam_run_4 [9]                 yes         no          French      0.396    0.294   0.249    0.171
    mouse_run_8 [10]              yes        yes          French      0.395    0.298   0.248    0.174
    mouse_run_10 [10]             yes        yes          French      0.393    0.298   0.246    0.175
    iris_run_4 [11]               yes        yes          French      0.392    0.293   0.244    0.171
    mouse_run_9 [10]              yes        yes          French      0.392    0.298   0.245    0.175
    iris_run_1 [11]               yes        yes          French      0.392    0.294   0.244    0.171
    iris_run_2 [11]               yes        yes          French      0.392    0.293   0.242    0.170
    iris_run_3 [11]               yes        yes          French      0.391    0.293   0.243    0.171
    iris_run_5 [11]               yes                     French      0.390    0.294   0.240    0.171
    mouse_run_7 [10]              yes         no          French      0.386    0.288   0.236    0.163
    dam_run_3 [9]                  no         no          French      0.385    0.285   0.235    0.162
    quokkas_run_2                  no         no          French      0.379    0.276   0.225    0.150
    quokkas_run_1                  no         no          French      0.374    0.274   0.221    0.148
    lfzzo_run_7                    no         no          French      0.373    0.269   0.221    0.145
    lfzzo_run_7                    no         no          French      0.373    0.269   0.221    0.145
    lfzzo_run_8                    no         no          French      0.372    0.269   0.221    0.144
    lfzzo_run_9                    no         no          French      0.372    0.268   0.221    0.143
    lfzzo_run_10                   no         no          French      0.372    0.269   0.219    0.145
    lfzzo_run_6                    no         no          French      0.371    0.270   0.218    0.145
    dam_run_5 [9]                 yes         no          French      0.370    0.279   0.220    0.156
    mouse_run_6 [10]              yes         no          French      0.367    0.286   0.215    0.162
    cir_run_3 [12]                 no         no          English     0.354    0.242   0.226    0.136
    snu_run_1 [13]                yes        yes          English     0.334    0.251   0.197    0.142
    ows_run_1 [13]                 no         no          English     0.333    0.243   0.199    0.139
    kalu_run_2 [14]               yes         no          French      0.330    0.254   0.192    0.143
    kalu_run_3 [14]               yes         no          French      0.330    0.254   0.192    0.143
    kalu_run_5 [14]               yes         no           Frencg     0.324    0.249   0.188    0.140
    kalu_run_4 [14]               yes         no          French      0.323    0.250   0.186    0.140
    cir_run_4 [12]                 no         no          English     0.320    0.229   0.172    0.117
    wonder_run_3                   no         no   French,English     0.313    0.235   0.163    0.116
    cir_run_2 [12]                yes         no          English     0.308    0.230   0.173    0.123
    mouse_run_3 [10]              yes        yes          English     0.306    0.235   0.171    0.126
    ows_run_2 [15]                 no         no          English     0.306    0.229   0.197    0.140
    dam_run_2 [9]                 yes         no          English     0.304    0.231   0.169    0.121
    mouse_run_4 [10]              yes       yes           English   0.304    0.232    0.167   0.124
    mouse_run_5 [10]              yes       yes           English   0.304    0.232    0.166   0.124
    wonder_run_4                   no        no           French    0.299    0.223    0.155   0.107
    kalu_run_1 [14]                no        no           French    0.298    0.219    0.158   0.107
    galapagos_run_4 [16]          yes       yes           English   0.295    0.220    0.189   0.131
    ows_run_3 [15]                yes       yes           English   0.294    0.224    0.188   0.135
    dam_run_1 [9]                  no        no           English   0.294    0.221    0.156   0.112
    galapagos_run_5 [16]          yes       yes           English   0.293    0.221    0.187   0.132
    mouse_run_2 [10]              yes        no           English   0.291    0.225    0.152   0.115
    mouse_run_1 [10]              yes        no           English   0.291    0.225    0.153   0.114
    ows_run_7 [15]                yes       yes           English   0.290    0.213    0.180   0.123
    cir_run_5 [12]                 no        no           English   0.285    0.212    0.148   0.104
    ows_run_6 [15]                yes       yes           English   0.284    0.216    0.173   0.126
    cir_run_1 [12]                 no        no           English   0.282    0.211    0.145   0.103
    snu_run_2 [13]                yes       yes           English   0.282    0.213    0.177   0.127
    lfzzo_run_4                    no        no           English   0.280    0.209    0.142   0.102
    lfzzo_run_2                    no        no           English   0.280    0.207    0.142   0.099
    wonder_run_2                   no        no           English   0.279    0.207    0.137   0.099
    lfzzo_run_3                    no        no           English   0.277    0.209    0.139   0.102
    lfzzo_run_1                    no        no           English   0.276    0.207    0.140   0.100
    lfzzo_run_5                    no        no           English   0.274    0.207    0.137   0.101
    seekx_run_1                    no        no           French    0.274    0.201    0.145   0.095
    seekx_run_2                    no        no           French    0.274    0.202    0.144   0.096
    seekx_run_4                    no        no           English   0.273    0.202    0.139   0.098
    wonder_run_5                   no        no           English   0.273    0.203    0.137   0.098
    wonder_run_1                   no        no           English   0.272    0.203    0.136   0.098
    seekx_run_5                    no        no           English   0.264    0.193    0.133   0.091
    galapagos_run_2 [16]          yes       yes           English   0.261    0.198    0.162   0.115
    galapagos_run_1 [16]          yes       yes           English   0.258    0.196    0.157   0.111
    galapagos_run_3 [16]          yes       yes           English   0.253    0.192    0.151   0.107
    ows_run_4 [15]                yes       yes           English   0.246    0.204    0.128   0.114
    ows_run_5 [15]                 no       yes           English   0.240    0.177    0.124   0.085
    seekx_run_3                    no        no           French    0.236    0.174    0.120   0.079
    AVERAGE                                                         0.318    0.238    0.183   0.129



2.5. Changes in the Scores
The main part of the retrieval task is to study the changes in the performance scores between the
collections. The collections were created using the same approach and procedure have a relatively
high overlap in terms of both queries and documents (see Tables 1 and 2), we thus provide the Relative
NDCG Drop (RND) values of systems between the collections Lag8 and Lag6. RnD(r) for a system 𝑟, is
defined as as:
                                              NDCG𝐿𝑎𝑔6 (𝑟)−NDCG𝐿𝑎𝑔8 (𝑟)
                                 𝑅𝑁 𝐷(𝑟) =         NDCG𝐿𝑎𝑔6 (𝑟)

   With such definition, small RND values man more robust systems against changes, and large RND
values mean that the systems are not able to generalize well between lag6 and lag8. What we see in
Table 4 is that the systems which are more robust to the evolution of the test collections (low values on
RND) are not the best ones: for instance, ows_run_4 is the more robust system but the third worse one
in table 3. The best systems in term of NDCG values in lag6, 𝑑𝑎𝑚_𝑟𝑢𝑛4 and 𝑚𝑜𝑢𝑠𝑒_𝑟𝑢𝑛_8, have an
RND of 0.245, which means that they quite robust, but much less than the most robut ones. This shows
that the very best systems do cope with some extend to the evolution of the corpus, but that their is
room for improving best systems against robustness. We also see that the worse robust system against
changes, cir_run_3, is a system that does not rely on neural IR models: such finding shows that neural
models are also likely to be more robust against changes than non-neural ones.


        Table 4: Changes in the NDCG scores. Lines are ordered by descending RND values.
                                                   NDCG          RND
                            System              Lag6     Lag8
                            ows_run_4           0.246    0.204   0.169
                            mouse_run_6         0.367    0.286   0.220
                            kalu_run_4          0.323    0.250   0.224
                            mouse_run_1         0.291    0.225   0.226
                            mouse_run_2         0.291    0.225   0.229
                            kalu_run_2          0.330    0.254   0.230
                            kalu_run_5          0.324    0.249   0.230
                            mouse_run_3         0.306    0.235   0.231
                            kalu_run_3          0.330    0.254   0.232
                            mouse_run_5         0.304    0.232   0.235
                            mouse_run_4         0.304    0.232   0.235
                            ows_run_6           0.284    0.216   0.238
                            galapagos_run_1     0.258    0.196   0.239
                            ows_run_3           0.294    0.224   0.239
                            mouse_run_9         0.392    0.298   0.240
                            galapagos_run_2     0.261    0.198   0.241
                            dam_run_2           0.304    0.231   0.241
                            mouse_run_10        0.393    0.298   0.243
                            galapagos_run_3     0.253    0.192   0.243
                            lfzzo_run_3         0.277    0.209   0.243
                            snu_run_2           0.282    0.213   0.245
                            mouse_run_8         0.395    0.298   0.245
                            dam_run_5           0.370    0.279   0.245
                            lfzzo_run_5         0.274    0.207   0.245
                            wonder_run_3        0.313    0.235   0.247
                            iris_run_5          0.390    0.294   0.248
                            galapagos_run_5     0.293    0.221   0.248
                            dam_run_1           0.294    0.221   0.249
                            snu_run_1           0.334    0.251   0.250
                            iris_run_3          0.391    0.293   0.251
                            lfzzo_run_1         0.276    0.207   0.251
                            ows_run_2           0.306    0.229   0.251
                            iris_run_2          0.392    0.293   0.251
                            iris_run_1          0.392    0.294   0.251
                            lfzzo_run_4         0.280    0.209   0.252
                            iris_run_4          0.392    0.293   0.252
                            cir_run_2           0.308    0.230   0.252
                            cir_run_1           0.282    0.211   0.252
                            wonder_run_1        0.272    0.203   0.253
                            wonder_run_4        0.299    0.223   0.253
                            mouse_run_7         0.386    0.288   0.255
                            galapagos_run_4     0.295    0.220   0.256
                             wonder_run_5         0.273   0.203    0.257
                             cir_run_5            0.285   0.212    0.257
                             dam_run_4            0.396   0.294    0.258
                             wonder_run_2         0.279   0.207    0.258
                             dam_run_3            0.385   0.285    0.258
                             seekx_run_4          0.273   0.202    0.260
                             ows_run_5            0.240   0.177    0.261
                             lfzzo_run_2          0.280   0.207    0.261
                             seekx_run_2          0.274   0.202    0.263
                             seekx_run_1          0.274   0.201    0.264
                             ows_run_7            0.290   0.213    0.264
                             seekx_run_3          0.236   0.174    0.265
                             kalu_run_1           0.298   0.219    0.265
                             seekx_run_5          0.264   0.193    0.267
                             quokkas_run_1        0.374   0.274    0.268
                             quokkas_run_2        0.379   0.276    0.271
                             ows_run_1            0.333   0.243    0.272
                             lfzzo_run_6          0.371   0.270    0.273
                             lfzzo_run_10         0.372   0.269    0.277
                             lfzzo_run_8          0.372   0.269    0.277
                             lfzzo_run_7          0.373   0.269    0.280
                             lfzzo_run_9          0.372   0.268    0.281
                             cir_run_4            0.320   0.229    0.284
                             cir_run_3            0.354   0.242    0.316
                             AVERAGE              0.305   0.228    0.251



2.6. Run Rankings
Another point of view studied is how the submitted runs compare to each other, either in terms of
the absolute NDCG scores achieved on the collections, or in terms of NDCG changes between the
collections. We also calculated the Pearson correlation between the runs (now shown here), with high
correlation in terms of NDCG scores, 0.99, and similarly high, 0.98, with respect to ranking order. This
corresponds to the relatively high overlaps of the documents and also the queries between Lag6 and
Lag8 collections (Table 1 and Table 2). This observation does not hold for the correlation between the
ranking according to the NDCG score achieved and the ranking of the performance change, which is
relatively low. The Pearson correlation is 0.07 for the Lag6 dataset and -0.05 on the Lag8 dataset.
   Last, we calculated a combination of both rankings (ranking in terms of absolute values and ranking
in terms of change). For this, we first calculated a Borda count of the ranking in terms of absolute
values and Borda count of the ranking in terms of relative change and then we simply summed these
two Borda counts: this result is displayed in the last column in the Table 5. We see that in terms of this
measure the top performing systems (on Lag6 and Lag8 datasets) are ranked higher, although they have
lower rank in terms of the rank of the NDCG change.


Table 5: Ranking of the submitted systems by NDCG scores (columns 2-3), changes in NDCG scores
         between Lag6 and Lag8 dataset (column 4). Column 4 shows the sum of the Borda count applied
         to ranking on Lag6 and Lag8 datasets and Borda count of ranking change between Lag8 and
         Lag6 dataset. The darker color means better performance.
    System              NDCG Lag6          NDCG Lag8          RND               Borda
    dam_run_4           1                  4                  45                151
mouse_run_8       2    2    22   175
mouse_run_10      3    3    18   177
iris_run_4        4    7    36   154
mouse_run_9       5    1    15   180
iris_run_1        6    5    34   156
iris_run_2        7    8    33   153
iris_run_3        8    9    30   154
iris_run_5        9    6    26   160
mouse_run_7       10   10   41   140
dam_run_3         11   12   47   131
quokkas_run_2     12   14   58   117
quokkas_run_1     13   15   57   116
lfzzo_run_7       14   19   63   105
lfzzo_run_8       15   17   62   107
lfzzo_run_9       16   20   64   101
lfzzo_run_10      17   18   61   105
lfzzo_run_6       18   16   60   107
dam_run_5         19   13   23   146
mouse_run_6       20   11   2    168
cir_run_3         21   27   66   87
snu_run_1         22   23   29   127
ows_run_1         23   26   59   93
kalu_run_2        24   21   9    147
kalu_run_3        24   22   6    149
kalu_run_5        26   25   7    143
kalu_run_4        27   24   3    147
cir_run_4         28   34   65   74
wonder_run_3      29   29   25   118
cir_run_2         30   33   37   101
mouse_run_3       31   28   8    134
ows_run_2         32   35   32   102
dam_run_2         33   32   17   119
mouse_run_4       34   31   11   125
mouse_run_5       35   30   10   126
wonder_run_4      36   39   40   86
kalu_run_1        37   43   55   66
galapagos_run_4   38   42   42   79
ows_run_3         39   38   14   110
dam_run_1         40   41   28   92
galapagos_run_5   41   40   27   93
mouse_run_2       42   37   5    117
mouse_run_1       43   36   4    118
ows_run_7         44   45   53   59
cir_run_5         45   47   44   65
ows_run_6         46   44   12   99
cir_run_1         47   48   38   68
snu_run_2         48   46   21   86
lfzzo_run_4       49   49   35   68
lfzzo_run_2       50   54   50   47
wonder_run_2      51   52   46   52
lfzzo_run_3       52   50   20   79
lfzzo_run_1       53   53   31   64
    lfzzo_run_5         54                 51                 24                72
    seekx_run_1         55                 60                 52                34
    seekx_run_2         56                 59                 51                35
    seekx_run_4         57                 58                 48                38
    wonder_run_5        58                 57                 43                43
    wonder_run_1        59                 56                 39                47
    seekx_run_5         60                 63                 56                22
    galapagos_run_2     61                 61                 16                63
    galapagos_run_1     62                 62                 13                64
    galapagos_run_3     63                 64                 19                55
    ows_run_4           64                 55                 1                 81
    ows_run_5           65                 65                 49                22
    seekx_run_3         66                 66                 54                15



2.7. Queries Overview
We further investigate performance on the provided queries. Due to the space reason, we only investigate
a selected subset of queries from each collection. We used a pooling strategy to select these queries
to be used for the manual assessment process (described in Section 2.8). We first selected the top
five performing runs on the average NDCG performance on both collections. We then calculated the
performance of these runs per queries for each collection (i.e. Lag6 and Lag8) and sorted the queries
based on their NDCG performance for the five runs. Then, we divided the query set in each collection
to four sets and randomly selected from each set: five and 10 queries from Lag 6 and Lag8, respectively.
We selected in total 20 queries from Lag6 collection and 40 Lag8 collection. We selected more queries
from Lag8 collection since, as shown in Table 2, the number of Lag8 collection is higher than Lag6
collection.
   Overview of the scores achieved for the selected queries in each collection is displayed in Figure 4.
The figure shows minimum performance (by any submitted run), 25%, quantile, 75% quantile and the
maximum achieved NDCG score. Due to a relatively large number of runs, the range of the scores
achieved is typically quite large and for some of the queries it even ranges between 0 and 0.8. It can be
also noticed that the variation (corresponding to the size of the boxplot) of the query performance for
the Lag8 collection is higher than Lag6 collection.
   Some of the worst performing queries are very general (“birdsong”, “taxes”, and “used car” for
instance) and can thus be expected to be ambiguous. This is in contrast with the top performing queries
(e.g. “camping concarneau”, “Prune rabbit”, and “point bordeaux vision”) which refer to more specific
information need. Some other top performing queries have high variation in the results, e.g. the query
“origami bird” for which it is not specified if the user focuses about about "origami bird" or looks for
tutorials to make them.

2.8. Manual relevance judgments acquisition
The evaluation results of LongEval IR task presented above rely on automatic assessments generated
from click models [5]. In addition to these click-based relevance assessments, we have set up an
annotation tool to acquire further relevance assessments by humans. For that, we used the open source
annotation tool, Doctag [17], on a sample of the queries selected in section 2.7 (60 queries in total).
   Doctag provides a customizable and portable platform specifically designed for Information Retrieval
(IR) evaluation. To perform manual relevance judgments using Doctag, annotators utilize its web-based
interface. They access the tool and interact with its annotation functionalities, including the assignment
of labels to indicate document relevance to specific queries. Annotators view the documents and
associate appropriate relevance labels (Fig. 5). The documents to be annotated were selected through
pooling the participants runs [18]. For the annotation to remain tractable, we conducted a stratified
                   (a) Lag6 Dataset                                      (b) Lag8 Dataset

Figure 4: Selected queries performance from Lag6 and Lag8 datasets.




Figure 5: Screenshot from Doctag main page. Labels annotation is done associating to each document one label
that expresses the relevance of that document for that topic.


sampling and selected 60 queries for evaluation (Section 2.7). We set up dedicated online servers where
Doctag is deployed, through their use we have acquired over 25K manual assessments. 2900 documents
from the original dataset were then assessed. The average number of assessments per query is around
429. To perform the manual annotation and assess document relevance for the corresponding queries,
we assigned subsets of the document dataset to a team of 25 annotators. We set up dedicated online
servers where Doctag was deployed. Each annotator was assigned to a specific server to perform the
annotation tasks. This distributed setup allowed for parallel processing, enabling annotators to work
simultaneously and collaborate effectively within their assigned subsets.
   We have recorded an aggregate of 25,759 judgments. These judgments span across four distinct
categories: ’Relevant’, ’Not Relevant’, ’Partially Relevant’, and ’I Don’t Know’.
   Preliminary analysis of the data indicates a more balanced approach among annotators in categorizing
the query-document pairs. Figure 6 presents the judgment distribution for the top 30 queries in terms of
document count. What we observe in Figure 6 is a more evenly distributed number of relevant (green)
and non-relevant (red) documents for many queries. While some queries still show a high number
of relevant documents (with peaks exceeding 300 relevant documents), the number of non-relevant
documents is also significant, indicating no single dominant category. This balanced distribution of
relevant and non-relevant documents is much more equitable than previous analyses, where non-
relevant judgments predominated.
Figure 6: The distribution of judgment votes for the top 30 queries based on document count. Resulting counts
of ‘Relevant’ (green), ‘Not Relevant’ (red), and ‘Partially Relevant’ (orange) votes are shown.




Figure 7: Violin plots showing the distribution of judgment counts across different categories for all queries.
The plots reveal that the distributions for relevant and not relevant judgments are similar, both with wide ranges
and high densities around the median values.
  Additionally, Figure 7 provides a detailed view of the distribution of judgment counts across all
queries using violin plots. The violin plots reveal that the distributions for relevant and non-relevant
judgments are quite similar, with both categories showing a wide range of counts and high densities
around the median values. The partially relevant category, while also having a substantial number of
judgments, shows a narrower distribution, indicating less variability. The "I don’t know" category has a
very narrow distribution, reflecting its infrequent use among annotators.
  Further evaluation rounds utilizing the collected data are in progress. We will utilize the annotated
documents and relevance annotations from the queries to construct an aggregated 𝑄𝑟𝑒𝑙 file. With this
Qrel file, we will run the evaluation using trec_eval6 on the participants’ runs. Trec_eval will compare
the system’s retrieved results against the ground truth relevance judgments defined in the Qrel file. This
evaluation process will provide valuable insights by comparing the results of the clic model with the
manual annotations, thereby assessing the effectiveness and performance of the information retrieval
system in relation to the specified queries.

2.9. Discussion and conclusion
This task was the second attempt to collectively investigate the impact of the evolution of the data on
search system’s performances. Having 14 participating teams submitting runs confirmed that this topic
was of interest to the community.
   The dataset released for this task consisted in a sequence of test collections corresponding to different
times. The collections were composed of documents and queries coming from Qwant, and relevance
judgment coming from a click model and manual assessment. While the manual assessment is ongoing
at the time of the paper’s publication, performances of participants’ submitted runs were measured
using the click logs.
   Most of submitted runs rely on multi-stage retrieval approaches. In addition to the usage of Large
Language Models in Query expansion. The effect of the translation of the documents and queries
provided by the lab has a clear impact: the best results were obtained on the original French data.
   Since each subset had substantial overlaps, the correlations between systems rankings was pretty
high. As for the robustness of the systems towards dataset changes, we observed that the systems that
are the more robust to the evolution of test collection were not the best performing ones.
   Further evaluations will be carried out in the near future with the manual assessment of the pooled
sets. A thorough analysis of the results will be necessary to study the impact of queries on the results
(their nature, topic, difficulty, etc.). Further analysis work will be necessary to fully establish the
robustness of the systems and the specific impact of dataset evolution on the performances.


3. Task 2 - Classification
Stance detection, an essential task in natural language processing (NLP), involves identifying an author’s
position or attitude towards a particular topic or statement. This task goes beyond simple sentiment
analysis by requiring models to discern not just positive or negative sentiments but also the specific
stance (supporting/believer, opposing/denier, or neutral) towards a given target [19, 20].
   Comprehending the evolution of social media stances over time poses a significant challenge, a topic
that has gained recent interest in the AI and NLP communities but remains relatively unexplored. The
performance of social media stance classifiers is intricately linked to temporal shifts in language and
evolving societal attitudes toward the subject matter [21].
   In LongEval 2024, social media stance detection, a multi-label English classification task, takes center
stage, surpassing the complexity of the binary sentiment task in LongEval 2023 [22]. Our primary goal
is to assess the persistence of stance detection models in the dynamic landscape of social media posts.
   The evolving nature of language and social opinions adds an additional layer of complexity to
the challenges faced by text classifiers. Language undergoes continuous changes, reflecting shifts in

6
    https://trec.nist.gov/trec_eval/
societal norms and opinions and the emergence of novel concepts and words. For instance, consider the
evolution of public opinion on climate change over the past two decades:

    • Sentence from 2000: “Global warming is a theory that needs more proof; it’s not urgent.”

    • Sentence from 2010: “Evidence for climate change is mounting, and we need to start taking
      action.”

    • Sentence from 2020: “Climate change is an undeniable crisis that requires immediate global
      action.”

   The context over two decades in the above example shows that language and urgency surrounding
climate change have evolved from skepticism to an accepted crisis. Models not updated with recent
discussions and policy changes might fail to accurately capture the critical tone and terminology used
in current dialogues about the environment. Similarly, the rapid emergence of new vocabulary, as
witnessed with terms like COVID-19 [23], highlights the dynamic nature of language, presenting unique
challenges for text classifiers.

3.1. Description of the task
To assess the extent of the performance drop of models over shorter and longer temporal gaps, we
provided a comprehensive training dataset along with five testing sets. These testing sets include two
practice sets and three development sets. The shared competition aimed to stimulate the development
of classifiers that can effectively handle temporal variations and maintain performance persistence over
different time distances. Participants were expected to submit solutions for two sub-tasks, showcasing
their ability to address the challenges of temporal variations in performance. The shared task was in
turn divided into two sub-tasks:
   Sub-Task 1: Short-Term Persistence: In this sub-task, participants were tasked with developing
models that demonstrated performance persistence over short periods. Specifically, the models needed
to maintain their performance over a temporal gap between the within datasets and the short-term
datasets. This involved comparing the performance from the within-practice data (January 2010 to
December 2010) to the short-practice data (January 2014 to December 2014), a time gap of 4 years,
and from the within-dev data (January 2011 to December 2011) to the short-dev data (January 2015
to December 2015), a time gap of 4 years
   Sub-Task 2: Long-Term Persistence: This sub-task required participants to develop models that
maintained performance persistence over a longer period of time. The classifiers were expected to
mitigate performance drops over a temporal gap between the within time datasets and the long-
term datasets. This involved comparing the performance from the within-dev data (January 2011 to
December 2011) to the long-dev data (January 2018 to September 2019), a time gap of approximately 7
to 8 years.
   In addition to the main sub-tasks, participants were also asked to work on models that maintained
performance within the same temporal year of the training set, with the practice-within data covering
January 2010 to December 2010 and the within-dev data covering January 2011 to December 2011,
with no gap between them and the training set (time gap 0).

3.2. Dataset
In this section, we present the process of constructing our final annotated corpus for the task. The
large-scale Climate Change Twitter dataset was originally described in [24], Our primary focus will be
on climate change stance, time of the post (created at), and the textual content of the tweets, which
we will refer to as the CC-SD dataset. This CC-SD is large-scale, covering a span of 13 years and
containing a diverse set of more than 15 million tweets from various years. Using the BERT model to
annotated tweets, the CC-SD stance labels fall into three categories: those that express support for
the belief in man-made climate change (believer), those that dispute it (denier), and those that remain
neutral on the topic.
   The total sum of the categorized tweets over the entire time span are as follows: 11,292,424 tweets as
believers, 1,191,386 as deniers, and 3,305,601 as neutral, distributed across the timeline. The annotation
is performed using transfer learning with BERT as distant supervision based on another sentiment
climate change dataset 7 and, thus, can be easily manually annotated to improve its precision using
human in the loop.
   Data sampling. The dataset is first downsampled to ensure an equal number of instances for each
stance (neutral, denier, believer) within a specified date range, using the minimum stance count across
all selected months and years to avoid bias. This involves randomly sampling the same number of rows
for each stance, year, and month combination, ensuring balanced representation. The downsampled data
is then shuffled and split into training, development, and practice sets, including short- and long-term
coverage, with any intersecting IDs between these sets being removed to maintain data integrity and
prevent data leakage. Finally, a summary of the downsampled data is generated, detailing the number
of rows, date and time of sampling, and statistics per year and month.
   Test set annotation. We annotate our test data using Prolific8 , which is a high quality data collection
and annotation platform. The forms that contain data to annotate are created using Qualtrics9 . We
run the annotation in several batches, and provide the annotation guideline stating the task details
and guidelines for the participants to follow. We add several filters, automatic and manual to select
the optimal demographic and qualified annotators. Additionally, a manual annotation is also enforced
which contains 5 tweets from the training set, which the organisers first annotate and then using the
majority annotation is released as qualification task. The participant have to correctly answer 4 out of 5
questions to access the actual annotation task. We also provide fields in our form for every annotator to
give their feedback and to point out if any tweet is inappropriate or contains explicit content in it. We
collect responses from 5 annotators for each tweet, and select the majority annotation from the five
annotation. In some cases, we find equal agreement among the annotators, and for those cases, we run
an extra round of annotation to finalise the agreement. Finally after cleanup and majority annotation
finding process, we manually check the data and divide into their respective splits.
   The resulting distribution of data is shown in Table 6. table Dataset statistics summary of training,
practice and testing sets.

Table 6
Dataset statistics summary of training, practice and testing sets.
                              Dataset                   Time Period                  Size
                                train          January 2009 to December 2011        35739
                           within-practice     January 2010 to December 2010         450
                           short-practice      January 2014 to December 2014         450
                             dev-within        January 2011 to December 2011        1074
                              dev-short        January 2015 to December 2015        1074
                              dev-long         January 2018 to September 2019       1074

   In the Practice phase, participants undertake Pre-Evaluation tasks with datasets from 2010 and
2014, sampled from CC-SD, allowing them to practice within a recent time frame and over a short
duration. These datasets are manually verified. Additionally, human-annotated "within time" and "short
time" practice sets are provided, also sampled from CC-SD, to refine model development before formal
evaluation.
   Subsequently, the Evaluation phase assesses models using datasets from 2011, 2015, and the longer
period of 2018-2019, all sampled from CC-SD. These datasets undergo manual verification and en-
compass within-timeframe assessments, short-term predictions, and long-term predictions, offering a

7
  https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset
8
  https://www.prolific.com/
9
  https://www.qualtrics.com/
holistic evaluation of model performance across various temporal contexts. By incorporating datasets
covering different years, the evaluation ensures thorough testing and understanding of models’ temporal
persistence and performance.

3.3. Evaluation
Evaluation metrics for this edition of the task remain consistent with the previous version [3, 25]. All
submissions were assessed using two key metrics: the macro-averaged F1-score on the corresponding
sub-task’s development set and the Relative Performance Drop (RPD), calculated by comparing
performance on "within time" data against results from short- or long-term distant development
sets. Submissions for each sub-task were ranked primarily based on the macro-averaged F1-score.
Additionally, a unified score, the weighted-F1, was computed between the two sub-tasks, encouraging
participants to contribute to both for accurate placement on a collective leaderboard and a deeper
analysis of their system’s performance in various settings.
  Participants were expected to design an experimental architecture to enhance a text classifier’s
temporal performance. In such, the performance of the submissions was evaluated in two ways:

   1. Macro-averaged F1-score: This metric measured the overall F1-score on the testing set for
      the sentiment classification sub-task. The F1-score combines precision and recall to provide a
      balanced measure of model performance. A higher F1-score indicated better performance in
      terms of both positive and negative sentiment classification.

                                                  2 · precision · recall
                                       𝐹macro =                                                      (1)
                                                   precision + recall

   2. Relative Performance Drop (RPD): This metric quantified the difference in performance
      between the "within-period" data and the short- or long-term distant testing sets. RPD was
      computed as the difference in performance scores between two sets. A negative RPD value
      indicated a drop in performance compared to the "within-period" data, while a positive value
      suggested an improvement.

                                                   𝑓score𝑡𝑗 − 𝑓score𝑡0
                                         𝑅𝑃 𝐷 =                                                      (2)
                                                        𝑓score𝑡0

      Where 𝑡0 represents performance when the time gap is 0, and 𝑡𝑗 represents performance when
      the time gap is short or long, as introduced in previous work [26].

  The submissions were ranked primarily based on the macro-averaged F1-score, emphasizing the
overall performance of the stance detection model on the testing sets. The higher the macro-averaged
F1-score, the higher the ranking of the submission.

3.4. Models
In our study, we evaluated several baseline classifiers to assess their performance and temporal per-
sistence when exposed to evolving data. The models we focused on include bert-base-uncased,
roberta-base, and their respective variations with additional continual incremental pretraining from
the climate change corpus.
   To address the challenges posed by evolving data, we implemented continual incremental pretraining
for both bert-base-uncased and roberta-base models. These variations, referred to as ++MLM 2019,
were further pretrained on a climate change corpus that covers data from the initial training year up to
2019 using masked language modeling. This approach aimed to incorporate recent linguistic trends and
contextual information, enhancing the models’ ability to adapt to new and evolving data.
   The dataset is segmented by years, starting from 2006 to various end years (2011, 2013, 2015, 2017,
2019). For each end year, data from all preceding years up to that point is aggregated and preprocessed.
Preprocessing includes filling missing values with the most frequent value in each column, removing
rows with missing values in the ’text’ or ’stance’ columns, and eliminating duplicate entries. Text data is
normalized to lowercase, and entries with fewer than six words are excluded. Post-processing, the data is
merged into a single dataset for each end year, resulting in five datasets representing different temporal
spans. These datasets are subsequently balanced by downsampling to ensure uniform representation
for incremental training.
   Using a masked language modeling strategy, the textual data without its label is fed into the models
incrementally in their chronological order, starting with the 2011 sample and ending with the 2019
sample. This approach ensures a balanced and clean dataset, facilitating robust analysis and model
training. Each model was incrementally tested to evaluate its persistence over time, and the best
performance was reported in the results section.

    • bert-base-uncased (Bidirectional Encoder Representations from Transformers) [27] is a foun-
      dational model in NLP that introduced the concept of bidirectional training of transformers
      for language modeling. The bert-base-uncased model is a version of BERT that ignores case
      sensitivity, which helps in learning case-independent features. It also consists of 12 transformer
      layers, 768 hidden units, and 12 attention heads. BERT uses a static masked language modeling
      objective during pretraining, which involves predicting masked words in a sentence based on
      their context.

    • roberta-base (Robustly optimized BERT approach) [28] is a variant of the BERT model designed
      to improve performance by optimizing the pretraining process. It uses dynamic masking, a larger
      batch size, and more data to enhance the training of transformer-based models. The roberta-base
      model consists of 12 transformer layers, 768 hidden units, and 12 attention heads. It is pretrained
      on a diverse range of data to capture rich contextual representations, making it effective for
      various NLP tasks.

    • ++MLM 2019: A masked language modeling strategy used to adapt a language model to new data
      by incrementally pretraining with an unlabeled corpus up to 2019. This method leverages recent
      linguistic trends and contextual updates to improve model adaptation and performance over time.

  This systematic approach allowed us to evaluate and enhance the models’ temporal persistence and
robustness baselines, ensuring they remain effective in the face of evolving language patterns.

3.5. Results
This section presents the results obtained during both the practice and evaluation phases of task 2.

3.6. Practice phase
In this subsection, we present the results of the practice phase of task 2. This practice dataset was
provided to participants to allow them to practice and initiate their text classifiers. Since we did not
get any submissions and to understand the initial performance of our practice sets, we compared
several baseline classifiers. The models evaluated include roberta-base, bert-base-uncased, and their
respective variations with additional continual incremental pretraining from the climate change corpus
from the initial year of training up to 2019 using masked lanague modeling. The results are summarized
in Table 7.
   As it can be seen from Table 7, the results indicate that the ++MLM 2019 variations of both roberta-
base and bert-base-uncased demonstrate improved f-Within and f-Avg scores compared to their
original counterparts. This suggests that additional continual pretraining based on recent data, in-
crementally over time, contributes to better performance persistence. Notably, bert-base-uncased
++MLM 2019 achieved the lowest RPD, highlighting its resilience to temporal changes.
Table 7
Performance of baseline models on practice data. The columns represent: f-Within - performance within the
same time period, f-Short - performance over short temporal gaps, f-Avg - average performance across all
temporal gaps, and RPD - relative performance drop when applied to temporally distant data.
                        Model                   f-Within     f-Short     f-Avg     RPD
                        roberta-base            0.586        0.523       0.555     -10.80%
                        ++MLM 2019               0.612       0.525       0.569     -14.36%
                        bert-base-uncased       0.577        0.536       0.557      -7.19%
                        ++MLM 2019              0.586        0.542       0.564     -7.59%


3.7. Evaluation phase
In this subsection, we present the results of the evaluation phase of task 2. Using the development
dataset provided to participants, we evaluated the final performance of the text classifier models. To
understand the performance of our development sets, we compared several baseline classifiers due to
the lack of submissions. The models evaluated include roberta-base, bert-base-uncased, and their
respective variations with additional continual incremental pretraining from the climate change corpus
up to 2019 using masked language modeling. The results are summarized in Table 8.

Table 8
Performance of baseline models on development sets. The columns represent: f-Within - performance within
the same time period, f-Short - performance over short temporal gaps, f-Long - performance over long temporal
gaps, f-Avg - average performance across all temporal gaps, RPD-Short - relative performance drop over short
temporal gaps, RPD-Long - relative performance drop over long temporal gaps, and RPD-Avg - average relative
performance drop.
        Model               f-Within   f-Short      f-Long   f-Avg     RPD-Short     RPD-Long   RPD-Avg
        roberta-base         0.626     0.558        0.529    0.571     -10.81%       -15.46%    -26.26%
        ++MLM 2019          0.623       0.594        0.552    0.590     -4.74%       -11.46%    -16.20%
        bert-base-uncased   0.614      0.569        0.536    0.573     -7.26%        -12.64%    -19.89%
        ++MLM 2019          0.600      0.571        0.540    0.570     -4.94%         -10.01%    -14.94%


   As shown in Table 8, the ++MLM 2019 variations of both roberta-base and bert-base-uncased
models exhibit notable improvements in the f-Short and f-Long scores, as well as reduced RPD values
compared to their standard counterparts. The ++MLM 2019 variation of roberta-base achieved an f-Avg
score of (0.590), an improvement over the original model’s score of (0.571). It also showed a significantly
lower RPD-Short of (-4.74%) and RPD-Long of (-11.46%), indicating better resilience to temporal changes
over both short and long gaps. Similarly, the ++MLM 2019 variation of bert-base-uncased achieved
an f-Avg score of (0.570), slightly lower than the original model’s 0.573. However, it exhibited a lower
RPD-Long of (-10.01%) and RPD-Avg of (-14.94%), demonstrating improved performance persistence
over time.
   These results reinforce the value of continual incremental pretraining with recent data to maintain
and improve model performance in dynamic environments. The ++MLM 2019 variations consistently
showed enhanced performance metrics and reduced performance degradation over time, validating the
effectiveness of this approach in enhancing temporal persistence.

3.8. Discussion and conclusion
This section discusses the results of our study on temporally adaptive classification methods, highlighting
the significance of incorporating temporal information into text classification models to mitigate
performance drops over time and the use of an outdated language model. These results reveal that
classifiers trained on older data exhibit significant performance drops when applied to newer data.
This is evident from the relative performance drops (RPD) reported, where the ++MLM 2019 variations
showed a marked improvement in mitigating this drop.
   Previous work by Alkhalifa et al. [26] introduced the Incremental Temporal Alignment (ITA) method as
a superior approach for enhancing temporal persistence of static word embedding. This method aligns
closely with the continual incremental pretraining approach evaluated in our results, where ++MLM
2019 variations of both roberta-base and bert-base-uncased demonstrated improved f-Within, f-Avg
scores, and lower RPD values. The ITA method’s emphasis on leveraging incremental updates to word
embeddings aligns with the improvements seen in the ++MLM 2019 models, showcasing their resilience
to evolving data and enhancing their persistence as text classifiers as context updated overtime.
   The results reinforce several best practices for designing temporally robust and persistent text classi-
fiers. Methods relying on incremental updates generally outperform static embeddings, as corroborated
by the superior performance of the ++MLM 2019 models. Additionally, it is crucial to select robust
baseline models and incrementally update them to accommodate evolving language patterns over time.
   The practical implications of our findings are significant for real-world NLP applications. In dynamic
environments such as stance posts on social media, language evolves rapidly, making temporal adapta-
tion through an incremental pretraining approach substantially enhance the longevity and persistence
of text classifiers. These results provide empirical evidence supporting the implementation of temporally
adaptive classification methods in real-world scenarios.


Acknowledgments
This work is supported by the ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of the French
Agence Nationale de la Recherche, and by the Austrian Science Fund (FWF, grant I4471-N). This work
is also supported by a UKRI/EPSRC Turing AI Fellowship to Maria Liakata (grant no. EP/V030302/1).
This work has been using services provided by the LINDAT/CLARIAH-CZ Research Infrastructure
(https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic
(Project No. LM2023062) and has been also supported by the Ministry of Education, Youth and Sports
of the Czech Republic, Project No. LM2023062 LINDAT/CLARIAH-CZ.


References
 [1] R. Gangi Reddy, B. Iyer, M. A. Sultan, R. Zhang, A. Sil, V. Castelli, R. Florian, S. Roukos, Synthetic
     target domain supervision for open retrieval qa, in: Proceedings of the 44th International ACM
     SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Association
     for Computing Machinery, New York, NY, USA, 2021, p. 1793–1797. URL: https://doi.org/10.1145/
     3404835.3463085. doi:10.1145/3404835.3463085.
 [2] J. Lovón-Melgarejo, L. Soulier, K. Pinel-Sauvagnat, L. Tamine, Studying catastrophic forgetting
     in neural ranking models, Springer-Verlag, Berlin, Heidelberg, 2021, p. 375–390. URL: https:
     //doi.org/10.1007/978-3-030-72113-8_25. doi:10.1007/978-3-030-72113-8_25.
 [3] R. Alkhalifa, I. Bilal, H. Borkakoty, J. Camacho-Collados, R. Deveaud, A. El-Ebshihy, L. Espinosa-
     Anke, G. Gonzalez-Saez, P. Galuščáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, H. T.
     Madabushi, P. Mulhem, F. Piroi, M. Popel, C. Servan, A. Zubiaga, Overview of the clef-2023 longeval
     lab on longitudinal evaluation of model performance, in: Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the
     CLEF Association (CLEF 2023), Lecture Notes in Computer Science (LNCS), Springer, Thessaliniki,
     Greece, 2023.
 [4] R. Alkhalifa, H. Borkakoty, R. Deveaud, A. El-Ebshihy, L. Espinosa-Anke, T. Fink, P. Galuščáková,
     G. Gonzalez-Saez, L. Goeuriot, D. Iommi, M. Liakata, H. T. Madabushi, P. Medina-Alias, P. Mul-
     hem, F. Piroi, M. Popel, A. Zubiaga, Overview of the CLEF 2024 LongEval Lab on Longitudinal
     Evaluation of Model Performance, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier,
     G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS),
     Springer, Heidelberg, Germany, 2024.
 [5] P. Galuščáková, R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval-
     retrieval: French-english dynamic test collection for continuous web search evaluation, 2023.
     arXiv:2303.03229.
 [6] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for language
     understanding, Advances in neural information processing systems 33 (2020) 16857–16867.
 [7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint
     arXiv:2302.13971 (2023).
 [8] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas,
     E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024).
 [9] A. Basaglia, A. Stocco, M. Popović, N. Ferro, Seupd@clef: Team dam on reranking using sentence
     embedders, in: [29], 2024.
[10] L. Cazzador, F. L. D. Faveri, F. Franceschini, L. Pamio, S. Piron, N. Ferro, Seupd@clef: Team mouse
     on enhancing search engines effectiveness with large language models, in: [29], 2024.
[11] F. Galli, M. Rigobello, M. Schibuola, R. Zuech, N. Ferro, Seupd@clef: Team iris on temporal
     evolution of query expansion and rank fusion techniques applied to cross-encoder re-rankers, in:
     [29], 2024.
[12] J. Keller, T. Breuer, P. Schaer, Leveraging prior relevance signals in web search, in: [29], 2024.
[13] S. Yoon, J. Kim, S. won Hwang, Analyzing the effectiveness of listwise reranking with positional
     invariance on temporal generalizability, in: [29], 2024.
[14] A. Kimia, A. Akan, F. Arwa, N. Ferro, Seupd@clef: Team kalu on improving search engine
     performance with query expansion and re-ranking approach, in: [29], 2024.
[15] D. Alexander, M. Fröbe, G. Hendriksen, F. Schlatt, M. Hagen, D. Hiemstra, M. Potthast, A. P.
     de Vries, Team openwebsearch at clef 2024: Longeval, in: [29], 2024.
[16] M. Gründel, M. Weber, J. Franke, J. H. Reimer, Team galápagos tortoise at longeval 2024: Neural
     re-ranking and rank fusion for temporal stability, in: [29], 2024.
[17] F. Giachelle, O. Irrera, G. Silvello, Doctag: A customizable annotation tool for ground truth
     creation, in: Advances in Information Retrieval: 44th European Conference on IR Research, ECIR
     2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, volume 13186 of Lecture Notes in
     Computer Science, Springer, 2022, pp. 288–293.
[18] D. Harman, TREC-Style Evaluations, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 97–
     115. URL: https://doi.org/10.1007/978-3-642-36415-0_7. doi:10.1007/978-3-642-36415-0_7.
[19] D. Küçük, F. Can, Stance detection: A survey, ACM Comput. Surv. 53 (2020). URL: https://doi.org/
     10.1145/3369026. doi:10.1145/3369026.
[20] S. M. Mohammad, P. Sobhani, S. Kiritchenko, Stance and sentiment in Tweets, ACM Transac-
     tions on Internet Technology 17 (2017). URL: http://alt.qcri.org/semeval2016/task6/. doi:10.1145/
     3003433. arXiv:1605.01655.
[21] R. Alkhalifa, A. Zubiaga, Capturing stance dynamics in social media: open challenges and research
     directions, International Journal of Digital Humanities (2022) 1–21.
[22] R. Alkhalifa, I. Bilal, H. Borkakoty, J. Camacho-Collados, R. Deveaud, A. El-Ebshihy, L. Espinosa-
     Anke, G. Gonzalez-Saez, P. Galuščáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, H. Tay-
     yar Madabushi, P. Mulhem, F. Piroi, M. Popel, C. Servan, A. Zubiaga, Longeval: Longitudinal
     evaluation of model performance at clef 2023, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro,
     H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval,
     Springer Nature Switzerland, Cham, 2023.
[23] R. Alkhalifa, T. Yoong, E. Kochkina, A. Zubiaga, M. Liakata, QMUL-SDS at checkthat! 2020: Deter-
     mining COVID-19 tweet check-worthiness using an enhanced CT-BERT with numeric expressions,
     CoRR abs/2008.13160 (2020). URL: https://arxiv.org/abs/2008.13160. arXiv:2008.13160.
[24] D. Effrosynidis, A. I. Karasakalidis, G. Sylaios, A. Arampatzis, The climate change twitter dataset,
     Expert Systems with Applications 204 (2022) 117541. URL: https://www.sciencedirect.com/science/
     article/pii/S0957417422008624. doi:https://doi.org/10.1016/j.eswa.2022.117541.
[25] R. Alkhalifa, I. M. Bilal, H. Borkakoty, Romain, Deveaud, A. El-Ebshihy, Luis, Espinosa-Anke,
     Gabriela, Gonzalez-Saez, P. Galuscáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, P. Mul-
     hem, F. Piroi, M. Popel, C. Servan, H. T. Madabushi, Arkaitz, Zubiaga, Extended overview
     of the clef-2023 longeval lab on longitudinal evaluation of model performance, 2023. URL:
     https://api.semanticscholar.org/CorpusID:259953335.
[26] R. Alkhalifa, E. Kochkina, A. Zubiaga, Opinions are made to be changed: Temporally adaptive
     stance classification, in: Proceedings of the 2021 Workshop on Open Challenges in Online Social
     Networks, 2021, pp. 27–32.
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
     and Short Papers), 2019, pp. 4171–4186.
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[29] G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Proceedings of Working Notes of
     CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, Aachen,
     2024.
A. Runs submitted to the IR Task

Table 9
The original name of the submitted runs for the IR task are shown in the second column while the Runs Ids used
assigned to the systems and used in the paper are shown in the first column.
                   Run Id            Submitted System

                   abyss_run_1       ABYSS_BM25-French-Stop50_40FR_10EN-SnowStem-Dict-Fuzzy-Phrase-Start-Synonyms-RR
                   abyss_run_2       ABYSS_BM25-French-Stop50_40FR_10EN-SnowStem-Fuzzy-Phrase-Start
                   abyss_run_3       ABYSS_BM25-French-Stop50_40FR_10EN-SnowStem-Fuzzy-Phrase-Start-RR
                   cir_run_1         CIR_BM25
                   cir_run_2         CIR_BM25+monoT5
                   cir_run_3         CIR_BM25+qrel_boost
                   cir_run_4         CIR_BM25+RF
                   cir_run_5         CIR_BM25+time_boost
                   galapagos_run_1   galapagos-tortoise-bm25-bo1-pl2-monot5-kmax-avg-k-4
                   galapagos_run_2   galapagos-tortoise-bm25-bo1-pl2-monot5-max
                   galapagos_run_3   galapagos-tortoise-bm25-bo1-pl2-monot5-mean
                   galapagos_run_4   galapagos-tortoise-rank-zephyr
                   galapagos_run_5   galapagos-tortoise-wsum
                   kalu_run_1        KALU_MISTRAL_FRENCH
                   kalu_run_2        KALU_RERANK_HARMONIC_MISTRAL_FRENCH
                   kalu_run_3        KALU_RERANK_HARMONIC_MISTRAL_FRENCH_SHOULD
                   kalu_run_4        KALU_RERANK_SIMPLE_FRENCH_LLAMA
                   kalu_run_5        KALU_RERANK_SIMPLE_MISTRAL_FRENCH
                   ows_run_1         ows_bm25_bo1_keyqueries
                   ows_run_2         ows_bm25_reverted_index
                   ows_run_3         ows_ltr_all
                   ows_run_4         ows_ltr_wows_all_rerank
                   ows_run_5         ows_ltr_wows_base_rerank
                   ows_run_6         ows_ltr_wows_rerank_and_keyquery
                   ows_run_7         ows_ltr_wows_rerank_and_reverted_index
                   quokkas_run_1     Quokkas_french-letter-lightstem
                   quokkas_run_2     Quokkas_french-standard-lightstem
                   dam_run_1         seupd2324-dam_EN-Stop-SnowBall-Poss-Prox(50)
                   dam_run_2         seupd2324-dam_EN-Stop-SnowBall-Poss-Prox(50)-Reranking(200)
                   dam_run_3         seupd2324-dam_FR-Stop-FrenchLight-Elision-ICU-Prox(50)
                   dam_run_4         seupd2324-dam_FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150)
                   dam_run_5         seupd2324-dam_FR-Stop-FrenchLight-Elision-ICU-Shingles-Prox(50)-Reranking(150)
                   iris_run_1        seupd2324-iris_FR_GFF@12_w0.162_MMARCO@1000_ADD_w5
                   iris_run_2        seupd2324-iris_FR_GFF@12_w0.162_MMARCO@1000_MAXMIN_ADD_w5
                   iris_run_3        seupd2324-iris_FR_MMARCO@1000_ADD_w5
                   iris_run_4        seupd2324-iris_FR_url_w1.4_GFF@12_w0.162_MMARCO@1000_ADD_w5
                   iris_run_5        seupd2324-iris-FR_Q2K@1_w0.16_MMARCO@1000_MAXMIN_ADD_w5
                   lfzzo_run_1       seupd2324-lfzzo-englishSystem1
                   lfzzo_run_2       seupd2324-lfzzo-englishSystem2
                   lfzzo_run_3       seupd2324-lfzzo-englishSystem3
                   lfzzo_run_4       seupd2324-lfzzo-englishSystem4
                   lfzzo_run_5       seupd2324-lfzzo-englishSystem5
                   lfzzo_run_6       seupd2324-lfzzo-frenchSystem1
                   lfzzo_run_7       seupd2324-lfzzo-frenchSystem2
                   lfzzo_run_8       seupd2324-lfzzo-frenchSystem3
                   lfzzo_run_9       seupd2324-lfzzo-frenchSystem4
                   lfzzo_run_10      seupd2324-lfzzo-frenchSystem5
                   mouse_run_1       seupd2324-mouse_English_Porter_Standard_NoStop_Mixtral-8x7b_NoRerank
                   mouse_run_2       seupd2324-mouse_English_Porter_Standard_stopwords-en_LLama3-70b_NoRerank
                   mouse_run_3       seupd2324-mouse_English_Porter_Standard_top125_LLama3-70b_Cohere-100-w06
                   mouse_run_4       seupd2324-mouse_English_Porter_Standard_top125_LLama3-70b_Pygaggle-Luyu-20-w06
                   mouse_run_5       seupd2324-mouse_English_Porter_Standard_top125_Mixtral-8x7b_Pygaggle-Luyu-20-w06
                   mouse_run_6       seupd2324-mouse_French_FrenchLight_Standard_NoStop_Mixtral-8x7b_NoRerank
                   mouse_run_7       seupd2324-mouse_French_FrenchLight_Standard_stopwords-fr_LLama3-70b_NoRerank
                   mouse_run_8       seupd2324-mouse_French_FrenchLight_Standard_top125_LLama3-70b_Cohere-100-w06
                   mouse_run_9       seupd2324-mouse_French_FrenchLight_Standard_top125_LLama3-70b_Pygaggle-Luyu-20-w06
                   mouse_run_10      seupd2324-mouse_French_FrenchLight_Standard_top125_Mixtral-8x7b_Pygaggle-Luyu-20-w06
                   seekx_run_1       seupd2324-seekx_LetLightFR
                   seekx_run_2       seupd2324-seekx_LetLightStopFR
                   seekx_run_3       seupd2324-seekx_LetLightStopSynFR
                   seekx_run_4       seupd2324-seekx_StanMinEN
                   seekx_run_5       seupd2324-seekx_StanMinSynEN
                   snu_run_1         SNU_LDI_listt5
                   snu_run_2         SNU_LDI_monot5
                   wonder_run_1      WONDER_BASELINE
                   wonder_run_2      WONDER_ENGLISH
                   wonder_run_3      WONDER_ENGLISH_FRENCH
                   wonder_run_4      WONDER_FRENCH
                   wonder_run_5      WONDER_TWOPHASE
                   xplore_run_1      XPLORE_French-BM25-FrenchLight-Stop
                   xplore_run_2      XPLORE_French-BM25-FrenchLight-Stop-SynonymMapper
                   xplore_run_3      XPLORE_French-BM25Default-FrenchLight-Stop
                   xplore_run_4      XPLORE_French-LMDirichlet-FrenchLight-Stop