<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEUPD@CLEF: Team HIBALL on Incremental Information Retrieval System with RRF and BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Ceccato</string-name>
          <email>andrea.ceccato.6@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Fabbian</string-name>
          <email>luca.fabbian.1@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bor-Woei Huang</string-name>
          <email>borwoei.huang@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irfan Ullah Khan</string-name>
          <email>irfanullah.khan@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harjot Singh</string-name>
          <email>harjot.singh@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>This report showcases the work of the HIBALL team from the University of Padua on Task 1 LongEvalRetrieval of CLEF 2023 [1] [2]. Our goal was to create a general-purpose information retrieval system, using the CLEF document corpus and judgments as our reference. We explored various approaches, including algorithmic techniques and machine learning methods, and compared their results. Our best-performing system, a fusion between classical and AI techniques, shows promising outcomes and may serve as a foundation for future developments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The paper is organized as follows: Section 2 provides an overview of related works; Section
3 describes our approach; Section 4 explains our experimental setup; Section 5 discusses our
main findings; finally, Section 6 draws some conclusions and outlooks for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Information retrieval has been around for a long time, with first experiments being earlier than
computers [4]. Despite decades of eforts, the field still has many open challenges. While there
exists some ground-solid techniques [5], they fail to capture the true semantic meaning of the
documents involved, and thus new research is done every year [6].</p>
      <p>Traditional approaches are primary based on text matching, using explicit rules for indexing,
querying and ranking algorithms. New approaches rely on AI for semantic extraction. They
employ advanced techniques such as Natural Language Processing (NLP).</p>
      <p>In this paper, we tried to get the best of both world by combining the two approaches into a
hybrid one, and we tested the results on two complete document collections and real queries
submitted by users of the Qwant search engine. For query selection and corpora retrieval, we
leverage on the work done by CLEF [2].</p>
      <p>As a starting point for our code, we used the frrncl/hello-tipster example provided by Professor
Nicola Ferro. The code showcases basic IR features from parsing documents, indexing the parsed
documents, and searching. Everything is performed with Lucene as the underlying engine
For the Re-ranker, we got inspired by JIAZHI GUO’s notebook on ’Rescore BM25 with BERT’
[7]. For the AI experiments regarding sentence similarity models, we took inspiration from blog
tutorials such as [8].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our system is an incremental one made of 3 parts, where each part becomes the grounding of
the following one:
1. BASELINE. First, we performed basic parsing, indexing, and searching using Lucene, a</p>
      <p>Java library for information retrieval. We tried many runs with diferent setups.
2. RRF (Reciprocal Rank Fusion). Using Python, we merged two standard runs done Lucene
Pipeline, one with BM25 Similarity and one with LMDirichletSimilarity. In this way, we
got the best recall so far.
3. BERT, Re-ranking with Machine Learning. Finally, we took the results obtained
previously, and we run second phase re-ranker using Python notebooks.</p>
      <p>In parallel, we have done other AI experiments with pre-trained models. The only relevant
results, that we have submitted to CLEF, as AI-FIXED and AI-MERGED, were given by the
distilbert-base-nli-stsb-mean-tokens, a Sentence Similarity model.</p>
      <sec id="sec-3-1">
        <title>3.1. BASELINE</title>
        <p>Our first component, "BASELINE", is where we performed most of the retrieval heavy lifting.
This system, written in Java, employs Lucene API under the hood. Here we indexed the entire
data-set and produced our first phase ranked list.
The system parses the documents thanks to our LongEvalDocumentJsonParser and produces
ParsedDocuments as a result. Then the DirectoryIndexer takes those ParsedDocuments,
produces an Inverted Index, and saves it on the disk. These steps are called Parsing and Indexing.
The last step is done by the Searcher which loads the Index, parsed the queries, and finally
produces the Run. Figure 1 shows these steps.
Parsing: the first step of our Information Retrieval Pipelines is parsing the documents provided
by the organizers at CLEF. CLEF corpus is distributed both in the TREC and the JSON formats.
We chose the latter, because it is more popular and thus, both Java and Python have libraries to
parse it. You may find more details on the data sets in the section 4.1
We wrote our parser, LongEvalDocumentJsonParser, which uses JSON Parser provided by
"com.fasterxml.jackson.core".</p>
        <p>LongEvalDocumentJsonParser converts JSON documents in ParsedDocuments. These are ready
to be used in the subsequent step of the IR chain. We kept the same IDs provided in the JSON
document by CLEF.</p>
        <p>Meanwhile, we also wrote the LongEvalTopicsTSVReader parser for the queries, which were
provided in the TSV (tab-separated-values) format.</p>
        <p>Indexing: the goal of indexing is to store the document collection in a data structure optimized
for lookups. In this way, we will be able to search inside documents as fast as possible. This is
done by creating a structure called inverted index, which maps each term to the documents
including that term. We used the class DirectoryIndexer provided by Professor Ferro as a
template, providing as arguments:
• an Analyzer to build TokenStreams from a document
• a Similarity function for scoring
Searching: the searcher takes as inputs the indexed documents and the topic, and returns
a run file containing the document ranking for each query/topic provided. Analyzer and the
Similarity function are also required by the Searcher when creating a run. Searcher and Indexer
should have the same Analyzer and Similarity function in a pipeline.
3.1.1. Analyzer:
the analyzer is the core part of every IR pipeline and, as mentioned previously, is required by
both the Indexer and the Searcher. Its goal is to build TokenStreams from a ParsedDocument.
In order to reduce boilerplate code, we created our own Analyzer by extending
org.apache.lucene.analysis.Analyzer. We added Tokenizer, Stop List, and Stemmer as
parameters.</p>
        <p>• A Tokenizer is the first piece in the pipeline chain. It creates a TokenStream from a
Reader. We tested StandardTokenizer, WhitespaceTokenizer, and LetterTokenizer and
after each tokenizer, we add LowerCaseFilter to increase the matched token on search
time.
• The Stop List Filter aims to get rid of words with low informative value like articles,
prepositions, pronouns, and conjunctions. Moreover, this helps to reduce the size of the
index (∼ 10GB). We tested Lucene, Lingpipe, Snowball, Okapi, Glasgow, Atire, Terrier,
Smart, Zettair, and Indri.
• Another way to increase matching is through Stemming. A Stemmer Filter removes
prefixes and sufixes, going back to the "root" form of the word. We tested EnglishMinimal,</p>
        <sec id="sec-3-1-1">
          <title>3.1.2. Extra: Url matching</title>
          <p>While looking at the data, we realized some queries are just URLs. This may happen because
distracted users mistake the search engine bar for the address bar. In these cases, it’s clear the
search engine should take the user to the exact website they typed.</p>
          <p>This made us think about introducing some special rules just for URL handling. We discovered,
however, most of these URLs are not in our dataset: even if we know the website we should point
to, we could not, because we do not have a corresponding document ID. Thus, we decided to
discard those results. We collected our experiments regarding the topic under
/code/src/main/ainotebooks/extra-urlmatcher.ipynb.</p>
          <p>Even if we dismissed the experiments, we did it just due to the lack of corrisponding data.
Perhaps, this could lead to some improvements if investigated in the future.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. RRF (Reciprocal rank fusion)</title>
        <p>After testing diferent combinations of Stemmers and Stop Lists, we focused on increasing the
recall. In the next phase (see section 3.3) we are performing a re-ranking, and so right now we
just care about finding the most relevant documents, and we do not aim at higher precisions.
To increase the recall we tried two Similarities:
• Okapi BM25 (BM25) of a document</p>
        <p>(, ) = ∑︁  ()
=1</p>
        <p>
          (, )(1 + 1)
 (, ) + 1(1 −  +  || )
• LMDirichletSimilarity (Bayesian smoothing using Dirichlet priors)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
and their combinations, as shown in Table 1, with diferent rank fusion algorithms:
• Reciprocal Rank Fusion (RRF) [9]:
(|) =
        </p>
        <p>(|) +

|| +</p>
        <p>
          || ()
|| +  ||
 ( ∈ ) = ∑︁ 1/( + ())
∈
• CombSUM:
  ( ∈ ) = ∑︁ () (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
∈
CombSUM is provided by Lucene as the default implementation of MultiSimilarity, but no
implementation of RRF was found, so we wrote a Python script, and we used the "fusion"
function from trectools library for the fusion.
where trec_runs is a list of TrecRun containing the run files.
3.3. BERT
After obtaining as many relevant documents as possible, with the re-ranker our goal was to
increase the MAP, P@10, and nDCG. See section 4.2 for Evaluation measures.
As our second phase re-ranker we chose Bidirectional Encoder Representations from Transformers
(BERT) [10], a transformer-based machine learning model.
        </p>
        <p>BERT comes in two model sizes, the small one , which has around 110 million
parameters, and a large one,  with 340 million parameters [11].
BERT-base-uncased model Given the limited resources, we decided to fine-tune the smaller
version of the BERT model for our purpose of re-ranking. The initial model can be downloaded
from HuggingFace.
As input, the model requires a unique vector of tokens (input_ids), so for each query we had
to concatenate together the query itself and the documents.</p>
        <p>The model is able to distinguish the query from the document thanks to token_type_ids.
In the following code snippet, we create an input sequence for BERT where we put 0 for query
tokens and 1 for document tokens.</p>
        <p># make input sequences for BERT
input_ids = query_token_ids + doc_token_ids
token_type_ids = [0 for token_id in query_token_ids]
token_type_ids.extend(1 for token_id in doc_token_ids)
if len(input_ids) &gt; max_input_length: # truncation
input_ids = input_ids[:max_input_length]
token_type_ids = token_type_ids[:max_input_length]
attention_mask = [1 for token_id in input_ids]
We can see from the code above that the input tokens, that exceed 512 in length, are truncated
to fit the input window size of the BERT-base model.</p>
        <p>As a final result, the BERT re-ranker estimates a score on the relevance of a candidate document
to a query.</p>
        <p>score = model(input_ids, attention_mask,token_type_ids)[0]
In detail,the steps performed to get this part done are the following ones:
• BERT Fine-Tuning</p>
        <p>The First part was to fine-tune the BERT-base-uncased model with our training data.</p>
        <p>Data Pre-processing
For each query, we consider training instance a relevant document put together with
three other non-relevant documents: one "positive" sample with three "negative"
samples.</p>
        <p>The "positive" samples are taken from the CLEF assessments (qrel) file and the
"negative" samples are randomly chosen from the set obtained by removing all the
relevant documents from the run file created in step 3.2.</p>
        <p>To simplify the model (and ultimately consume less computational power) we did
not consider the diference between documents graded as "highly relevant" and
"relevant".</p>
        <p>We split our training data into training and validation sets with the 80/20 rule.
We used the Adam optimizer with an initial learning rate set to 3 * 10− 5. Due to the
time constraints, we trained the network with a very low patient of early stopping, set to
2 and the training stopped at the 4th iteration/epoch with the best loss on the validation
set of 0.58664.
• Hyperparameter tuning</p>
        <p>After the model training, we had to tune another hyperparameter: __ℎ
for the  in the ranking. For this we grid searched the ℎ_ to
give to  and get __ℎ=0.012 with the best MAP of 0.2298.
• Re-ranking</p>
        <p>Finally, after the training and hyperparameter tuning, we could combine the scores from
our trained model () with the scores from step 3.2 ( 60) as per:
weighted_scores = RRF60_scores + best_bert_weight * BERT_scores
After computing the ℎ_ of each document, we resort all the documents by
the new ℎ_ and create the final rank.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. AI Experiments</title>
        <p>AI-based technologies are game changers in the current era. Tasks like information retrieval are
human related and allow for inaccuracies, so it just feels natural to test AI tools in such projects.</p>
        <sec id="sec-3-3-1">
          <title>3.4.1. Dismissed experiments</title>
          <p>Generative models: The main idea here was to improve or expand document and/or queries
thanks to generative text models, such as the one baking ChatGPT. Most of the time, when
we search, we are not looking for a document including the question itself, but rather for a
document containing the answer to our question. If the AI is able to partially answer a question,
or to rephrase a sentence, those new terms could make the query way more precise.
This proved to be a dead-end pretty soon. Weaker models such as gpt2 are totally unable to
provide a useful result.</p>
          <p>Switching to a more powerful model requires way more hardware resources compared to what
we own. The open source competitor of ChatGPT, GPT-j, requires a cluster of GPUs and more
than 48 + 16GBs of dedicated RAM.</p>
          <p>We then used the expanded query to perform a search with Lucene. The evaluation scored
with trec_eval did not show any significant improvements, sometimes it got even worse (query
drifting?).</p>
          <p>We do believe this approach could lead to great results, however we lack the hardware capabilities
to investigate systematically.</p>
          <p>Paraphrase models: These models aim to provide a diferent, improved version of a sentence.
They are the AI answer to stemmers (sort of), in a sense that they provide a standardized and
less convoluted way to express the same words.</p>
          <p>Unfortunately, this process is not deterministic, leading to unexpected behaviours: similar
sentences may be mapped into completely diferent sentences; thus, if we map a document and
a query using the same AI model, we may increase their distance instead of reducing it.
Moreover, most of these models (even commercial ones like https://instatext.io/), are tuned for
long sentences, whereas most of our queries are made of 1–3 words.</p>
          <p>No relevant results here as well.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.4.2. Sentence similarity models</title>
          <p>AI-FIXED:
Our biggest experiment employed sentence similarity models. This kind of AI extracts features
from text in the form of a dense vector. Similar sentences will have similar corresponding
vectors; it means we may perform the actual search using a vector space and pick document
vectors closer to the query vector.</p>
          <p>After some research, we end up with the distilbert-base-nli-stsb-mean-tokens model, which
proved to be a good compromise between performance and accuracy. While it is based on BERT
as well, it follows a diferent approach: it is not meant for re-ranking, but rather to provide a
ranking on its own.</p>
          <p>We were able to run the model and index the whole longeval-train-v2 document collection into a
4.5G file (each document is represented as a 768 float 32 array). This was rather process-intensive,
and required hours, even with a 3060 NVidia GPU.
Once indexed, we use another notebook to perform the information retrieval. The full
process does not rely on Java/Lucene at all, because it employs Faiss, a powerful python library
for working with vector spaces and indexes using hardware acceleration. Faiss is currently
the fastest way to look for vector similarity using GPUs, and it’s under the hood of popular
commercial services such as Amazon Elasticsearch.</p>
          <p>Once performed every query, we are finally able to check our results. However, there is a
problem: if you look at the results from the trec_eval tool, we get very diferent scores. While
the classic approach is rewarded, the AI results are given a P_10 and even P_30 of zero.
We think this happens because the AI finds documents diferent from the one ranked, and thus
unknown to trec_eval. After all, the train dataset only has like 10 scores per query, and these
one have been retrieved using a diferent approach.</p>
          <p>AI-MERGED:
The retrieved documents looked so diferent from the one suggested by any other method, and
this suggested us to try a fusion.</p>
          <p>We decided to fuse the "AI-FIXED" run with the most promising one so far, the one described in
Chapter 4 as "BASELINE".</p>
          <p>Usual fusion functions did not work as expected, so we decided to end up with our own. We
chose an almost linear function, because it was easier to tune by hand.</p>
          <p>After some attempts, we end up with the following weights:</p>
          <p>
            1
 =  + (   − 0.0026) * 1000 * 0.5 * 
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
Where  = 0.25 if the document is short, 1 otherwise.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>In this section, we will provide the setup used for each system/run submitted to the CLEF. We
will describe the dataset, evaluation measures and the Hardware we used to train and evaluate
our system. The 5 submitted runs to the CLEF are:
• HIBALL_BASELINE: described in Sec. 3.1 (with Lucene StopList and K Stemmer), is our
ifrst and basic system which requires only Lucene to work;
• HIBALL_RRF60: described in Sec. 3.2, which merges BM25 and LM1000 similarities
with python code;
• HIBALL_BERT: described in Sec. 3.3, trained on GPUs with colab notebook;
• HIBALL_AIFIXED: described in Sec. 3.4, showcases our sentence similarity model for
which we used GPUs from our physical device;
• HIBALL_AIMERGED: described in Sec. 3.4, which merges the HIBALL_AIFIXED with
the HIBALL_BASELINE.</p>
      <sec id="sec-4-1">
        <title>4.1. Data description</title>
        <p>The data provided by CLEF includes a training dataset and a test dataset. The training data can
be found at http://hdl.handle.net/11234/1-5010, while the test data is at http://hdl.handle.net/
11234/1-5139.</p>
        <p>
          • training data: includes documents, queries and qrels collected during June 2022. The
document corpus has 1.5 million scraped Web pages, provided both in the TREC and
JSON formats. We worked on the JSON format documents. Each document as an id and
a contents fields. The dataset also provides 672 train queries, in both TREC and TSV
formats. In the qrels file, documents are labelled with human-provided assessments.
Document are scored as either highly relevant (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), relevant (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) or irrelevant (0) to the
queries.
• testing data: contains two sets of data, short-term dataset collected during July 2022 and
long-term dataset collected during September 2022. Queries and documents keeps the
same format of the training data.
        </p>
        <p>We used only the English version of the documents for the training and short-term dataset for
the evaluation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Measures</title>
        <p>We used the trec_eval tool to evaluate the rankings based on query relevance list (qrels file),
and the rankings of the documents (run file). The metrics we focused on are:
• Average Precision (AP): defined as:
 =</p>
        <p>1

∑︁  ()
∈
where RB is the recall base and  is the set of the rank positions of the relevant retrieved
documents
• Mean Average Precision (MAP): defined as the mean of the AP scores for each query in a
set of  number of queries:
• Precision at document Cut-of (P@k) with k = 10, defined as:
• Recall at document Cut-of (R@k) with k=1000, defined as:
  =
∑︀=1  ()</p>
        <p>() = 1 ∑︁</p>
        <p>=1
() =</p>
        <p>1 ∑︁ 
 =1
where RB is the recall base
• Discounted Cumulated Gain (DCG)@p, defined as:
(7)
(8)
(9)
(10)
(11)

@ = ∑︁
=1</p>
        <p>(1 + 2( + 1))
• Normalized Discounted Cumulated Gain ad cut p (nDCG@p), dividing DCG by ideal DCG
(iDCG) to normalize the score in [0, 1]:</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Repository</title>
        <p>For Lucene pipeline, we worked with Java. Additionally, we worked on python notebooks, which
have proven to be the best environment for AI prototyping. All the developed code is available
on the git repository of the group: https://bitbucket.org/upd-dei-stud-prj/seupd2223-hiball/src/
master/.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Hardware</title>
        <p>The hardware used to run the experiments are:
• Model: HP
• OS: Windows 11
• CPU: Intel i5
• RAM: 16GB
• Storage: SSD 512GB
• Model: MacBook Air
• OS: macOS Ventura 13.3.1
• CPU: M1
• Storage: SSD 256GB
A GPU is mandatory, since AI processes rely on hardware acceleration to run in feasible time.
For the RE-RANKER, training and getting the re-ranked runs, we used free tier of colab which
includes:
• Model: Google Compute Engine
• RAM: 12.7 GB
• GPU: NVIDIA® T4 15.0 GB
• Storage: SSD 78.2 GB
• Model: Dell
• RAM: 16 GB
• GPU: NVIDIA® 3060
• Storage: SSD 1024 GB
The Google Colab free tier is not enough to run the ai experiments of section 3.4, but fortunately
we had the opportunity to run our codes on a physical device:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>In this section, we will first introduce some general considerations about each employed
technique. We will then provide a result comparison (5.5) based on trec_evals metrics, and
ifnally provide a statistical analysis of it (5.6).</p>
      <sec id="sec-5-1">
        <title>5.1. BASELINE System</title>
        <p>To choose our BASELINE system, we experimented all the possible combinations of stemming
and StopList filters with BM25 Similarity.</p>
        <p>The results were not significantly diferent among them.</p>
        <p>• The map values were around 0.1465;
• the P@10 values were around 0.0937;
• Recalls were around 0.6954;
• and nDCG values are around 0.2896.</p>
        <p>K-stemming is slightly outperforming Lovins-stemming and Porter-stemming. While "Lucene"
and "Lingpipe" Stop Lists have the best overall performances among all stop lists we tested in
combination with the stemmers. For the BASELINE system, we chose Lucene StopList and K
Stemmer. With this configuration, we have the highest MAP and P@10 values.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. RRF (Reciprocal Rank Fusion)</title>
        <p>In the second part of our system development, where the goal was to improve recall, we tried
out diferent Similarities and their combinations. From the Table 1, we can see the best recall is
achieved with RRF rank fusion algorithm, and we chose to use RRF60 for the re-ranking system
as it has higher overall performance map value.
5.3. BERT
From the results in Table 2, we can see that with our HIBALL-BASELINE, with Lucene
StopListFilter and KStemFilter, we improved slightly the baseline of LongEval-Retrieval [3]. However,
from Tables 2 ,3, and 4 we can see that the re-ranking system with BERT model, which is based
on RRF60 system, significantly improves all the measures MAP, P@10 and nDCG and Recall.
The BERT model fine-tuned with the parameters described in Sec. 3.3 and the implementation
of early stopping show a good generalization of the model on new data. For example, the results
on TEST data (unknown to the model) in Tab. 4, are similar to those on the TRAIN data (used
for the training and validation) in Tab. 2.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. AI experiments</title>
        <p>We had trouble measuring the final score. Evaluating our results has been and still is a challenge
because the train set provided by CLEF is clueless about some documents retrieved by AI models,
while the same documents look right at a human eyes. Sometimes, the conclusion we draw by
manually looking at the results were clearly conflicting with the one measured by trec_eval.
We also realized too late the HIBALL_BERT was so successful, because it did not look so
promising at first. Since we fused with HIBALL_BASELINE, the improvements we got are
related to the HIBALL_BASELINE and do not look so good compared to the HIBALL_BERT
ones.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.5. Result comparison</title>
      </sec>
      <sec id="sec-5-5">
        <title>5.6. Statistical Analysis</title>
        <p>While table 4 shows the means of the measures, the boxplots depict better the performance
distribution over the whole data. Figure 3 shows the results on the HELDOUT set, while figure 4
reveals the results on the test set. From the boxplot of the measure of AP (figure 3a), we noticed
that HIBALL_BERT with more high-precision queries outperforming other systems, despite the
fact that there are still some queries’ precision close to 0. The distribution of the nDCG boxplot
(figure 3b) shows a similar trend. That tells us that the BERT re-ranker improved the AP and
nDCG of many queries, but there are still many queries’ results do not benefit from the BERT
re-ranker.</p>
        <p>Unsurprisingly, the boxplots of the results on the test set are in like manner with that on the
heldout set. In figure 4a, we can see that HIBALL_BERT system has a much larger third quartile
toward higher AP value compared with BASELINE. The AP of HIBALL_BERT system has a
certain ratio of queries with high AP values above 0.6, which significantly raises the overall MAP
performance. The other 3 runs, HIBALL_BASELINE, HIBALL_RRF60 and HIBALL_AIMERGED,
do not show much diference. A similar behavior can be observed in the nDCG boxplots (figure
4b). It is worth noticing that, from the heldout set to the test set, HIBALL_BERT’s third quartile
of the P@10 distribution stays at around 0.2 (figure 4c), whereas the BASELINE model’s third
quartile of P@10 drops from 0.2 to 0.1. Thus, the IR-system with BERT re-ranker sustains its
high performance of the P@10 on the TEST set.</p>
        <p>The ANOVA tests were implemented to compare the 5 runs. The p-values of the two-way
ANOVA tests of AP and nDCG on the HELDOUT set are close to 0 (table 5 and 6), while the
p-values on the TEST set are exactly 0 as shown in table 7 and 8. The close to 0 p-values are far
below the threshold  = 0.05. We confidently reject the hull hypothesis, i.e., the 5 runs are
equal, and claim that the runs are diferent regarding the measure AP and nDCG.
In figure 5, the Tukey’s Honestly Significant Diference ( HSD) clearly reveals the diference of
the means among the 5 runs on HELDOUT set. HIBALL_BERT has high AP and nDCG means,
while HIBALL_BASELINE, HIBALL_RRF60 and HIBALL_AIMERGED have similar means and
variance, which reinforce the argument inferred from the boxplots shown in figure 3. Once
again, the Tukey’s HSD in figure 6 shows the diference of the means among the 5 runs on the
TEST set. Therefore, we can safely conclude that the 5 runs are significantly diferent.
(a) Average Precision
(b) nDCG
(a) Average Precision
Figure 6: Tukey’s HSD test of the 5 runs on the TEST set
18</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this report, we started by introducing a relevant problem in the Information Retrieval domain,
and we proposed our System as a solution. Then we explained how we incrementally came up
with the final System, i.e., starting from a BASELINE system and then trying to improve recall
with RRF, and finally, other measures like MAP, P@10, and nDCG with BERT.
Finally, we spent some time investigating AI models pretrained on general purpose text. Those
provided diferent results from the classical approach; while those results looks promising on
manual inspection, they were outside the assessments’ radar, and thus we were not able to
provide a solid measure for them.</p>
      <p>In conclusion, from our experiments, BERT re-ranking is the most promising way to improve
the search engine metrics. There are, however, flaws in our BERT-base training processes.
Firstly, due to the limitation of the GPU and RAM memory, we only feed first 512 tokens to the
BERT-base model, which means the model was not trained with sentences large enough for
each document (4900 on average). Secondly. the training process is time-consuming, restricted
by not enough powerful GPU. We trained the BERT-base model for four epochs only, and we
did not make distinction between "highly relevant" and "relevant" documents assessments.
We could also train the BERT-base model with full training documents by shifting the input
window. Moreover, the BERT-base mode can be trained with more epochs to get higher accuracy.
To further improve the BERT re-ranker performance, the BERT-large model might be a better
option than the BERT-base model.</p>
      <p>We should also find a way to measure if the pretrained AI experiments were actually on the
spot or not, and try to fuse it with the HIBALL_BERT instead of the HIBALL_BASELINE run.
[6] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, M. Brockschmidt, Codesearchnet challenge:</p>
      <p>Evaluating the state of semantic code search, arXiv preprint arXiv:1909.09436 (2019).
[7] Rescore bm25 with bert, 2020. URL: https://www.kaggle.com/code/jerrykuo7727/
rescore-bm25-with-bert/notebook.
[8] K. Stathoulopoulos, How to build a semantic search engine with
transformers and faiss, 2021. URL: https://towardsdatascience.com/
how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8.
[9] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet
and individual rank learning methods, in: SIGIR ’09: Proceedings of the 32nd international
ACM SIGIR conference on Research and development in information retrieval, New York,
NY, USA, 2009, pp. 758–759. URL: http://doi.acm.org/10.1145/1571941.1572114.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, 2018. arXiv:1810.04805.
[11] Bert (language model), 2023. URL: https://en.wikipedia.org/wiki/BERT_(language_model).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          , I. Bilal,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , E. Kochkina,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Servan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2023 longeval lab on longitudinal evaluation of model performance, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), Lecture Notes in Computer Science (LNCS)</source>
          , Springer, Thessaliniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Clef</source>
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: https://clef-longeval.github.io/tasks/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. G. R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          , Longevalretrieval:
          <article-title>French-english dynamic test collection for continuous web search evaluation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>03229</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bush</surname>
          </string-name>
          , et al.,
          <article-title>As we may think</article-title>
          ,
          <source>The atlantic monthly 176</source>
          (
          <year>1945</year>
          )
          <fpage>101</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Alves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schwanninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sawyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rummler</surname>
          </string-name>
          ,
          <article-title>An exploratory study of information retrieval techniques in domain analysis</article-title>
          ,
          <source>in: 2008 12th International Software Product Line Conference</source>
          , IEEE,
          <year>2008</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>