=Paper=
{{Paper
|id=Vol-3180/paper-248
|storemode=property
|title=Aldo Nadi at Touché 2022: Argument Retrieval for Comparative Questions
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-248.pdf
|volume=Vol-3180
|authors=Maria Aba,Munzer Azra,Marco Gallo,Odai Mohammad,Ivan Piacere,Giacomo Virginio,Nicola Ferro
|dblpUrl=https://dblp.org/rec/conf/clef/AbaAGMPV022
}}
==Aldo Nadi at Touché 2022: Argument Retrieval for Comparative Questions==
Aldo Nadi at Touché 2022: Argument Retrieval for
Comparative Questions
Maria Aba1 , Munzer Azra1 , Marco Gallo1 , Odai Mohammad1 , Ivan Piacere1 ,
Giacomo Virginio1 and Nicola Ferro1
1
University of Padua, Italy
Abstract
In this paper we present the information retrieval system we developed for the 2022 Touché @ CLEF
Task 2 evaluation campaign. The participation in the task is performed as a student group project
conducted in the Search Engines course a.y. 2021/2022 at the Computer Engineering and Data Science
master degrees at University of Padua.
This tasks’ aim is to create systems that are able to retrieve documents that compare two options, e.g.
which is the best pet between a dog and a cat.
Here we describe the architecture of our system, we list the software and hardware resources we made
use of, we discuss the results obtained using different configurations and finally we present improvements
which could be applied to our system to enhance its performance.
Keywords
Information retrieval, Comparative questions, Lucene
1. Introduction
Before the era of the internet, information storage and retrieval systems were mostly used by
professionals for medical research, in libraries, by governmental organizations, and archives.
Therefore, access to such information was a hard process especially for non-search experts.
Recently, with the fast increase in the number of data and information available online, the
importance of search engines grew rapidly. Nowadays, people use search engines to locate
and buy goods, choose a vacation destination, select a medical treatment, etc. Search engines
transitioned from being searchers’ tools for information to tools for building opinions and
making major decisions. All of these aspects, when considered together, make retrieval systems
a need for impacting the industry and improving the field of information retrieval.
This paper is structured as follows: Section 2 presents related work; Section 3 describes our
approach; Section 4 explains our experimental setup; Section 5 discusses our main findings in
the model selection process; Section 6 discusses the results and analysis of our runs; finally,
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
" maria.aba@studenti.unipd.it (M. Aba); munzer.azra@studenti.unipd.it (M. Azra);
marco.gallo.9@studenti.unipd.it (M. Gallo); odai.mohammad@studenti.unipd.it (O. Mohammad);
ivan.piacere@studenti.unipd.it (I. Piacere); giacomo.virginio@studenti.unipd.it (G. Virginio); ferro@dei.unipd.it
(N. Ferro)
~ http://www.dei.unipd.it/~ferro/ (N. Ferro)
0000-0001-9219-6239 (N. Ferro)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
Section 7 draws some conclusions and outlooks for future work.
2. Related Work
The packages mentioned in 3.1 and 3.3 are obtained expanding from a baseline built on the
TIPSTER collection, during lessons of the "Search Engines" course, University of Padua.
Information about the course are available online: 1
3. Methodology
The following is the class diagram for our implementation: Figure 1
ParsedTopic
description String ParsedDocument
narrative String contents String
number String docT5Query String
objects String id String
title String «create»
1 1
TopicParser
DocumentParser
«create» LOGGER Logger DocT5QueryField AnalyzerUtil BodyField MultipleCharsFilter
RF next boolean
next boolean DOCT5QUERY_TYPE FieldType CL ClassLoader BODY_TYPE FieldType charTermAttr CharTermAttribute
«create» in Reader
in XMLEventReader «create»
XMLTopicParser
TOPIC_ELEMENT String
NUMBER_ELEMENT String Parser
TITLE_ELEMENT String document ParsedDocument
Filter BaselineAnalyzer MainAnalyzer
OBJECTS_ELEMENT String objectMapper ObjectMapper
DESCRIPTION_ELEMENT String jsonParser JsonParser
NARRATIVE_ELEMENT String
event XMLEvent «create»
«create»
«create»
1 1
DirectoryIndexer
MBYTE int
Searcher
writer IndexWriter
runID String
dpCls Class extends DocumentParser>
run PrintWriter
docsDir Path
reader IndexReader ArgumentQualityVerifier
extension String
ArgumentQualityReranker searcher IndexSearcher RRF apiKey String
cs Charset
topics ParsedTopic[] argumentQualityClient ArgumentQualityClient
«create» expectedDocs long
queryParser QueryParser
start long
maxDocsRetrieved int
filesCount long
elapsedTime long
docsCount long
bytesCount long
Run
Figure 1: Class diagram of the project
The developed Java system is divided into the following packages, each package representing
a stage: Parse, Analyze, Index, Filter, Search, RF, RRF, and Argument Quality.
3.1 Parse, Analyze, Index
These packages are in charge of the database creation and to prepare topics.
The documents in the DocT5Query expanded corpus are parsed, their text field analyzed
(with the possibility of using different custom analyzers) and then indexed with the fields: ID,
Body and DocT5Query.
1
https://en.didattica.unipd.it/off/2021/LM/IN/IN2547/004PD/INQ0091599/N0
The topics are also parsed so that the fields number, title and objects can be used in the search
using Lucene, with the latter two also analyzed.
3.2 Filter
This class allows to extract the strings from the object field in the topics and return a Boolean-
Query.Builder object, which can be later consumed by the search method by adding it as a
MUST clause in the search, to only retrieve documents that present all terms contained in the
object fields.
3.3 Search
This packed is responsible responsible for:
1. Calling the Paese and Analyze packages to retrieve and preparing the topics for the search.
2. Defining which type of comparison to perform between topics and documents, that can
be chosen by changing the similarity function.
3. Defining how to use topics in the search.
The topics titles are used search by similarity with a SHOULD clause, it possible to also
assign weights to the different fields of the documents among which to search, or to select
just one of the two fields (Contents and DocT5Query), and the MUST clause described in
the Filter class can be added.
4. Writing the results on a file
3.4 RF
RF is a customized class with the goal of performing a search using explicit relevance feedback
to perform query expansion.
RF functions in a similar way to the Searcher class, with the exception of building the query
used in the searching using the tokens present in relevant documents, instead of using the terms
in title field of the topics file.
The class collects all docID and relevance of relevant documents in the qrels file.
The tokens and their frequency in the relevant documents are retrieved by searching the
document by docID and iterating through its termvector.
The tokens used in the search are boosted by their frequency in the document multiplied by
the square of the relevance score.
Relevance Feedback is standardly based on the Rocchio Algorithm. [1] The formula for the
Rocchio Algorithm is:
⎛ ⎞ ⎛ ⎞
−→ (︁ −→ )︁ 1 ∑︁ −
→ 1 ∑︁ −→
𝑄𝑚 = 𝑎 · 𝑄𝑂 + ⎝𝑏 · · 𝐷𝑗 ⎠ − ⎝ 𝑐 · · 𝐷𝑘 ⎠
⎜ ⎟
|𝐷𝑟 | −→ |𝐷𝑛𝑟 | −→
𝐷𝑗 ∈𝐷𝑟 𝐷𝑘 ∈𝐷𝑛𝑟
−→ −→ −
→
where 𝑄𝑚 is the modified query vector, 𝑄𝑂 is the original query vector, 𝐷𝑖 is the document
vector for the 𝑖𝑡ℎ document, 𝐷𝑟 is the set of relevant documents, 𝐷𝑛𝑟 is the set of non-relevant
documents and 𝑎, 𝑏 and 𝑐 are weight parameters.
In our case the parameters used are 0, 1, 0.
Rocchio algorithm is however defined for working with binary relevance, since this collection
uses multi-graded relevance, our version of RF is customized to take into account the different
relevance scores used (0 to 3).
The custom formula we used is:
−→ 1 ∑︁ −→
𝑄𝑚 = 𝑘𝑖2 · · 𝐷𝑖
|𝐷𝑟 | −→
𝐷𝑖 ∈𝐷𝑟
−→ −
→
where 𝑄𝑚 is the modified query vector, 𝐷𝑖 is the document vector for the 𝑖𝑡ℎ document, 𝐷𝑟 is
the set of relevant documents, and 𝑘𝑖 is the relevance score of the 𝑖𝑡ℎ document.
In this work a total of 491 relevant documents have been used to perform Relevance Feedback.
The results of the search are then outputted as a standard run file.
3.5 RRF
This package contains a single class, also called RRF.
RRF.java is a customized class with the goal of performing using Reciprocal Ranking Fusion
[2] to fuse the results of different runs in a single one.
RRF takes in imput a directory path and performs RRF using all the runs in .txt documents
inside that directory.
For each documents and for each topic the documents and their respective ranking are
collected.
Then document receive a new scoring using the RRF formula.
Given a set of documents D and a set of rankings R for the documents, the formula for RRF is:
∑︁ 1
𝑅𝑅𝐹 𝑠𝑐𝑜𝑟𝑒(𝑑 ∈ 𝐷) =
𝑘 + 𝑟(𝑑)
𝑟∈𝑅
where k is a fixed number, in this case k is set to 30.
Then, for each topic, documents are ranked (and ordered) based on their RRF score.
The results of the search are then outputted as a standard run file.
3.6 Argument quality
We decided to make use of IBM Project Debater API.
Project Debater is an AI system used to perform various tasks about debating at a human
level. IBM makes freely available, for research purposes, some services based on this system
through an API. [3]
We were interested in the argument quality service of the API. It accepts a couple of strings
labeled as Sentence and Topic, and it returns a float score in the range 0-1 based on the relevance
of the sentence for the topic and on the quality of the sentence as a text, which means how
good it is written.
Since the rest of our system is designed to already score documents based on the relevance to
the topic, we now just wanted to evaluate the text quality. In order to do so, for each document
in the collection we decided to send Sentence-Topic pairs in which the Sentence was the body
of the document and the Topic was an empty string.
We coded the ArgumentQualityVerifier class which evaluates the written quality of each
document by using the API and then saves the scores to a file.
Then we had to use the obtained scores to rerank the results of the search saved in a run file.
So we defined the ArgumentQualityReranker class which:
1. loads the quality scores of all the documents from the file into a Map object
2. iterates over the lines of the old run file and for each: multiplies the old score by the one
assigned by Project Debater API and saves the object representing the new line to a list
3. sorts the list of new lines by topic number and score and writes them on a new run file
4. Experimental Setup
4.1 Collections
Some of the collections used throughout the process of system development were the ones
provided by CLEF for the Touché 2022 edition, accessible from Task 2’s site. Those include:
• topics-task2.xml which contains the topics.
• The original version of passages.jsonl which contains the documents.
• DocT5Query expanded version of passages.jsonl2 which contains the documents expanded
with queries generated using DocT5Query. [4]
Other collections are:
• Historical stoplists: lucene, smart and terrier;
• Custom stoplists:
– Kueristop - Stoplist formed by the 400 most concurrent term in the Contents field
of the document collection;
– Kueristopv2 - Subset of kueristop, obtained by removing from it terms appearing in
the Objects field of the topics, except for the very general terms also appearing in
lucene stoplist ("in" and "the").
• Sentence quality - file containing, for each document in the document Collection, the
pairs of docIds and the score obtained by that document as explained in 3.6.
2
This collection was provided by Team Princess Knight that parteciped in Touche, the corpus can be found at:
https://www.tira.io/t/expanded-passages-for-the-touche-22-task-2-argument-retrieval-for-comparative-questions/
578
4.2 Evaluation measures
The evaluation measure used is Normalized Discounted Cumulative Gain at depth 5, NDCG@5
in short. [5]
It is the evaluation measure used by Touché to officially evaluate runs.
NDCG@k is calculated as follows:
𝐷𝐶𝐺@𝑘
𝑁 𝐷𝐶𝐺@𝑘 =
𝑖𝐷𝐶𝐺@𝑘
where
𝑘
∑︁ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒𝑖
𝐷𝐶𝐺@𝑘 =
𝑙𝑜𝑔2 (𝑖 + 1)
𝑖=1
and iDCG@k is the ideal DCG@k, meaning the DCG@k for documents ordered by relevance,
highest to lowest.
4.3 Git repository
The project’s development can be found in the following link to its Git repository 3
4.4 Hardware
The specifications of the computer used to perform the runs are the following:
OS Windows 10 Home 21H2 x64
CPU AMD Ryzen 5 1600 @ 3.9GHz
RAM 16GB 3000mhz cl16
GPU Nvidia GTX 1060 6GB
HDD 2TB 7200RPM
5. Model Selection
The conventional and ideal approach when evaluating the performance of the runs would have
been to use last year’s test collection. [6]
However, since we did not have access to last year’s corpus we have decided to use this year’s
test collection to evaluate our systems, using a qrels file containing relevance feedback manually
performed by us.
The qrels file has been built by gathering, for each of the runs performed, the top 5 ranked
documents for each topic.
3
https://bitbucket.org/upd-dei-stud-prj/seupd2122-kueri/src/master/
Table 1
NDCG@5 and setup for single runs
# NDCG@5 num_q RF Stoplist Filter Stemmer Similarity Weights Reranking
1 0.3830 50 False lucene False None BM25 [1,1] False
2 0.3756 50 False lucene False None LMD [1,1] False
3 0.3313 50 False lucene False None TFIDF [1,1] False
4 0.4140 50 False smart False None BM25 [1,1] False
5 0.4258 50 False terrier False None BM25 [1,1] False
6 0.4366 50 False kueristop False None BM25 [1,1] False
7 0.4548 50 False kueristopv2 False None BM25 [1,1] False
8 0.4015 48 False lucene True None BM25 [1,1] False
9 0.4759 41 False kueristop True None BM25 [1,1] False
10 0.4823 48 False kueristopv2 True None BM25 [1,1] False
11 0.2634 50 False kueristopv2 False None BM25 [0,1] False
12 0.3654 50 False kueristopv2 False None BM25 [1,0] False
13 0.4525 50 False kueristopv2 False None BM25 [1,2] False
14 0.4674 50 False kueristopv2 False None BM25 [2,1] False
15 0.4873 50 False kueristopv2 False Porter BM25 [1,1] False
16 0.8549 50 True kueristopv2 False False BM25 [1,1] False
17 0.8552 50 True kueristopv2 False Porter BM25 [1,1] False
18 0.5867 48 False kueristopv2 True None BM25 [1,1] True
19 0.5392 50 False kueristopv2 False None BM25 [2,1] True
20 0.5714 50 False kueristopv2 False Porter BM25 [1,1] True
21 0.8606 50 True kueristopv2 False False BM25 [1,1] True
22 0.8323 50 True kueristopv2 False Porter BM25 [1,1] True
Table 2
NDCG@5 and setup for rrf runs
# # fused NDCG@5 Reranking
23 10,14,15,16,17 0.7521 False
24 10,14,15,16,17 0.7450 True
The runs’ performance has been evaluated using trec_eval, the key measures considered are
NDCG@5, the official measure used by CLEF to rank runs, and num_q, the number of topics
retrieved (since some runs retrieved no documents for some of the topics).
All the runs, their characteristics and key measures are reported in Table 1 and 2. The five
runs with their number in bold are the five submitted runs.
All the runs are performed on indexes obtained using Standard tokenizer and Lowercase
filter, except for indexes used in runs obtained using Relevance Feedback, which use Letter
tokenizer instead; this is because some of the tokens obtained using standard tokenizer were
written in a format that caused errors when used as query (e.g. "text:text:text" would be a token
that caused errors).
5.1 Retrieval Similarity Model
The runs 1 to 3 compare BM25 (utilizing lucene’s default parameters), Dirichlet and TFIDF
Similarity as scoring functions, using lucene stoplist. The run using BM25 was the best performer,
so we decided to use this Similarity for all the other experiments.
5.2 Stoplists
Runs 1 and 4 to 7 compare different stoplists, in particular we compared lucene, smart and
terrier stoplists and our own custom stoplists kueristop and kueristopv2; the results show that
among the "generic" stoplists the larger ones have a bigger impact, but custom stoplists bring to
even better improvements, with kueristopv2 being the best.
5.3 Filter
We then wanted to assess the impact of filtering the runs by all the terms in the object field.
Runs 8, 9 and 10 are performed adding the filter to the setup of runs 1, 6 and 7. Run 9 only
retrieved documents for 41 topics, as 9 topics contain, in the ojects field, terms that are in the
stoplist (and therefore are in the index); runs 8 and 10 retrieve documents for 48 queries, because
lucene and kueristopv2 contain the terms "the" and "in", which again are in the objects field for
two queries.
The runs with filtering have a better NDCG@5 score compared to runs without, however
they retrieve less topics. Retrieving no documents for some topics make us assess these runs
as worse performing compared to the ones without filtering. Moreover the improvement in
NDCG@5 score could be caused in part by the lack of these topics, as the system could have
worse performance for these topics compared to the others. Despite having worse results when
taken singularly, runs using filtering can be used to improve other runs by using RRF.
5.4 Field Weight
Runs 11 to 14 use the same setup as the current best performing run, 7, changing the weight of
Contents and DocT5Query fields respectively. When searching on a single field (weight 0 on
the other field) the score is much worse, increasing the weight of DocT5Query field slightly
worsens the score, increasing the weight of Contents field instead improve the score.
5.5 Stemmer
Run 15 adds to the setup of run 7 a stemmer, specifically Porter stemmer; this addition brings to
a good improvement in performance.
5.6 RF
Runs 16 and 17 instead are performed using Relevance Feedback, respectively without stemmer
and with Porter stemmer; These runs have an NDGC@5 score incredibly higher than the
previous ones, this however is due to using the same collection, and in particular the same qrels,
to obtain the RF runs and to score its performance.
To have a more reliable assessment of performance we could have done the search on a index
built removing documents present in the qrels file. However, while this would have prevented
the overfitting problem, we still couldn’t have directly compared results to other runs; in fact,
the documents in the qrels file, being the top documents retrieved, should be the most relevant,
which mean we should have expected worse results by the runs performed when removing the
documents from the collection.
5.7 RRF
The first rrf run is obtained fusing a mixture of well performing and slightly different runs: 10,
14, 15, 16 and 17. It presents a very good NDCG@5 score, but since it uses RF runs the score is
not reliable as these runs also may contain overfitting.
5.8 Reranking
Runs 18 to 22 and the second rrf run are obtained by applying reranking to the runs above (10,
14, 15, 16 and 17 and their fusion). Reranking has been performed by multiplying documents’
scores in the runs with their respective Argument quality score, as described in 3.6.
Comparing to their non-reranked respectives we can see that results on RF and rrf runs are
mixed, but again not the most reliable because of previous overfitting; on the other three runs
instead reranking offers a really great improvement in performance.
6. Results and Analysis
6.1 Results
The runs’ performance are evaluated on this year’s relevance qrels, provided by CLEF, built
performing top-5 pooling on the runs delivered by all participants. [7]
In table 3 we show, for each run, the NDCG@5 score obtained during model selection and
the respective NDCG@5 score obtained with CLEF’s relevance qrels.
The scores are close to the ones obtained in model selection and all choices done are confirmed
by the final results.
The only runs that differ much from the ones in the model selection are, as expected due to
the mentioned overfitting, the ones that use Relevance Feedback. These runs still have a better
score than other run obtained before reranking, but they don’t differ from them as much as
they did earlier.
The runs obtained through RRF also suffer a decrease in score, also due to the previous partial
overfitting deriving from RF runs.
As expected reranking improved the performance of all runs, included RF and RRF runs.
The only result hinted in the model selection that we didn’t expect to turn out to be true was
that Porter stemmer slightly worsened the score when applied to RF runs.
The best performing run is run 24, the reranked RRF run.
Table 3
Model and CLEF’s qrels scores
# selection NDCG@5 Final NDCG@5
1 0.3830 0.3828
2 0.3756 0.3688
3 0.3313 0.2937
4 0.4140 0.4497
5 0.4258 0.4461
6 0.4366 0.4746
7 0.4548 0.4896
8 0.4015 0.4376
9 0.4759 0.5226
10 0.4823 0.5042
11 0.2634 0.2289
12 0.3654 0.4088
13 0.4525 0.4535
14 0.4674 0.4939
15 0.4873 0.5466
16 0.8549 0.6098
17 0.8552 0.6036
18 0.5867 0.5812
19 0.5392 0.5772
20 0.5714 0.6362
21 0.8606 0.6954
22 0.8323 0.6669
23 0.7521 0.6681
24 0.7450 0.7089
6.2 Statistical Analysis
All the following statistical analysis have been obtained using CLEF’s relevance qrels.
In the analysis, to produce better results, when a run retrieves no documents for a topic the
NDCG@5 score is set to 0, while earlier that topic was not considered in the average NDCG@5
in the run, therefore results are worse than to the ones observed earlier for those runs (8, 9, 10,
18).
First we wanted to check if the runs were significantly different between them, in order to do
so we have used Tukey’s HSD test [8], with 𝛼 = 0.05: Figure 2
In particular run 24 is highlighted, to show how it significantly differs from all runs not using
RF or reranking (runs 1 to 15)
Then we produced a boxplot graph showing, for each run, the NDCG@5 score of each topic
on the Y axis, ordered by average NDCG@5: Figure 3
All runs have a very similar interquartile range and all runs, except run 19, have a score equal
to 0 for at least one topic.
Following these results we also got interested in finding out the difference by topics, to see
what type of topic we had a poor performance on: Figure 4
24
21
23
22
20
16
17
19
18
15
14
7
10
6
13
4
5
9
8
12
1
2
3
11
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
14 runs have means significantly different from run 24
Figure 2: Multiple comparison of Tukey’s HSD test
Figure 3: Run’s Boxplot ordered by NDCG@5
The run got their worse results on topics 43, 86 and 77, which are respectively:
• Should I prefer a Leica camera over Nikon for portrait photographs?
• I am planning to buy sneakers: Which are better, Adidas or Nike?
• Is it healthier to bake than to fry food?
The problem we found in particular with these topics is, for the first two, that a lot of documents
retrieved were ads, and for the last one that a lot of documents retrieved were just recipes that
bake or fry food.
Due to these results we believe in future work it might be useful to add to the search keywords
or shingles offering comparison (e.g. "versus", "compared to", "against"), since comparison
between two items is intrinsic to the task.
1
0.9
0.8
0.7
0.6
Performance
0.5
0.4
0.3
0.2
0.1
0
0
91
84
60
33
67
14
17
93
27
62
37
54
78
55
59
61
51
72
18
19
74
92
22
30
25
26
58
34
36
48
70
68
12
42
69
53
76
28
56
95
23
88
77
86
43
10
3
8
2
9
Topics (decreasing order of mean performance)
Figure 4: Topics’s Boxplot ordered by NDCG@5
We also decided to check the difference in performance in different topics between our best
run and the third and second best, again to check the reason for the dip in performance in
specific topics: Figure 5 Figure 6
0.25
y
0.00
0.2
y
0.0
−0.2
−0.4
−0.25
−0.50
9 95 72 23 88 74 77 22 59 76 26 25 14 28 61 62 78 55 19 58 100 17 33 43 51 84 30 93 12 56 37 34 42 91 48 54 69 2 27 60 67 68 70 92 8 18 3 53 36 86
reorder(x, y)
Figure 5: Difference in performance by topic in runs 24 and 21
When comparing run 24 to run 21 the performance is noticeably worse, with a difference of
0.5, for topic 9: "Why is Linux better than Windows?"
This top is however is one of the worse performing topic among all runs and at the same
time has the largest interquartile range.
Going more in depth we find, rather than the weaknesses of run 24, a proof to the strength
of Relevance Feedback: in fact, in this specific topic in general the documents retrieved often
display people talking about one of the two objects, Relevance Feedback runs instead excel
because they look for keywords, that are very often used when comparing the two objects, like
for example "price", "safety", "open" and "source".
0.3
0.2
0.1
y
0.3
0.2
0.1
y
0.0
0.0 −0.1
−0.2
−0.1
−0.2
30 26 88 51 25 54 14 62 28 22 61 74 59 77 3 72 17 18 33 43 60 68 8 84 91 55 93 78 36 37 2 76 58 70 19 95 9 69 12 27 67 100 23 92 86 34 48 42 56 53
reorder(x, y)
Figure 6: Difference in performance by topic in runs 24 and 23
The comparison between run 24 and 23 is particularly interesting, since the first is the
reranked version of the second. A noticeable dip in performance from run 23 to run 24 can be
seen in topics 30 and 26:
• Should I buy an Xbox or a PlayStation?
• Which is a better vehicle: BMW or Audi?
We decided to go in depth to find out the reason of the worse performance in topic 30 (Xbox vs.
Playstation) by manually checking the top-5 document retrieved for each and their relevance
score in qrels.
The main reason for the difference is that most of the relevant document for run 23 are
formed mainly by short ads (in the format "item on sale - price"), but also contained a very short
phrase that was relevant to the topic. These document when reranking suffer a big penalty due
to their low sentence quality.
Instead, in run 24 we found a document that is uniquely an ad (which was unexpected as we
thought that reranking with sentence quality was a very good way to push ads down in the
ranking), however this document was a well written ad, consisting of a company advertising
their business that sold consoles, therefore it’s sentence quality score is high.
A problem we found in this in depth analysis is that some documents (e.g. clueweb12-1810wb-
39-31830___3, clueweb12-1808wb-28-21892___11) were poorly scored, since these documents
were comprised of only ads, containing no relevant information at all to the topic, but still
were scored as partially relevant. Finding these two blatant mistakes in only 8 documents
manually checked (two document are present in both runs) raises concerns on the reliability of
the relevance scores delivered by CLEF.
7. Conclusions and Future Work
We managed to effectively select the search engine model, that offered results close to the ones
obtained with the official qrels, except for the already expected difference due to overfittin in
RF.
We managed to improve substantially the performance of the runs compare to the initial
lucene baseline, with an increase in score of over 85% when considering our best performing
run.
The greatest impact comes from relevance feedback, but reranking and a stoplist customized
to our corpus also offered noticeable improvements.
This is remarkable also because, due to the lack of access to last year’s corpus, it wasn’t
possible for us to perform any fine-tuning.
Having access to such test collections would allow us for example to fine tune BM25 parame-
ters, the field weights, the boosts for terms in RF, we could experiment with many more stoplists
and stemmers. As an example, a run implementing Porter stemmer (or a different stemmer), fine
tuned weights with the Contents field having more weight than the DocT5Query one would
probably best all the other single runs, but the extra time it took us to also manually assess
documents proved to be a strong limiting factor in the expansion of our experiments.
In future works it would be interesting to, as mentioned, add to the search terms used to
compare objects, and experiment with other "classic" method, for example using shingles, but
mostly with machine learning and deeplearning techniques, that have become the standard in
the last decade of information retrieval.
It would also be interesting having the chance to tackle a similarly built task, but with the
change to work with data in formats different than full-text, with the addition of metadata (for
example in this case, since the corpus was created crawling the web, having access to metadata
from the webpages would have presented new opportunities, like individuating ads).
References
[1] J. J. Rocchio, Relevance feeback in information retrieval (1965). URL: http://sigir.org/files/
museum/pub-08/XXIII-1.pdf.
[2] G. V. Cormack, C. L. A. Clarke, S. Büttcher, Reciprocal rank fusion outperforms condorcet
and individual rank learning methods (2009). URL: https://plg.uwaterloo.ca/~gvcormac/
cormacksigir09-rrf.pdf.
[3] Project debater for academic use, n.d. URL: https://early-access-program.debater.res.ibm.
com/academic_use.
[4] R. F. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, CoRR
abs/1904.08375 (2019). URL: http://arxiv.org/abs/1904.08375. arXiv:1904.08375.
[5] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans.
Inf. Syst. 20 (2002) 422–446. URL: http://doi.acm.org/10.1145/582415.582418. doi:10.1145/
582415.582418.
[6] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann,
B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument
Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume
12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021,
pp. 450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10.
1007/978-3-030-85251-1\_28.
[7] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Biemann,
B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argument
Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th
International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer
Science, Springer, Berlin Heidelberg New York, 2022, p. to appear.
[8] Tuckey’s method, n.d. URL: https://www.itl.nist.gov/div898/handbook/prc/section4/prc471.
htm.