<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Aldo Nadi at Touché 2022: Argument Retrieval for Comparative Questions</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Maria</forename><surname>Aba</surname></persName>
							<email>maria.aba@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Munzer</forename><surname>Azra</surname></persName>
							<email>munzer.azra@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Gallo</surname></persName>
							<email>marco.gallo.9@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Odai</forename><surname>Mohammad</surname></persName>
							<email>odai.mohammad@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ivan</forename><surname>Piacere</surname></persName>
							<email>ivan.piacere@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giacomo</forename><surname>Virginio</surname></persName>
							<email>giacomo.virginio@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicola</forename><surname>Ferro</surname></persName>
							<email>ferro@dei.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 5-8</addrLine>
									<postCode>2022</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Aldo Nadi at Touché 2022: Argument Retrieval for Comparative Questions</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">68FB8F123C5F71631BBB07D284B20D82</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:29+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Information retrieval</term>
					<term>Comparative questions</term>
					<term>Lucene</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we present the information retrieval system we developed for the 2022 Touché @ CLEF Task 2 evaluation campaign. The participation in the task is performed as a student group project conducted in the Search Engines course a.y. 2021/2022 at the Computer Engineering and Data Science master degrees at University of Padua. This tasks' aim is to create systems that are able to retrieve documents that compare two options, e.g. which is the best pet between a dog and a cat.</p><p>Here we describe the architecture of our system, we list the software and hardware resources we made use of, we discuss the results obtained using different configurations and finally we present improvements which could be applied to our system to enhance its performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Before the era of the internet, information storage and retrieval systems were mostly used by professionals for medical research, in libraries, by governmental organizations, and archives. Therefore, access to such information was a hard process especially for non-search experts. Recently, with the fast increase in the number of data and information available online, the importance of search engines grew rapidly. Nowadays, people use search engines to locate and buy goods, choose a vacation destination, select a medical treatment, etc. Search engines transitioned from being searchers' tools for information to tools for building opinions and making major decisions. All of these aspects, when considered together, make retrieval systems a need for impacting the industry and improving the field of information retrieval.</p><p>This paper is structured as follows: Section 2 presents related work; Section 3 describes our approach; Section 4 explains our experimental setup; Section 5 discusses our main findings in the model selection process; Section 6 discusses the results and analysis of our runs; finally, Section 7 draws some conclusions and outlooks for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The packages mentioned in 3.1 and 3.3 are obtained expanding from a baseline built on the TIPSTER collection, during lessons of the "Search Engines" course, University of Padua. Information about the course are available online:<ref type="foot" target="#foot_0">1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>The following is the class diagram for our implementation:  </p><formula xml:id="formula_0">«create» «create» «create» 1 1 «create» «create» «create» «create» «create»</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Parse, Analyze, Index</head><p>These packages are in charge of the database creation and to prepare topics.</p><p>The documents in the DocT5Query expanded corpus are parsed, their text field analyzed (with the possibility of using different custom analyzers) and then indexed with the fields: ID, Body and DocT5Query.</p><p>The topics are also parsed so that the fields number, title and objects can be used in the search using Lucene, with the latter two also analyzed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Filter</head><p>This class allows to extract the strings from the object field in the topics and return a Boolean-Query.Builder object, which can be later consumed by the search method by adding it as a MUST clause in the search, to only retrieve documents that present all terms contained in the object fields.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Search</head><p>This packed is responsible responsible for:</p><p>1. Calling the Paese and Analyze packages to retrieve and preparing the topics for the search. 2. Defining which type of comparison to perform between topics and documents, that can be chosen by changing the similarity function. 3. Defining how to use topics in the search.</p><p>The topics titles are used search by similarity with a SHOULD clause, it possible to also assign weights to the different fields of the documents among which to search, or to select just one of the two fields (Contents and DocT5Query), and the MUST clause described in the Filter class can be added. 4. Writing the results on a file</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">RF</head><p>RF is a customized class with the goal of performing a search using explicit relevance feedback to perform query expansion.</p><p>RF functions in a similar way to the Searcher class, with the exception of building the query used in the searching using the tokens present in relevant documents, instead of using the terms in title field of the topics file.</p><p>The class collects all docID and relevance of relevant documents in the qrels file.</p><p>The tokens and their frequency in the relevant documents are retrieved by searching the document by docID and iterating through its termvector.</p><p>The tokens used in the search are boosted by their frequency in the document multiplied by the square of the relevance score.</p><p>Relevance Feedback is standardly based on the Rocchio Algorithm. <ref type="bibr" target="#b0">[1]</ref> The formula for the Rocchio Algorithm is:</p><formula xml:id="formula_1">−→ 𝑄 𝑚 = (︁ 𝑎 • − → 𝑄 𝑂 )︁ + ⎛ ⎜ ⎝𝑏 • 1 |𝐷 𝑟 | • ∑︁ − → 𝐷 𝑗 ∈𝐷𝑟 − → 𝐷 𝑗 ⎞ ⎟ ⎠ − ⎛ ⎝ 𝑐 • 1 |𝐷 𝑛𝑟 | • ∑︁ − → 𝐷 𝑘 ∈𝐷𝑛𝑟 − → 𝐷 𝑘 ⎞ ⎠</formula><p>where −→ 𝑄 𝑚 is the modified query vector, − → 𝑄 𝑂 is the original query vector, − → 𝐷 𝑖 is the document vector for the 𝑖 𝑡ℎ document, 𝐷 𝑟 is the set of relevant documents, 𝐷 𝑛𝑟 is the set of non-relevant documents and 𝑎, 𝑏 and 𝑐 are weight parameters.</p><p>In our case the parameters used are 0, 1, 0. Rocchio algorithm is however defined for working with binary relevance, since this collection uses multi-graded relevance, our version of RF is customized to take into account the different relevance scores used (0 to 3).</p><p>The custom formula we used is:</p><formula xml:id="formula_2">−→ 𝑄 𝑚 = 𝑘 2 𝑖 • 1 |𝐷 𝑟 | • ∑︁ − → 𝐷 𝑖 ∈𝐷𝑟 − → 𝐷 𝑖</formula><p>where −→ 𝑄 𝑚 is the modified query vector, − → 𝐷 𝑖 is the document vector for the 𝑖 𝑡ℎ document, 𝐷 𝑟 is the set of relevant documents, and 𝑘 𝑖 is the relevance score of the 𝑖 𝑡ℎ document.</p><p>In this work a total of 491 relevant documents have been used to perform Relevance Feedback.</p><p>The results of the search are then outputted as a standard run file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">RRF</head><p>This package contains a single class, also called RRF. RRF.java is a customized class with the goal of performing using Reciprocal Ranking Fusion <ref type="bibr" target="#b1">[2]</ref> to fuse the results of different runs in a single one.</p><p>RRF takes in imput a directory path and performs RRF using all the runs in .txt documents inside that directory.</p><p>For each documents and for each topic the documents and their respective ranking are collected.</p><p>Then document receive a new scoring using the RRF formula. Given a set of documents D and a set of rankings R for the documents, the formula for RRF is:</p><formula xml:id="formula_3">𝑅𝑅𝐹 𝑠𝑐𝑜𝑟𝑒(𝑑 ∈ 𝐷) = ∑︁ 𝑟∈𝑅 1 𝑘 + 𝑟(𝑑)</formula><p>where k is a fixed number, in this case k is set to 30.</p><p>Then, for each topic, documents are ranked (and ordered) based on their RRF score.</p><p>The results of the search are then outputted as a standard run file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6">Argument quality</head><p>We decided to make use of IBM Project Debater API. Project Debater is an AI system used to perform various tasks about debating at a human level. IBM makes freely available, for research purposes, some services based on this system through an API. <ref type="bibr" target="#b2">[3]</ref> We were interested in the argument quality service of the API. It accepts a couple of strings labeled as Sentence and Topic, and it returns a float score in the range 0-1 based on the relevance of the sentence for the topic and on the quality of the sentence as a text, which means how good it is written.</p><p>Since the rest of our system is designed to already score documents based on the relevance to the topic, we now just wanted to evaluate the text quality. In order to do so, for each document in the collection we decided to send Sentence-Topic pairs in which the Sentence was the body of the document and the Topic was an empty string.</p><p>We coded the ArgumentQualityVerifier class which evaluates the written quality of each document by using the API and then saves the scores to a file.</p><p>Then we had to use the obtained scores to rerank the results of the search saved in a run file. So we defined the ArgumentQualityReranker class which:</p><p>1. loads the quality scores of all the documents from the file into a Map object 2. iterates over the lines of the old run file and for each: multiplies the old score by the one assigned by Project Debater API and saves the object representing the new line to a list 3. sorts the list of new lines by topic number and score and writes them on a new run file</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Collections</head><p>Some of the collections used throughout the process of system development were the ones provided by CLEF for the Touché 2022 edition, accessible from Task 2's site. Those include:</p><p>• topics-task2.xml which contains the topics.</p><p>• The original version of passages.jsonl which contains the documents.</p><p>• DocT5Query expanded version of passages.jsonl<ref type="foot" target="#foot_1">2</ref> which contains the documents expanded with queries generated using DocT5Query. <ref type="bibr" target="#b3">[4]</ref> Other collections are:</p><p>• Historical stoplists: lucene, smart and terrier;</p><p>• Custom stoplists:</p><p>-Kueristop -Stoplist formed by the 400 most concurrent term in the Contents field of the document collection; -Kueristopv2 -Subset of kueristop, obtained by removing from it terms appearing in the Objects field of the topics, except for the very general terms also appearing in lucene stoplist ("in" and "the").</p><p>• Sentence quality -file containing, for each document in the document Collection, the pairs of docIds and the score obtained by that document as explained in 3.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation measures</head><p>The evaluation measure used is Normalized Discounted Cumulative Gain at depth 5, NDCG@5 in short. <ref type="bibr" target="#b4">[5]</ref> It is the evaluation measure used by Touché to officially evaluate runs. NDCG@k is calculated as follows:</p><p>𝑁 𝐷𝐶𝐺@𝑘 = 𝐷𝐶𝐺@𝑘 𝑖𝐷𝐶𝐺@𝑘 where</p><formula xml:id="formula_4">𝐷𝐶𝐺@𝑘 = 𝑘 ∑︁ 𝑖=1 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 𝑖 𝑙𝑜𝑔 2 (𝑖 + 1)</formula><p>and iDCG@k is the ideal DCG@k, meaning the DCG@k for documents ordered by relevance, highest to lowest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Git repository</head><p>The project's development can be found in the following link to its Git repository<ref type="foot" target="#foot_2">3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Hardware</head><p>The specifications of the computer used to perform the runs are the following: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>OS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Model Selection</head><p>The conventional and ideal approach when evaluating the performance of the runs would have been to use last year's test collection. <ref type="bibr" target="#b5">[6]</ref> However, since we did not have access to last year's corpus we have decided to use this year's test collection to evaluate our systems, using a qrels file containing relevance feedback manually performed by us.</p><p>The qrels file has been built by gathering, for each of the runs performed, the top 5 ranked documents for each topic. The runs' performance has been evaluated using trec_eval, the key measures considered are NDCG@5, the official measure used by CLEF to rank runs, and num_q, the number of topics retrieved (since some runs retrieved no documents for some of the topics).</p><p>All the runs, their characteristics and key measures are reported in Table <ref type="table" target="#tab_0">1</ref> and 2. The five runs with their number in bold are the five submitted runs.</p><p>All the runs are performed on indexes obtained using Standard tokenizer and Lowercase filter, except for indexes used in runs obtained using Relevance Feedback, which use Letter tokenizer instead; this is because some of the tokens obtained using standard tokenizer were written in a format that caused errors when used as query (e.g. "text:text:text" would be a token that caused errors).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Retrieval Similarity Model</head><p>The runs 1 to 3 compare BM25 (utilizing lucene's default parameters), Dirichlet and TFIDF Similarity as scoring functions, using lucene stoplist. The run using BM25 was the best performer, so we decided to use this Similarity for all the other experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Stoplists</head><p>Runs 1 and 4 to 7 compare different stoplists, in particular we compared lucene, smart and terrier stoplists and our own custom stoplists kueristop and kueristopv2; the results show that among the "generic" stoplists the larger ones have a bigger impact, but custom stoplists bring to even better improvements, with kueristopv2 being the best.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Filter</head><p>We then wanted to assess the impact of filtering the runs by all the terms in the object field. Runs 8, 9 and 10 are performed adding the filter to the setup of runs 1, 6 and 7. Run 9 only retrieved documents for 41 topics, as 9 topics contain, in the ojects field, terms that are in the stoplist (and therefore are in the index); runs 8 and 10 retrieve documents for 48 queries, because lucene and kueristopv2 contain the terms "the" and "in", which again are in the objects field for two queries.</p><p>The runs with filtering have a better NDCG@5 score compared to runs without, however they retrieve less topics. Retrieving no documents for some topics make us assess these runs as worse performing compared to the ones without filtering. Moreover the improvement in NDCG@5 score could be caused in part by the lack of these topics, as the system could have worse performance for these topics compared to the others. Despite having worse results when taken singularly, runs using filtering can be used to improve other runs by using RRF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Field Weight</head><p>Runs 11 to 14 use the same setup as the current best performing run, 7, changing the weight of Contents and DocT5Query fields respectively. When searching on a single field (weight 0 on the other field) the score is much worse, increasing the weight of DocT5Query field slightly worsens the score, increasing the weight of Contents field instead improve the score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Stemmer</head><p>Run 15 adds to the setup of run 7 a stemmer, specifically Porter stemmer; this addition brings to a good improvement in performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">RF</head><p>Runs 16 and 17 instead are performed using Relevance Feedback, respectively without stemmer and with Porter stemmer; These runs have an NDGC@5 score incredibly higher than the previous ones, this however is due to using the same collection, and in particular the same qrels, to obtain the RF runs and to score its performance.</p><p>To have a more reliable assessment of performance we could have done the search on a index built removing documents present in the qrels file. However, while this would have prevented the overfitting problem, we still couldn't have directly compared results to other runs; in fact, the documents in the qrels file, being the top documents retrieved, should be the most relevant, which mean we should have expected worse results by the runs performed when removing the documents from the collection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">RRF</head><p>The first rrf run is obtained fusing a mixture of well performing and slightly different runs: 10, 14, 15, 16 and 17. It presents a very good NDCG@5 score, but since it uses RF runs the score is not reliable as these runs also may contain overfitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8">Reranking</head><p>Runs 18 to 22 and the second rrf run are obtained by applying reranking to the runs above (10, 14, 15, 16 and 17 and their fusion). Reranking has been performed by multiplying documents' scores in the runs with their respective Argument quality score, as described in 3.6.</p><p>Comparing to their non-reranked respectives we can see that results on RF and rrf runs are mixed, but again not the most reliable because of previous overfitting; on the other three runs instead reranking offers a really great improvement in performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results and Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Results</head><p>The runs' performance are evaluated on this year's relevance qrels, provided by CLEF, built performing top-5 pooling on the runs delivered by all participants. <ref type="bibr" target="#b6">[7]</ref> In table <ref type="table" target="#tab_1">3</ref> we show, for each run, the NDCG@5 score obtained during model selection and the respective NDCG@5 score obtained with CLEF's relevance qrels.</p><p>The scores are close to the ones obtained in model selection and all choices done are confirmed by the final results.</p><p>The only runs that differ much from the ones in the model selection are, as expected due to the mentioned overfitting, the ones that use Relevance Feedback. These runs still have a better score than other run obtained before reranking, but they don't differ from them as much as they did earlier.</p><p>The runs obtained through RRF also suffer a decrease in score, also due to the previous partial overfitting deriving from RF runs.</p><p>As expected reranking improved the performance of all runs, included RF and RRF runs.</p><p>The only result hinted in the model selection that we didn't expect to turn out to be true was that Porter stemmer slightly worsened the score when applied to RF runs.</p><p>The best performing run is run 24, the reranked RRF run. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Statistical Analysis</head><p>All the following statistical analysis have been obtained using CLEF's relevance qrels.</p><p>In the analysis, to produce better results, when a run retrieves no documents for a topic the NDCG@5 score is set to 0, while earlier that topic was not considered in the average NDCG@5 in the run, therefore results are worse than to the ones observed earlier for those runs <ref type="bibr" target="#b7">(8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">18)</ref>.</p><p>First we wanted to check if the runs were significantly different between them, in order to do so we have used Tukey's HSD test <ref type="bibr" target="#b7">[8]</ref>, with 𝛼 = 0.05: Figure <ref type="figure">2</ref> In particular run 24 is highlighted, to show how it significantly differs from all runs not using RF or reranking (runs 1 to 15)</p><p>Then we produced a boxplot graph showing, for each run, the NDCG@5 score of each topic on the Y axis, ordered by average NDCG@5: Figure <ref type="figure">3</ref> All runs have a very similar interquartile range and all runs, except run 19, have a score equal to 0 for at least one topic.</p><p>Following these results we also got interested in finding out the difference by topics, to see what type of topic we had a poor performance on: The run got their worse results on topics 43, 86 and 77, which are respectively:</p><p>• Should I prefer a Leica camera over Nikon for portrait photographs?</p><p>• I am planning to buy sneakers: Which are better, Adidas or Nike?</p><p>• Is it healthier to bake than to fry food?</p><p>The problem we found in particular with these topics is, for the first two, that a lot of documents retrieved were ads, and for the last one that a lot of documents retrieved were just recipes that bake or fry food.</p><p>Due to these results we believe in future work it might be useful to add to the search keywords or shingles offering comparison (e.g. "versus", "compared to", "against"), since comparison between two items is intrinsic to the task. Topics (decreasing order of mean performance) We also decided to check the difference in performance in different topics between our best run and the third and second best, again to check the reason for the dip in performance in specific topics: Figure <ref type="figure">5</ref>  When comparing run 24 to run 21 the performance is noticeably worse, with a difference of 0.5, for topic 9: "Why is Linux better than Windows?" This top is however is one of the worse performing topic among all runs and at the same time has the largest interquartile range.</p><p>Going more in depth we find, rather than the weaknesses of run 24, a proof to the strength of Relevance Feedback: in fact, in this specific topic in general the documents retrieved often display people talking about one of the two objects, Relevance Feedback runs instead excel because they look for keywords, that are very often used when comparing the two objects, like for example "price", "safety", "open" and "source". The comparison between run 24 and 23 is particularly interesting, since the first is the reranked version of the second. A noticeable dip in performance from run 23 to run 24 can be seen in topics 30 and 26:</p><p>• Should I buy an Xbox or a PlayStation? • Which is a better vehicle: BMW or Audi?</p><p>We decided to go in depth to find out the reason of the worse performance in topic 30 (Xbox vs. Playstation) by manually checking the top-5 document retrieved for each and their relevance score in qrels.</p><p>The main reason for the difference is that most of the relevant document for run 23 are formed mainly by short ads (in the format "item on sale -price"), but also contained a very short phrase that was relevant to the topic. These document when reranking suffer a big penalty due to their low sentence quality.</p><p>Instead, in run 24 we found a document that is uniquely an ad (which was unexpected as we thought that reranking with sentence quality was a very good way to push ads down in the ranking), however this document was a well written ad, consisting of a company advertising their business that sold consoles, therefore it's sentence quality score is high.</p><p>A problem we found in this in depth analysis is that some documents (e.g. clueweb12-1810wb-39-31830___3, clueweb12-1808wb-28-21892___11) were poorly scored, since these documents were comprised of only ads, containing no relevant information at all to the topic, but still were scored as partially relevant. Finding these two blatant mistakes in only 8 documents manually checked (two document are present in both runs) raises concerns on the reliability of the relevance scores delivered by CLEF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Future Work</head><p>We managed to effectively select the search engine model, that offered results close to the ones obtained with the official qrels, except for the already expected difference due to overfittin in RF.</p><p>We managed to improve substantially the performance of the runs compare to the initial lucene baseline, with an increase in score of over 85% when considering our best performing run.</p><p>The greatest impact comes from relevance feedback, but reranking and a stoplist customized to our corpus also offered noticeable improvements. This is remarkable also because, due to the lack of access to last year's corpus, it wasn't possible for us to perform any fine-tuning.</p><p>Having access to such test collections would allow us for example to fine tune BM25 parameters, the field weights, the boosts for terms in RF, we could experiment with many more stoplists and stemmers. As an example, a run implementing Porter stemmer (or a different stemmer), fine tuned weights with the Contents field having more weight than the DocT5Query one would probably best all the other single runs, but the extra time it took us to also manually assess documents proved to be a strong limiting factor in the expansion of our experiments.</p><p>In future works it would be interesting to, as mentioned, add to the search terms used to compare objects, and experiment with other "classic" method, for example using shingles, but mostly with machine learning and deeplearning techniques, that have become the standard in the last decade of information retrieval.</p><p>It would also be interesting having the chance to tackle a similarly built task, but with the change to work with data in formats different than full-text, with the addition of metadata (for example in this case, since the corpus was created crawling the web, having access to metadata from the webpages would have presented new opportunities, like individuating ads).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Class diagram of the project</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>Windows 10 Home 21H2 x64 CPU AMD Ryzen 5 1600 @ 3.9GHz RAM 16GB 3000mhz cl16 GPU Nvidia GTX 1060 6GB HDD 2TB 7200RPM</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :Figure 3 :</head><label>23</label><figDesc>Figure 2: Multiple comparison of Tukey's HSD test</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Topics's Boxplot ordered by NDCG@5</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 −Figure 5 :</head><label>65</label><figDesc>Figure 5: Difference in performance by topic in runs 24 and 21</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Difference in performance by topic in runs 24 and 23</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 NDCG</head><label>1</label><figDesc></figDesc><table><row><cell></cell><cell cols="3">@5 and setup for single runs</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>#</cell><cell cols="2">NDCG@5 num_q</cell><cell>RF</cell><cell>Stoplist</cell><cell></cell><cell cols="5">Filter Stemmer Similarity Weights Reranking</cell></row><row><cell>1</cell><cell>0.3830</cell><cell>50</cell><cell>False</cell><cell>lucene</cell><cell></cell><cell>False</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>2</cell><cell>0.3756</cell><cell>50</cell><cell>False</cell><cell>lucene</cell><cell></cell><cell>False</cell><cell>None</cell><cell>LMD</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>3</cell><cell>0.3313</cell><cell>50</cell><cell>False</cell><cell>lucene</cell><cell></cell><cell>False</cell><cell>None</cell><cell>TFIDF</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>4</cell><cell>0.4140</cell><cell>50</cell><cell>False</cell><cell>smart</cell><cell></cell><cell>False</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>5</cell><cell>0.4258</cell><cell>50</cell><cell>False</cell><cell>terrier</cell><cell></cell><cell>False</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>6</cell><cell>0.4366</cell><cell>50</cell><cell>False</cell><cell cols="2">kueristop</cell><cell>False</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>7</cell><cell>0.4548</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>8</cell><cell>0.4015</cell><cell>48</cell><cell>False</cell><cell>lucene</cell><cell></cell><cell>True</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>9</cell><cell>0.4759</cell><cell>41</cell><cell>False</cell><cell cols="2">kueristop</cell><cell>True</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>10</cell><cell>0.4823</cell><cell>48</cell><cell cols="4">False kueristopv2 True</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>11</cell><cell>0.2634</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[0,1]</cell><cell>False</cell></row><row><cell>12</cell><cell>0.3654</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[1,0]</cell><cell>False</cell></row><row><cell>13</cell><cell>0.4525</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[1,2]</cell><cell>False</cell></row><row><cell>14</cell><cell>0.4674</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[2,1]</cell><cell>False</cell></row><row><cell>15</cell><cell>0.4873</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>Porter</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>16</cell><cell>0.8549</cell><cell>50</cell><cell cols="4">True kueristopv2 False</cell><cell>False</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>17</cell><cell>0.8552</cell><cell>50</cell><cell cols="4">True kueristopv2 False</cell><cell>Porter</cell><cell>BM25</cell><cell>[1,1]</cell><cell>False</cell></row><row><cell>18</cell><cell>0.5867</cell><cell>48</cell><cell cols="4">False kueristopv2 True</cell><cell>None</cell><cell>BM25</cell><cell>[1,1]</cell><cell>True</cell></row><row><cell>19</cell><cell>0.5392</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>None</cell><cell>BM25</cell><cell>[2,1]</cell><cell>True</cell></row><row><cell>20</cell><cell>0.5714</cell><cell>50</cell><cell cols="4">False kueristopv2 False</cell><cell>Porter</cell><cell>BM25</cell><cell>[1,1]</cell><cell>True</cell></row><row><cell>21</cell><cell>0.8606</cell><cell>50</cell><cell cols="4">True kueristopv2 False</cell><cell>False</cell><cell>BM25</cell><cell>[1,1]</cell><cell>True</cell></row><row><cell>22</cell><cell>0.8323</cell><cell>50</cell><cell cols="4">True kueristopv2 False</cell><cell>Porter</cell><cell>BM25</cell><cell>[1,1]</cell><cell>True</cell></row><row><cell cols="2">Table 2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">NDCG@5 and setup for rrf runs</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>#</cell><cell></cell><cell># fused</cell><cell cols="3">NDCG@5 Reranking</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="3">23 10,14,15,16,17</cell><cell></cell><cell>0.7521</cell><cell>False</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="3">24 10,14,15,16,17</cell><cell></cell><cell>0.7450</cell><cell>True</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc></figDesc><table><row><cell>Model and CLEF's qrels scores</cell><cell></cell><cell></cell></row><row><cell>#</cell><cell cols="2">selection NDCG@5 Final NDCG@5</cell></row><row><cell>1</cell><cell>0.3830</cell><cell>0.3828</cell></row><row><cell>2</cell><cell>0.3756</cell><cell>0.3688</cell></row><row><cell>3</cell><cell>0.3313</cell><cell>0.2937</cell></row><row><cell>4</cell><cell>0.4140</cell><cell>0.4497</cell></row><row><cell>5</cell><cell>0.4258</cell><cell>0.4461</cell></row><row><cell>6</cell><cell>0.4366</cell><cell>0.4746</cell></row><row><cell>7</cell><cell>0.4548</cell><cell>0.4896</cell></row><row><cell>8</cell><cell>0.4015</cell><cell>0.4376</cell></row><row><cell>9</cell><cell>0.4759</cell><cell>0.5226</cell></row><row><cell>10</cell><cell>0.4823</cell><cell>0.5042</cell></row><row><cell>11</cell><cell>0.2634</cell><cell>0.2289</cell></row><row><cell>12</cell><cell>0.3654</cell><cell>0.4088</cell></row><row><cell>13</cell><cell>0.4525</cell><cell>0.4535</cell></row><row><cell>14</cell><cell>0.4674</cell><cell>0.4939</cell></row><row><cell>15</cell><cell>0.4873</cell><cell>0.5466</cell></row><row><cell>16</cell><cell>0.8549</cell><cell>0.6098</cell></row><row><cell>17</cell><cell>0.8552</cell><cell>0.6036</cell></row><row><cell>18</cell><cell>0.5867</cell><cell>0.5812</cell></row><row><cell>19</cell><cell>0.5392</cell><cell>0.5772</cell></row><row><cell>20</cell><cell>0.5714</cell><cell>0.6362</cell></row><row><cell>21</cell><cell>0.8606</cell><cell>0.6954</cell></row><row><cell>22</cell><cell>0.8323</cell><cell>0.6669</cell></row><row><cell>23</cell><cell>0.7521</cell><cell>0.6681</cell></row><row><cell>24</cell><cell>0.7450</cell><cell>0.7089</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://en.didattica.unipd.it/off/2021/LM/IN/IN2547/004PD/INQ0091599/N0</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">This collection was provided by Team Princess Knight that parteciped in Touche, the corpus can be found at: https://www.tira.io/t/expanded-passages-for-the-touche-22-task-2-argument-retrieval-for-comparative-questions/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://bitbucket.org/upd-dei-stud-prj/seupd2122-kueri/src/master/</note>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(N. Ferro) http://www.dei.unipd.it/~ferro/ (N. Ferro) 0000-0001-9219-6239 (N. Ferro)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Rocchio</surname></persName>
		</author>
		<ptr target="http://sigir.org/files/museum/pub-08/XXIII-1.pdf" />
		<title level="m">Relevance feeback in information retrieval</title>
				<imprint>
			<date type="published" when="1965">1965</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L A</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Büttcher</surname></persName>
		</author>
		<ptr target="https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf" />
		<title level="m">Reciprocal rank fusion outperforms condorcet and individual rank learning methods</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<ptr target="https://early-access-program.debater.res.ibm.com/academic_use" />
		<title level="m">Project debater for academic use</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Document expansion by query prediction</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">F</forename><surname>Nogueira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<idno>CoRR abs/1904.08375</idno>
		<ptr target="http://arxiv.org/abs/1904.08375.arXiv:1904.08375" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Cumulated gain-based evaluation of IR techniques</title>
		<author>
			<persName><forename type="first">K</forename><surname>Järvelin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kekäläinen</surname></persName>
		</author>
		<idno type="DOI">10.1145/582415.582418</idno>
		<idno>doi:10.1145/ 582415.582418</idno>
		<ptr target="http://doi.acm.org/10.1145/582415.582418" />
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="422" to="446" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overview of Touché 2021: Argument Retrieval</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bondarenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gienapp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fröbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Beloucif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ajjour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Panchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Biemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wachsmuth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-85251-1_28</idno>
		<ptr target="https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28.doi:10.1007/978-3-030-85251-1\_28" />
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021)</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">K</forename><surname>Candan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Maistro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Piroi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">12880</biblScope>
			<biblScope unit="page" from="450" to="467" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Overview of Touché 2022: Argument Retrieval</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bondarenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fröbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kiesel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Syed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gurcke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Beloucif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Panchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Biemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wachsmuth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022)</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note>to appear</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<ptr target="https://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm" />
		<title level="m">Tuckey&apos;s method</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
