<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Document retrieval task on controversial topic with Re-Ranking approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Cassetta</string-name>
          <email>andrea.cassetta@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Piva</string-name>
          <email>alberto.piva.8@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Vicentini</string-name>
          <email>enrico.vicentini.1@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLEF 2021 - Conference and Labs of the Evaluation Forum</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is the report of the work done for Argument Retrieval CLEF 2021 Touché Task 1 by Shanks team (based in Italy and precisely the members are University's of Padua students). Argument Retrieval CLEF 2021 Touché Task 1 focuses on the problem of retrieving relevant arguments for a given controversial topic, from a focused crawl of online debate portals. After some tests Shanks group has decided to parse the input documents taking only the title, premises and conclusion of the arguments (as well as the stance necessary to understand the arguments' author point of view). After the indexing part of the documents the work is concentrate on how the retrieving and the raking are done. After some tests, we discover that the better results are obtained using a WordNet [1] based query expansion approach and a re-ranking process with two diferent similarity functions. This report describes in details how the documents parsing work and how the indexing and searching part are developed. The unexpected update of the qrels file did not allow us to re-run all the tests. In the end, however, we also reported the results of the runs obtained from parameter tuning on the new qrels.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Argument Retrieval CLEF 2021 Touché Task 1</kwd>
        <kwd>WordNet synonyms</kwd>
        <kwd>Re-ranking</kwd>
        <kwd>BM25</kwd>
        <kwd>DirichletLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Section 5 explains our experimental setup including the software, tools and methods used;
Section 6 discusses results; and finally, Section 9 draws some conclusions and outlooks for future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        To create our search engine we build upon some source code created by Professor Nicola Ferro
to show as some toy examples and changed as described in the subsequent sections. We have
also read the overview of CLEF 2020 on the Touchè task[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        After some research and as suggested by Professor Ferro, we discover a diferent way of
re-ranking and merging the results. In practice the paper Combination of Multiple Searches by
Fox and Shaw [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Shaw, described an interesting method to increase the performance of our
system by combining the similarity values from diferent output runs, using Boolean retrieval
methods. Into the paper is also described how they have done the indexing (and analyzing)
part, but we decide to overtake that part because the stating dataset is diferent, and we have
already analyzed our dataset to reach better index possible. For the purpose to understand
how the results merging has been done is not useful to analyze also how the query are written
and so we decide only to relate that the P-norm queries are written using a complex boolean
expression using AND and OR operators. When all the runs are done, the second part of the
experiment consists in combining the output runs (obviously obtained from the same collection
of data) to reach the best result. Diferent way of combining them are, for example, taking
the top N documents retrieved for each run or modify the value of N for each run, based on
the eleven point average precision for that run. In TREC-2, their experiments concentrated on
methods of combining runs based on the similarity values of a document to each query for each
of the runs. After some tests, the best choice is to weight each of the separate runs equally and
not favor any individual run or method, but sometimes some runs has to be weighted more
or less, depending on their performance. This method of merging the runs help the retrieval
system to make a trade-of between the runs’ errors. During the tests have been considered six
diferent way to combine the runs:
• CombMIN: it is used to minimize the probability that a non-relevant document would
be highly ranked;
• CombMAX: it is used to minimize the number of relevant documents being poorly
ranked;
• CombMED: it is used to take the median similarity value (to solve the previous methods’
problem) instead of taking a value from only one run;
• CombSUM: it is used to take the sum of the set of the similarity values;
• CombANZ: it is used to take the average of the non-zero similarity values, so it ignores
the runs that fail to retrieve a relevant document;
• CombMNZ: it is used to consider the higher weights to documents retrieved by multiple
retrieval methods.
We have to point out that the first two method have a specific objective but they do not care
about the possible problems that they can generate on the other retrieved documents. During
the tests CombMIN has worse performance than all the single runs, on contrary CombANZ and
CombMNZ methods have better performance than the individual runs, it is possible maybe
because they produce the same ranked sequence for all the documents retrieved by all five
individual runs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Initial Attempts</title>
      <sec id="sec-3-1">
        <title>3.1. Parsing Documents</title>
        <p>Before going into the details of our final solution, it is useful to describe previous approaches
we took into account to solve the problem and why we chose not to explore them further.
Multiple parsers have been developed to parse the documents from the provided collection.
The most trivial parser, called  1, extracts the sourceText and discussionTitle elements from the
corpus documents. The second parser  2 extracts the elements related to the conclusion, the
premise, the discussion title, and all the text from the sourceText field in between the premise
and the conclusion. The third parser  0, which at the end we decided to use for our more
advanced experiments, extracts only the discussion title, the premise and the conclusion for
each document. Table 1 shows how the index statistics are afected by each of these parser.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Query Expansion</title>
        <p>3.2.1. OpenAI GPT-2
When we have developed the software to create the index, we have deeply thought about how
we could have used the resulting tokens from the analysis of the topic query.
In the attempt of expanding the queries we came across the OpenAI GPT-2 model, a Machine
Learning algorithm that generates synthetic text samples from an arbitrary input.</p>
        <p>The idea was to use this powerful algorithm to make a query expansion, giving as input the
topic title to generate a more complete phrase with hopefully new words that could help the
searching part. Unfortunately the output of GPT-2 is not always what we expect. For example
if we give it as input the tokenized query title, that could be only made of 2 words, the output
is a not very useful dialog for our task. Another problem is the structure of the query, in fact
since they are all questions, the GPT-2 algorithm generates an answer for them which still is
not what we were interested in. The problem persists also if we remove the question mark at
the end of the phrase. In fact, the queries still have a question structure. For these reasons, we
have decided to set aside this kind of approach.</p>
        <sec id="sec-3-2-1">
          <title>3.2.2. Randomly Weighted Synonyms</title>
          <p>An approach initially devised for query expansion, but which we later decided not to explore
further, was to generate multiple queries for the same topic, each with randomly generated
synonym boost values. For each query, the rankings of 1000 documents were then generated, and
ifnally all the rankings were merged into one. The first performances obtained by this method
did not encourage us to proceed with the development because there were many possible paths
to follow from that point and the search time increased considerably.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Minimum body length</title>
        <p>During the process of documents exploration, useful to detect which field are needed and
present in each collection, we notice that in some documents the sourceText field were
constituted by useless text without a single relevant information about its topic. In order to avoid
this kind of documents ending up in the inverted file, we have tried to include, during the
indexing process, a check on the length of the ParsedDocument’s body. If this field, the one that
we have considered as the union of the conclusion and premise fields, is made up of less than a
certain number of token, recurrent aspect that the parsing phase highlights for such instance,
we avoid to consider them in the indexing phase.</p>
        <p>After doing some tests with diferent values for the "min body length" (5, 10, 15), we have
compared the results of this kind of solution with the results obtained without using it and we
have discovered that, taking as example "min body length" equal to 10, the number of retrieved
document switches from 48781 to 48764 and the number of relevant document switches from
1263 to 1257 (the other evaluation measures are no so afected by this change).
Considering this result we have decided not to use this kind of document pre-processing to
avoid the discarding of some document that could be considered relevant in the qrels file (and
so for an user) even if they don’t seem like it.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Re-ranking with discussion ID</title>
        <p>One of our primary goals is to improve the final ranking of the documents. With this
assumption we tried to improve performance by using re-ranking.</p>
        <p>As a first analysis we observed that, in the documents of the dataset, the posts related to the
same discussion had the same first part of the document ID. Based on the assumption that only
posts from some discussion are relevant to a query, we tried to index those posts as a single
document. We finally obtained an index with a discussion-based clustering of documents. In
the searching phase, we firstly searched the query in the normal index, retrieving the classic
ranking of single documents. Secondly we searched the same query in the second index of
discussion clusters, thus obtaining a ranking of discussions. Finally the scores of documents in
the first ranking were increased based on the rank of their respective discussion in the second
ranking. Unfortunately this approach did not provide the desired results because it assumes
that all posts related to a discussion have the same relevance to the searched topic. In fact, it
was found that some posts do not contain any useful information to argue the searched topic
but are part of a discussion that is really relevant.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. OpenNLP attempt</title>
        <p>Exploring new solutions, we have even tried to implement a version of the program that uses
the OpenNLP Machine Learning toolkit in order to see which advantages a tokenization able
to distinguish location, personal nouns and so on could provide for the solution.
Following this path we have encountered an error which requires significant changes in the
workflow of what we had done up to that point. For this reason we have decided not to keep
going on with this branch.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>The goal of this task is to retrieve relevant arguments from online debate portals, given a query
on a controversial topic. We have checked the dataset and we have noticed that it is composed
by five JSON files. We have also read some documents and we have noticed that the main
structure is the same for each one but with some diferent fields. To use the documents with
Lucene we had to parse the documents. To do so, we have used the Jackson library and we
have implemented our parser  0 that takes the premises, conclusions, the document title and
the stance attribute (pro or cons) of the documents.</p>
      <sec id="sec-4-1">
        <title>4.1. Indexing</title>
        <p>We have built four diferent parts starting from the Lucene default ones: ArgsParser,
ShanksAnalyzer, DirectoryIndexer and the Searcher. There is indeed a ShanksTouche class which has
the main method and allows us to setup parameters for indexing and analyzing parts.
After converting the documents into something that Lucene can work on, we focused ourselves
on the development of how the indexer is created, in particular on how the analyzer module
works. Into the tokenization phase we have used the StandardTokenizer, the LowerCaseFilter,
the EnglishPossessiveFilter and the stop-words StopFilter.</p>
        <p>The arguments of the collection are stored in the index using four fields: ID, Title, Body, and
Stance (pro or con).</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Custom Stop-List</title>
          <p>As you can see in Figure 4 we have compared the baseline with the default stop-list and with
our custom stop-list; after that comparison we have decided to use a custom stop-list to better
achieve the project goal. Our stop-list contains 1362 words that are derived from the merging
of other stop-lists (e.g. smart and lucene) typically used. Our custom stop-list reduces memory
usage by approximately 38% and indexing time by almost 20%.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Searching</title>
        <p>In the searching phase of our program we focus our attention in finding strategies to improve
the general quality of the results: experimenting with diferent approaches like query
expansion based on WordNet synonyms or defining queries to score diferently the fields of the
documents. The approach we have chosen to perform involves the use of BooleanQuery. In this
way it is possible to assign specific weights to every term of the query (boosts).</p>
        <p>Finally, we decided to use and test both BM25Similarity and LMDirichletSimilarity
similarities and then their combination with a MultiSimilarity to get the document scores.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Re-Ranking</title>
        <p>During various experiments we have noticed that some measures of similarity and set of
parameters favored the precision at the expense of recall and vice versa. With the purpose of
combining advantages of both cases, we opted for a re-reranking method which exploits
different similarities and query parameters. Our implementation is made of two steps. In the first
one we use a query able to obtain a higher recall value when searching the index for relevant
documents, whose parameters and similarity are decided based on empirical trials. In the
following step we use a second query with better performance in terms of  to re-evaluate
the returned documents and so re-ranking them according to the new score. We call maxRecall
the first query and maxNdcg the second one. According to our implementation, this approach
turns out to be efective on the 2020 topic set, but at the expense of the time spent in the search
phase, which increases quite substantially.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>Our work is based on the following experimental setups:
• Repository: https://bitbucket.org/upd-dei-stud-prj/seupd2021-goldr;
• During the develop and the experimentation we have used our own computer and in the
end we have run our code using Tira;
• Evaluation measure: BM25, LMDirichlet and a "MULTI" where both were combined;
• Apache Maven, Lucene;
• Java JDK version 11;
• Version control system git.
• 
_</p>
      <p>
        tool [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
      </p>
      <p>
        The collection is a set of 387,740 arguments crawled from debatewise.org, idebate.org,
debatepedia.org, debate.org and 48 arguments from Canadian parliament discussions. We used the
50 topics from Touché 2020 Task 1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] of the contest to train and refine our search engine.
Furthermore we developed the source code collaborating through the BitBucket platform.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>In this section we provide graphical and numerical results about the experiments we conducted
during the development of the project. We also discuss these results to derive some useful
insights.</p>
      <sec id="sec-6-1">
        <title>6.1. Our Baseline</title>
        <p>In order to be able to track performance progress during development, we created our own
baseline. For each of the three parsers  0,  1,  2 we produced a run using a simple analyzer
that uses a StandardTokenizer, LowerCaseFilter and a StopFilter where the stop-list used is the
standard one ofered by Lucene. The similarity used is BM25. Figure 1 compares the three
baselines. From these results we decided to develop our approach based on the  0 parser. Figures
2 and 3 show how  0 compares to the other parser evaluating performance per topic. The
choice of  0 was motivated not only by the statistics in Table 1 but also by its overall eficacy
as shown in the following figures. Figure 1 shows how diferently the three approaches can
behave.  0, that extracts only the discussion title, the premise and the conclusion for each
document (with its stance), highlights better performance with respect to the other two retrieving
more relevant document across the entire run. Obviously we chose to discard  1 because his
performances were inferior to the other two as shown in Figure 2. To better understand the
behavior of  2 versus to  0, Figure 3 compare the per topic average precision. Approximately
10% of the topic, in particular topic 3, 23, 34 ad 43 to 46 tend to give nicer results with parser
 2.
0.15
0.10
0.100</p>
        <p>Topic</p>
        <p>Topic
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5</p>
        <sec id="sec-6-1-1">
          <title>6.1.1. Baseline with the custom stop-list</title>
          <p>The results obtained by our custom stop list, compared to the one ofered by Lucene are
comparable.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Parameters Tuning</title>
        <p>Our ultimate goal is to maximize the average value of  @5 across all topics. To find the
optimal parameter values, we performed extensive iterative tests, trying all combinations of
parameters with values belonging to discrete intervals we defined. The same experiments were
conducted using three diferent similarities : BM25Similarity (BM25), LMDirichletSimilarity
(LMD), MultiSimilarity (MULTI). The MultiSimilarity we have used, combines BM25Similarity
and LMDirichletSimilarity. The method we developed is governed by five parameters, when
using re-ranking they become ten. Out of those ten, five parameters concern the query search
in the index which aims to maximize the overall recall, the other five afect the re-ranking
process and are chosen to maximize  @5.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Optimal parameters</title>
        <p>The optimal set of parameters we found for the  query and for the  query
are available in Tables 2 and 3. The best similarity measure to maximize  @5, according to
our empirical tests is LMD. To maximize recall, on the other hand, the best measure is MULTI.
As it can be seen in Table 2, the best value of tBoost for maxnDCG is 0. This allowed us to come
to the conclusion that considering the title of the discussion (which is the same for all posts in
it) can be misleading with respect to the relevance of the content of each post that is part of it.
The title, however, is very useful to obtain higher recall. This intuition is what pushed us to
abandon the method of re-ranking based on discussion ID, which is described in 3.4.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. maxnDCG and maxRecall</title>
        <p>In this section we want to compare the two queries maxnDCG and maxRecall with the optimal
parameter values established by the tests. From the graph in Figure 5 we can see that the
precision is significantly higher for maxnDCG, while maxRecall has a better recall on the whole
ranking. From these data we believe to have obtained the desired result from the two queries.
In Figures 6, 7, 8 we compare the Average Precision per topic for the three test runs. We can
notice how there is not much diference between maxRecall and P0. The same is not true for
maxnDCG, the latter proves to be better than P0 in almost all topics. This scenario is further
confirmed by Figure 8, which shows the dominance of   over   . The
performance can be further compared with the numerical results reported in Table 4. From these
data we can realize the actual trade-of between the two approaches. The advantage in terms
of recall for maxRecall is significant compared to maxnDCG. The same advantage applies in
the opposite way for precision andnDCG.
0.075
0.25
0.20
e
c
n
e
re 0.15
f
if
d
n 0.10
o
i
s
i
c</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Re-Ranking Results</title>
        <p>Here we compare the Re-Ranking approach described in 4.3, with the previous ones. Figure 9
shows how re-ranking based on maxNdcg (in red), greatly improves the interpolated precision
of maxRecall (in green). It is also possible to note that it allows for better performance with
Topic
=</p>
        <p />
        <p>Topic
=</p>
        <p>Topic
=  
respect to maxNdcg alone (in orange). Figure 10 shows the Average Precision value obtained
by re-ranking for each topic. From it, it is possible to identify the most problematic topics, for
which the method developed is not very efective. In particular, the most critical topics are:
{2, 8, 22, 40, 44}. From Figure 11 we can see that the Re-Ranking method improves on almost
every topic w.r.t the baseline. In Figure 12 it can be seen that the Re-Ranking method improves
on maxNdcg on many topics, except for topic 10 and 20. Figure 13, as predictable, clearly shows
the better performance achieved by Re-Ranking w.r.t. maxRecall, only the first topic is penalized
by the re-ranking process.
1 2 3 4 5 6 7 8 9 01 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Topic
0.25
0.20
0.05
0.015
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5</p>
      </sec>
      <sec id="sec-6-6">
        <title>6.6. RUN Submission</title>
        <p>The five run we decided to submit are the following:
• run-1 : Re-Ranking approach.
• run-2 : like run-1 but proximity searches areonly with pairs of subsequent tokens.
• run-3 : maxnDCG query with LMDirichletSimilarity
• run-4 : maxnDCG query with MultiSimilarity
• run-5 : maxRecall query with MultiSimilarity
NEW EXPERIMENTAL RESULTS AFTER A NEW CORRECTED VERSION OF
THE QRELS WAS RELEASED
All the previous results are based on an incorrect version of the .qrels file for the 2020
topics. Since we only knew about the new corrected version when the deadline was approaching,
we could not recreate all the graphs and comparisons in 6. However, we managed to find the
new optimal parameter sets and repeat the five run.</p>
        <p>The new data is provided in Tables 6 and 7.</p>
        <sec id="sec-6-6-1">
          <title>Query Similarity tBoost sBoost pBoost</title>
          <p>maxRecall MULTI 0,3 0,2 0,75
maxnDCG LMD 0,15 0,05 0,75</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Statistical Analysis</title>
      <p>Here we analyze our models with some important statistics to evaluate them in a deeper way
via hypothesis testing. We used ANOVA, tStudent test and through boxplots. All the analyses
focused on the mean of two key metrics, average precision and nDCG@5. We analyzed all 5
diferent retrieval models, the ones described in the section 6.6.</p>
      <p>A boxplot gives a visual representation about location, symmetry of the data, dispersion and
presence of outliers (points that escape the construction of the boxplot).</p>
      <p>It can be appreciate that run1,run3 have almost an identical structure: the median,
interquartile range(IRQ) and whiskers length. On the other hand, run4 and run5 exhibit less performance
in this metric because those runs were tuned for maximize recall or nDCG@5, the bottom
whiskers in fact are closer to zero meaning that for some topics the system did not retrieve
enough relevant documents. All the runs have outliers, the points above the whiskers,
represented by circles, are topics which perform better than the others, or worse if they are below
the boxplot. Run1 and run3 are skewed to the right and show less variance compared to other
systems.</p>
      <p>Looking at the boxlot of nDCG@5 reveals the same behavior seen before, run1 and run3
produce higher scores compared to other runs and they seem identical in performances. Run3
has better results w.r.t run4 so we can conclude that using diferent similarities change
dramatically the results. Run2 shows less IRQ among others, in particular compared to run1 that
share the same architecture with the only diference in the proximity parameter. We can say
that run1 is able to score higher score exploiting the more flexible proximity parameter.</p>
      <p>Further analysis with anova and tTest will help to understand the possible similarities or not
between the systems.</p>
      <sec id="sec-7-1">
        <title>7.1. Hypothesis testing</title>
        <p>The first tool that we utilize is ANOVA (Analysis of Variance) a statistical test of whether or not
the means of several groups are equal. H , the null hypothesis that all the means are equal, is
0
tested against the possibility to reject or don’t reject it. As we can see in Figure 16 there are
multiple factors to be taken into account to perform a correct analysis of the F-statistic. The
system sum of squares, SS_system and the SS_error are divided by their degree of freedom to
obtain the mean squares(MS). The F-statistic (F) is equal to the ratio of MS_System an MS_Error.
Having a p-value = 0.1849 we cannot reject H0. To do that we should have had a value lower
than  . The significance level  is set at 0.05.</p>
        <p>We could conclude that our systems are statistically similar in mean average precision but
performing the ANOVA2 test the situation is reversed.</p>
        <p>
          As it can be seen from Figure 18 and Figure 19, the null hypothesis can be rejected because
the runs have statistically significant diferences. The topic efect, that can be read in cell [
          <xref ref-type="bibr" rid="ref2 ref3">3,2</xref>
          ]
of Figure 17, is able to express much more variance, that means that this variability is greater
than the one of the systems as we can expect by topics.
        </p>
        <p>Run 3 vs run 4: as we pointed out previously, the use of LMDirichlet similarity improve the
results. For this pair the null hypothesis can be rejected. As we will see for the nDCG@5
analysis run 4 and run 5 are diferent from the others, and the p-value of their statistic suggest
us that the runs are equivalent in mean. Instead, for the others (run 1, run 3, run 2) we cannot
reject H .</p>
        <p>0</p>
        <p>Unfortunately we have not been able to simplify the reading of the images. Here is a sort of
conversion between the numbers on the axes and the real ones.
1 - 2 - 3 - 4 -5 (image numbers) —&gt; run1 - run3 - run2 - run5 - run4 (true values). This is applicable
to Figure 18, 19, 21, 23</p>
        <p>Moving the study to nDCG@5, in ANOVA table, Figure 20, the p-value is lower than the
confidence level so we can reject the null hypothesis and claim that there is at least one mean
between the various runs that difers significantly from the others. This table doesn’t tell us
which systems are diferent on average, it just tells us that there is at least one.</p>
        <p>Tukey Honestly Significant Diference (HSD) test, as we did in AP case, answer to that point
creating confidence intervals for all pairwise diferences between the systems we want to
compare, while controlling the family error rate. Otherwise, the probability of a type I(one) error
would be magnified.</p>
        <p>Run4 and run5, Figure 22, are statistically diferent from the others, while for them we cannot
reject the null hypothesis. It seems that supporting the MaxRecall or MaxnDCG query approach
does not yield performance benefits except through diferent similary as in case of run3 or
rerank(run1 and run2). Accordingly to what the boxplot chart suggested for run3 and run4 we
can reject H0, the p-value is below  = 0.05. The diferent similarity approach is visible.</p>
        <p>We can conclude that our Re-ranking approach is much more significant than a standard
technique even if run3 through this analysis it returns comparable results. In the future this
behavior can be investigated carefully.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Failure analysis</title>
      <p>Looking at the results obtained from the trec_eval program execution with our runs as a
parameter we can see the performance of our information retrieval systems. In particular we
are now focusing on finding and understanding for which topic the systems fail in achieving
good performance. Therefore we have decided to apply this kind of evaluation to our best run
(shanks-run 1). Looking at it, and using the map field as reference parameter for the following
analysis, we have discovered that the top (map) performance come from the topics 42, 43 and 1
and the worst from 22, 12 and 44. Searching for the reason why, we understand that the main
weakness of the process is the lacking of an argument quality evaluation phase and of a more
consistent lexical analysis process.</p>
      <p>As an example we take and compare the topics 1, 12 and 44.</p>
      <p>Topic 1: Should teachers get tenure?</p>
      <p>Narrative: highly relevant arguments make a clear statement about tenure for teachers in
schools or universities. Relevant arguments consider tenure more generally, not specifically
for teachers, or, instead of talking about tenure, consider the situation of teachers’ financial
independence.</p>
      <p>The process of stopwording applied to the topics lead to obtain the parsed query
“teachers tenure”, that even without the whole phrase construction explain very well what we are
searching for. The document about this topic are well retrieved by our system, in fact if we
look for example at the first 5 position (the most relevant ones for a browser) we can see from
the comparison with the qrels that all of them are considered as "highly relevant".</p>
      <p>Topic 12: Should birth control pills be available over the counter?</p>
      <p>Narrative: highly relevant arguments argue for or against the availability of birth control
pills without prescription. Relevant arguments argue with regard to birth control pills and
their side efects only. Arguments only arguing for or against birth control are irrelevant.</p>
      <p>The process of stopwording applied to the topics lead to obtain the parsed query “birth
control pills available”. This seems a quite explicative phrase but in the retrieval phase the system
has trouble in finding the proper relevant results. The critical issue with this query is the
distinction between high and low quality argument. The lacking of an efective argument quality
process leads the program in failing the document quality evaluation. Due to this fact the
system is unable to put in the right ranking position the appropriate document. We can notice this
aspect even looking at the other topic fields (not only the map), in fact if we look for example
at the growing of the recall parameter we can notice that is mainly focused in the tail of the
process (when many documents are retrieved).</p>
      <p>Topic 44: Should election day be a national holiday?</p>
      <p>Narrative: highly relevant arguments explain why making election day a holiday is a good
idea, or why not. Relevant arguments mention the fact or its remedy as one of the problems
that elections have.</p>
      <p>The process of stopwording applied to the topics lead to obtain the parsed query “election
national holiday”. The results obtained from this kind of search are quite bad in fact the
system retrieves as relevant some discussion like "Potato day should be a national holiday" or
"Star Wars day should be a national holiday". These mismatches came from the fact that the
system does not recognize "election day" as unique mandatory query term and so document
about similar topic that difer only for some word are wrongly retrieved.</p>
      <p>To overcome these kind of issues it could be useful to equip the system with an argument
quality evaluator and a sort of word pattern recognizer that catches the words which must not
be separate and a way to identify structure like "subject, predicate, object" that assign higher
weights to the subject and marks it as mandatory word in the documents to be retrieved.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions and Future Work</title>
      <p>At the end of this experiment we have discovered that the changes we made, lead to obtain fairly
good improvement in respect to the starting baseline [Table 4] of our information retrieval
system. All the statistics have undergone a significant increase, as an example we can see the
improvement as follows: MAP +39.3%, P5 +68.8%, nDCG@5 +91.4%.</p>
      <p>Comparing our results with the last year ones we have noticed that they follows their trend
for the nDCG@5 parameter, but looking at the applied strategies they seem to be fairly
diferent. Our results are in line with those of last year.</p>
      <p>The whole process we followed highlighted a lot of possible expansions that with more time
could be implemented and explored. As an example it could be interesting the application
of some sort of location word detection, personal nouns recognition and compound words
discernment and an improvement or even the addition of a new ranking phase that allows the
insertion of some kind of argument quality analysis. The possibility of implementing a diferent
approach based on some kind of machine learning model remains open. As we stated in the
initial attempt section 3 we dropped the idea of using GPT-2 because it returned us unsatisfying
results, so in the future we can go more into details of this modern tool and improve in this
way the performance of our solution.</p>
      <p>
        Another possible approach that can be exploited derives from the paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It can be
interesting because it takes diferent runs in input and combines them to reach the best result.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <source>WordNet: An Electronic Lexical Database, Bradford Books</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , args.
          <source>me corpus</source>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.5281/zenodo.3734893. doi:
          <volume>10</volume>
          .5281/zenodo.3734893.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2020:
          <article-title>Argument Retrieval</article-title>
          ,
          <source>CEUR-WS</source>
          <volume>2696</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shaw</surname>
          </string-name>
          , Combination of Multiple Searches, in: D. K. Harman (Ed.),
          <source>The Second Text REtrieval Conference (TREC-2)</source>
          ,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <source>Special Publication 500-215</source>
          , Washington, USA,
          <year>1993</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2009 Web Track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          Buckland (Eds.),
          <source>The Eighteenth Text REtrieval Conference Proceedings (TREC</source>
          <year>2009</year>
          ),
          <article-title>National Institute of Standards and Technology (NIST</article-title>
          ),
          <source>Special Publication 500-278</source>
          , Washington, USA,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2020:
          <article-title>Argument Retrieval</article-title>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhof</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Névéol (Eds.),
          <source>Working Notes Papers of the CLEF 2020 Evaluation Labs</source>
          , volume
          <volume>2696</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>