=Paper=
{{Paper
|id=Vol-2936/paper-210
|storemode=property
|title=Document retrieval task on controversial topic with Re-Ranking approach
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-210.pdf
|volume=Vol-2936
|authors=Andrea Cassetta,Alberto Piva,Enrico Vicentini
|dblpUrl=https://dblp.org/rec/conf/clef/CassettaPV21
}}
==Document retrieval task on controversial topic with Re-Ranking approach==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-210.pdf</pdf>
<pre>
     Document retrieval task on controversial topic with
                  Re-Ranking approach
                                           Notebook for the Touché Lab on Argument Retrieval at CLEF 2021

Andrea Cassettaa , Alberto Pivaa and Enrico Vicentinia
a
    University of Padua, Italy


                                         Abstract
                                         This paper is the report of the work done for Argument Retrieval CLEF 2021 Touché Task 1 by Shanks
                                         team (based in Italy and precisely the members are University’s of Padua students). Argument Retrieval
                                         CLEF 2021 Touché Task 1 focuses on the problem of retrieving relevant arguments for a given contro-
                                         versial topic, from a focused crawl of online debate portals. After some tests Shanks group has decided
                                         to parse the input documents taking only the title, premises and conclusion of the arguments (as well
                                         as the stance necessary to understand the arguments’ author point of view). After the indexing part of
                                         the documents the work is concentrate on how the retrieving and the raking are done. After some tests,
                                         we discover that the better results are obtained using a WordNet [1] based query expansion approach
                                         and a re-ranking process with two different similarity functions. This report describes in details how
                                         the documents parsing work and how the indexing and searching part are developed. The unexpected
                                         update of the qrels file did not allow us to re-run all the tests. In the end, however, we also reported the
                                         results of the runs obtained from parameter tuning on the new qrels.

                                         Keywords
                                         Argument Retrieval CLEF 2021 Touché Task 1, WordNet synonyms, Re-ranking, BM25, DirichletLM


1. Introduction
In this report, we describe the project developed for the participation by the Shanks group to
the CLEF 2021 Touché Task 1. The task focuses on the problem of retrieving relevant arguments
for a given controversial topic, from a focused crawl of online debate portals.
   Our goal is to develop a Java based information retrieval system that finds and ranks the
relevant documents from the args.me corpus dataset composed by over 380.000 arguments
crawled from 5 different debate forums [2] for 50 topics (query). The retrieved results need to
be relevant for each input topic the system has to elaborate.
   This paper is structured as follow: Section 3 is about the solutions that we’ve taken in con-
sideration to build the retrieval system, Section 4 describes the whole workflow of the program;


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
“Search Engines”, course at the master degree in “Computer Engineering”, Department of Information Engineering,
University of Padua, Italy. Academic Year 2020/2021
" andrea.cassetta@studenti.unipd.it (.A. Cassetta); alberto.piva.8@studenti.unipd.it - 0000-0003-0242-0749
(ORCID) (.A. Piva); enrico.vicentini.1@studenti.unipd.it (.E. Vicentini)

                                       © 2021 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Section 5 explains our experimental setup including the software, tools and methods used; Sec-
tion 6 discusses results; and finally, Section 9 draws some conclusions and outlooks for future
work.


2. Related Works
To create our search engine we build upon some source code created by Professor Nicola Ferro
to show as some toy examples and changed as described in the subsequent sections. We have
also read the overview of CLEF 2020 on the Touchè task[3].
   After some research and as suggested by Professor Ferro, we discover a different way of
re-ranking and merging the results. In practice the paper Combination of Multiple Searches by
Fox and Shaw [4]. Shaw, described an interesting method to increase the performance of our
system by combining the similarity values from different output runs, using Boolean retrieval
methods. Into the paper is also described how they have done the indexing (and analyzing)
part, but we decide to overtake that part because the stating dataset is different, and we have
already analyzed our dataset to reach better index possible. For the purpose to understand
how the results merging has been done is not useful to analyze also how the query are written
and so we decide only to relate that the P-norm queries are written using a complex boolean
expression using AND and OR operators. When all the runs are done, the second part of the
experiment consists in combining the output runs (obviously obtained from the same collection
of data) to reach the best result. Different way of combining them are, for example, taking
the top N documents retrieved for each run or modify the value of N for each run, based on
the eleven point average precision for that run. In TREC-2, their experiments concentrated on
methods of combining runs based on the similarity values of a document to each query for each
of the runs. After some tests, the best choice is to weight each of the separate runs equally and
not favor any individual run or method, but sometimes some runs has to be weighted more
or less, depending on their performance. This method of merging the runs help the retrieval
system to make a trade-off between the runs’ errors. During the tests have been considered six
different way to combine the runs:
    • CombMIN: it is used to minimize the probability that a non-relevant document would
      be highly ranked;

    • CombMAX: it is used to minimize the number of relevant documents being poorly
      ranked;

    • CombMED: it is used to take the median similarity value (to solve the previous methods’
      problem) instead of taking a value from only one run;

    • CombSUM: it is used to take the sum of the set of the similarity values;

    • CombANZ: it is used to take the average of the non-zero similarity values, so it ignores
      the runs that fail to retrieve a relevant document;

    • CombMNZ: it is used to consider the higher weights to documents retrieved by multiple
      retrieval methods.


                                               2
We have to point out that the first two method have a specific objective but they do not care
about the possible problems that they can generate on the other retrieved documents. During
the tests CombMIN has worse performance than all the single runs, on contrary CombANZ and
CombMNZ methods have better performance than the individual runs, it is possible maybe
because they produce the same ranked sequence for all the documents retrieved by all five
individual runs.


3. Initial Attempts
Before going into the details of our final solution, it is useful to describe previous approaches
we took into account to solve the problem and why we chose not to explore them further.

3.1. Parsing Documents
Multiple parsers have been developed to parse the documents from the provided collection.
The most trivial parser, called 𝑃1, extracts the sourceText and discussionTitle elements from the
corpus documents. The second parser 𝑃2 extracts the elements related to the conclusion, the
premise, the discussion title, and all the text from the sourceText field in between the premise
and the conclusion. The third parser 𝑃0, which at the end we decided to use for our more
advanced experiments, extracts only the discussion title, the premise and the conclusion for
each document. Table 1 shows how the index statistics are affected by each of these parser.

Table 1
Statistics regarding three indexes generated using the three implemented document parser. The ana-
lyzer is always the same. Time Ratio is obtained by dividing the time taken by each parser by the time
taken by the fastest parser.
             Parser    Term Count      Storage (MB)      Time (seconds)     Time ratio
             P0            1078017               195                 99           1,00
             P1            1196289              1745                507           5,12
             P2            1153517               706                249           2,52


3.2. Query Expansion
When we have developed the software to create the index, we have deeply thought about how
we could have used the resulting tokens from the analysis of the topic query.

3.2.1. OpenAI GPT-2
In the attempt of expanding the queries we came across the OpenAI GPT-2 model, a Machine
Learning algorithm that generates synthetic text samples from an arbitrary input.
   The idea was to use this powerful algorithm to make a query expansion, giving as input the
topic title to generate a more complete phrase with hopefully new words that could help the
searching part. Unfortunately the output of GPT-2 is not always what we expect. For example
if we give it as input the tokenized query title, that could be only made of 2 words, the output
is a not very useful dialog for our task. Another problem is the structure of the query, in fact


                                                  3
since they are all questions, the GPT-2 algorithm generates an answer for them which still is
not what we were interested in. The problem persists also if we remove the question mark at
the end of the phrase. In fact, the queries still have a question structure. For these reasons, we
have decided to set aside this kind of approach.

3.2.2. Randomly Weighted Synonyms
An approach initially devised for query expansion, but which we later decided not to explore
further, was to generate multiple queries for the same topic, each with randomly generated syn-
onym boost values. For each query, the rankings of 1000 documents were then generated, and
finally all the rankings were merged into one. The first performances obtained by this method
did not encourage us to proceed with the development because there were many possible paths
to follow from that point and the search time increased considerably.

3.3. Minimum body length
During the process of documents exploration, useful to detect which field are needed and
present in each collection, we notice that in some documents the sourceText field were con-
stituted by useless text without a single relevant information about its topic. In order to avoid
this kind of documents ending up in the inverted file, we have tried to include, during the in-
dexing process, a check on the length of the ParsedDocument’s body. If this field, the one that
we have considered as the union of the conclusion and premise fields, is made up of less than a
certain number of token, recurrent aspect that the parsing phase highlights for such instance,
we avoid to consider them in the indexing phase.
After doing some tests with different values for the "min body length" (5, 10, 15), we have
compared the results of this kind of solution with the results obtained without using it and we
have discovered that, taking as example "min body length" equal to 10, the number of retrieved
document switches from 48781 to 48764 and the number of relevant document switches from
1263 to 1257 (the other evaluation measures are no so affected by this change).
Considering this result we have decided not to use this kind of document pre-processing to
avoid the discarding of some document that could be considered relevant in the qrels file (and
so for an user) even if they don’t seem like it.

3.4. Re-ranking with discussion ID
One of our primary goals is to improve the final ranking of the documents. With this assump-
tion we tried to improve performance by using re-ranking.
   As a first analysis we observed that, in the documents of the dataset, the posts related to the
same discussion had the same first part of the document ID. Based on the assumption that only
posts from some discussion are relevant to a query, we tried to index those posts as a single
document. We finally obtained an index with a discussion-based clustering of documents. In
the searching phase, we firstly searched the query in the normal index, retrieving the classic
ranking of single documents. Secondly we searched the same query in the second index of
discussion clusters, thus obtaining a ranking of discussions. Finally the scores of documents in
the first ranking were increased based on the rank of their respective discussion in the second


                                                4
ranking. Unfortunately this approach did not provide the desired results because it assumes
that all posts related to a discussion have the same relevance to the searched topic. In fact, it
was found that some posts do not contain any useful information to argue the searched topic
but are part of a discussion that is really relevant.

3.5. OpenNLP attempt
Exploring new solutions, we have even tried to implement a version of the program that uses
the OpenNLP Machine Learning toolkit in order to see which advantages a tokenization able
to distinguish location, personal nouns and so on could provide for the solution.
Following this path we have encountered an error which requires significant changes in the
workflow of what we had done up to that point. For this reason we have decided not to keep
going on with this branch.


4. Methodology
The goal of this task is to retrieve relevant arguments from online debate portals, given a query
on a controversial topic. We have checked the dataset and we have noticed that it is composed
by five JSON files. We have also read some documents and we have noticed that the main
structure is the same for each one but with some different fields. To use the documents with
Lucene we had to parse the documents. To do so, we have used the Jackson library and we
have implemented our parser 𝑃0 that takes the premises, conclusions, the document title and
the stance attribute (pro or cons) of the documents.

4.1. Indexing
We have built four different parts starting from the Lucene default ones: ArgsParser, Shanks-
Analyzer, DirectoryIndexer and the Searcher. There is indeed a ShanksTouche class which has
the main method and allows us to setup parameters for indexing and analyzing parts. Af-
ter converting the documents into something that Lucene can work on, we focused ourselves
on the development of how the indexer is created, in particular on how the analyzer module
works. Into the tokenization phase we have used the StandardTokenizer, the LowerCaseFilter,
the EnglishPossessiveFilter and the stop-words StopFilter.
   The arguments of the collection are stored in the index using four fields: ID, Title, Body, and
Stance (pro or con).

4.1.1. Custom Stop-List
As you can see in Figure 4 we have compared the baseline with the default stop-list and with
our custom stop-list; after that comparison we have decided to use a custom stop-list to better
achieve the project goal. Our stop-list contains 1362 words that are derived from the merging
of other stop-lists (e.g. smart and lucene) typically used. Our custom stop-list reduces memory
usage by approximately 38% and indexing time by almost 20%.


                                                5
4.2. Searching
In the searching phase of our program we focus our attention in finding strategies to improve
the general quality of the results: experimenting with different approaches like query expan-
sion based on WordNet synonyms or defining queries to score differently the fields of the doc-
uments. The approach we have chosen to perform involves the use of BooleanQuery. In this
way it is possible to assign specific weights to every term of the query (boosts).
   Finally, we decided to use and test both BM25Similarity and LMDirichletSimilarity similari-
ties and then their combination with a MultiSimilarity to get the document scores.

4.3. Re-Ranking
During various experiments we have noticed that some measures of similarity and set of pa-
rameters favored the precision at the expense of recall and vice versa. With the purpose of
combining advantages of both cases, we opted for a re-reranking method which exploits dif-
ferent similarities and query parameters. Our implementation is made of two steps. In the first
one we use a query able to obtain a higher recall value when searching the index for relevant
documents, whose parameters and similarity are decided based on empirical trials. In the fol-
lowing step we use a second query with better performance in terms of 𝑛𝑑𝑐𝑔 to re-evaluate
the returned documents and so re-ranking them according to the new score. We call maxRecall
the first query and maxNdcg the second one. According to our implementation, this approach
turns out to be effective on the 2020 topic set, but at the expense of the time spent in the search
phase, which increases quite substantially.


5. Experimental Setup
Our work is based on the following experimental setups:

    • Repository: https://bitbucket.org/upd-dei-stud-prj/seupd2021-goldr;

    • During the develop and the experimentation we have used our own computer and in the
      end we have run our code using Tira;

    • Evaluation measure: BM25, LMDirichlet and a "MULTI" where both were combined;

    • Apache Maven, Lucene;

    • Java JDK version 11;

    • Version control system git.

    • 𝑡𝑟𝑒𝑐_𝑒𝑣𝑎𝑙 tool [5]

  The collection is a set of 387,740 arguments crawled from debatewise.org, idebate.org, debate-
pedia.org, debate.org and 48 arguments from Canadian parliament discussions. We used the
50 topics from Touché 2020 Task 1 [6] of the contest to train and refine our search engine.
Furthermore we developed the source code collaborating through the BitBucket platform.


                                                6
6. Results and Discussion
In this section we provide graphical and numerical results about the experiments we conducted
during the development of the project. We also discuss these results to derive some useful
insights.

6.1. Our Baseline
In order to be able to track performance progress during development, we created our own
baseline. For each of the three parsers 𝑃0, 𝑃1, 𝑃2 we produced a run using a simple analyzer
that uses a StandardTokenizer, LowerCaseFilter and a StopFilter where the stop-list used is the
standard one offered by Lucene. The similarity used is BM25. Figure 1 compares the three base-
lines. From these results we decided to develop our approach based on the 𝑃0 parser. Figures
2 and 3 show how 𝑃0 compares to the other parser evaluating performance per topic. The
choice of 𝑃0 was motivated not only by the statistics in Table 1 but also by its overall efficacy
as shown in the following figures. Figure 1 shows how differently the three approaches can
behave. 𝑃0, that extracts only the discussion title, the premise and the conclusion for each doc-
ument (with its stance), highlights better performance with respect to the other two retrieving
more relevant document across the entire run. Obviously we chose to discard 𝑃1 because his
performances were inferior to the other two as shown in Figure 2. To better understand the
behavior of 𝑃2 versus to 𝑃0, Figure 3 compare the per topic average precision. Approximately
10% of the topic, in particular topic 3, 23, 34 ad 43 to 46 tend to give nicer results with parser
𝑃2.


Figure 1: Plot of the Interpolated precision against recall of the three baselines.


                                                    7
                               0.15

                               0.10
Average Precision difference


                               0.05

                               0.00

                               0.05

                               0.10
                                        1
                                        2
                                        3
                                        4
                                        5
                                        6
                                        7
                                        8
                                        9
                                       10
                                       11
                                       12
                                       13
                                       14
                                       15
                                       16
                                       17
                                       18
                                       19
                                       20
                                       21
                                       22
                                       23
                                       24
                                       26
                                       27
                                       28
                                       29
                                       30
                                       31
                                       32
                                       33
                                       34
                                       35
                                       36
                                       37
                                       38
                                       39
                                       40
                                       41
                                       42
                                       43
                                       44
                                       45
                                       46
                                       47
                                       48
                                       49
                                       50
                                                      Topic


Figure 2: Per topic Average Precision Difference = 𝐴𝑃𝑃0 − 𝐴𝑃𝑃1 .


                               0.100
                               0.075
Average Precision difference


                               0.050
                               0.025
                               0.000
                               0.025
                               0.050
                               0.075
                                        1
                                        2
                                        3
                                        4
                                        5
                                        6
                                        7
                                        8
                                        9
                                       10
                                       11
                                       12
                                       13
                                       14
                                       15
                                       16
                                       17
                                       18
                                       19
                                       20
                                       21
                                       22
                                       23
                                       24
                                       26
                                       27
                                       28
                                       29
                                       30
                                       31
                                       32
                                       33
                                       34
                                       35
                                       36
                                       37
                                       38
                                       39
                                       40
                                       41
                                       42
                                       43
                                       44
                                       45
                                       46
                                       47
                                       48
                                       49
                                       50
                                                      Topic


Figure 3: Per topic Average Precision Difference = 𝐴𝑃𝑃0 − 𝐴𝑃𝑃2 .


6.1.1. Baseline with the custom stop-list
The results obtained by our custom stop list, compared to the one offered by Lucene are com-
parable.


                                                  8
Figure 4: Plot of the Interpolated precision against recall of the two P0 baselines obtained by the
Lucene stop-list and the custom stop-list.


6.2. Parameters Tuning
Our ultimate goal is to maximize the average value of 𝑛𝐷𝐺𝐶@5 across all topics. To find the
optimal parameter values, we performed extensive iterative tests, trying all combinations of
parameters with values belonging to discrete intervals we defined. The same experiments were
conducted using three different similarities : BM25Similarity (BM25), LMDirichletSimilarity
(LMD), MultiSimilarity (MULTI). The MultiSimilarity we have used, combines BM25Similarity
and LMDirichletSimilarity. The method we developed is governed by five parameters, when
using re-ranking they become ten. Out of those ten, five parameters concern the query search
in the index which aims to maximize the overall recall, the other five affect the re-ranking
process and are chosen to maximize 𝑛𝐷𝐺𝐶@5.

6.3. Optimal parameters
The optimal set of parameters we found for the 𝑚𝑎𝑥𝑅𝑒𝑐𝑎𝑙𝑙 query and for the 𝑚𝑎𝑥𝑛𝐷𝐶𝐺 query
are available in Tables 2 and 3. The best similarity measure to maximize 𝑛𝐷𝐺𝐶@5, according to
our empirical tests is LMD. To maximize recall, on the other hand, the best measure is MULTI.
As it can be seen in Table 2, the best value of tBoost for maxnDCG is 0. This allowed us to come
to the conclusion that considering the title of the discussion (which is the same for all posts in
it) can be misleading with respect to the relevance of the content of each post that is part of it.
The title, however, is very useful to obtain higher recall. This intuition is what pushed us to
abandon the method of re-ranking based on discussion ID, which is described in 3.4.

Table 2
The optimal parameters obtained by training on the 2020 topics (topic description not considered).
                  Query        Similarity    tBoost    sBoost    pBoost     pDist
                  maxRecall    MULTI            3,50      0,15      1,75      12
                  maxnDCG      LMD              0,00      0,05      0,75      15


                                                  9
Table 3
The optimal parameters obtained by training on the 2020 topics (considering the topic description).
                  Query         Similarity   tBoost     sBoost    pBoost    pDist
                  maxRecall     MULTI           3,50       0,15      1,75     12
                  maxnDCG       LMD             0,30       0,05      1,75     15


6.4. maxnDCG and maxRecall
In this section we want to compare the two queries maxnDCG and maxRecall with the optimal
parameter values established by the tests. From the graph in Figure 5 we can see that the pre-
cision is significantly higher for maxnDCG, while maxRecall has a better recall on the whole
ranking. From these data we believe to have obtained the desired result from the two queries.
In Figures 6, 7, 8 we compare the Average Precision per topic for the three test runs. We can
notice how there is not much difference between maxRecall and P0. The same is not true for
maxnDCG, the latter proves to be better than P0 in almost all topics. This scenario is further
confirmed by Figure 8, which shows the dominance of 𝐴𝑃𝑚𝑎𝑥𝑛𝐷𝐶𝐺 over 𝐴𝑃𝑚𝑎𝑥𝑅𝑒𝑐𝑎𝑙𝑙 . The per-
formance can be further compared with the numerical results reported in Table 4. From these
data we can realize the actual trade-off between the two approaches. The advantage in terms
of recall for maxRecall is significant compared to maxnDCG. The same advantage applies in
the opposite way for precision andnDCG.


Figure 5: Plot of the Interpolated precision against recall for the maxnDCG query versus maxRecall.


                                                 10
                               0.100
                               0.075
Average Precision difference

                               0.050
                               0.025
                               0.000
                               0.025
                               0.050
                               0.075
                                        1
                                        2
                                        3
                                        4
                                        5
                                        6
                                        7
                                        8
                                        9
                                       10
                                       11
                                       12
                                       13
                                       14
                                       15
                                       16
                                       17
                                       18
                                       19
                                       20
                                       21
                                       22
                                       23
                                       24
                                       26
                                       27
                                       28
                                       29
                                       30
                                       31
                                       32
                                       33
                                       34
                                       35
                                       36
                                       37
                                       38
                                       39
                                       40
                                       41
                                       42
                                       43
                                       44
                                       45
                                       46
                                       47
                                       48
                                       49
                                       50
                                                      Topic


Figure 6: Per topic Average Precision Difference = 𝐴𝑃𝑚𝑎𝑥𝑅𝑒𝑐𝑎𝑙𝑙 − 𝐴𝑃𝑃0 .


                               0.25
                               0.20
Average Precision difference


                               0.15
                               0.10
                               0.05
                               0.00
                               0.05
                                        1
                                        2
                                        3
                                        4
                                        5
                                        6
                                        7
                                        8
                                        9
                                       10
                                       11
                                       12
                                       13
                                       14
                                       15
                                       16
                                       17
                                       18
                                       19
                                       20
                                       21
                                       22
                                       23
                                       24
                                       26
                                       27
                                       28
                                       29
                                       30
                                       31
                                       32
                                       33
                                       34
                                       35
                                       36
                                       37
                                       38
                                       39
                                       40
                                       41
                                       42
                                       43
                                       44
                                       45
                                       46
                                       47
                                       48
                                       49
                                       50
                                                      Topic


Figure 7: Per topic Average Precision Difference = 𝐴𝑃𝑚𝑎𝑥𝑛𝐷𝐶𝐺 − 𝐴𝑃𝑃0 .


                               0.15
Average Precision difference


                               0.10

                               0.05

                               0.00

                               0.05

                               0.10
                                        1
                                        2
                                        3
                                        4
                                        5
                                        6
                                        7
                                        8
                                        9
                                       10
                                       11
                                       12
                                       13
                                       14
                                       15
                                       16
                                       17
                                       18
                                       19
                                       20
                                       21
                                       22
                                       23
                                       24
                                       26
                                       27
                                       28
                                       29
                                       30
                                       31
                                       32
                                       33
                                       34
                                       35
                                       36
                                       37
                                       38
                                       39
                                       40
                                       41
                                       42
                                       43
                                       44
                                       45
                                       46
                                       47
                                       48
                                       49
                                       50


                                                      Topic


Figure 8: Per topic Average Precision Difference = 𝐴𝑃𝑚𝑎𝑥𝑛𝐷𝐶𝐺 − 𝐴𝑃𝑚𝑎𝑥𝑅𝑒𝑐𝑎𝑙𝑙 .


6.5. Re-Ranking Results
Here we compare the Re-Ranking approach described in 4.3, with the previous ones. Figure 9
shows how re-ranking based on maxNdcg (in red), greatly improves the interpolated precision
of maxRecall (in green). It is also possible to note that it allows for better performance with


                                                 11
respect to maxNdcg alone (in orange). Figure 10 shows the Average Precision value obtained
by re-ranking for each topic. From it, it is possible to identify the most problematic topics, for
which the method developed is not very effective. In particular, the most critical topics are:
{2, 8, 22, 40, 44}. From Figure 11 we can see that the Re-Ranking method improves on almost
every topic w.r.t the baseline. In Figure 12 it can be seen that the Re-Ranking method improves
on maxNdcg on many topics, except for topic 10 and 20. Figure 13, as predictable, clearly shows
the better performance achieved by Re-Ranking w.r.t. maxRecall, only the first topic is penalized
by the re-ranking process.


Figure 9: Plot of the Interpolated precision against recall comparing baseline, maxRecall, maxnDCG,
Re-Ranking


                    0.40
                    0.35
                    0.30
Average Precision


                    0.25
                    0.20
                    0.15
                    0.10
                    0.05
                    0.00
                            1
                            2
                            3
                            4
                            5
                            6
                            7
                            8
                            9
                           10
                           11
                           12
                           13
                           14
                           15
                           16
                           17
                           18
                           19
                           20
                           21
                           22
                           23
                           24
                           26
                           27
                           28
                           29
                           30
                           31
                           32
                           33
                           34
                           35
                           36
                           37
                           38
                           39
                           40
                           41
                           42
                           43
                           44
                           45
                           46
                           47
                           48
                           49
                           50


                                                  Topic


Figure 10: Per topic Average Precision for the Re-Ranking approach.


                                                12
                                                                                                        Average Precision difference                                                                                                               Average Precision difference                                                                                                             Average Precision difference


                                                                                                 0.10
                                                                                                        0.05
                                                                                                               0.00
                                                                                                                      0.05
                                                                                                                              0.10
                                                                                                                                       0.15
                                                                                                                                                                                                                                                                                                                                                                                     0.05
                                                                                                                                                                                                                                                                                                                                                                                             0.00
                                                                                                                                                                                                                                                                                                                                                                                                    0.05
                                                                                                                                                                                                                                                                                                                                                                                                           0.10
                                                                                                                                                                                                                                                                                                                                                                                                                  0.15
                                                                                                                                                                                                                                                                                                                                                                                                                         0.20
                                                                                                                                                                                                                                                                                                                                                                                                                                0.25


                                                                                                                                                                                                                                        0.015
                                                                                                                                                                                                                                                0.010
                                                                                                                                                                                                                                                        0.005
                                                                                                                                                                                                                                                                0.000
                                                                                                                                                                                                                                                                        0.005
                                                                                                                                                                                                                                                                                0.010
                                                                                                                                                                                                                                                                                        0.015
                                                                                             1                                                                                                                                      1                                                                                                                                            1
                                                                                             2                                                                                                                                      2                                                                                                                                            2
                                                                                             3                                                                                                                                      3                                                                                                                                            3
                                                                                             4                                                                                                                                      4                                                                                                                                            4
                                                                                             5                                                                                                                                      5                                                                                                                                            5
                                                                                             6                                                                                                                                      6                                                                                                                                            6
                                                                                             7                                                                                                                                      7                                                                                                                                            7
                                                                                             8                                                                                                                                      8                                                                                                                                            8
                                                                                             9                                                                                                                                      9                                                                                                                                            9
                                                                                            10                                                                                                                                     10                                                                                                                                           10
                                                                                            11                                                                                                                                     11                                                                                                                                           11
                                                                                            12                                                                                                                                     12                                                                                                                                           12
                                                                                            13                                                                                                                                     13                                                                                                                                           13
                                                                                            14                                                                                                                                     14                                                                                                                                           14
                                                                                            15                                                                                                                                     15                                                                                                                                           15
                                                                                            16                                                                                                                                     16                                                                                                                                           16
                                                                                            17                                                                                                                                     17                                                                                                                                           17
                                                                                            18                                                                                                                                     18                                                                                                                                           18
                                                                                            19                                                                                                                                     19                                                                                                                                           19
                                                                                            20                                                                                                                                     20                                                                                                                                           20
                                                                                            21                                                                                                                                     21                                                                                                                                           21
                                                                                            22                                                                                                                                     22                                                                                                                                           22
                                                                                            23                                                                                                                                     23                                                                                                                                           23


13
                                                                                            24                                                                                                                                     24                                                                                                                                           24


                                                                                        Topic
                                                                                                                                                                                                                                                                                                                                                                            Topic


                                                                                            26                                                                                                                                                                                                                                                                                  26


                                                                                                                                                                                                                               Topic
                                                                                                                                                                                                                                   26
                                                                                            27                                                                                                                                     27                                                                                                                                           27
                                                                                            28                                                                                                                                     28                                                                                                                                           28
                                                                                            29                                                                                                                                     29                                                                                                                                           29
                                                                                            30                                                                                                                                     30                                                                                                                                           30
                                                                                            31                                                                                                                                     31                                                                                                                                           31
                                                                                            32                                                                                                                                     32                                                                                                                                           32
                                                                                            33                                                                                                                                     33                                                                                                                                           33
                                                                                            34                                                                                                                                     34                                                                                                                                           34
                                                                                            35                                                                                                                                     35                                                                                                                                           35
                                                                                                                                                                                                                                                                                                Figure 11: Per topic Average Precision Difference = 𝐴𝑃𝑟𝑒−𝑟𝑎𝑛𝑘𝑖𝑛𝑔 − 𝐴𝑃𝑃0 .


                                                                                            36                                                                                                                                     36                                                                                                                                           36
                                                                                            37                                                                                                                                     37                                                                                                                                           37
                                                                                            38                                                                                                                                     38                                                                                                                                           38
                                                                                            39                                                                                                                                     39                                                                                                                                           39


     Figure 13: Per topic Average Precision Difference = 𝐴𝑃𝑟𝑒−𝑟𝑎𝑛𝑘𝑖𝑛𝑔 − 𝐴𝑃𝑚𝑎𝑥𝑅𝑒𝑐𝑎𝑙𝑙 .
                                                                                            40                                                Figure 12: Per topic Average Precision Difference = 𝐴𝑃𝑟𝑒−𝑟𝑎𝑛𝑘𝑖𝑛𝑔 − 𝐴𝑃𝑚𝑎𝑥𝑛𝐷𝐶𝐺 .       40                                                                                                                                           40
                                                                                            41                                                                                                                                     41                                                                                                                                           41
                                                                                            42                                                                                                                                     42                                                                                                                                           42
                                                                                            43                                                                                                                                     43                                                                                                                                           43
                                                                                            44                                                                                                                                     44                                                                                                                                           44
                                                                                            45                                                                                                                                     45                                                                                                                                           45
                                                                                            46                                                                                                                                     46                                                                                                                                           46
                                                                                            47                                                                                                                                     47                                                                                                                                           47
                                                                                            48                                                                                                                                     48                                                                                                                                           48
                                                                                            49                                                                                                                                     49                                                                                                                                           49
                                                                                            50                                                                                                                                     50                                                                                                                                           50
Table 4
Some numerical results for comparing performance of Re-Ranking, maxnDCG, maxRecall and the base-
line.
 RUN                               num_rel_ret    map      P_5      recall_1000      nDCG      nDCG@5
 P0-RERANK                                1263    0.1750   0.6490         0.6631     0.4979       0.5495
 P0-LMD-MAX_nDCG@5                        1103    0.1717   0.6490         0.5803     0.4764       0.5495
 P0-MULTI-MAX_RECALL_1000                 1263    0.1276   0.4082         0.6631     0.4283       0.3117
 P0-BM25-BASELINE                         1250    0.1256   0.3796         0.6563     0.4216       0.2871


6.6. RUN Submission
The five run we decided to submit are the following:

    • run-1 : Re-Ranking approach.

    • run-2 : like run-1 but proximity searches areonly with pairs of subsequent tokens.

    • run-3 : maxnDCG query with LMDirichletSimilarity

    • run-4 : maxnDCG query with MultiSimilarity

    • run-5 : maxRecall query with MultiSimilarity

  Table 5 shows the numerical results we obtained for each run. Out of all, the first run is the
best one overall.

Table 5
Numerical statistics from the trec-evaluations of the 5 run on the 2020 topic set.
  RUN               num_rel_ret        map        P_5       recall_1000       nDCG            nDCG@5
  shanks-run-1             1263        0.1750     0.6490          0.6631      0.4979             0.5495
  shanks-run-2             1255        0.1735     0.6408          0.6588      0.4960             0.5432
  shanks-run-3             1103        0.1717     0.6490          0.5803      0.4764             0.5495
  shanks-run-4             1199        0.1497     0.4816          0.6312      0.4536             0.3866
  shanks-run-5             1263        0.1276     0.4082          0.6631      0.4283             0.3117

SUBMISSION UPDATE:
NEW EXPERIMENTAL RESULTS AFTER A NEW CORRECTED VERSION OF
THE QRELS WAS RELEASED

All the previous results are based on an incorrect version of the .qrels file for the 2020 top-
ics. Since we only knew about the new corrected version when the deadline was approaching,
we could not recreate all the graphs and comparisons in 6. However, we managed to find the
new optimal parameter sets and repeat the five run.
   The new data is provided in Tables 6 and 7.


                                                  14
Table 6
The optimal parameters obtained by training on the 2020 topics (WITH CORRECTED QRELS).
                 Query        Similarity     tBoost     sBoost    pBoost    pDist
                 maxRecall    MULTI              0,3        0,2      0,75     12
                 maxnDCG      LMD               0,15       0,05      0,75     17


Table 7
Numerical statistics from the 5 run on the 2020 topic set (WITH CORRECTED QRELS).
  RUN              num_rel_ret      map        P_5        recall_1000       nDCG     nDCG@5
  shanks-run-1             795      0.3146     0.6245           0.8705      0.6521      0.6407
  shanks-run-2             790      0.3126     0.5959           0.8671      0.6521      0.6213
  shanks-run-3             770      0.3141     0.6245           0.8502      0.6479      0.6407
  shanks-run-4             788      0.2546     0.4327            0.865      0.5839      0.4391
  shanks-run-5             795      0.2565     0.4449           0.8705       0.588      0.4513


7. Statistical Analysis
Here we analyze our models with some important statistics to evaluate them in a deeper way
via hypothesis testing. We used ANOVA, tStudent test and through boxplots. All the analyses
focused on the mean of two key metrics, average precision and nDCG@5. We analyzed all 5
different retrieval models, the ones described in the section 6.6.
   A boxplot gives a visual representation about location, symmetry of the data, dispersion and
presence of outliers (points that escape the construction of the boxplot).


                                                15
Figure 14: Boxplots of the 5 runs of Aveage Precision


   It can be appreciate that run1,run3 have almost an identical structure: the median, interquar-
tile range(IRQ) and whiskers length. On the other hand, run4 and run5 exhibit less performance
in this metric because those runs were tuned for maximize recall or nDCG@5, the bottom
whiskers in fact are closer to zero meaning that for some topics the system did not retrieve
enough relevant documents. All the runs have outliers, the points above the whiskers, repre-
sented by circles, are topics which perform better than the others, or worse if they are below
the boxplot. Run1 and run3 are skewed to the right and show less variance compared to other
systems.
   Looking at the boxlot of nDCG@5 reveals the same behavior seen before, run1 and run3
produce higher scores compared to other runs and they seem identical in performances. Run3
has better results w.r.t run4 so we can conclude that using different similarities change dra-
matically the results. Run2 shows less IRQ among others, in particular compared to run1 that
share the same architecture with the only difference in the proximity parameter. We can say
that run1 is able to score higher score exploiting the more flexible proximity parameter.
   Further analysis with anova and tTest will help to understand the possible similarities or not
between the systems.


                                                 16
Figure 15: Boxplots of the 5 runs of nDCG@5


7.1. Hypothesis testing
The first tool that we utilize is ANOVA (Analysis of Variance) a statistical test of whether or not
the means of several groups are equal. H0 , the null hypothesis that all the means are equal, is
tested against the possibility to reject or don’t reject it. As we can see in Figure 16 there are
multiple factors to be taken into account to perform a correct analysis of the F-statistic. The
system sum of squares, SS_system and the SS_error are divided by their degree of freedom to
obtain the mean squares(MS). The F-statistic (F) is equal to the ratio of MS_System an MS_Error.
Having a p-value = 0.1849 we cannot reject H0 . To do that we should have had a value lower
than 𝛼. The significance level 𝛼 is set at 0.05.
   We could conclude that our systems are statistically similar in mean average precision but
performing the ANOVA2 test the situation is reversed.


Figure 16: Anova1 results


                                                17
Figure 17: Anova2 results


        Figure 18: Pairwise comparisons,
        HSD adjustment


                                                      Figure 19: Multi comparison chart for AP


   As it can be seen from Figure 18 and Figure 19, the null hypothesis can be rejected because
the runs have statistically significant differences. The topic effect, that can be read in cell [3,2]
of Figure 17, is able to express much more variance, that means that this variability is greater
than the one of the systems as we can expect by topics.
   Run 3 vs run 4: as we pointed out previously, the use of LMDirichlet similarity improve the
results. For this pair the null hypothesis can be rejected. As we will see for the nDCG@5
analysis run 4 and run 5 are different from the others, and the p-value of their statistic suggest
us that the runs are equivalent in mean. Instead, for the others (run 1, run 3, run 2) we cannot
reject H0 .

   Unfortunately we have not been able to simplify the reading of the images. Here is a sort of
conversion between the numbers on the axes and the real ones.
1 - 2 - 3 - 4 -5 (image numbers) —> run1 - run3 - run2 - run5 - run4 (true values). This is applicable
to Figure 18, 19, 21, 23

  Moving the study to nDCG@5, in ANOVA table, Figure 20, the p-value is lower than the
confidence level so we can reject the null hypothesis and claim that there is at least one mean
between the various runs that differs significantly from the others. This table doesn’t tell us
which systems are different on average, it just tells us that there is at least one.


                                                 18
   Tukey Honestly Significant Difference (HSD) test, as we did in AP case, answer to that point
creating confidence intervals for all pairwise differences between the systems we want to com-
pare, while controlling the family error rate. Otherwise, the probability of a type I(one) error
would be magnified.
   Run4 and run5, Figure 22, are statistically different from the others, while for them we cannot
reject the null hypothesis. It seems that supporting the MaxRecall or MaxnDCG query approach
does not yield performance benefits except through different similary as in case of run3 or re-
rank(run1 and run2). Accordingly to what the boxplot chart suggested for run3 and run4 we
can reject H0 , the p-value is below 𝛼 = 0.05. The different similarity approach is visible.
   We can conclude that our Re-ranking approach is much more significant than a standard
technique even if run3 through this analysis it returns comparable results. In the future this
behavior can be investigated carefully.


Figure 20: nDCG@5 - Anova1 results


Figure 21: nDCG@5 - Multiple pairwise comparison of the group means, first column are the pairs of
systems, the last one is the related p-value


                                               19
Figure 22: Multiple comparison chart for nDCG@5


        Figure 23: Pairwise comparisons,
        HSD adjustment


                                                   Figure 24: Multi comparison chart for nDCG@5


8. Failure analysis
Looking at the results obtained from the trec_eval program execution with our runs as a pa-
rameter we can see the performance of our information retrieval systems. In particular we
are now focusing on finding and understanding for which topic the systems fail in achieving
good performance. Therefore we have decided to apply this kind of evaluation to our best run
(shanks-run 1). Looking at it, and using the map field as reference parameter for the following
analysis, we have discovered that the top (map) performance come from the topics 42, 43 and 1
and the worst from 22, 12 and 44. Searching for the reason why, we understand that the main


                                              20
weakness of the process is the lacking of an argument quality evaluation phase and of a more
consistent lexical analysis process.
  As an example we take and compare the topics 1, 12 and 44.

  Topic 1: Should teachers get tenure?
  Narrative: highly relevant arguments make a clear statement about tenure for teachers in
schools or universities. Relevant arguments consider tenure more generally, not specifically
for teachers, or, instead of talking about tenure, consider the situation of teachers’ financial
independence.
  The process of stopwording applied to the topics lead to obtain the parsed query “teach-
ers tenure”, that even without the whole phrase construction explain very well what we are
searching for. The document about this topic are well retrieved by our system, in fact if we
look for example at the first 5 position (the most relevant ones for a browser) we can see from
the comparison with the qrels that all of them are considered as "highly relevant".

   Topic 12: Should birth control pills be available over the counter?
   Narrative: highly relevant arguments argue for or against the availability of birth control
pills without prescription. Relevant arguments argue with regard to birth control pills and
their side effects only. Arguments only arguing for or against birth control are irrelevant.
   The process of stopwording applied to the topics lead to obtain the parsed query “birth con-
trol pills available”. This seems a quite explicative phrase but in the retrieval phase the system
has trouble in finding the proper relevant results. The critical issue with this query is the dis-
tinction between high and low quality argument. The lacking of an effective argument quality
process leads the program in failing the document quality evaluation. Due to this fact the sys-
tem is unable to put in the right ranking position the appropriate document. We can notice this
aspect even looking at the other topic fields (not only the map), in fact if we look for example
at the growing of the recall parameter we can notice that is mainly focused in the tail of the
process (when many documents are retrieved).

   Topic 44: Should election day be a national holiday?
   Narrative: highly relevant arguments explain why making election day a holiday is a good
idea, or why not. Relevant arguments mention the fact or its remedy as one of the problems
that elections have.
   The process of stopwording applied to the topics lead to obtain the parsed query “election
national holiday”. The results obtained from this kind of search are quite bad in fact the sys-
tem retrieves as relevant some discussion like "Potato day should be a national holiday" or
"Star Wars day should be a national holiday". These mismatches came from the fact that the
system does not recognize "election day" as unique mandatory query term and so document
about similar topic that differ only for some word are wrongly retrieved.

  To overcome these kind of issues it could be useful to equip the system with an argument
quality evaluator and a sort of word pattern recognizer that catches the words which must not
be separate and a way to identify structure like "subject, predicate, object" that assign higher
weights to the subject and marks it as mandatory word in the documents to be retrieved.


                                               21
9. Conclusions and Future Work
At the end of this experiment we have discovered that the changes we made, lead to obtain fairly
good improvement in respect to the starting baseline [Table 4] of our information retrieval
system. All the statistics have undergone a significant increase, as an example we can see the
improvement as follows: MAP +39.3%, P5 +68.8%, nDCG@5 +91.4%.
   Comparing our results with the last year ones we have noticed that they follows their trend
for the nDCG@5 parameter, but looking at the applied strategies they seem to be fairly differ-
ent. Our results are in line with those of last year.
   The whole process we followed highlighted a lot of possible expansions that with more time
could be implemented and explored. As an example it could be interesting the application
of some sort of location word detection, personal nouns recognition and compound words
discernment and an improvement or even the addition of a new ranking phase that allows the
insertion of some kind of argument quality analysis. The possibility of implementing a different
approach based on some kind of machine learning model remains open. As we stated in the
initial attempt section 3 we dropped the idea of using GPT-2 because it returned us unsatisfying
results, so in the future we can go more into details of this modern tool and improve in this
way the performance of our solution.
   Another possible approach that can be exploited derives from the paper [4]. It can be inter-
esting because it takes different runs in input and combines them to reach the best result.


                                              22
References
[1] C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998.
[2] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, args.me corpus, 2020.
    URL: https://doi.org/10.5281/zenodo.3734893. doi:10.5281/zenodo.3734893.
[3] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
    B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
    Retrieval, CEUR-WS 2696 (2020).
[4] E. A. Fox, J. Shaw, Combination of Multiple Searches, in: D. K. Harman (Ed.), The Sec-
    ond Text REtrieval Conference (TREC-2), National Institute of Standards and Technology
    (NIST), Special Publication 500-215, Washington, USA, 1993, pp. 243–252.
[5] C. L. A. Clarke, N. Craswell, I. Soboroff, Overview of the TREC 2009 Web Track, in: E. M.
    Voorhees, L. P. Buckland (Eds.), The Eighteenth Text REtrieval Conference Proceedings
    (TREC 2009), National Institute of Standards and Technology (NIST), Special Publication
    500-278, Washington, USA, 2010.
[6] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
    B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
    Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers
    of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL:
    http://ceur-ws.org/Vol-2696/.


                                              23

</pre>