1. Introduction

Document retrieval task on controversial topic with Re-Ranking approach

Andrea Cassetta

andrea.cassetta@studenti.unipd.it 0 1

Alberto Piva

alberto.piva.8@studenti.unipd.it 0 1

Enrico Vicentini

enrico.vicentini.1@studenti.unipd.it 0 1 0 CLEF 2021 - Conference and Labs of the Evaluation Forum 1 University of Padua , Italy

This paper is the report of the work done for Argument Retrieval CLEF 2021 Touché Task 1 by Shanks team (based in Italy and precisely the members are University's of Padua students). Argument Retrieval CLEF 2021 Touché Task 1 focuses on the problem of retrieving relevant arguments for a given controversial topic, from a focused crawl of online debate portals. After some tests Shanks group has decided to parse the input documents taking only the title, premises and conclusion of the arguments (as well as the stance necessary to understand the arguments' author point of view). After the indexing part of the documents the work is concentrate on how the retrieving and the raking are done. After some tests, we discover that the better results are obtained using a WordNet [1] based query expansion approach and a re-ranking process with two diferent similarity functions. This report describes in details how the documents parsing work and how the indexing and searching part are developed. The unexpected update of the qrels file did not allow us to re-run all the tests. In the end, however, we also reported the results of the runs obtained from parameter tuning on the new qrels.

eol>Argument Retrieval CLEF 2021 Touché Task 1 WordNet synonyms Re-ranking BM25 DirichletLM

1. Introduction

Section 5 explains our experimental setup including the software, tools and methods used; Section 6 discusses results; and finally, Section 9 draws some conclusions and outlooks for future work.

2. Related Works

To create our search engine we build upon some source code created by Professor Nicola Ferro to show as some toy examples and changed as described in the subsequent sections. We have also read the overview of CLEF 2020 on the Touchè task[ 3 ].

After some research and as suggested by Professor Ferro, we discover a diferent way of re-ranking and merging the results. In practice the paper Combination of Multiple Searches by Fox and Shaw [ 4 ]. Shaw, described an interesting method to increase the performance of our system by combining the similarity values from diferent output runs, using Boolean retrieval methods. Into the paper is also described how they have done the indexing (and analyzing) part, but we decide to overtake that part because the stating dataset is diferent, and we have already analyzed our dataset to reach better index possible. For the purpose to understand how the results merging has been done is not useful to analyze also how the query are written and so we decide only to relate that the P-norm queries are written using a complex boolean expression using AND and OR operators. When all the runs are done, the second part of the experiment consists in combining the output runs (obviously obtained from the same collection of data) to reach the best result. Diferent way of combining them are, for example, taking the top N documents retrieved for each run or modify the value of N for each run, based on the eleven point average precision for that run. In TREC-2, their experiments concentrated on methods of combining runs based on the similarity values of a document to each query for each of the runs. After some tests, the best choice is to weight each of the separate runs equally and not favor any individual run or method, but sometimes some runs has to be weighted more or less, depending on their performance. This method of merging the runs help the retrieval system to make a trade-of between the runs’ errors. During the tests have been considered six diferent way to combine the runs: • CombMIN: it is used to minimize the probability that a non-relevant document would be highly ranked; • CombMAX: it is used to minimize the number of relevant documents being poorly ranked; • CombMED: it is used to take the median similarity value (to solve the previous methods’ problem) instead of taking a value from only one run; • CombSUM: it is used to take the sum of the set of the similarity values; • CombANZ: it is used to take the average of the non-zero similarity values, so it ignores the runs that fail to retrieve a relevant document; • CombMNZ: it is used to consider the higher weights to documents retrieved by multiple retrieval methods. We have to point out that the first two method have a specific objective but they do not care about the possible problems that they can generate on the other retrieved documents. During the tests CombMIN has worse performance than all the single runs, on contrary CombANZ and CombMNZ methods have better performance than the individual runs, it is possible maybe because they produce the same ranked sequence for all the documents retrieved by all five individual runs.

3. Initial Attempts 3.1. Parsing Documents

Before going into the details of our final solution, it is useful to describe previous approaches we took into account to solve the problem and why we chose not to explore them further. Multiple parsers have been developed to parse the documents from the provided collection. The most trivial parser, called 1, extracts the sourceText and discussionTitle elements from the corpus documents. The second parser 2 extracts the elements related to the conclusion, the premise, the discussion title, and all the text from the sourceText field in between the premise and the conclusion. The third parser 0, which at the end we decided to use for our more advanced experiments, extracts only the discussion title, the premise and the conclusion for each document. Table 1 shows how the index statistics are afected by each of these parser.

3.2. Query Expansion

3.2.1. OpenAI GPT-2 When we have developed the software to create the index, we have deeply thought about how we could have used the resulting tokens from the analysis of the topic query. In the attempt of expanding the queries we came across the OpenAI GPT-2 model, a Machine Learning algorithm that generates synthetic text samples from an arbitrary input.

The idea was to use this powerful algorithm to make a query expansion, giving as input the topic title to generate a more complete phrase with hopefully new words that could help the searching part. Unfortunately the output of GPT-2 is not always what we expect. For example if we give it as input the tokenized query title, that could be only made of 2 words, the output is a not very useful dialog for our task. Another problem is the structure of the query, in fact since they are all questions, the GPT-2 algorithm generates an answer for them which still is not what we were interested in. The problem persists also if we remove the question mark at the end of the phrase. In fact, the queries still have a question structure. For these reasons, we have decided to set aside this kind of approach.

3.2.2. Randomly Weighted Synonyms

An approach initially devised for query expansion, but which we later decided not to explore further, was to generate multiple queries for the same topic, each with randomly generated synonym boost values. For each query, the rankings of 1000 documents were then generated, and ifnally all the rankings were merged into one. The first performances obtained by this method did not encourage us to proceed with the development because there were many possible paths to follow from that point and the search time increased considerably.

3.3. Minimum body length

During the process of documents exploration, useful to detect which field are needed and present in each collection, we notice that in some documents the sourceText field were constituted by useless text without a single relevant information about its topic. In order to avoid this kind of documents ending up in the inverted file, we have tried to include, during the indexing process, a check on the length of the ParsedDocument’s body. If this field, the one that we have considered as the union of the conclusion and premise fields, is made up of less than a certain number of token, recurrent aspect that the parsing phase highlights for such instance, we avoid to consider them in the indexing phase.

After doing some tests with diferent values for the "min body length" (5, 10, 15), we have compared the results of this kind of solution with the results obtained without using it and we have discovered that, taking as example "min body length" equal to 10, the number of retrieved document switches from 48781 to 48764 and the number of relevant document switches from 1263 to 1257 (the other evaluation measures are no so afected by this change). Considering this result we have decided not to use this kind of document pre-processing to avoid the discarding of some document that could be considered relevant in the qrels file (and so for an user) even if they don’t seem like it.

3.4. Re-ranking with discussion ID

One of our primary goals is to improve the final ranking of the documents. With this assumption we tried to improve performance by using re-ranking.

As a first analysis we observed that, in the documents of the dataset, the posts related to the same discussion had the same first part of the document ID. Based on the assumption that only posts from some discussion are relevant to a query, we tried to index those posts as a single document. We finally obtained an index with a discussion-based clustering of documents. In the searching phase, we firstly searched the query in the normal index, retrieving the classic ranking of single documents. Secondly we searched the same query in the second index of discussion clusters, thus obtaining a ranking of discussions. Finally the scores of documents in the first ranking were increased based on the rank of their respective discussion in the second ranking. Unfortunately this approach did not provide the desired results because it assumes that all posts related to a discussion have the same relevance to the searched topic. In fact, it was found that some posts do not contain any useful information to argue the searched topic but are part of a discussion that is really relevant.

3.5. OpenNLP attempt

Exploring new solutions, we have even tried to implement a version of the program that uses the OpenNLP Machine Learning toolkit in order to see which advantages a tokenization able to distinguish location, personal nouns and so on could provide for the solution. Following this path we have encountered an error which requires significant changes in the workflow of what we had done up to that point. For this reason we have decided not to keep going on with this branch.

4. Methodology

The goal of this task is to retrieve relevant arguments from online debate portals, given a query on a controversial topic. We have checked the dataset and we have noticed that it is composed by five JSON files. We have also read some documents and we have noticed that the main structure is the same for each one but with some diferent fields. To use the documents with Lucene we had to parse the documents. To do so, we have used the Jackson library and we have implemented our parser 0 that takes the premises, conclusions, the document title and the stance attribute (pro or cons) of the documents.

4.1. Indexing

We have built four diferent parts starting from the Lucene default ones: ArgsParser, ShanksAnalyzer, DirectoryIndexer and the Searcher. There is indeed a ShanksTouche class which has the main method and allows us to setup parameters for indexing and analyzing parts. After converting the documents into something that Lucene can work on, we focused ourselves on the development of how the indexer is created, in particular on how the analyzer module works. Into the tokenization phase we have used the StandardTokenizer, the LowerCaseFilter, the EnglishPossessiveFilter and the stop-words StopFilter.

The arguments of the collection are stored in the index using four fields: ID, Title, Body, and Stance (pro or con).

4.1.1. Custom Stop-List

As you can see in Figure 4 we have compared the baseline with the default stop-list and with our custom stop-list; after that comparison we have decided to use a custom stop-list to better achieve the project goal. Our stop-list contains 1362 words that are derived from the merging of other stop-lists (e.g. smart and lucene) typically used. Our custom stop-list reduces memory usage by approximately 38% and indexing time by almost 20%.

4.2. Searching

In the searching phase of our program we focus our attention in finding strategies to improve the general quality of the results: experimenting with diferent approaches like query expansion based on WordNet synonyms or defining queries to score diferently the fields of the documents. The approach we have chosen to perform involves the use of BooleanQuery. In this way it is possible to assign specific weights to every term of the query (boosts).

Finally, we decided to use and test both BM25Similarity and LMDirichletSimilarity similarities and then their combination with a MultiSimilarity to get the document scores.

4.3. Re-Ranking

During various experiments we have noticed that some measures of similarity and set of parameters favored the precision at the expense of recall and vice versa. With the purpose of combining advantages of both cases, we opted for a re-reranking method which exploits different similarities and query parameters. Our implementation is made of two steps. In the first one we use a query able to obtain a higher recall value when searching the index for relevant documents, whose parameters and similarity are decided based on empirical trials. In the following step we use a second query with better performance in terms of to re-evaluate the returned documents and so re-ranking them according to the new score. We call maxRecall the first query and maxNdcg the second one. According to our implementation, this approach turns out to be efective on the 2020 topic set, but at the expense of the time spent in the search phase, which increases quite substantially.

5. Experimental Setup

Our work is based on the following experimental setups: • Repository: https://bitbucket.org/upd-dei-stud-prj/seupd2021-goldr; • During the develop and the experimentation we have used our own computer and in the end we have run our code using Tira; • Evaluation measure: BM25, LMDirichlet and a "MULTI" where both were combined; • Apache Maven, Lucene; • Java JDK version 11; • Version control system git. • _

tool [ 5 ]

The collection is a set of 387,740 arguments crawled from debatewise.org, idebate.org, debatepedia.org, debate.org and 48 arguments from Canadian parliament discussions. We used the 50 topics from Touché 2020 Task 1 [ 6 ] of the contest to train and refine our search engine. Furthermore we developed the source code collaborating through the BitBucket platform.

6. Results and Discussion

In this section we provide graphical and numerical results about the experiments we conducted during the development of the project. We also discuss these results to derive some useful insights.

6.1. Our Baseline

In order to be able to track performance progress during development, we created our own baseline. For each of the three parsers 0, 1, 2 we produced a run using a simple analyzer that uses a StandardTokenizer, LowerCaseFilter and a StopFilter where the stop-list used is the standard one ofered by Lucene. The similarity used is BM25. Figure 1 compares the three baselines. From these results we decided to develop our approach based on the 0 parser. Figures 2 and 3 show how 0 compares to the other parser evaluating performance per topic. The choice of 0 was motivated not only by the statistics in Table 1 but also by its overall eficacy as shown in the following figures. Figure 1 shows how diferently the three approaches can behave. 0, that extracts only the discussion title, the premise and the conclusion for each document (with its stance), highlights better performance with respect to the other two retrieving more relevant document across the entire run. Obviously we chose to discard 1 because his performances were inferior to the other two as shown in Figure 2. To better understand the behavior of 2 versus to 0, Figure 3 compare the per topic average precision. Approximately 10% of the topic, in particular topic 3, 23, 34 ad 43 to 46 tend to give nicer results with parser 2. 0.15 0.10 0.100

Topic

Topic 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5

6.1.1. Baseline with the custom stop-list

The results obtained by our custom stop list, compared to the one ofered by Lucene are comparable.

6.2. Parameters Tuning

Our ultimate goal is to maximize the average value of @5 across all topics. To find the optimal parameter values, we performed extensive iterative tests, trying all combinations of parameters with values belonging to discrete intervals we defined. The same experiments were conducted using three diferent similarities : BM25Similarity (BM25), LMDirichletSimilarity (LMD), MultiSimilarity (MULTI). The MultiSimilarity we have used, combines BM25Similarity and LMDirichletSimilarity. The method we developed is governed by five parameters, when using re-ranking they become ten. Out of those ten, five parameters concern the query search in the index which aims to maximize the overall recall, the other five afect the re-ranking process and are chosen to maximize @5.

6.3. Optimal parameters

The optimal set of parameters we found for the query and for the query are available in Tables 2 and 3. The best similarity measure to maximize @5, according to our empirical tests is LMD. To maximize recall, on the other hand, the best measure is MULTI. As it can be seen in Table 2, the best value of tBoost for maxnDCG is 0. This allowed us to come to the conclusion that considering the title of the discussion (which is the same for all posts in it) can be misleading with respect to the relevance of the content of each post that is part of it. The title, however, is very useful to obtain higher recall. This intuition is what pushed us to abandon the method of re-ranking based on discussion ID, which is described in 3.4.

6.4. maxnDCG and maxRecall

In this section we want to compare the two queries maxnDCG and maxRecall with the optimal parameter values established by the tests. From the graph in Figure 5 we can see that the precision is significantly higher for maxnDCG, while maxRecall has a better recall on the whole ranking. From these data we believe to have obtained the desired result from the two queries. In Figures 6, 7, 8 we compare the Average Precision per topic for the three test runs. We can notice how there is not much diference between maxRecall and P0. The same is not true for maxnDCG, the latter proves to be better than P0 in almost all topics. This scenario is further confirmed by Figure 8, which shows the dominance of over . The performance can be further compared with the numerical results reported in Table 4. From these data we can realize the actual trade-of between the two approaches. The advantage in terms of recall for maxRecall is significant compared to maxnDCG. The same advantage applies in the opposite way for precision andnDCG. 0.075 0.25 0.20 e c n e re 0.15 f if d n 0.10 o i s i c

6.5. Re-Ranking Results

Here we compare the Re-Ranking approach described in 4.3, with the previous ones. Figure 9 shows how re-ranking based on maxNdcg (in red), greatly improves the interpolated precision of maxRecall (in green). It is also possible to note that it allows for better performance with Topic =

Topic =

Topic = respect to maxNdcg alone (in orange). Figure 10 shows the Average Precision value obtained by re-ranking for each topic. From it, it is possible to identify the most problematic topics, for which the method developed is not very efective. In particular, the most critical topics are: {2, 8, 22, 40, 44}. From Figure 11 we can see that the Re-Ranking method improves on almost every topic w.r.t the baseline. In Figure 12 it can be seen that the Re-Ranking method improves on maxNdcg on many topics, except for topic 10 and 20. Figure 13, as predictable, clearly shows the better performance achieved by Re-Ranking w.r.t. maxRecall, only the first topic is penalized by the re-ranking process. 1 2 3 4 5 6 7 8 9 01 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Topic 0.25 0.20 0.05 0.015 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5

6.6. RUN Submission

The five run we decided to submit are the following: • run-1 : Re-Ranking approach. • run-2 : like run-1 but proximity searches areonly with pairs of subsequent tokens. • run-3 : maxnDCG query with LMDirichletSimilarity • run-4 : maxnDCG query with MultiSimilarity • run-5 : maxRecall query with MultiSimilarity NEW EXPERIMENTAL RESULTS AFTER A NEW CORRECTED VERSION OF THE QRELS WAS RELEASED All the previous results are based on an incorrect version of the .qrels file for the 2020 topics. Since we only knew about the new corrected version when the deadline was approaching, we could not recreate all the graphs and comparisons in 6. However, we managed to find the new optimal parameter sets and repeat the five run.

The new data is provided in Tables 6 and 7.

Query Similarity tBoost sBoost pBoost

maxRecall MULTI 0,3 0,2 0,75 maxnDCG LMD 0,15 0,05 0,75

7. Statistical Analysis

Here we analyze our models with some important statistics to evaluate them in a deeper way via hypothesis testing. We used ANOVA, tStudent test and through boxplots. All the analyses focused on the mean of two key metrics, average precision and nDCG@5. We analyzed all 5 diferent retrieval models, the ones described in the section 6.6.

A boxplot gives a visual representation about location, symmetry of the data, dispersion and presence of outliers (points that escape the construction of the boxplot).

It can be appreciate that run1,run3 have almost an identical structure: the median, interquartile range(IRQ) and whiskers length. On the other hand, run4 and run5 exhibit less performance in this metric because those runs were tuned for maximize recall or nDCG@5, the bottom whiskers in fact are closer to zero meaning that for some topics the system did not retrieve enough relevant documents. All the runs have outliers, the points above the whiskers, represented by circles, are topics which perform better than the others, or worse if they are below the boxplot. Run1 and run3 are skewed to the right and show less variance compared to other systems.

Looking at the boxlot of nDCG@5 reveals the same behavior seen before, run1 and run3 produce higher scores compared to other runs and they seem identical in performances. Run3 has better results w.r.t run4 so we can conclude that using diferent similarities change dramatically the results. Run2 shows less IRQ among others, in particular compared to run1 that share the same architecture with the only diference in the proximity parameter. We can say that run1 is able to score higher score exploiting the more flexible proximity parameter.

Further analysis with anova and tTest will help to understand the possible similarities or not between the systems.

7.1. Hypothesis testing

The first tool that we utilize is ANOVA (Analysis of Variance) a statistical test of whether or not the means of several groups are equal. H , the null hypothesis that all the means are equal, is 0 tested against the possibility to reject or don’t reject it. As we can see in Figure 16 there are multiple factors to be taken into account to perform a correct analysis of the F-statistic. The system sum of squares, SS_system and the SS_error are divided by their degree of freedom to obtain the mean squares(MS). The F-statistic (F) is equal to the ratio of MS_System an MS_Error. Having a p-value = 0.1849 we cannot reject H0. To do that we should have had a value lower than . The significance level is set at 0.05.

We could conclude that our systems are statistically similar in mean average precision but performing the ANOVA2 test the situation is reversed.

As it can be seen from Figure 18 and Figure 19, the null hypothesis can be rejected because the runs have statistically significant diferences. The topic efect, that can be read in cell [ 3,2 ] of Figure 17, is able to express much more variance, that means that this variability is greater than the one of the systems as we can expect by topics.

Run 3 vs run 4: as we pointed out previously, the use of LMDirichlet similarity improve the results. For this pair the null hypothesis can be rejected. As we will see for the nDCG@5 analysis run 4 and run 5 are diferent from the others, and the p-value of their statistic suggest us that the runs are equivalent in mean. Instead, for the others (run 1, run 3, run 2) we cannot reject H .

Unfortunately we have not been able to simplify the reading of the images. Here is a sort of conversion between the numbers on the axes and the real ones. 1 - 2 - 3 - 4 -5 (image numbers) —> run1 - run3 - run2 - run5 - run4 (true values). This is applicable to Figure 18, 19, 21, 23

Moving the study to nDCG@5, in ANOVA table, Figure 20, the p-value is lower than the confidence level so we can reject the null hypothesis and claim that there is at least one mean between the various runs that difers significantly from the others. This table doesn’t tell us which systems are diferent on average, it just tells us that there is at least one.

Tukey Honestly Significant Diference (HSD) test, as we did in AP case, answer to that point creating confidence intervals for all pairwise diferences between the systems we want to compare, while controlling the family error rate. Otherwise, the probability of a type I(one) error would be magnified.

Run4 and run5, Figure 22, are statistically diferent from the others, while for them we cannot reject the null hypothesis. It seems that supporting the MaxRecall or MaxnDCG query approach does not yield performance benefits except through diferent similary as in case of run3 or rerank(run1 and run2). Accordingly to what the boxplot chart suggested for run3 and run4 we can reject H0, the p-value is below = 0.05. The diferent similarity approach is visible.

We can conclude that our Re-ranking approach is much more significant than a standard technique even if run3 through this analysis it returns comparable results. In the future this behavior can be investigated carefully.

8. Failure analysis

Looking at the results obtained from the trec_eval program execution with our runs as a parameter we can see the performance of our information retrieval systems. In particular we are now focusing on finding and understanding for which topic the systems fail in achieving good performance. Therefore we have decided to apply this kind of evaluation to our best run (shanks-run 1). Looking at it, and using the map field as reference parameter for the following analysis, we have discovered that the top (map) performance come from the topics 42, 43 and 1 and the worst from 22, 12 and 44. Searching for the reason why, we understand that the main weakness of the process is the lacking of an argument quality evaluation phase and of a more consistent lexical analysis process.

As an example we take and compare the topics 1, 12 and 44.

Topic 1: Should teachers get tenure?

Narrative: highly relevant arguments make a clear statement about tenure for teachers in schools or universities. Relevant arguments consider tenure more generally, not specifically for teachers, or, instead of talking about tenure, consider the situation of teachers’ financial independence.

The process of stopwording applied to the topics lead to obtain the parsed query “teachers tenure”, that even without the whole phrase construction explain very well what we are searching for. The document about this topic are well retrieved by our system, in fact if we look for example at the first 5 position (the most relevant ones for a browser) we can see from the comparison with the qrels that all of them are considered as "highly relevant".

Topic 12: Should birth control pills be available over the counter?

Narrative: highly relevant arguments argue for or against the availability of birth control pills without prescription. Relevant arguments argue with regard to birth control pills and their side efects only. Arguments only arguing for or against birth control are irrelevant.

The process of stopwording applied to the topics lead to obtain the parsed query “birth control pills available”. This seems a quite explicative phrase but in the retrieval phase the system has trouble in finding the proper relevant results. The critical issue with this query is the distinction between high and low quality argument. The lacking of an efective argument quality process leads the program in failing the document quality evaluation. Due to this fact the system is unable to put in the right ranking position the appropriate document. We can notice this aspect even looking at the other topic fields (not only the map), in fact if we look for example at the growing of the recall parameter we can notice that is mainly focused in the tail of the process (when many documents are retrieved).

Topic 44: Should election day be a national holiday?

Narrative: highly relevant arguments explain why making election day a holiday is a good idea, or why not. Relevant arguments mention the fact or its remedy as one of the problems that elections have.

The process of stopwording applied to the topics lead to obtain the parsed query “election national holiday”. The results obtained from this kind of search are quite bad in fact the system retrieves as relevant some discussion like "Potato day should be a national holiday" or "Star Wars day should be a national holiday". These mismatches came from the fact that the system does not recognize "election day" as unique mandatory query term and so document about similar topic that difer only for some word are wrongly retrieved.

To overcome these kind of issues it could be useful to equip the system with an argument quality evaluator and a sort of word pattern recognizer that catches the words which must not be separate and a way to identify structure like "subject, predicate, object" that assign higher weights to the subject and marks it as mandatory word in the documents to be retrieved.

9. Conclusions and Future Work

At the end of this experiment we have discovered that the changes we made, lead to obtain fairly good improvement in respect to the starting baseline [Table 4] of our information retrieval system. All the statistics have undergone a significant increase, as an example we can see the improvement as follows: MAP +39.3%, P5 +68.8%, nDCG@5 +91.4%.

Comparing our results with the last year ones we have noticed that they follows their trend for the nDCG@5 parameter, but looking at the applied strategies they seem to be fairly diferent. Our results are in line with those of last year.

The whole process we followed highlighted a lot of possible expansions that with more time could be implemented and explored. As an example it could be interesting the application of some sort of location word detection, personal nouns recognition and compound words discernment and an improvement or even the addition of a new ranking phase that allows the insertion of some kind of argument quality analysis. The possibility of implementing a diferent approach based on some kind of machine learning model remains open. As we stated in the initial attempt section 3 we dropped the idea of using GPT-2 because it returned us unsatisfying results, so in the future we can go more into details of this modern tool and improve in this way the performance of our solution.

Another possible approach that can be exploited derives from the paper [ 4 ]. It can be interesting because it takes diferent runs in input and combines them to reach the best result.

[1]

Fellbaum , WordNet: An Electronic Lexical Database, Bradford Books , 1998 .

[2]

Ajjour ,

Wachsmuth ,

Kiesel ,

Potthast ,

Hagen ,

Stein , args. me corpus , 2020 . URL: https://doi.org/10.5281/zenodo.3734893. doi: 10 .5281/zenodo.3734893.

[3]

Bondarenko ,

Fröbe ,

Beloucif ,

Gienapp ,

Ajjour ,

Panchenko ,

Biemann ,

Stein ,

Wachsmuth ,

Potthast ,

Hagen , Overview of Touché 2020: Argument Retrieval , CEUR-WS 2696 ( 2020 ).

[4]

E. A.

Fox ,

Shaw , Combination of Multiple Searches, in: D. K. Harman (Ed.), The Second Text REtrieval Conference (TREC-2) , National Institute of Standards and Technology (NIST) , Special Publication 500-215 , Washington, USA, 1993 , pp. 243 - 252 .

[5]

C. L. A.

Clarke ,

Craswell , I. Soborof , Overview of the TREC 2009 Web Track , in: E. M. Voorhees , L. P. Buckland (Eds.), The Eighteenth Text REtrieval Conference Proceedings (TREC 2009 ), National Institute of Standards and Technology (NIST ), Special Publication 500-278 , Washington, USA, 2010 .

[6]

Bondarenko ,

Fröbe ,

Beloucif ,

Gienapp ,

Ajjour ,

Panchenko ,

Biemann ,

Stein ,

Wachsmuth ,

Potthast ,

Hagen , Overview of Touché 2020: Argument Retrieval , in: L. Cappellato , C.

Eickhof , N.

Ferro , A . Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs , volume 2696 of CEUR Workshop Proceedings , 2020 . URL: http://ceur-ws. org/ Vol- 2696 /.