=Paper= {{Paper |id=Vol-3180/paper-254 |storemode=property |title=SEUPD@CLEF: Team 6musk on Argument Retrieval for Controversial Questions by Using Pairs Selection and Query Expansion |pdfUrl=https://ceur-ws.org/Vol-3180/paper-254.pdf |volume=Vol-3180 |authors=Lorenzo Cappellotto,Matteo Lando,Daniel Lupu,Marco Mariotto,Riccardo Rosalen,Nicola Ferro |dblpUrl=https://dblp.org/rec/conf/clef/CappellottoLLMR22 }} ==SEUPD@CLEF: Team 6musk on Argument Retrieval for Controversial Questions by Using Pairs Selection and Query Expansion== https://ceur-ws.org/Vol-3180/paper-254.pdf

SEUPD@CLEF: Team 6musk on Argument Retrieval
for Controversial Questions by Using Pairs Selection
and Query Expansion
Notebook for the Touché Lab on Argument Retrieval at CLEF 2022

Lorenzo Cappellotto1 , Matteo Lando1 , Daniel Lupu1 , Marco Mariotto1 ,
Riccardo Rosalen1 and Nicola Ferro1
1
University of Padua, Italy

Abstract
This is a report based on the work done for Touché Task 1: Argument Retrieval for Controversial
Questions at CLEF 2022 by team 6musk (whose members are all students of the University of Padua).
This year’s task focuses on the problem of retrieving a couple of sentences to support users who search
for arguments to be used in conversations.

Keywords
Touchè 2022, Argument Retrieval, Search Engines, Controversial Questions, Pair of Sentences

1. Introduction
This report aims at providing a brief explanation of the Information Retrieval system built for
Touché lab 2022 [1]. We participated in Touché Task 11 : Argument Retrieval for Controversial
Questions. This year’s task focused on the retrieval of pairs of sentences, instead of whole
documents, from the collection of arguments used for the previous iteration of the task. The
corpus used for the development of the system is a pre-processed version of the args.me corpus
[2] (version 2020-04-01) that was also used during the previous year’s edition of the Touché
task.
The level of complexity added to this year’s version of Touché Task 1 required us to think
on how to specialize a basic retrieval system for documents to support this new required
functionality. After developing a basic system for last year’s task, we focused on this year’s
challenges and on more elaborated ideas to solve them. In a first iteration we decided to divide
the system in two distinct phases, the first one to retrieve the most relevant documents for a
topic and the second one to select the two most argumentative sentences for each topic and each

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ lorenzo.cappellotto@studenti.unipd.it (L. Cappellotto); matteo.lando@studenti.unipd.it (M. Lando);
daniel.lupu@studenti.unipd.it (D. Lupu); marco.mariotto.1@studenti.unipd.it (M. Mariotto);
riccardo.rosalen@studenti.unipd.it (R. Rosalen); ferro@dei.unipd.it (N. Ferro)
http://www.dei.unipd.it/~ferro/ (N. Ferro)
0000-0001-9219-6239 (N. Ferro)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073

1
https://webis.de/events/touche-22/shared-task-1.html
document retrieved. Once the full pipeline was in place, we tried to upgrade the second phase
to select the two most argumentative sentences for each topic and each document retrieved.
Our focus was not to produce a top performing specialized system, but to try to develop general
ideas in order to approach sentences selection and query expansion.
The paper is organized as follows: Section 2 introduces related works; Section 3 describes our
approach; Section 4 details the implementation process; Section 5 explains our experimental
setup; Section 6 discusses our main findings; Section ?? carries out failure analysis; finally,
Section 7 draws some conclusions and outlooks for future works.

2. Related Work
On Argument Search and Retrieval
Argumentation is a daily occurrence for taking individual and collaborative decisions. We follow
the definition of an argument as proposed by [Walton et al. 2008] to be a conclusion (claim)
supported by a set of premises (reason) and to be conveying a stance on a controversial topic
[Freeley and Steinberg, 2009]. Sometimes the premise can be left implicit (enthymemes) and the
mechanism to draw the conclusion from the premises is informal.
The Web is the most important and extensive source of information, and everyone will rely
on Google at some point to fill their lack of knowledge. However, as much as search engines
usually provide fast and correct answers for factual information, it is not so straightforward
when there are multiple controversial opinions [3]. Also, fake news worsen their effectiveness,
forcing users to check the credibility of sources [4] (e.g. the specific website he is visiting).
As an example, determining the stance of Twitter users towards messages may constitute an
indirect way to identify the truthfulness of discussed rumors [5].
The corpus we have is preprocessed from the argument search engine args.me2 , which clearly
identifies sentences (both premises and conclusions) and the premise’s stance towards conclu-
sion, but crawling the Web to extract such data is a research field on its own. Automatically
detecting argumentative content in natural language text, i.e. argument mining, can help to
determine people’s opinion on a given topic, why they hold it, and extract insight on public
matters, such as politics [6]. More details about argument mining can be found in the survey by
Lawrence and Reed [7].

Touché 2021
Our starting point was the Overview of Touché 2021: Argument Retrieval paper [8], which
provided valuable insights on what is argumentation, how we can assess its quality and what
retrieval approaches were successful past editions, being this the third one.
We developed our system taking into account two winning techniques: query expansion
through Wordnet synonyms/antonyms and DirichletLM as the best similarity function (among
BM25, DPH, and TF-IDF). Furthermore, we used last year’s qrels to identify the best combinations
based on our experiments.

2
https://www.args.me/index.html
Query Expansion
Query Expansion (QE) is a technique that automatically expands the initial query issued by the
user, usually with a human controlled thesaurus. It helps to reduce ambiguity and to increase
recall. We added synonyms to the original query in two different ways, namely with WordNet
(vocabulary-based) and Word2vec (corpus-based). The interested reader can find more about
QE in the comprehensive survey done by Azad and Deepak [9], which provides an overview
on core methodologies, use cases and state of the art approaches. Current techniques adopt
transformer-based models which show more promising results than classical ones.

On Stance Detection and Sentiment Analysis
This year we face an increased layer of complexity: we have to retrieve two sentences with the
same stance, and identify if this last one is "pro" or "con" the given topic. The Webis Group3
already started tackling the task of predicting if two arguments share the same stance, though
the Same Side Stance Classification Shared task [10], which achieved promising results and
showed that it’s both feasible and there’s room for improvement.
Retrieving a relevant pair of sentences is closely related both to stance detection, for which
a tutorial can be found in [11]4 , and to sentiment analysis. For example, Alshari et al. [12]
used a Word2vec model to expand SentiWordNet, a lexical dictionary for sentiment analysis to
learn the polarity of words in the corpus, and evaluated their approach on the IMDB with two
classifiers (Logistic Regression and SVM). To estimate the polarity of each non-opinion word in
the vocabulary, they computed the score of a given word based on the polarity of the closest
term present in SentiWordNet [13], with encouraging performance in identifying “positive” and
“negative” reviews . Also, current research on stance detection aims at finding the position of a
person from a piece of text they produce, focusing on social media portals [14].

3. Methodology
In this section we address how the system was developed starting from examples studied during
the Search Engine course. Furthermore, we will highlight which parts we focused more on and,
more in details, what we tried to improve the base system.

3.1. Base system, our first steps into IR
As previously stated in the introduction, the system was built in a first iteration to obtain a fully
functioning pipeline for retrieving pairs of sentences. In this sense, we spent the first part of
the development phase building a basic functioning pipeline to retrieve documents as in last
year’s edition of Touché Task 1. After concluding the previous step, we turned to this year’s
challenges and focused on upgrading the base system to retrieve sentences and increasing the
performance by trying both off-the-shelf components and custom ones.
The three main components of the basic pipeline are the following:

3
https://webis.de/
4
https://dkucuk.github.io/stancedetection/
3.1.1. ToucheParser
Since the corpus was provided as a large .csv file, the first important step was to parse it
keeping in mind the memory usage. ToucheParser is a class which specializes the abstract
DocumentParser, an Iterable class used to run through the corpus and to index each document
using the DocumentIndexer.
In particular, by using a BufferedReader and the Jackson CsvMapper, each line of the corpus
was transformed into a proper fielded instance of ParsedDocument which could then be easily
used to index the document with Lucene5 .

3.1.2. DirectoryIndexer
Once we parsed the lines from the corpus file correctly and created an Iterable class (ToucheParser)
to obtain a ParsedDocument one by one, we proceeded to index all the documents by using
the Lucene API.
For indexing execution, our class expected to get 365408 documents, previously extracted
from ToucheParser. As already mentioned, the initially implemented StandardAnalyzer
belongs to the packet org.apache.lucene.analysis.Analyzer. It is a simple analyzer
where we specified a StandardTokenizerFactory and a LowerCaseFilterFactory to
make all texts lowercase. In addition, the BM25Similarity similarity function has been used
as a starting point to match the documents with the topic. (In the next sections it will be shown
how we were able to change the analyzer and the similarity function).

3.1.3. BasicSearcher
After checking the integrity of the index using Luke (Lucene Index Toolbox) and verifying all the
fields were containing the expected content, we created a basic searcher to retrieve documents.
In particular, after opening an IndexSearcher and using ToucheTopicsReader to read each
topic, the searcher constructs a query from the title of the topic only. The searcher matches it
with the text and the conclusion of the documents in the index providing the best matching
documents for each topic (at most 1000). For the initial searcher, StandardAnalyzer and
BM25Similarity were used as Analyzer and similarity function to be coherent with what we
used for indexing.
As it will be shown in the results section, after this part we were able to run the system using
last year’s topics and relevance judgments to check the base performance for our system.

3.2. Different approaches for Sentence Selection
The second planned iteration of the development required upgrading the searcher(s) to select
pairs of sentences in a meaningful way, instead of single documents.
For this phase of the development, we decided to upgrade BasicSearcher to trivially retrieve
the conclusion and the first premise of the initially selected documents, or, if no conclusion
was present, add another premise to the output (e.g., choosing the first two premises of the
document). This sparked the idea for two different approaches with a common idea at the core,
5
https://lucene.apache.org/
doing sentence selection from the document initially matched with the last year’s version of
the system. In our view, this is beneficial since pairs of sentences from the same document have
intuitively higher coherence than arbitrary pairs of sentences from the entire collection.

3.2.1. SentencesSearcher
SentencesSearcher is the first of the two new solutions. The development of the class
required the generalization of BasicSearcher into an abstract class AbstractSearcher to
follow the DRY principle and avoid repetitions. AbstractSearcher was used to implement
the shared portion of code and later became the superclass for all the rest of the searchers.
Until this point, the sentences were not correctly indexed to provide fast matching between
them and the topics, so it became necessary to index them. To separate the document retrieval
from the sentences retrieval we chose to create a new index. As in the case of the searchers we
made use of an abstract class called AbstractDirectoryIndexer that contained the shared
code (DRY principle) for DirectoryIndexerDocument, the indexer for the documents, and
DirectoryIndexerSentences, the index for the sentences.
To sum up, the searcher works by retrieving for each topic the required number of documents
using the index constructed by DirectoryIndexerDocument and then, for each document,
providing the two sentences from that document that best match the topic title using the index
constructed by DirectoryIndexerSentences.

3.2.2. ConclusionSearcher
After developing SentencesSearcher, we decided to push more on the idea of coherence
between sentences of the same document. ConclusionSearcher works in a similar way
to SentencesSearcher. For each topic, it retrieves the required number of documents
using the index constructed by DirectoryIndexerDocument and then, chooses the two
sentences by selecting always the first conclusion of the document and the premise (of the
same document) which better matches the text of the conclusion. To implement this second
step, we constructed a query that finds only the relevant premises in the index created by
DirectoryIndexerSentences and matches them with the text of the conclusion.

3.3. Increasing performance
Once the full system (Figure 1) for this year edition of Task 1 was completed and different
sentence selection methods were explored, we decided to focus on improving the performance
of the overall system. The way in which we decided to divide the system was beneficial in
experimenting with ideas since we were able to evaluate them using last year relevance judgment.
We’ve tried query expansions because it was listed among the winning approaches of last year
and wanted to see its effectiveness against the new topics.
Following are the details of the various off-the-shelf components tried and the custom query
expansion techniques used to improve performance.
3.3.1. Trying out off-the-shelf components
To try different off-the-shelf components it was necessary to develop our own analyzer, a
subclass of org.apache.lucene.analysis.Analyzer.
The main components explored were:
• Stoplists. We experimented with ’glasgow.txt’ and ’smart.txt’ stoplists, respectively com-
posed by 319 and 571 common words.
• Stemming. We experimented with PorterStemFiler and KStemFilter, versions of
the Porter stemmer and the Krovetz stemmer found in the Lucene library.
• Character N-grams and Word N-grams (Shingles). Also in this case filters were imple-
mented by Lucene, named respectively NGramTokenFilter (with 3 characters) and
ShingleFilter (applied using 3 words).

The components were chosen using a heuristic approach. Starting from the baseline model we
added one component at a time and selected at each iteration the best performing between the
ones available.

Figure 1: Complete pipeline with the different parts

3.3.2. Query expansion based on WordNet
WordNet is the most popular thesaurus6 . Different versions can be found online: for convenience
reasons which we will explain in 4, we adopted a Prolog version of WordNet 3.1, which can be
found at https://github.com/ekaf/wordnet-prolog. WordNet is a network where words are linked
to other words which may be semantically related. The new searcher which is implemented in
WordNetSearcher is an extension of ConclusionSearcher in the following sense: the query
(topic title) is expanded using all synonyms from WordNet and searched among all documents
(using Dirichlet similarity function); for each document among the top 1000 retrieved by the
6
https://wordnet.princeton.edu/
index searcher, the conclusion is again expanded using at most 2 synonyms for each term and
searched among all sentences of this document to find the best possible matching premise (since
the other selected sentence is necessarily the conclusion). The value 2 is a heuristic number
and can be modified as we please: it is hard to define a precise value for the maximum number
of synonyms without qrels.
To speed up the process of searching one can avoid expanding the full conclusion with all
terms in WordNet, but instead provide a keep only filter: for instance one can decide to keep
only synonyms appearing in the topic description, or in the topic narrative (or both) which
likely may contain synonyms of words in the topic title. In the end we opted to fully expand
the conclusion, even though the process of searching took almost an hour. See 7 for details on
speeding things up or alternative designs.
One final note: we performed the search of the sentences as in ConclusionSearcher since
we are confident that it is more likely that conclusion will be expanded further than the query
title, as in general it is longer (in fact a mixed approach could be tried as well).

3.3.3. Query expansion based on Word2vec
Word2vec [15] is a natural language technique to produce word embeddings, ideally from a
large corpus of text. Each word is represented as a vector of numbers and the cosine similarity
computes the semantic relationship between words. It’s a set of neural network models with one
hidden layer trained to predict the features of the surrounding window of a given word (Skip-
gram) or the target word from a context window (CBOW, continuous-bag-of-words). CBOW
better captures syntactic relationships between words, meanwhile Skip-gram the semantic one.
For example, given the term “day” CBOW may retrieve “days”, whereas Skip-gram may also
retrieve “night”, which is semantically close but not syntactically. Skip-gram is less sensitive to
high frequency words and is less likely to overfit because it looks at single words each iteration,
whereas CBOW trains on a window of words, meaning that it sees frequent words more often.
For the previous reasons, we opted for Skip-gram.

4. Implementation
4.1. Query expansion based on WordNet
The package org.apache.lucene.wordnet provided by Lucene allows for adding a syn-
onym filter from a WordNet-like database in Prolog. It loads the database in a SynonymMap
which is a fast hash map used to retrieve synonyms from any specified lowercase word. After
creating this map, we can add a SynonymTokenFilter to our analyzer, passing the former to
the constructor. We explicitly made an analyzer called WordNetQueryExpander which does
other things among setting up a synonym filter. In this order, the first step consists in setting
up a lower-case filter, then a stop filter, and finally the synonym filter. The constructor of the
analyzer accepts a parameter set which enables us to keep only synonyms words contained
in set: in this sense this has been used to speed up the process of searching, as said before.
The second step applies this keep only filter if set is not null, and then applies the Krovetz
Stemmer.
We tried two combinations of filters for this analyzer (the synonym filter is always applied
clearly):

• a combination of lowercase filter, stop filter (Glasgow list), synonym filter, Krovetz
stemmer (the default)
• a combination of lowercase filter and synonym filter

both were tried using BM-25 similarity and Dirichlet similarity (see 6 for a discussion of the
final results).

4.2. Query expansion based on Word2vec
Word2Vec generates synonyms after training on the corpus, which we preprocessed to remove
noise and improve performance (training and searching time). The training data for the word2vec
model was built removing all duplicates, i.e. documents with the same "sourceId" sub-field,
resulting in 58962 documents. We filtered from "args_processed_04_01.csv" the "sourceText"
sub-field, replacing with a whitespace all words with three or more equal consecutive characters,
words that contained numbers and the substring "xa0". All terms have been turned lowercase
and only ones between 3 and 14 characters were kept.
The model was built using the deeplearning4j library with the following:

Word2Vec vec = new Word2Vec.Builder().stopWords(list).minWordFrequency(20)
.layerSize(512).windowSize(10).iterate(iter).tokenizerFactory(t).build();

• stopWords(list) is a stopword list based on ’smart.txt’ plus approximately 50 of the
more common terms from the context field.
• minWordFrequency(20) forces the removal from the vocabulary of words with less
than 20 repetitions.
• tokenizerFactory(new StandardTokenizer()) is the default tokenizer of the
deeplearning4j library.
• windowSize is the number of words used to compute the numeric vector of a given
word. Since we already preprocessed "SourceText" and removed many stop words we
kept a relatively low value (10).
• layerSize is the number of features in the word vector. Default values are around 100,
we used 512 to try to improve the quality, with no apparent downsides. Printing out
synonyms for query terms we saw a closer semantic meaning with higher values of this
parameter (512 vs 256).

Word2Vec run description In the first phase, query expansion was performed on the topic
title, using two stoplists (’glasgow.txt’ and ’smart.txt’) but no stemming because, as can be seen
from row 1 and rows 2 & 3 (student-students) in table 1, we are kind of doing lemmatization.
Also antonyms may be added (row 6)). During searching time, only synonyms with a similarity
higher than a heuristically found value (0.53 in range 0-1) were added, and at most 5 per term
of the original query. For the second phase, we’ve used the conclusion field from the retrieved
document as input for another query to find the most similar premise in the document (as in
ConclusionSearcher). The Krovetz stemmer has been applied and no stoplist.

Table 1
Example of synonyms/antonyms generated with Word2vec
Query Term Synonym Similarity
1 teachers teacher 0.76
2 teachers student 0.66
3 teachers students 0.66
4 teachers classroom 0.65
5 teachers schools 0.61
6 legal illegal 0.60

5. Experimental Setup
This section contains more details on which hardware was used to run the experiments, alongside
which tools and measures were used to evaluate the performance.
The hardware used to create the indices, search the topics and evaluating the performance is:

• OS: MacOS Big Sur
• CPU: AMD Ryzen 7 3800X 3.9 Ghz
• RAM: 16GB DDR4 3200MHZ
• Storage: SSD M.2 PCIe NVMe 512GB

The development of the system was based on the experimental collections created for this year
and last year editions of Touché Task 1. To evaluation tools used during the development are
trec_eval7 to compute the measures and Luke, a GUI that lets you look inside the index Lucene
creates, to check indices health and coherence.
The evaluation measures used to check the system during the various phases of the development
are:

• nDCG_cut_5 or Normalized Discounted Cumulated Gain at cut 5: main metric used for
evaluating the systems in Touché Task 1.
• MAP or Mean Average Precision: with a single number it gives an overall view of the
system’s performance, and offers a different perspective w.r.t nDCG_cut_5 since it’s a
binary relevance function. Plus, the more qrels available, the more accurate and we had
the ground truth of previous two editions.

The git repository containing the source code of the system is available at the link seupd2122-
6musk.

7
https://github.com/usnistgov/trec_eval/
6. Results
6.1. Performance on This Year’s Relevance Judgments
This year another class of judgments was given besides quality and relevance ones: in order to
establish if a pair of sentences do not contradict each other, Touché organizers supplied us with
coherence qrels. We want to understand whether our searching strategies are different or not:
comparing performance scores between runs always requires statistical analysis; to do so, we
will compare the distributions of NDCG values among the 5 runs we submitted to CLEF for
each topic title. The runs we chose based on last year performance are:
1. seupd2122-6musk-kstem-stop-shingle3 (uses 3.2.1 and off the shelf components described
in 3.3.1)
2. seupd2122-6musk-stop-kstem-basic (described in 3.1.3)
3. seupd2122-6musk-stop-kstem-concsearch (described in 3.2.2)
4. seupd2122-6musk-stop-wordnet-kstem-dirichlet (described in 3.3.2)
5. seupd2122-6musk-word2vec-sentences-kstem (combines searcher from 3.2.1 and expan-
sion technique described in 3.3.3)
We have the following results for NDCG on each type of judgment:

Table 2
nDCG@5 for all judgments
Run nDCG@5 qual nDCG@5 rel nDCG@5 coh
seupd2122-6musk-kstem-stop-shingle3 0.7258 0.6378 0.3699
seupd2122-6musk-stop-kstem-basic 0.2876 0.1767 0.1962
seupd2122-6musk-stop-kstem-concsearch 0.7244 0.5881 0.3415
seupd2122-6musk-stop-wordnet-kstem-dirichlet 0.7299 0.6055 0.3622
seupd2122-6musk-word2vec-sentences-kstem 0.7183 0.5822 0.3374

This table only shows means computed by the trec_eval tool over all topics; to get a better
idea over the whole data, boxplots are really useful:

Figure 2: nDCG@5 for quality qrels Figure 3: nDCG@5 for relevance qrels
Figure 4: nDCG@5 for coherence qrels

Informally we can say that runs 1, 3, 4, 5 perform all better than run 2, which intuitively
makes sense, since the core of the searcher is very basic. Runs 3, 4, 5 are almost equal, in
particular run 4 shows a lower interquartile range. To verify these statements we will apply
both ANOVA one way and pairwise Student’s t-tests.
qrels F stat p_value
quality 31.85 2.3e-21
relevance 29.38 5.8e-20
coherence 4.68 0.001

Table 3
F statistic and p_values

Table 3 confirms that for each qrels file, the means of the NDCG distributions are not all
equal among different runs, as boxplots anticipated (as p_values are less than 0.01 we reject the
null hypothesis that all means are equal). To better understand pairwise differences, we show
the p_values after computing pairwise t-tests:

Figure 5: p_values pairwise t-tests, quality qrels Figure 6: p_values pairwise t-tests, relevance qrels

Figure 7: p_values pairwise t-tests, coherence qrels
So run 2 is in every case worse than all the other runs since the p_values are < 0.01, as we
suspected. We can’t inference further differences between runs from the tables. Note that run 2
and run 3 use the same analyzer, whereas the difference relies only on the type of searcher. So
we say for sure that searcher 3.2.2 makes a very substantial impact on the final result w.r.t. 3.1.3.
We can also point out that even if run 1 and run 5 use very different first phases (and 3.2.1 as
second phase of the system), they were not statistically different. Perhaps this suggests that
another interesting test to execute would be to run the same first phase as in run 2 and run 3,
but in combination with sentences searcher. This would have enabled us to see the difference in
performance given by only 3.2.1 with respect to the other 2 sentence selection methods.

7. Conclusions and Future Work
Comparing our results on document retrieval with last year’s overview paper [8] allowed us
to ensure that the system we were developing had a satisfying enough performance, and we
could proceed with sentence selection and retrieval. With the release of qrels, we will be able
to discern which of the multiple approaches worked better and to fine-tune parameters (e.g.,
similarity functions, word2vec training parameters and WordNet max synonyms).
An evaluation measure for Touché 2022 Task 1 is the coherence between sentences; we may
want to further improve it by means of character n-grams, word n-grams, and skip-grams that
overcome data sparsity. Since sentences, especially conclusions, are often short phrases and
should be meaningful claims, finding matching text between them may lead to an increase of
syntactic similarity, and consequently in coherence. For this reason, we are interested to see
how the just mentioned techniques could perform.
In the run file we have to print the stance w.r.t the query and, as can be seen from the args.me
search engine’s API8 , the one we have available in the premises field is towards conclusion. So,
we don’t have an immediate information if a retrieved argument is "pro" or "con" the original
query, and we may try to address this issue through sentiment analysis. It’s the application of
natural language processing (NLP) to understand if the explicit or implicit opinion in a sentence
is positive, negative or neutral. Through the use of Part of Speech (PoS) tagging, for example
ApacheNLP 9 , and human-validated sentiment lexicon we can define a polarity score for the
premises and compare it with the stance towards conclusion to derive the final returned stance.
Since "each sentence in the pair must ideally be the most representative/most important of its
corresponding argument", sentiment analysis may also lead to an improvement in this regard.
As a further development we would like to retrieve sentences from different documents
to encourage diversity, as suggested by Touché’s organizers. To ensure coherence between
sentences and query text, we may want to identify keywords, weight them and try synonym
expansion with terms from the description or narrative field of the topic file.

8
https://www.args.me/api-en.html
9
https://opennlp.apache.org/
References
[1] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie-
mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu-
ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction.
13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in
Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear.
[2] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for
Argument Search: The args.me corpus, in: C. Benzmüller, H. Stuckenschmidt (Eds.), 42nd
German Conference on Artificial Intelligence (KI 2019), Springer, Berlin Heidelberg New
York, 2019, pp. 48–59. doi:10.1007/978-3-030-30179-8\_4.
[3] M. S. Shahshahani, J. Kamps, Argument retrieval from web, in: International Conference
of the Cross-Language Evaluation Forum for European Languages, Springer, 2020, pp.
75–81.
[4] M. Samadi, P. Talukdar, M. Veloso, M. Blum, Claimeval: Integrated and flexible framework
for claim evaluation using credibility of sources, in: Thirtieth AAAI Conference on
Artificial Intelligence, 2016.
[5] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva, L. Derczynski,
SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours,
in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association
for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 845–854. URL:
https://aclanthology.org/S19-2147. doi:10.18653/v1/S19-2147.
[6] H. Hüning, L. Mechtenberg, S. Wang, Detecting arguments and their positions in experi-
mental communication data, Available at SSRN 4052402 (2022).
[7] J. Lawrence, C. Reed, Argument mining: A survey, Computational Linguistics 45 (2020)
765–818.
[8] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann,
B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument
Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume
12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp.
450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10.
1007/978-3-030-85251-1\_28.
[9] H. K. Azad, A. Deepak, Query expansion techniques for information retrieval: a survey,
Information Processing & Management 56 (2019) 1698–1735.
[10] B. Stein, Y. Ajjour, R. El Baff, K. Al-Khatib, P. Cimiano, H. Wachsmuth, Same side stance
classification, Preprint (2021).
[11] D. Küçük, F. Can, A tutorial on stance detection, in: Proceedings of the Fifteenth ACM
International Conference on Web Search and Data Mining, 2022, pp. 1626–1628.
[12] E. M. Alshari, A. Azman, S. Doraisamy, N. Mustapha, M. Alksher, Senti2vec: An effective
feature extraction technique for sentiment analysis based on word2vec, Malaysian Journal
of Computer Science 33 (2020) 240–251.
[13] S. Baccianella, A. Esuli, F. Sebastiani, Sentiwordnet 3.0: An enhanced lexical resource
for sentiment analysis and opinion mining, in: Proceedings of the Seventh International
Conference on Language Resources and Evaluation (LREC’10), 2010.
[14] A. ALDayel, W. Magdy, Stance detection on social media: State of the art and trends,
Information Processing & Management 58 (2021) 102597.
[15] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations
in vector space, 2013. URL: https://arxiv.org/abs/1301.3781. doi:10.48550/ARXIV.1301.
3781.