CLEARNESS: Coreference Resolution for Generating
                                and Ranking Arguments Extracted from Debate
                                Portals for Queries
                                Johannes Weidmann, Lorik Dumani and Ralf Schenkel
                                Trier University, Behringstraße 13, 54286 Trier, Germany


                                                                      Abstract
                                                                      Argumentation has always been used by humans to convince other people of certain viewpoints, e.g., to
                                                                      push through personal interests, or to resolve conflicts. The research field of computational argumentation
                                                                      deals with the extraction, analysis, retrieval, and generation of arguments in natural language texts.
                                                                      Modern argument search engines are able to generate appropriate arguments for controversial topics,
                                                                      very often based on posts taken from debate portals. However, an important issue is that posts in these
                                                                      portals are often quite long and incomprehensible. Apart from that, long posts in debate portals cannot
                                                                      be arguments by definition as not every text is argumentative.
                                                                          In this paper, we dock on preliminary work about argument search engines and present CLEARNESS
                                                                      (Coreference resoLution for gEnerating and rAnking aRgumeNts Extracted from portalS for querieS),
                                                                      an approach to generate arguments in response to a query. The arguments we focus on consist of two
                                                                      essential elements: a claim, which is a point of view on a topic, and a premise, which provides the reasons
                                                                      or evidence backing up the claim. While previous works address issues like the generation of claims or
                                                                      the creation of abstract summaries of texts, we pursue a high-precision retrieval approach. We extract
                                                                      fine-grained premises from argumentative texts and modify these through coreference resolution to
                                                                      obtain an isolated text that is –although short– both coherent and completed.
                                                                          We first build up a database by extracting arguments from the Args corpus of arguments from a
                                                                      number of popular debate portals. In these argumentative texts, we identify all coreferences and resolve
                                                                      them. Next, we examine classic and state-of-the-art approaches to rank arguments in response to a query.
                                                                      Additionally, we study the ranking behavior by utilizing query expansion. Lastly, we investigate the
                                                                      performance on relevance and coreference resolution.

                                                                      Keywords
                                                                      computational argumentation, coreference resolution, argument generation, argument retrieval


                                1. Introduction
                                Debates, be they verbal or non-verbal, have always been an integral part of human beings and
                                continue to play a significant role in societies, cultures, and politics up to this day. They are
                                essential to debate on a given topic or to make informed decisions after understanding them as
                                they stimulate critical thinking and ease problem-solving. Additionally, new perspectives also
                                encourage questioning one’s own worldview and integrating new ideas into one’s own opinion.

                                LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
                                $ s4joweid@uni-trier.de (J. Weidmann); dumani@uni-trier.de (L. Dumani); schenkel@uni-trier.de (R. Schenkel)
                                 0009-0008-7095-4332 (J. Weidmann); 0000-0001-9567-1699 (L. Dumani); 0000-0001-5379-5191 (R. Schenkel)
                                                                    © 2023 Copyright © 2023 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J.
                                                                    (Eds.): Proceedings of the LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at
                                                                    http://ceur-ws.org
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
However, the prerequisite for participating in a debate is to research information on the topic of
discussion. This is typically done through various media sources such as newspapers, websites,
television, or social media.
   Although nowadays there are almost limitless possibilities to inform and educate oneself,
it is also becoming increasingly challenging to verify and filter relevant information. Social
media has created a gigantic amount of information, thus most of it cannot be fact-checked.
Fake news, the spread of manipulative false news, has become a global problem. For example,
according to a statistic on fake news in the USA [1], during the Covid-19 pandemic, 80 % of
people encountered fake news, but only 26 % were able to identify it as false. Another problem is
that people are engaging with topics with less concentration. Research teams analyzed different
media and examined how long a topic remained popular on Twitter, Reddit, and others. While
in 2013, on average, a hashtag remained in the top 50 list for 17.5 hours, by 2016, it only stayed
there for an average of 11.9 hours [2]. Interest in individual topics tends to decrease over time.
Concurrently, the desire to constantly jump from one topic to another is increasing. The decline
in attention span is exacerbated nowadays by short videos on YouTube, Instagram, or TikTok.
   These findings are highly concerning and can potentially harm our culture of discourse in the
future. Therefore, it is all the more necessary to support people in forming informed opinions
during times of social media, accompanied by an influx of information and uncertainty about
the truthfulness of facts. Therefore, we consider the field of Computational Argumentation
(CA) to be extremely essential in assisting individuals in this process. CA is a subfield of NLP
that, among others, deals with the extraction, analysis, and generation of arguments in natural
language texts [3]. In particular, the findings of CA contribute to creating argument generation
systems that support people in researching a topic or forming an opinion. Argument search
involves gathering pertinent premises and claims related to a specific, often contentious topic.
An argument [4] can be defined to consist of two components, a claim and a premise. The claim
describes a controversial statement the arguer wants to persuade or dissuade the audience. A
premise serves as evidence or clue to increase or decrease the acceptance of the claim. The
polarity of a premise, i.e., whether it is supporting or opposing the claim, is called stance. An
example of a claim could be “Teachers should get tenure”, two supporting premises could be 𝑝1 =
“Teacher tenure provides stability within schools” and 𝑝2 = “It protects teachers’ academic freedom”.
The objective of an argument search engine is to furnish users with substantiated statements
that aid in acquiring knowledge about their subject of interest and potentially facilitate their
decision-making process [5].
   An example of an argument search engine is Args [6, 7]. This platform provides the user
with arguments on a topic divided in terms of their stance on a particular issue. Args and other
platforms use posts from debate portals as arguments as their underlying dataset. As Args
takes into account user-generated posts as arguments, the returned premises for user queries
can become lengthy and linguistically unintelligible when considering individual sentences.
Pronouns are often used, referring to previously mentioned objects in the sentence. If multiple
objects exist, it can be difficult to determine which pronoun exactly refers to which object.
Additionally, in long discussions, a comment may refer to a post that occurred much earlier,
which may not be evident in the individual sentence. However, when considering the sentence
isolated, the context might be missing for understanding. Due to the fact that in this work
we consider standalone sentences as short premises, a challenge is that without additional
information, the meaning and sentence comprehension may be lost. As an example, imagine
the two aforementioned premises 𝑝1 , 𝑝2 would be written sequentially in one post but only
the second sentence (𝑝2 ) would be retrieved as output isolated from 𝑝1 . Then, it is difficult to
understand 𝑝2 as it uses a pronoun (“It”) that refers to a subject (“Teacher tenure”) in 𝑝1 .
   While existing works focus on topics such as motion-aware claim generation [8] or belief-
based claim generation [9], we pursue a different goal in this study for generating premises.
More precisely, given a query, we aim to generate short but relevant and coherent premises
from existing valuable texts, employing a high-precision approach. To tackle the problem of
current argument search engines with lengthy posts as premises, we seek to extract fine-grained
premises from these posts using coreference resolution. A coreference is a concept that deals
with the relationship between two or more expressions that refer to the same thing or person as
shown in the above example. An expression can be a pronoun, a noun, or another type of word
that refers to another noun in the text [10]. The goal is to extract short premises from texts and
then resolve their coreferences to improve the retrieval with more understandable arguments
when considering their sentences. For the remainder of the paper, we denote isolated sentences
as premises. However, when we refer to longer premises, we call them posts. Additionally,
we enhance the query (internal knowledge) by performing query expansion, which involves
generating synonymous queries using ChatGPT.


2. Related Work
We now discuss various contemporary papers that have explored the field of argument genera-
tion. We specifically highlight diverse approaches and techniques, revealing the complexity and
depth of existing work in this area. The purpose of this comprehensive literature review is to
identify strengths and weaknesses in the current research landscape, while also establishing the
originality and relevance of our paper. It underscores the evolution of the field and justifies the
proposed study’s exploration of argument generation with coreference resolution.
   Alshomary et al. [9] focus on tailoring the generation of arguments according to the beliefs
and convictions of the audience, which have not been considered in previous works. The study
concludes that while there are limitations in modeling users’ beliefs based on their stances, the
results demonstrate the potential of encoding beliefs into argumentative texts. This lays the
groundwork for future exploration of audience reach in argumentative discourse.
   Al-Khatib et al. [11] explore how arguments can be generated and controlled using a knowl-
edge graph. While previous research has already enhanced models like GPT-2 with knowledge,
the use of an external knowledge graph is a new approach. Their findings demonstrate that
their approach is capable of generating high-quality arguments by enriching the models with
complex, interconnected knowledge.
   Opitz et al. [12] consider new metrics for argument similarity that offer both high performance
and interpretability. Previous approaches often lacked interpretable evidence or justifications
for their evaluations, making it difficult to understand the features that determine argument sim-
ilarity. The study suggests using Abstract Meaning Representation (AMR) graphs to represent
arguments and demonstrates that new AMR graph metrics can provide explanations for argu-
ment similarity ratings. The AMR similarity statistics provided initial indications of what could
be considered a good conclusion, even without a reference comparison. They examined two
hypotheses: (1) AMR semantic representation and graph metrics help in evaluating argument
similarity, and (2) automatically derived conclusions can support or enhance the evaluation of
argument similarity. Evidence was found for the former hypothesis but not for the latter.
   Another approach is the use of argument generation frameworks. Hua et al. [13] present a
framework called CANDELA for the automatic generation of counter-arguments, supported
by a retrieval system and two decoders (text planning and content reflection). Unlike previous
approaches, this approach is more precise as it considers language style, such as “In theory, I agree
with you”. CANDELA outperforms other methods like Seq2Seq, showing significantly better
ROUGE, BLEU, and METEOR scores. In the evaluation conducted by human judges, who were
asked to rate arguments on a scale of 1 (worst) to 5 (best) based on grammar, appropriateness,
and content richness, CANDELA performed the best, delivering arguments with richer content
and human-like responses.
   Schiller et al. [14] introduce a language model called Arg-CTRL, which can be used for
argument generation and has the ability to fine-tune argument-related aspects such as topic,
stance, and aspect. The Arg-CTRL model is trained using a Common Crawl Dump with 331
million documents and a Reddit Dump with 2.5 trillion documents as the data source. Generated
arguments are evaluated by human judges for grammatical correctness and persuasiveness.
The evaluation also demonstrates that the arguments generated by Arg-CTRL can compete
qualitatively with human arguments, as measured by WA-scores. The results show that argu-
ments generated using this approach are generally authentic and of high argumentative and
grammatical quality. Refining Arg-CTRL with data from Common Crawl leads to a higher
quality of generated arguments compared to using user discussions from Reddit comments.
   In the mentioned studies, new arguments are generated through the use of knowledge graphs,
frameworks, or specialized argument language models. However, in this study, we utilize
existing arguments from the text corpus. We apply coreference resolution to these arguments,
meaning that if there are coreferences within a sentence, they will be resolved. If no coreferences
are present, we retain the sentence in its original form. Thus, in the case of resolved coreference,
a sentence is obtained that is structurally similar to the original, but with references replaced by
the intended entities. Therefore, our approach does not generate new arguments but it makes
them more visible and adapts the invaluable user-generated opinions as it only modifies existing
arguments through coreference resolution in order to enhance the coherence and persuasiveness
of the sentence.
   Another method to generate new arguments is to enrich the dataset with external information.
Yu et al. [15] summarize methods and techniques in a review on how to generate text using
additional knowledge. A widely used approach for text generation is the use of text generation
models. Text generation can be enhanced by internal knowledge derived solely from the query.
For example, the query can be expanded by extracting and incorporating its topic, enabling the
system to produce more accurate outputs that align with the query. Additionally, individual
keywords from a predefined vocabulary that describe the query, in summary, can help prevent
the generation of universal outputs that do not precisely match the query. In this work, we
facilitate argument generation by utilizing an external dataset consisting of 5,933 argumentative
texts (external knowledge) that have already been labeled with topics and stances.
3. Dataset
In this section, we describe our dataset, its source, and its transformation into a usable format.
   As our main dataset, we use the Args corpus [16], a large open-source dataset that consists
of 387,606 arguments (i.e. posts) extracted from online debate portals.1 Debate portals are
websites specifically designed to organize and regulate debates. Users can post their opinions
and viewpoints on controversial topics, as well as agree or disagree with others. Overall, Args
consists of 59,637 debates with 200,099 posts with a pro stance and 187,507 with a con stance.
More precisely, in the Args corpus an argument is stored as a claim (yielded from the debate
title) and premises (yielded from the posts to that debate). The data is available in JSON format.
The average count of arguments per claim is 5.5 while 90 % of the claims have 1 to 10 arguments.
In general, the debates are rather short with about 62 % of debates having 6-10 arguments. The
average count of arguments per debate amounts to 6.5. In our work, we process the Args
corpus by splitting the posts or premises into sentences.
   In order to ensure a robust system at the end, we pick 30 out of 50 topic titles provided in the
XML file as queries from the CLEF lab Touché 2022. Touché is an initiative focused on argument
retrieval.2 Given a controversial topic, the goal of this task was to rank arguments by relevance,
argument quality, and stance. However, this is not part of this study as we only investigate
the effect of coreference resolution on the ranking in comparison with the original sentences.
Including other ranking measures could skew the findings. With regard to the relevance of the
arguments, we make use of the Qrels (“Query relevance judgments”), which contain relevance
assessments for a set of queries and documents and are also provided by the lab. For easier
handling, we developed separate parsers for each dataset to merge them into one dataset.


4. Methods
Our main goals are to evaluate the effect of different methods for coreference resolution and
query expansion. Our retrieval pipeline is illustrated in Figure 1. The first step of the process is
to collect the data from the data sources, in our case the processed dataset Args (see Section 3).
   Next, we use query expansion to enhance the queries. This is expected to increase the recall
in the (relevant) results, meaning a higher proportion of relevant premises will be found without
impacting the ranking negatively as much as possible as we pursue a high-precision recall.
For query expansion, we do not employ conventional approaches such as document collection
analysis or pseudo-relevance feedback, which involves utilizing user ratings on the returned
results. Instead, we utilize a zero-shot approach by making use of ChatGPT to obtain five similar
queries to the original query.3
   After identifying the relevant premises, we proceed with the text modification by applying
coreference resolution to get small but coherent premises. We apply the algorithm to each
retrieved text and resolve all co-references. As a result, we obtain all texts from our database with
    1
      The sources of Args are debatewise.org, idebate.org, debatepedia.org, and debate.org
    2
      The task name is “Argument Retrieval for Controversial Questions”: https://touche.webis.de/clef22/
touche22-web/argument-retrieval-for-controversial-questions.html
    3
      For reproducing the experiments with ChatGPT, the extended questions can be downloaded from the following
website: https://basilika.uni-trier.de/nextcloud/s/nzoJfd9t9HRDTHN
                                                                                             Query
         Touché                                                                            Expansion
         Topics
         Topics
                                                                                 Queries
                                                                                                       Expanded
                               Data                    Coreference                                      Queries
                          Transformation                Resolution
        Arguments
         args.me
           from                                                      Resolved
                                           Arguments
         Debate
          corpus                                                     Arguments
          Portals


                                                                                           Argument                 Output
                                                                                           Ranking
                                                                                           Matching               (Premises)
         Touché
          Tasks
          Qrels


Figure 1: Pipeline for our argument search engine


resolved references. We then input the texts resolved through coreference resolution, along with
the original and expanded queries, into our ranking system. The system provides us with relevant
premises for each query –regardless of whether it is original or extended– on a controversial
topic by sorting the premises based on scores between the query and premise, indicating their
relevance, i.e., in the end, the user gets a fixed number of premises, sorted in descending
order of estimated relevance. Within this section we compare different implementations for its
components, deriving our final configuration based on these findings.

4.1. Coreference Resolution
We conducted preliminary tests with five different coreference libraries: neuralcoref, allennlp,
coreferee, stanfordnlp, and fastcoref.4
   To obtain the best possible result and identify as many coreferences as possible, we selected
10 argumentative texts with an average of 280 words and applied the mentioned coreference
resolution algorithms to them. Then, we compared the texts and analyzed the number and
quality of the identified coreferences. Each library provides us with a list of coreferences for
a text, which consists of pairs of an object and a pronoun referring to the same entity in the
real world. We used these coreferences as a reference point to evaluate the tools in terms of
their results. The goal was to identify the best tool based on the quantity and quality of the
identified references. Prior to running each tool on the texts, we manually defined a set of
ground-truth coreferences known to exist in the texts. We have counted every coreference,
even if it would not fit into the text when resolved later. Coreferences such as we and us are
thus counted as coreferences because both parts refer to the same real entity. If a tool identifies
a coreference that we have defined as true, it is considered a True Positive. Conversely, if the
tool identifies a faulty coreference, it is considered a False Positive. False Negatives are those
coreferences that the tool fails to detect, but they actually exist in the text. True Negatives are
those that neither exist nor are identified by the tool. For each tool, we calculated the precision
(the proportion of correctly identified coreferences by the tool out of the total coreferences
detected by the tool), the recall (the proportion of correctly identified coreferences out of all
    4
    https://github.com/huggingface/neuralcoref,       https://github.com/allenai/allennlp,   https://github.com/
msg-systems/coreferee, https://stanfordnlp.github.io/CoreNLP/coref.html, https://pypi.org/project/fastcoref/
Table 1
Evaluation results of Coreference Resolution tools
      Tool          Micro Avg.   Macro Avg.   Micro    Avg.   Macro Avg.   Micro    Avg.   Macro Avg.
                    Precision    Precision    Recall          Recall       F1              F1
      Neuralcoref   0.6308       0.7507       0.3475          0.4320       0.4469          0.5351
      AllenNLP      0.5760       0.5499       0.4308          0.4582       0.4854          0.4790
      Coreferee     0.6825       0.6561       0.3739          0.3517       0.4735          0.4497
      Fastcoref     0.6562       0.6706       0.5887          0.6226       0.6162          0.6252
      StanfordNLP   0.4222       0.4345       0.3725          0.3324       0.3894          0.3612


ground-truth coreferences that exist in the text), and the F1 -score (harmonic mean of precision
and recall). Table 1 shows the performances.
   Precision, recall, and F1 scores serve as reliable benchmarks to determine which tool provides
satisfactory results. However, during the compilation of all relevant coreferences, we adopted a
detailed approach; that is, we included every coreference (all pairs of objects and pronouns that
refer to the same entity) without considering whether the reference would be meaningful later.
Pairs such as ’that’ and ’it’ are technically considered as coreferences, but add no value to the
resolved text. This ultimately signifies that higher precision or recall does not causally indicate
whether the tool delivers the best resolved text. Consequently, we meticulously examined
all texts resolved by the tools, checking for sense, meaning, and sentence coherence. When
evaluating the tools, we always kept our goal in mind. Since we follow a high-precision approach,
we aim to obtain isolated premises in the final result. Therefore, a high precision is of great
relevance to us, meaning that the coreferences should be correct and of high quality. Although
Fastcoref has the best overall precision and recall scores, we hardly noticed any differences in
actual results compared to neuralcoref. In the end, we decided to proceed with neuralcoref as
the coreference library in our pipeline for several reasons. Neuralcoref uses a neural network
model trained on extensive text data, which results in high accuracy in coreference resolution
tasks. It has a clear and simple API that allows users to quickly and efficiently utilize the tool.
This makes it possible to pass a text to the model and receive a list of coreferences in return.
Neuralcoref has provided us with solid overall results that have convinced us in terms of both
quality and quantity of coreferences. The tool was not overly greedy compared to other libraries
like StanfordNLP which yielded too many unnecessary coreferences. Ultimately, the choice of
library for the system is up to the individual. It depends on whether one prefers a high-precision
or high-recall approach.

4.2. Query Expansion
For the query expansion, we utilize the ChatGPT API provided by OpenAI. As the prompt sent
to the API endpoint, we select:
T h i s i s a q u e r y : { q u e r y } . R e t u r n 5 s i m i l a r q u e r i e s b a s e d on
      t h i s q u e r y i n a JSON L i s t w i t h t h e key ' s i m i l a r _ q u e r i e s ' .
ChatGPT followed these instructions and processed our prompt by returning 5 similar queries
based on our original query in a JSON list. The generated queries either replace keywords with
synonyms or rephrase the query in the same sense. For example, with the prompt and the
query “Is vaping with e-cigarettes safe?”, we obtain the following 5 expanded queries: “What are
the health risks of vaping?”, “Is vaping less harmful than smoking?”, “What chemicals are in
e-cigarette vapor?”, “Can vaping lead to addiction?”, “What is the long-term impact of vaping on
health?”.

4.3. Ranking of Premises
Now we have all argumentative texts with resolved coreferences and all queries together,
allowing us to start matching queries with sentences. We now analyze different methods
to find relevant premises for a query. We utilize five methods for this task: Jaccard, BM25,
BERT, TF-IDF, and ChatGPT. For each query (in total, we use 30), the system finds –as usual in
argument retrieval– the most similar claims to the query. We can then consider their premises
and determine the relevant texts. Each method takes as input the query (or a list of queries for
query expansion), all relevant texts aggregated into one text (which we split into sentences), and
the original texts to determine if a coreference has been resolved. We considered the following
methods:
Jaccard. For the Jaccard coefficient, we calculate the Jaccard score between the query and each
premise (sentence) in the texts. The score is computed based on the 4-grams of the query and
premise.
BM25. For BM25, we input the relevant texts and the query (as a list of tokens).5
BERT. W.r.t. BERT, we make use of the “sentence-transformers” python library. More precisely,
we use the model “all-MiniLM-L6-v2 model” that maps texts to 384-dimensional embeddings.
In our implementation, we use it for the query and each premise in the text corpus. Then, we
compute the cosine similarity which also represents the score for each premise to the query.
TF-IDF. For TF-IDF, we employ the TfidfVectorizer module of the open-source “scikit-learn”
python library.6 It applies TF-IDF to transform the raw text into a matrix of TF-IDF features.
The module’s internal cosine similarity function calculates scores for all query-premise pairs.
ChatGPT. The ChatGPT approach differs from the other methods in that we do not compute
a score. Instead, we ask the chatbot to provide the most relevant sentences from the corpus
regarding the query. To do this, we send the following prompt to the ChatGPT API endpoint:
T h i s i s a t e x t : { t e x t } . R e t u r n a r g u m e n t s t h a t you can i n f e r
      from t h i s t e x t m a t c h i n g t o t h i s q u e r y : { q u e r y } . You s h o u l d
      o n l y g i v e me a r g u m e n t s t h a t can be i n f e r r e d from t h e t e x t .
The result does not correspond exactly to the sentences from the corpus but rather to arguments
that ChatGPT can infer based on its prior knowledge. Therefore, ChatGPT is unsuitable as
a method for solving this task, despite its good performance. This is mainly due to the non-
deterministic behavior of ChatGPT since it tries to answer in natural language and seems to
pursue the paradigm that a related answer is better than no answer. Thus, it does not always
reply properly to questions. Hence, we received arguments in the form of summaries, inferences
based on prior knowledge, or even completely new ones which do not occur in the original
texts at all. However, we encountered an issue with this method as the API only allows a limited
number of tokens of 4,096 as a prompt. To circumvent this, we divided the text into sections
   5
       We use the rank-bm25 python library: https://pypi.org/project/rank-bm25/
   6
       https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
of 300 words each, ensuring the token limit was never reached.7 For each section, we sent the
aforementioned prompt to the endpoint and saved the output. To obtain a comprehensive result,
we then sent the following prompt for the results of all sections:
Give t h e f i v e b e s t a r g u m e n t s from t h e s e a r g u m e n t s t h a t f i t t h e
   most t o t h i s q u e r y : { q u e r y } .
   For each method and query, we take the best five arguments, i.e., for each of those with
the highest scores. Each returned premise in the output (except for ChatGPT) is accompanied
by the resolved coreferences, indicating whether a reference has been resolved (True) or not
(False). W.r.t. the query expansion, we also output to which query the premise relates to. For
the annotation process, we utilized a scale ranging from 0 to 2. A score of 0 implies that the
premise provides no answer to the query and is thematically unrelated, 1 implies that the
premise does not provide a precise answer but is thematically consistent with the query, and 2
suggests that the premise delivers an appropriate and thematically accurate response. Overall,
we fed 30 queries into the system, applying query expansion to each. As a result, we had a
total of 60 queries (30 standard queries + 30 expanded queries) input into the system. Note
that the five expanded queries per query are viewed as one block, i.e., their premises are put
into one list without duplicates. Then, we consider its 𝑛 most relevant premises where the
relevance is based on the score from the query to the premise. Since we tested five methods,
each returning five premises, we had a total of 1,500 premises that needed to be annotated.
This was accomplished manually. To evaluate the qualitative performance of each method,
we calculated the Normalized Discounted Cumulative Gain (nDCG) score, the precision, and
the mean reciprocal rank (MRR) for the normal query and the expanded queries as shown in
Table 2.8
   Table 2 reveals that ChatGPT delivers the best results in both disciplines. However, it was
unable to provide the exact premises from the text and could not display resolved co-references.
More precisely, it returned the five most relevant premises; rather, they are premises derived
by ChatGPT based on the textual context. Therefore, each returned premise was relevant to
the query because they were inferred from the entire context. As a result, ChatGPT exhibits
consistent values of 1 for nDCG, precision, and MRR. Thus, we chose BERT as the module in
our pipeline for our final system (i) because of the aforementioned issues and (ii) the premises
returned are generally more flexible and not rigidly tied to the query in terms of identical words
or structure. Note, that there are no significant deviations among the other methods that rely
on a direct comparison between the query and premise in the text.
   For query expansion, the results show that it delivers more relevant results for some methods
but not for others. For instance, query expansion leads to slightly lower nDCG and precision
values for Jaccard coefficient, BM25, and TF-IDF. Probably, since they are based on the bag-
of-words models. However, for BERT, query expansion achieves higher nDCG and precision
values. We strongly suspect that this is due to BERT’s ability to “understand” the context of a
sentence as its embeddings capture the context and not just individual terms as in the other
    7
      This workaround successfully bypassed the token limit, but implicates a longer duration of the program since
each request to the API takes about 10 seconds.
    8
      We performed statistical significance calculations only for nDCG and MRR but not for Precision@{1,5} because
these tests were performed only after acceptance and one hint from a reviewer.
Table 2
Evaluation results of query matching methods. Values with significant differences to ChatGPT are
highlighted with a *. The precision is reported in two scenarios: one strict were we only consider high
quality premises (score 2) and one lenient, which contains all premises with scores 1 or 2. The lower
table shows the performance after utilizing Query Expansion (QE).
                                     (mean)          (mean)          (mean)           (mean)           (mean)
                (mean)    (mean)     precision@1     precision@5     precision@1      precision@5      Reciprocal
      Method    nDCG@1    nDCG@5     (strict)        (strict)        (lenient)        (lenient)        Rank
      Jaccard   0.73      0.91       0.53            0.69            0.9              0.95             0.71
      BM25      0.80      0.91       0.7             0.63            0.9              0.89             0.81
      BERT      0.73      0.92       0.53            0.72            0.93             0.96             0.73
      TF-IDF    0.73      0.91       0.6             0.69            0.87             0.93             0.78
      ChatGPT   1.00*     1.00*      1.00            1.00            1.00             1.00             1.00*
                (mean)    (mean)     (mean)          (mean)          (mean)           (mean)           (mean)
                nDCG@1    (nDCG@5)   precision@1     precision@5     precision@1      precision@5      Reciprocal
      Method    (QE)      (QE)       (strict) (QE)   (strict) (QE)   (lenient) (QE)   (lenient) (QE)   Rank (QE)
      Jaccard   0.68      0.88       0.59            0.63            0.83             0.91             0.71
      BM25      0.67      0.89       0.43            0.61            0.87             0.89             0.66
      BERT      0.78      0.92       0.6             0.72            0.97             0.97             0.77
      TF-IDF    0.7       0.87       0.6             0.59            0.8              0.83             0.76
      ChatGPT   1.00*     1.00*      1.00            1.00            1.00             1.00             1.00


methods. Therefore BERT can possibly find more relevant premises because of the reformulated
queries. ChatGPT’s performance could not be improved anyway. However, query expansion
did not make it worse either. All nDCG precision, and MRR values remain at 1. We suppose this
is because ChatGPT infers the premises through the queries. Since query expansion not only
involves additional costs due to requests to ChatGPT, but also produced worse results for most
methods except for BERT, where, however, the observed differences in relevance between a
query and its expansion are not significant anyway, we did not use query expansion from the
end-to-end evaluation in the next section.


5. Evaluation of Coreference Resolution
In this section, we examine the effect of coreference resolution in an end-to-end evaluation.
First, we investigate the effect of coreference resolution on the regular output of our system.
Second, we study how resolving the coreferences changes the understanding and structure of a
premise.

5.1. Effect of Coreference Resolution on Argument Retrieval
We input the 30 queries on our dataset (see Section 3) and apply BERT (see Section 4) to find
the 10 most relevant premises for each query, yielding 300 premises in total, 23 % (71 premises)
of which have a resolved coreference. Although this is not a particularly high proportion of the
total output, it shows that good coreference resolution is relevant for these premises.
   W.r.t. the effect of coreference resolution on the rank of a single premise, we examined the
equivalent original unresolved sentence for each resolved premise and checked what rank it
actually holds. We found that on average, coreference resolution improves the rank of a premise
by 16.7 positions. In the worst case, the rank of the resolved premise drops by four positions,
and in the best case, it improves by 377 positions. The median of all rank differences is 0. Note
Table 3
Overall relevance of premises with resolved and unresolved texts, * = p < 0.05
                                          nDCG scores (mean)
       Corpus        @1     @2     @3      @4 @5 @6              @7      @8      @9      @10
       unresolved    0.78   0.80   0.83    0.84 0.83 0.85        0.86    0.88    0.90    0.94
       resolved      0.87   0.87   0.89    0.89 0.89 0.90        0.90*   0.91*   0.93*   0.96


that whether the effect is positive or negative depends strongly on the sentence structure and
the quality of the coreference resolution tool. For example, in the best case, a sentence like “It
will reduce crime, as showed in the real world” is ranked higher after resolving the pronoun “It”
to “more gun laws” as it perfectly fits the query. As a negative example, the original sentence “I
do not think that schools should require their students to wear a school uniform” becomes “I do
not think that schools should require schools students to wear a school uniform” by resolving the
coreference their to schools, making the sentence sound less natural and introducing redundancy,
resulting in a lower rank.
   Now, we examine the overall relevance of the premises (i) with the texts containing resolved
coreferences, and (ii) with the original texts. Thus, we run the system on both the original
and the resolved corpus and extract the top 10 relevant premises from each version. We then
annotate the relevance score for each result using the 0-1-2 scale, as done in the annotations.
This allows us to assess the overall relevance using nDCG as shown in Table 3. The results show
that coreference resolution increases the overall relevance of the returned premises. For each
𝑛, 1 ≤ 𝑛 ≤ 10, the variant with the resolved corpus demonstrates higher values. A two-sided
paired 𝑡-test with Bonferroni correction shows significant differences for 𝑛 ∈ {7, 8, 9}.

5.2. Effect of Coreference Resolution on Isolated Premises
In the first part of the evaluation, we measured the impact of coreference resolution for the
whole argument retrieval pipeline. In the second part of the evaluation, we now focus on the
effect of coreference resolution on individual premises. We simulate the scenario in which
all premises in the output contain resolved coreferences, retrieving with BERT the 10 most
relevant premises that contain a resolved coreference for all 30 queries. In total, we obtained
300 premises to examine more closely. More precisely, we investigate two questions: (i) what is
the effect of coreference resolution on the understanding of a premise, and (ii) what exactly the
success of the coreference resolution depends on.
   We employ a scale ranging from 0 to 2 for both investigations. A score of 0 indicates a resolved
reference that was incorrectly identified. An example could be the resolution in the sentence
“We should be making it easier for these people to become citizens, not harder” by replacing these
people with they. A score of 1 suggests that the resolved reference does not provide additional
assistance, but neither does it alter the original meaning. An example is the sentence “Teacher
tenure creates complacency because teachers know they are unlikely to lose their jobs”, where the
coreference their could be replaced by teachers. A score of 2 implies that the resolved coreference
substantially enhances the premise’s value, even to the point of restoring its original meaning
only through resolution. For instance, in the premise “My case is against it, it should not be
illegal, it should be legal” through the resolution of it by abortion.
   Note that 24 % of the premises (Score 0) contain resolved coreferences that disrupt the structure
and meaning of the original sentence. Manual inspection showed that this is mostly due to
an error in the coreference resolution tool or finding a correct reference that does not make
sense in the context of the sentence. 61 % of the premises do not improve the understanding of
the sentence but also do not disrupt its structure or meaning. These are cases where a correct
coreference is identified and resolved, but it does not provide any additional information for the
user’s comprehension because the pronoun alone is already sufficient to convey the meaning.
Finally, in 15 % of all premises, coreference resolution significantly enhances the understanding
of a premise. By resolving the coreferences, the premise regains its meaning when considered
individually. These are mostly premises where the pronoun is used but the object is missing.
Replacing the pronoun with the object brings the actual statement back to the surface.
   This does not generally imply that coreference resolution is ineffective for modifying premises.
The results are highly dependent on the sentence structure of the premise. In nearly all instances
with scores 0 and 1, the coreference cluster is resolved within the same sentence, meaning the
object and pronoun are present in the original premise. In the instances with a score of 2, the
original premises nearly always lack the object and only the pronoun is included.


6. Conclusion and Future Work
We presented a pipeline for an argument search engine that returns fine-grained premises from
long posts in debate portals using coreference resolution. We have centrally investigated the
effect of coreference resolution on the overall output of the search engine and, individually,
on the understanding of a premise. Additionally, we examined the effect of query expansion
and which method is best suited to match a premise to a query. We consider neuralcoref the
best tool for resolving coreferences and BERT the best method for finding a relevant premise
to a query. The evaluation of the returned premises revealed that coreference resolution has
a positive effect on both the general relevance and the understanding of the premises. More
relevant premises are found for a query, which, when considered individually, tend to be more
understandable than those without resolved coreference.
   Note that in our evaluation which can be seen as a first step, about 24 % of the premises
deteriorated rather than improved as a result of the resolution, while it had no effect on about 61
% and contributed to an improvement in only 15 %. This is mainly because resolving is not useful
everywhere. Thus, in the future, we plan to check the sentences for suitability before resolving
the coreferences. For example, we might try to suppress cases with coreference resolution
within the same sentence. Further, we could evaluate the performance when coreference is only
applied to premises that lack the object, and only the pronoun is included.


Acknowledgments
This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the projects
ReCAP and ReCAP-II, Grant Number 375342983 - 2018-2024, as part of the Priority Program
“Robust Argumentation Machines (RATIO)” (SPP-1999).
References
 [1] A. Watson, Statistic on Fake News, https://www.statista.com/topics/3251/fake-news/
     topicOverview, 2023. Accessed: 2023-07-30.
 [2] P. Lorenz-Spreen, B. Mønsted, P. Hövel, S. Lehmann, Accelerating dynamics of collective
     attention, Nature Communications 10 (2019). doi:10.1038/s41467-019-09311-w.
 [3] A. Lauscher, H. Wachsmuth, I. Gurevych, G. Glavas, Scientia potentia est - on the role
     of knowledge in computational argumentation, CoRR abs/2107.00281 (2021). URL: https:
     //arxiv.org/abs/2107.00281. arXiv:2107.00281.
 [4] F. Macagno, D. Walton, C. Reed, Argumentation Schemes, 2018, pp. 517–574.
 [5] E. Durmus, Towards understanding persuasion in computational argumentation, 2021.
 [6] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch,
     V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in:
     Proceedings of the 4th Workshop on Argument Mining, Association for Computational Lin-
     guistics, Copenhagen, Denmark, 2017, pp. 49–59. URL: https://aclanthology.org/W17-5106.
     doi:10.18653/v1/W17-5106.
 [7] H. Wachsmuth, Args - Argument Search, https://www.args.me/index.html, 2023. Accessed:
     2023-08-28.
 [8] D. Suhartono, A. Gema, S. Winton, T. David, M. Fanany, A. Arymurthy, Sequence-to-
     sequence learning for motion-aware claim generation, International Journal of Computing
     19 (2020) 620–628. doi:10.47839/ijc.19.4.1997.
 [9] M. Alshomary, W. Chen, T. Gurcke, H. Wachsmuth, Belief-based generation of argu-
     mentative claims, in: P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), Proceedings of the
     16th Conference of the European Chapter of the Association for Computational Linguis-
     tics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, Association for Computa-
     tional Linguistics, 2021, pp. 224–233. URL: https://doi.org/10.18653/v1/2021.eacl-main.17.
     doi:10.18653/v1/2021.eacl-main.17.
[10] R. Sukthanker, S. Poria, E. Cambria, R. Thirunavukarasu, Anaphora and coreference resolu-
     tion: A review, CoRR (2018). URL: http://arxiv.org/abs/1805.11824. arXiv:1805.11824.
[11] K. Al Khatib, L. Trautner, H. Wachsmuth, Y. Hou, B. Stein, Employing argumentation
     knowledge graphs for neural argument generation, in: Proceedings of the 59th Annual
     Meeting of the Association for Computational Linguistics and the 11th International Joint
     Conference on Natural Language Processing (Volume 1: Long Papers), Association for
     Computational Linguistics, Online, 2021, pp. 4744–4754. URL: https://aclanthology.org/
     2021.acl-long.366. doi:10.18653/v1/2021.acl-long.366.
[12] J. Opitz, P. Heinisch, P. Wiesenbach, P. Cimiano, A. Frank, Explainable unsupervised
     argument similarity rating with Abstract Meaning Representation and conclusion gen-
     eration, in: Proceedings of the 8th Workshop on Argument Mining, Association for
     Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 24–35. URL: https:
     //aclanthology.org/2021.argmining-1.3. doi:10.18653/v1/2021.argmining-1.3.
[13] X. Hua, Z. Hu, L. Wang, Argument generation with retrieval, planning, and realization, in:
     Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
     Association for Computational Linguistics, Florence, Italy, 2019, pp. 2661–2672. URL:
     https://aclanthology.org/P19-1255. doi:10.18653/v1/P19-1255.
[14] B. Schiller, J. Daxenberger, I. Gurevych, Aspect-controlled neural argument generation, in:
     Proceedings of the 2021 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Association for Computational
     Linguistics, Online, 2021, pp. 380–396. URL: https://aclanthology.org/2021.naacl-main.34.
     doi:10.18653/v1/2021.naacl-main.34.
[15] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A survey of knowledge-enhanced
     text generation, CoRR abs/2010.04389 (2020). URL: https://arxiv.org/abs/2010.04389.
     arXiv:2010.04389.
[16] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acqui-
     sition for Argument Search: The args.me Corpus, 2019, pp. 48–59. doi:10.1007/
     978-3-030-30179-8_4.