<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Measuring nominal scale agreement
among many raters. Psychological Bulletin</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1037/h0031619</article-id>
      <title-group>
        <article-title>Arguments as Social Good: Good Arguments in Times of Crisis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johannes Daxenberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Gurevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ubiquitous Knowledge Processing (UKP) Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>PRO Stance</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CON Stance</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universita ̈t Darmstadt</institution>
          ,
          <addr-line>Germany https://</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1971</year>
      </pub-date>
      <volume>76</volume>
      <issue>5</issue>
      <fpage>1117</fpage>
      <lpage>1120</lpage>
      <abstract>
        <p>We report on a case study about extracting natural language arguments from news media to support decisionmaking in crises like the Covid-19 pandemic. In particular, we seek to detect the latest pro- and con-arguments and their trend for crisis relevant topics with the help of a combination of retrieval and machine learning. We present a prototype system that is able to uncover decision critical information about a broad range of topics. Manual analysis shows that the fully automatic system is able to retrieve arguments in real-time and with high quality.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The Covid-19 crisis presents decision-makers in politics,
society and business with the challenge of having to make very
quick decisions in a completely new situation under
conditions that can change daily. Many of these decisions had or
have a significant impact on our daily lives, like enforced
lockdown of businesses and schools, mandatory face
coverings, or travel restrictions. For many of these questions,
little or no evidence from previous incidents is available.
Consequently, any support to (more) thorough and
transparent decision-making is of great use. The aim of this case
study is to enable such support by extracting arguments from
the broadest possible spectrum of unstructured but
up-todate web sources (in particular, news sources). As a user
group, we primarily address decision-makers from politics
and business, but also the general public. The result is made
available through a publicly accessible web demonstrator.1</p>
      <p>Our prototype is realized in the form of an argumentative
search engine (Wachsmuth et al. 2017), which displays pros
and cons (i.e. justified options for action) on a
controversial topic or policy making in the context of the Covid-19
pandemic. Trends can be identified with a visualization that
reveals the absolute pro- and con-arguments over the last
months. To account for a balanced picture and to avoid
potential (regional or political) bias, we include sources from
all over the world. This submission describes the setup of
the system and some preliminary analysis of results.
Avg.</p>
      <p>Rel.</p>
    </sec>
    <sec id="sec-2">
      <title>System Description</title>
      <p>Our system consists of three independent components: a) the
retrieval component, which—given a query term—searches,
downloads, parses and segments articles from the web; b)
the connection to the ArgumenText API which classifies
the query term and output from a) into pro-, con- or
nonarguments; and c) the frontend which displays pro- and
con-arguments and their trend. The components are
connected through REST interfaces and can be deployed
independently.</p>
      <sec id="sec-2-1">
        <title>Retrieval Component</title>
        <p>Rather than implementing our own web crawler, we make
use of the GDELT project which aggregates news media
from all over the world in real-time and in 65 languages.2
GDELT offers a public full text search API giving access
to their collection of news articles and blogs.3 Given a query
term, the GDELT 2.0 DOC API searches in a rolling window
of the last three months of their total coverage and returns a</p>
        <sec id="sec-2-1-1">
          <title>2https://www.gdeltproject.org 3https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/</title>
          <p>list of at most 250 URLs and metadata (e.g. timestamps) of
matching web articles.</p>
          <p>
            As we aim to extract arguments from the full text of the
articles, we created a pipeline for scraping and parsing HTML
content to plain text. Boilerplate removal to clean unwanted
text elements is carried out using the Apache Tika toolkit.4
The processing backbone of this pipeline uses DKPro Core
            <xref ref-type="bibr" rid="ref6">(Eckart de Castilho and Gurevych 2014)</xref>
            for metadata
conversion and sentence segmentation. For the sake of
interoperability, the retrieval component acts as a proxy, masking
the details of the underlying pipeline. It can be queried like
any Elasticsearch client and returns responses similar to an
Elasticsearch cluster. At the time of submission, endpoints
supporting English and German queries (and responses) are
available. To minimize the answer delay, articles are scraped
and parsed in parallel. As a result, more than hundred pages
can typically be processed in less than five seconds. A more
exhaustive description of this part of the system is available
in Scheunemann et al. (2020).
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Argument Classification</title>
        <p>
          Once relevant documents have been identified by the
retrieval component, we rely on the ArgumenText API
          <xref ref-type="bibr" rid="ref1">(Daxenberger et al. 2020)</xref>
          to further process all sentences
from these documents.5 The ArgumenText system takes an
information-seeking perspective on Argument Mining (Stab
4https://tika.apache.org
5https://api.argumentsearch.com/en/doc
et al. 2018b) and classifies any given sentence as a pro-,
conand non-argument with regard to a topic (i.e., the query, in
our case). It does so using a transformer-based architecture,
where the topic and sentence are jointly embedded using
contextualized BERT-large embeddings
          <xref ref-type="bibr" rid="ref4">(Devlin et al. 2019;
Reimers et al. 2019)</xref>
          . The data used to fine-tune the
embeddings spans about 40 different topics from innovation
and technology (Stab et al. 2018a) and is extracted from a
large web crawl. The resulting model generalizes much
better than a model trained on fewer topics. As shown by Stab
et al. (2018a), a cross-topic evaluation yields 0.74 macro
F1-score compared to 0.66 macro F1-score when trained on
only eight topics. The ArgumenText system has been shown
to cover 89% of arguments from human experts among the
top-ranked results (Stab et al. 2018a). For the sake of this
case study, we did not adapt the training data and model
architecture. Rather, we seek to analyze the generalization
capabilities of the existing model to cope with topics related
to the Covid-19 pandemic.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Visualization</title>
        <p>The final application can be accessed through a search
interface which allows to specify any English or Germany query
and explore resulting pro- and con-arguments in multiple
ways. The appearance and handling of the frontend is based
on the ArgumenText search engine6, but additionally shows
a graph highlighting the occurrence of arguments along a</p>
        <sec id="sec-2-3-1">
          <title>6https://www.argumentsearch.com</title>
          <p>timeline of the last three months (see Figure 1). Absolute
counts of pro- (positive) and con-arguments (negative) are
shown as a bar chart including a trend line. The trend is
calculated by aggregating counts in the first and second half of
the full time range. The interface also allows to aggregate or
filter arguments by source document.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>To evaluate the system for the purpose of supporting
decision-making in crises, we defined ten topics around
social issues and policy making in the Covid-19 pandemic, as
listed in Table 1. We analyzed both the argument coverage
and search time as well as the relevance of the top-ranked
arguments for different initial sizes of the document
collection.</p>
      <sec id="sec-3-1">
        <title>Argument Coverage and Search Time</title>
        <p>As the retrieval component searches the GDELT API in
realtime, both response duration and the response itself vary
over time. To account for this, we repeated each query ten
times with a delay of about a minute. Except for the
relevancy scores in Table 1, all reported results are averaged
across these ten runs. In addition to document retrieval,
classification of sentences causes a delay in the overall response
time. Both increase with the number of documents/sentences
to be processed. Aiming to minimize search response time
while covering a broad content range in a period of up
to three months, we tested different initial input document
sizes. The results are shown in Figure 2.</p>
        <p>We report mean and standard deviation across the ten
topics for input document sizes between 10 and 50. The
number of sentences to be classified as well as the number of
detected pro- and con-arguments increases steeper with up
to 20 input documents (Figure 2, left and middle), however,
this effect is hardly noticeable in the overall response time
which increases almost linearly with the number of input
documents (Figure 2, right).</p>
        <p>Response times vary between 3 and 13 seconds. Among
the ten queries considered, “social distancing” and “covid
economy” are outliers with considerably more sentences to
be searched while “covid in schools” has considerably less.
In terms of the detected arguments “coronavirus protests”
returned only 3 pro-arguments on average, while “face masks”
and “herd immunity” yielded 76 and 71 (average across
all topics is 39). Among con-arguments, “face masks” only
gave 14 results on average, whereas “herd immunity” gives
92 (average across all topics is 49). We decided to set the
default input document size to 35, giving a reasonable trade-off
between answer delay and argument coverage.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Argument Relevance</title>
        <p>For each query term (topic), we also wanted to know
whether the returned sentences were i) relevant to the topic
(Potthast et al. 2019) and ii) valid pro- or con-arguments
with regard to the topic. In i), as a prerequisite for
relevancy, the relation between the result sentence and the topic
needed to be comprehensible without any further context.
ii) was only assessed among relevant result sentences. To be
counted as a valid argument, the sentence had to express
evidence or reasoning towards the topic (Stab et al. 2018b) and
the stance had to be classified correctly. For the latter, the
topic is considered as an implicit claim formed as “query
is/are not a problem” (pro) or “query is/are a problem”
(con).</p>
        <p>A graduate student with a background in language
technology assessed the first 10 pro- and the first 10
conarguments of the first run for each query according to these
prerequisites. The results are given in Table 1 as percentage
over all sentences (relevancy) and percentage over relevant
sentences (stance). Around 80% of the results are relevant to
the input query (with a slight advantage for con-arguments).
An exception is “coronavirus protests” for which not a single
relevant pro-argument was identified. Similarly, for stance,
con-arguments are recognized slightly better (78% as
compared to 75% for pro-arguments), but variance among the
queries is higher. In most cases, low stance scores only
affected either pro- or con-arguments, with the exception of
“quarantine”, where many sentences were rather descriptive
than argumentative.</p>
        <p>To assess the reliability of these judgements, half of the
data points were also assessed by the first author of this
paper. Fleiss’ Kappa scores (Fleiss 1971) have been calculated
over both pro- and con-arguments, but separately for
relevancy and stance. The inter-rater agreement for relevancy is
= 0:79 and for stance = 0:71. Both values are in the
range of substantial agreement (Fleiss 1971), demonstrating
the reliability of the evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Next Steps</title>
      <p>Our application of AI technology showcases how a
combination of real-time document retrieval and fully automatic
argument classification can support decision-making in
crisis situations. We believe that the availability of a balanced
and broad range of evidence from all over the world
substantially contributes to situational awareness on critical matters,
both for policy makers as well as the general public. The
system is publicly available for further testing. Results of a
preliminary evaluation show that instant retrieval and
classification of around 100 arguments is feasible within less than
10 seconds and that the quality of the resulting arguments is
high.</p>
      <p>Next, we plan to include further document sources. In
particular, we want to add scientific literature (e.g. pre-print
servers) such that evidence from recent research will also be
included among the pro- and con-arguments. Furthermore,
we want to integrate the ArgumenText Clustering API7, to
automatically quantify predominant argumentative aspects
among similar arguments (e.g. “reusability” for the query
“face masks”). This will help to identify important subtopics
in the discourse around the query of interest.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was supported through a grant by the Profile
Area “Internet and Digitization” of the Technical University
of Darmstadt within their “COVID-19” funding.</p>
      <p>Scheunemann, C.; Naumann, J.; Eichler, M.; Stowe, K.; and
Gurevych, I. 2020. Data Collection and Annotation Pipeline
for Social Good Projects. In Proceedings of the AAAI Fall
2020 AI for Social Good Symposium.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Daxenberger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schiller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Stahlhut</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>ArgumenText: Argument Classification and Clustering in a Generalized Search Scenario</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Datenbank-Spektrum</source>
          <volume>20</volume>
          :
          <fpage>115</fpage>
          -
          <lpage>121</lpage>
          . URL http://tubiblio.ulb.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>tu-darmstadt.de/121189/.</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . In NAACL'
          <volume>19</volume>
          ,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Eckart de Castilho</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>A Broadcoverage Collection of Portable NLP Components for Building Shareable Analysis Pipelines</article-title>
          .
          <source>In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          -5201. URL https://www.aclweb.org/anthology/W14-5201.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>