=Paper= {{Paper |id=Vol-2884/paper_113 |storemode=property |title=Arguments as Social Good: Good Arguments in Times of Crisis |pdfUrl=https://ceur-ws.org/Vol-2884/paper_113.pdf |volume=Vol-2884 |authors=Johannes Daxenberger,Iryna Gurevych |dblpUrl=https://dblp.org/rec/conf/aaaifs/DaxenbergerG20 }} ==Arguments as Social Good: Good Arguments in Times of Crisis== https://ceur-ws.org/Vol-2884/paper_113.pdf
               Arguments as Social Good: Good Arguments in Times of Crisis
                                       Johannes Daxenberger and Iryna Gurevych
                                            Ubiquitous Knowledge Processing (UKP) Lab
                                             Technische Universität Darmstadt, Germany
                                                  https://www.ukp.tu-darmstadt.de




                           Abstract                                                                       PRO               CON
                                                                           Query                   Rel.     Stance   Rel.     Stance
     We report on a case study about extracting natural lan-
     guage arguments from news media to support decision-                  covid economy           0.80      0.86    0.60      0.83
     making in crises like the Covid-19 pandemic. In particu-              face masks                1        1       1        0.80
     lar, we seek to detect the latest pro- and con-arguments              corona tourism          0.70        1     0.70      0.57
     and their trend for crisis relevant topics with the help              social distancing       0.90        1     0.90      0.44
     of a combination of retrieval and machine learning. We                corona party            0.70      0.57    0.80      0.86
     present a prototype system that is able to uncover deci-              covid in schools        0.90      0.89    0.70      0.71
     sion critical information about a broad range of topics.              covid vaccination       0.80      0.63    0.90        1
     Manual analysis shows that the fully automatic system                 quarantine                1       0.50    0.90      0.56
     is able to retrieve arguments in real-time and with high              coronavirus protests      0         0     0.70        1
     quality.                                                              herd immunity           0.90        1     0.90        1
                                                                           Avg.                    0.77      0.75    0.81      0.78
The Covid-19 crisis presents decision-makers in politics, so-
ciety and business with the challenge of having to make very        Table 1: Queries and results for manual evaluation.
quick decisions in a completely new situation under condi-          Rel.(evance) gives the percentage of sentences relevant to
tions that can change daily. Many of these decisions had or         the search query; Stance is a subset of the latter which is
have a significant impact on our daily lives, like enforced         correctly classified as pro- or con-arguments. Numbers are
lockdown of businesses and schools, mandatory face cov-             percentages over the total number of sentences assessed.
erings, or travel restrictions. For many of these questions,
little or no evidence from previous incidents is available.
Consequently, any support to (more) thorough and transpar-                                System Description
ent decision-making is of great use. The aim of this case
study is to enable such support by extracting arguments from        Our system consists of three independent components: a) the
the broadest possible spectrum of unstructured but up-to-           retrieval component, which—given a query term—searches,
date web sources (in particular, news sources). As a user           downloads, parses and segments articles from the web; b)
group, we primarily address decision-makers from politics           the connection to the ArgumenText API which classifies
and business, but also the general public. The result is made       the query term and output from a) into pro-, con- or non-
available through a publicly accessible web demonstrator.1          arguments; and c) the frontend which displays pro- and
    Our prototype is realized in the form of an argumentative       con-arguments and their trend. The components are con-
search engine (Wachsmuth et al. 2017), which displays pros          nected through REST interfaces and can be deployed inde-
and cons (i.e. justified options for action) on a controver-        pendently.
sial topic or policy making in the context of the Covid-19
pandemic. Trends can be identified with a visualization that        Retrieval Component
reveals the absolute pro- and con-arguments over the last           Rather than implementing our own web crawler, we make
months. To account for a balanced picture and to avoid po-          use of the GDELT project which aggregates news media
tential (regional or political) bias, we include sources from       from all over the world in real-time and in 65 languages.2
all over the world. This submission describes the setup of          GDELT offers a public full text search API giving access
the system and some preliminary analysis of results.                to their collection of news articles and blogs.3 Given a query
                                                                    term, the GDELT 2.0 DOC API searches in a rolling window
AAAI Fall 2020 Symposium on AI for Social Good.
Copyright c 2020 for this paper by its authors. Use permitted un-   of the last three months of their total coverage and returns a
der Creative Commons License Attribution 4.0 International (CC
                                                                       2
BY 4.0).                                                                   https://www.gdeltproject.org
   1                                                                   3
     https://asg.ukp.informatik.tu-darmstadt.de                            https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
Figure 1: The current user interface of the search engine including the argument trend as bar chart and the first pro- and
con-arguments discovered for the query “face masks”. Screenshot taken on September 16th, 2020.


list of at most 250 URLs and metadata (e.g. timestamps) of        et al. 2018b) and classifies any given sentence as a pro-, con-
matching web articles.                                            and non-argument with regard to a topic (i.e., the query, in
   As we aim to extract arguments from the full text of the ar-   our case). It does so using a transformer-based architecture,
ticles, we created a pipeline for scraping and parsing HTML       where the topic and sentence are jointly embedded using
content to plain text. Boilerplate removal to clean unwanted      contextualized BERT-large embeddings (Devlin et al. 2019;
text elements is carried out using the Apache Tika toolkit.4      Reimers et al. 2019). The data used to fine-tune the em-
The processing backbone of this pipeline uses DKPro Core          beddings spans about 40 different topics from innovation
(Eckart de Castilho and Gurevych 2014) for metadata con-          and technology (Stab et al. 2018a) and is extracted from a
version and sentence segmentation. For the sake of interop-       large web crawl. The resulting model generalizes much bet-
erability, the retrieval component acts as a proxy, masking       ter than a model trained on fewer topics. As shown by Stab
the details of the underlying pipeline. It can be queried like    et al. (2018a), a cross-topic evaluation yields 0.74 macro
any Elasticsearch client and returns responses similar to an      F1-score compared to 0.66 macro F1-score when trained on
Elasticsearch cluster. At the time of submission, endpoints       only eight topics. The ArgumenText system has been shown
supporting English and German queries (and responses) are         to cover 89% of arguments from human experts among the
available. To minimize the answer delay, articles are scraped     top-ranked results (Stab et al. 2018a). For the sake of this
and parsed in parallel. As a result, more than hundred pages      case study, we did not adapt the training data and model
can typically be processed in less than five seconds. A more      architecture. Rather, we seek to analyze the generalization
exhaustive description of this part of the system is available    capabilities of the existing model to cope with topics related
in Scheunemann et al. (2020).                                     to the Covid-19 pandemic.

Argument Classification                                           Visualization
Once relevant documents have been identified by the re-           The final application can be accessed through a search inter-
trieval component, we rely on the ArgumenText API (Dax-           face which allows to specify any English or Germany query
enberger et al. 2020) to further process all sentences            and explore resulting pro- and con-arguments in multiple
from these documents.5 The ArgumenText system takes an            ways. The appearance and handling of the frontend is based
information-seeking perspective on Argument Mining (Stab          on the ArgumenText search engine6 , but additionally shows
   4
                                                                  a graph highlighting the occurrence of arguments along a
       https://tika.apache.org
   5                                                                 6
       https://api.argumentsearch.com/en/doc                             https://www.argumentsearch.com
Figure 2: Total search time, number of (pro- and con-) arguments and total number of sentences searched for different input
document sizes; mean and standard deviation across 10 query topics (cf. Table 1).


timeline of the last three months (see Figure 1). Absolute        and “herd immunity” yielded 76 and 71 (average across
counts of pro- (positive) and con-arguments (negative) are        all topics is 39). Among con-arguments, “face masks” only
shown as a bar chart including a trend line. The trend is cal-    gave 14 results on average, whereas “herd immunity” gives
culated by aggregating counts in the first and second half of     92 (average across all topics is 49). We decided to set the de-
the full time range. The interface also allows to aggregate or    fault input document size to 35, giving a reasonable trade-off
filter arguments by source document.                              between answer delay and argument coverage.

                        Evaluation                                Argument Relevance
To evaluate the system for the purpose of supporting              For each query term (topic), we also wanted to know
decision-making in crises, we defined ten topics around so-       whether the returned sentences were i) relevant to the topic
cial issues and policy making in the Covid-19 pandemic, as        (Potthast et al. 2019) and ii) valid pro- or con-arguments
listed in Table 1. We analyzed both the argument coverage         with regard to the topic. In i), as a prerequisite for rele-
and search time as well as the relevance of the top-ranked        vancy, the relation between the result sentence and the topic
arguments for different initial sizes of the document collec-     needed to be comprehensible without any further context.
tion.                                                             ii) was only assessed among relevant result sentences. To be
                                                                  counted as a valid argument, the sentence had to express evi-
Argument Coverage and Search Time                                 dence or reasoning towards the topic (Stab et al. 2018b) and
As the retrieval component searches the GDELT API in real-        the stance had to be classified correctly. For the latter, the
time, both response duration and the response itself vary         topic is considered as an implicit claim formed as “query
over time. To account for this, we repeated each query ten        is/are not a problem” (pro) or “query is/are a problem”
times with a delay of about a minute. Except for the rel-         (con).
evancy scores in Table 1, all reported results are averaged          A graduate student with a background in language tech-
across these ten runs. In addition to document retrieval, clas-   nology assessed the first 10 pro- and the first 10 con-
sification of sentences causes a delay in the overall response    arguments of the first run for each query according to these
time. Both increase with the number of documents/sentences        prerequisites. The results are given in Table 1 as percentage
to be processed. Aiming to minimize search response time          over all sentences (relevancy) and percentage over relevant
while covering a broad content range in a period of up            sentences (stance). Around 80% of the results are relevant to
to three months, we tested different initial input document       the input query (with a slight advantage for con-arguments).
sizes. The results are shown in Figure 2.                         An exception is “coronavirus protests” for which not a single
   We report mean and standard deviation across the ten top-      relevant pro-argument was identified. Similarly, for stance,
ics for input document sizes between 10 and 50. The num-          con-arguments are recognized slightly better (78% as com-
ber of sentences to be classified as well as the number of        pared to 75% for pro-arguments), but variance among the
detected pro- and con-arguments increases steeper with up         queries is higher. In most cases, low stance scores only af-
to 20 input documents (Figure 2, left and middle), however,       fected either pro- or con-arguments, with the exception of
this effect is hardly noticeable in the overall response time     “quarantine”, where many sentences were rather descriptive
which increases almost linearly with the number of input          than argumentative.
documents (Figure 2, right).                                         To assess the reliability of these judgements, half of the
   Response times vary between 3 and 13 seconds. Among            data points were also assessed by the first author of this pa-
the ten queries considered, “social distancing” and “covid        per. Fleiss’ Kappa scores (Fleiss 1971) have been calculated
economy” are outliers with considerably more sentences to         over both pro- and con-arguments, but separately for rele-
be searched while “covid in schools” has considerably less.       vancy and stance. The inter-rater agreement for relevancy is
In terms of the detected arguments “coronavirus protests” re-     κ = 0.79 and for stance κ = 0.71. Both values are in the
turned only 3 pro-arguments on average, while “face masks”        range of substantial agreement (Fleiss 1971), demonstrating
the reliability of the evaluation.                                 Reimers, N.; Schiller, B.; Beck, T.; Daxenberger, J.; Stab,
                                                                   C.; and Gurevych, I. 2019. Classification and Clustering
                Conclusion and Next Steps                          of Arguments with Contextualized Word Embeddings. In
                                                                   ACL’19, 567–578. doi:10.18653/v1/P19-1054.
Our application of AI technology showcases how a combi-
nation of real-time document retrieval and fully automatic         Scheunemann, C.; Naumann, J.; Eichler, M.; Stowe, K.; and
argument classification can support decision-making in cri-        Gurevych, I. 2020. Data Collection and Annotation Pipeline
sis situations. We believe that the availability of a balanced     for Social Good Projects. In Proceedings of the AAAI Fall
and broad range of evidence from all over the world substan-       2020 AI for Social Good Symposium.
tially contributes to situational awareness on critical matters,   Stab, C.; Daxenberger, J.; Stahlhut, C.; Miller, T.; Schiller,
both for policy makers as well as the general public. The          B.; Tauchmann, C.; Eger, S.; and Gurevych, I. 2018a. Ar-
system is publicly available for further testing. Results of a     gumenText: Searching for Arguments in Heterogeneous
preliminary evaluation show that instant retrieval and classi-     Sources. In NAACL’18: System Demonstrations, 21–25.
fication of around 100 arguments is feasible within less than      URL http://tubiblio.ulb.tu-darmstadt.de/105466/.
10 seconds and that the quality of the resulting arguments is
                                                                   Stab, C.; Miller, T.; Schiller, B.; Rai, P.; and Gurevych,
high.
                                                                   I. 2018b. Cross-topic Argument Mining from Heteroge-
   Next, we plan to include further document sources. In
                                                                   neous Sources. In EMNLP’18, 3664–3674. URL https:
particular, we want to add scientific literature (e.g. pre-print
                                                                   //www.aclweb.org/anthology/D18-1402.
servers) such that evidence from recent research will also be
included among the pro- and con-arguments. Furthermore,            Wachsmuth, H.; Potthast, M.; Al-Khatib, K.; Ajjour, Y.;
we want to integrate the ArgumenText Clustering API7 , to          Puschmann, J.; Qu, J.; Dorsch, J.; Morari, V.; Bevendorff,
automatically quantify predominant argumentative aspects           J.; and Stein, B. 2017. Building an Argument Search Engine
among similar arguments (e.g. “reusability” for the query          for the Web. In Proceedings of the 4th Workshop on Ar-
“face masks”). This will help to identify important subtopics      gument Mining, 49–59. doi:10.18653/v1/W17-5106. URL
in the discourse around the query of interest.                     https://www.aclweb.org/anthology/W17-5106.

                      Acknowledgments
This research was supported through a grant by the Profile
Area “Internet and Digitization” of the Technical University
of Darmstadt within their “COVID-19” funding.

                           References
Daxenberger, J.; Schiller, B.; Stahlhut, C.; Kaiser, E.; and
Gurevych, I. 2020. ArgumenText: Argument Classifi-
cation and Clustering in a Generalized Search Scenario.
Datenbank-Spektrum 20: 115–121. URL http://tubiblio.ulb.
tu-darmstadt.de/121189/.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In NAACL’19, 4171–4186. doi:
10.18653/v1/N19-1423.
Eckart de Castilho, R.; and Gurevych, I. 2014. A Broad-
coverage Collection of Portable NLP Components for Build-
ing Shareable Analysis Pipelines. In Proceedings of the
Workshop on Open Infrastructures and Analysis Frame-
works for HLT, 1–11. doi:10.3115/v1/W14-5201. URL
https://www.aclweb.org/anthology/W14-5201.
Fleiss, J. L. 1971. Measuring nominal scale agreement
among many raters. Psychological Bulletin 76(5): 378–
382. ISSN 00332909. doi:10.1037/h0031619. URL http:
//content.apa.org/journals/bul/76/5/378.
Potthast, M.; Gienapp, L.; Euchner, F.; Heilenkötter, N.;
Weidmann, N.; Wachsmuth, H.; Stein, B.; and Hagen, M.
2019. Argument Search: Assessing Argument Relevance.
In SIGIR’19, 1117–1120. doi:10.1145/3331184.3331327.
   7
       https://api.argumentsearch.com/en/doc#cluster-api