=Paper=
{{Paper
|id=Vol-3164/paper5
|storemode=property
|title=BAM: Benchmarking Argument Mining on Scientific Documents
|pdfUrl=https://ceur-ws.org/Vol-3164/paper5.pdf
|volume=Vol-3164
|authors=Florian Ruosch,Cristina Sarasua,Abraham Bernstein
|dblpUrl=https://dblp.org/rec/conf/aaai/RuoschSB22
}}
==BAM: Benchmarking Argument Mining on Scientific Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-3164/paper5.pdf</pdf>
<pre>
BAM: Benchmarking Argument Mining on Scientific
Documents
Florian Ruosch1 , Cristina Sarasua1 and Abraham Bernstein1
1
    University of Zurich, Department of Informatics, Binzmühlestrasse 14, 8050 Zürich, Switzerland


                                             Abstract
                                             In this paper, we present BAM, a unified Benchmark for Argument Mining (AM). We propose a method to homogenize both
                                             the evaluation process and the data to provide a common view in order to ultimately produce comparable results. Built
                                             as a four stage and end-to-end pipeline, the benchmark allows for the direct inclusion of additional argument miners to
                                             be evaluated. First, our system pre-processes a ground truth set used both for training and testing. Then, the benchmark
                                             calculates a total of four measures to assess different aspects of the mining process. To showcase an initial implementation of
                                             our approach, we apply our procedure and evaluate a set of systems on a corpus of scientific publications. With the obtained
                                             comparable results we can homogeneously assess the current state of AM in this domain.


1. Introduction                                                                                                       we propose BAM, a unified approach to Benchmarking
                                                                                                                      Argument Mining.
In the last 200 years, the number of published papers per                                                                Following the AM pipeline described by Lippi and Tor-
year has consistently been increasing by around 5% [1].                                                               roni [5], we create a multi-level evaluation framework to
With this rapidly growing landscape, it becomes harder                                                                enable the benchmarking of every task of their AM pro-
to manually navigate the seemingly unending flood of                                                                  cess: sentence classification, boundary detection, com-
new scientific information.                                                                                           ponent classification, and relation prediction. We aim to
   One of the emerging fields addressing the machine-                                                                 provide a benchmarking framework that facilitates com-
assisted processing of scholarly documents is Argument                                                                parable results both within each stage and throughout
Mining (AM), aimed at identifying and extracting argu-                                                                the whole pipeline.
ment components (and possibly relations) from natural                                                                    In this work, we present a two main contributions.
language texts [2]. This information is not only useful                                                               First and foremost, we show the concept of a unified
for summarization but also for detecting connections                                                                  benchmark approach for Argument Mining: BAM. To
between different entities such as individual papers or                                                               the best of our knowledge, nothing of the sort exists yet.
outlets [3]. This kind of network has been described as                                                               Furthermore, we showcase a preliminary implementa-
the Argument Web by Bex et al. [4] — a vision where                                                                   tion of BAM by applying our benchmark to a pre-existing
all argument data is URI-addressable and linked. If we                                                                argument annotated corpus of scientific publications [7].
want to work toward the automatic implementation of                                                                   This allows for not only showing the feasibility of our
such a knowledge graph containing arguments from sci-                                                                 approach but also to present an initial comparison of
entific publications, we first need to be able to compare                                                             the performance results of a range of AM systems in the
the performance of existing solutions. However, there is                                                              domain of scholarly papers.
currently no widely established, standardized AM bench-
marking approach.                                                                                                     The remainder of this paper is structured as follows: Sec-
   Lippi and Torroni [5] point out several problem areas                                                              tion 2 presents related works and Section 3 introduces
which stand in the way of a homogeneous evaluation:                                                                   our newly proposed methodology. In the ensuing Sec-
the granularity of the in- and output of AM systems, the                                                              tion 4, we showcase our benchmark and describe the
variety of genres and domains they focus on, and the rep-                                                             results, before we draw conclusions in Section 5.
resentation of arguments in the evaluation data, i.e. the
argument model. Additionally, as previously noted by
Duthie et al. [6], a wide spectrum of different measures are                                                          2. Related Work
in use, and these are not accurately described or appro-
priately applied in all cases. To address the issues above,                                                           We first explore the definition of AM and then point to
                                                                                                                      an overview of efforts in the domain of scientific publica-
                                                                                                                      tions. In the second part, we address existing measures
Envelope-Open ruosch@ifi.uzh.ch (F. Ruosch); sarasua@ifi.uzh.ch (C. Sarasua);                                         used to evaluate the performance of AM. Finally, we de-
bernstein@ifi.uzh.ch (A. Bernstein)
Orcid 0000-0002-0257-3318 (F. Ruosch); 0000-0002-2076-9584
                                                                                                                      scribe available benchmarks.
(C. Sarasua); 0000-0002-0128-4602 (A. Bernstein)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2.1. Argument Mining                                          text, i.e. the boundaries of the identified components. For
                                                              the relation scores, these segments are aligned between
Despite the different interpretations of what AM en-
                                                              annotations. Considering the Levenshtein distance [15]
tails [8], there is the well-established information extrac-
                                                              and also the location in the text, the components are
tion approach, as popularized by several experts in the
                                                              mapped. Then, the number of correctly predicted con-
field [9, 2, 5]. Stab et al. [10] explain AM as a multistage
                                                              nections (also with respect to their types) is calculated
pipeline that extracts the arguments from text, usually by
                                                              for propositional (attack, support) and dialogical (consid-
first separating non-argumentative from argumentative
                                                              ering the speaker’s intent) relations.
units, then classifying the argument components and,
                                                                 Even though CASS is very flexible (i.e. scheme agnos-
finally, identifying their structure with relations. We
                                                              tic), it still has some drawbacks. Firstly, it assumes the
adopt this definition because it fits best with our ultimate
                                                              existence of dialogical annotations, which is not com-
goal of creating the Argument Web of Science [4] for
                                                              mon in current automated AM approaches. Also, there is
which we need to extract information about argumenta-
                                                              no public implementation such that it could be put into
tive units and their relations. Other AM papers [11], treat
                                                              practice. Finally, it wholly omits the component classifi-
the mining process as a search task to retrieve arguments
                                                              cation part of the AM pipeline by only focusing on the
from a pre-computed set according to their relevance for
                                                              segmentation (i.e. boundaries of the components) of the
a query or keyword.
                                                              text. By introducing our own evaluation method, we aim
   For a detailed overview of literature of the last 20 years
                                                              to remedy the points mentioned above.
in the field of AM for scientific publications, we point
the inclined reader to the survey of Al Khatib et al. [12].
They do not only present an overview about the efforts 2.3. Benchmarks
made but also indicate current applications and identified We found two previous works that designate themselves
challenges.                                                   specifically as benchmarks.
                                                                  NoDE [16] consists of a total of three data sets covering
2.2. Measures                                                  different domains: online discussions, a stage play, and
                                                               the revision history of Wikipedia articles. The source
There is a wide range of Information Retrieval (IR) mea-
                                                               for the first part were different online debate platforms
sures used to evaluate AM systems. Typically, IR evalua-
                                                               which allow members to discuss controversial topics such
tions presume the existence of a gold standard or ground
                                                               as violent video games or abortion. Secondly, arguments
truth that a proposed solution is evaluated against [13].
                                                               were extracted from the play “Twelve Angry Men”, where
The F1 (also F-score or F-measure) [14] can be used to
                                                               a jury discusses the culpability of a young man in a mur-
assess the accuracy of the predictions made by a system
                                                               der case. The third data resulted from comparing two
by calculating the harmonic mean of the precision and
                                                               different Wikipedia dumps based on the edits of the five
the recall, both of which have also been applied on their
                                                               most revised pages. All three sets were annotated by a
own for performance evaluation. For multi-class tasks
                                                               team of two (𝜅 = 0.70 – 0.74) and, in total, they contain 792
(such as AM, where we aim to identify various compo-
                                                               pairs, each connecting two arguments with information
nents or relations) different versions of F1 exist, based on
                                                               about entailment. Partly, they are also annotated with
how the score is averaged for the classes. The macro-F1
                                                               support- or attack-relationships. It is of note that it is
variant weights all classes equally for the combination
                                                               not possible to use this benchmark to evaluate the whole
into a single F1, while micro-F1 considers the number of
                                                               AM pipeline since it does not contain any information
occurrences for each label. Not only are we unable to
                                                               about the boundaries of arguments in continuous text.
directly compare results reported for different variants of
                                                                  Aharoni et al. [17] present a data set based on Wikipedia
the F-score, some literature also chooses not to include
                                                               pages for a range of controversial topics. In the labeling
the specifics of which weighting method was employed.
                                                               process, they first extracted claims from the articles, fol-
   Duthie et al. [6] raise the issue that traditional mea-
                                                               lowed by supporting evidences. Given that each claim
sures from the field of IR may over penalize when simply
                                                               is identified context-dependently, they are inherently as-
applying them for each of the pipeline stages successively.
                                                               signed to a topic. Every evidence is then connected to
For example, wrongly or not at all identified components
                                                               a claim and given a type (study, expert, or anecdotal).
directly influence and reduce the calculated performance
                                                               The labeling was conducted by 20 inhouse annotators
of the relation prediction task. To address this and other
                                                               with a Cohen’s 𝜅 of 0.39 for the claims and 0.40 for the
shortcomings, they introduce the Combined Argument
                                                               evidences. The corpus covers a total of 33 topics with
Similarity Score (CASS) [6]. It splits the evaluation of AM
                                                               1392 claims and 1291 evidences. Notably, all evidences
into three individual scores which are then aggregated
                                                               are supporting and no attack relation is annotated. It also
into a single number. First, the segmentation step evalu-
                                                               does not contain explicit information about the location
ates the similarity of different partitionings of the same
                                                               of the components in the text (and thus the boundaries).
Figure 1: System vision of the different parts of the benchmark framework and their interactions.


   These two works share one major drawback: Instead          our framework, the system can then address the training,
of providing a framework including one or more eval-          where applicable, and execution step, which are both
uation measures and a state-of-the-art benchmarking           integrated into the end-to-end pipeline. We explain each
methodology, they solely present a new data set, that can     of these functionalities separately below.
be used as a ground truth. Thus, no uniform method to
assess the performance of AM system is established since    (1) Pre-processing This step creates a data set suitable
the choice of the measure has not been fixed. We address    to be processed by a given system from a common ground
this issue in our work.                                     truth corpus, according to specified configurations and
                                                            the alignment of argument representations. It is tailored
                                                            to the requirements of the system to be benchmarked
3. BAM: A Unified Approach to                               such that it can be used as input at any stage, be it for
     Benchmarking Argument                                  training or evaluation. This ensures that every system
     Mining                                                 tested in the benchmark will use the same data as basis,
                                                            thus allowing for comparable results. The split of the data
We first describe the architecture of the end-to-end bench- into development, training, and test set is specified not
marking pipeline. Then, we specify the measures em- per system but rather per corpus ensuring comparability
ployed to assess performance for the different stages. between systems.
Finally, we describe our argument representation unifi-
cation effort.                                              (2) Training Given the prevalence of neural network
                                                            approaches for AM, we included an optional training step.
                                                            Here, the training API of the system to be integrated can
3.1. Overview
                                                            be invoked using the specifically created data set.
We designed BAM, the benchmark for Argument Mining,
with the goal of not only providing an easy to access sys- (3) Execution The resulting trained model is then em-
tem but also considering all aspects of AM and to obtain ployed to annotate the test data set using the system’s ex-
performance results in a unified and homogeneous way. ecution API. We enabled the functionality to either reuse
Figure 1 outlines the end-to-end pipeline and illustrates the intermediate results as input for the subsequent steps
how BAM is built on four pillars, from left to right: (1) or to test aspects independently and inject ground truth
pre-processing, (2) training, (3) execution, and (4) eval- annotations into the pipeline (e.g., relation prediction
uation. The implementation was done in Python and is with the components as annotated in the ground truth).
available publicly.1 We provide several examples of how
to integrate AM systems via the implemented Python (4) Evaluation This stage aligns the computed results
stubs.                                                      and the ground truth annotations to ensure the data con-
                                                              forms to the requirements set by the evaluation func-
With pre- and post-processing being taken care of by          tions, which expect sequences of labeled tokens. This
                                                              is achieved by applying NLTK’s [18] tokenizer, where
   1
       https://gitlab.ifi.uzh.ch/DDIS-Public/bam
necessary. Since the system’s output may already be tok-      result and weigh both classes equally. In our benchmark-
enized using an unknown technique, we have to expect          ing framework, we apply sklearn’s implementation of
differences in the labeled tokens. To address them, we        the micro-F1 measure4 to obtain a score between 0 and
match the two sequences with spaCy’s [19] implementa-         1, where bigger signifies better.
tion of the token aligner2 and, thus, all of the evaluation      For the comparison of the component boundaries, we
happens uniformly on token-level. Subsequently, sev-          follow the proposition of Duthie et al. [6] and use the im-
eral aspects are evaluated. Based on the AM pipeline          plementation5 of the segmentation evaluation [20]. The
described by Lippi and Torroni [5], our benchmarking          edit distance-based boundary similarity function assesses
framework assesses performance for four different tasks:      how well the results of segmentation tasks agree on a
argumentative sentence classification (S), boundary iden-     scale from 0 to 1. It compares pairs of boundaries, cal-
tification (B), argumentative component detection (C), as     culates the edit-distance, and normalizes based on the
well as argumentative relation prediction (R). A sentence     segmentation length. As input, we can simply pass two
is classified as argumentative, if it contains any argument   sequences of (multiclass) labels assigned to the tokens
component [5]. Next, the similarity of the boundaries         and the library will identify the boundaries automatically.
for the (non)argumentative segments is compared. Be-             Given that the previous measure does not take the cat-
fore the final stage, the detection and classification of     egories of the segments into account (i.e., the component
the components themselves is assessed. Lastly, the pre-       types), we have to address the classification in a separate
dicted relations are compared to the ones annotated in the    step. Based on the similarity of this task to Named Entity
ground truth, i.e. which components are connected and         Recognition (NER) [12], we can employ the nervaluate-
how. It is important to note that we do not require every     package6 originally designed for the evaluation of NER.
system to perform all the tasks, but rather the implemen-     By treating the argumentative components as named en-
tation specifies which are covered in the configuration       tities, we apply the same functions and obtain the F1
and which are not. The details for each evaluated aspect      through this well-established library.
are presented below.                                             The final evaluation step assesses the correctness of
                                                              the predicted relations between the identified compo-
By relying on a modular structure, we give enough room        nents. As pointed out by Duthie et al. [6], it is important
for customizations to account for any peculiarities that      to consider the possible double penalization since the
systems might exhibit.                                        previously detected argumentative units play a critical
   Furthermore, each system needs to specify a mapping        role. Not having identified certain components also takes
(represented by the graph icon on the bottom of Figure 1)     away the opportunity to relate them and, thus, is not only
to create a uniform view of the argument representation       penalized in the previous step but also has an impact on
and to make the results comparable. It is employed for        the relation prediction score. Consequently, we give the
pre-processing, to create a specific data set, and for the    possibility to either use the argumentative units as iden-
evaluation, to map all systems to the same argumenta-         tified by the system (i.e., the intermediate results) or to
tion scheme. By specifying the mapping with Semantic          recourse to the ground truth as the input for this step.
Web technologies (OWL3 ), we not only ensure that it is       When using the computed intermediate results, we match
machine-readable and interoperable, but we also facili-       the components to the ground truth to ensure fairness so
tate its extension and reuse.                                 that the boundaries do not need to coincide exactly. In-
                                                              stead, we assign each identified unit to one in the ground
3.2. Evaluation Measures                                      truth, if they overlap in at least one token. For compo-
                                                              nents covering multiple ones in the ground truth, we
Every task is treated as a (multinomial) classification.      select the one with the largest intersection. This does
However, we use several evaluation methods because            not only allow for differing boundaries, it also ensures
they differ slightly in the granularity and format of the     that localization information of the units is factored in.
data as well as their goal. We explain the measures and       By constructing triples out of the two components and
their reasoning for every step of the pipeline.               the relation (subject, predicate, object), we obtain lists
   In the first task, the aim is to classify sentences as     of predicted and gold data. This turns the problem into
(non-)argumentative. If a sentence contains at least one      identifying retrieved/missed, relevant/irrelevant triples.
argument component, it is defined as argumentative [5].       Therefore, we can again employ the F1-score. One caveat
After extracting these annotations from the mined results     is that we also need to consider the symmetric nature of
as well as from the ground truth, we compare two lists        some relation types. By converting the data into triples,
of the same length with binary values using micro-F1, to
                                                                  4
ensure that a possible label imbalance does not affect the          https://scikit-learn.org/stable/modules/generated/
                                                              sklearn.metrics.f1_score
   2                                                              5
       https://github.com/explosion/spacy-alignments                https://github.com/cfournie/segmentation.evaluation
   3                                                              6
       https://www.w3.org/OWL                                       https://pypi.org/project/nervaluate
we risk not awarding a correct prediction if it is reversed                 4. Showcasing BAM
(object and subject transposed) for a symmetric relation.
To amend this issue, we always arrange them in such a                       To illustrate the feasibility of our benchmarking frame-
way that the subject has the smaller identifier number                      work, we showcase it with an example data set and a
than the object. Since no relation is reflexive, this results               limited number of systems. This section first introduces
in unique triples.                                                          the used corpus, before elaborating on the selection of ar-
                                                                            gument miners. Ensuingly, we explain the alignment of
                                                                            the different argumentation schemes and, finally, present
3.3. Aligning Argumentation Schemes                                         a set of initial results.
To produce comparable results, a common view of how
an argument is represented in data is necessary. This is                    4.1. Setup
achieved by aligning different argumentation schemes
through mappings. Given the widespread adoption [5] of                      For our showcase, we use the corpus presented by Lauscher
the claim/premise model [21] and its simplicity, we chose                   et al. [7], currently the only available collection of fully
it for our benchmark and use the attacks- and supports-                     argument annotated scientific papers in English. The au-
relations to connect components with the import notion                      thors extend the Dr. Inventor data set [22] by annotating
that we do not restrict neither range (source) nor the                      arguments for 40 publications in the field of computer
domain (target) for both.                                                   graphics containing 10’780 sentences in total. According
    To align representations, we need two types of map-                     to the guidelines [23], several types of components have
pings: one for the components (claim and premise) and                       been annotated: background claim (i.e., a claim about
one for the relations (supports and attacks). There are                     someone else’s work), own claim (i.e., proprietary contri-
two different scenarios: either one scheme is more com-                     bution), and data (i.e., the evidence). Furthermore, they
plex than the other (i.e., it has more components and/or                    identify relations between the argumentative units: con-
relations or has other levels of specificity) or they are the               tradicts, supports, semantically same, and part of. The
same but use a different naming convention (e.g., syn-                      corpus is publicly available and can be downloaded from
onyms or similar but not identical terms such as attacks                    the project’s homepage.7
versus attack). There is also the special situation for the                    According to our previously defined requirements, we
components that a model is as simple as to only segment-                    select an initial list of systems to be included in the show-
ing text into non- and argumentative parts. In this case,                   case. TARGER [24] identifies and tags argument units
we do not assess the system’s ability to classify argumen-                  as claims or premises on token-level from free text in-
tative components due to the lack of information and,                       put. It implements a BiLSTM-CNN-CRF [25] and uses
thus, no mapping is necessary.                                              pre-computed word embeddings, such as GloVe [26].
    More complex schemes can be reduced to a simpler                        Mayer et al. [27] present an AM approach for the do-
model with the concepts of e q u i v a l e n t - and/or s u b c l a s s -   main of healthcare employing bi-directional transformers
o f -relations. Every component and relation from the                       and combining them with neural networks (LSTM, GRU,
original representation is assigned to exactly zero or one                  CRF), which we label as TRABAM (for TRansformer-
corresponding element of the benchmark model, depend-                       Based AM) in this paper. Not only do they address the
ing on whether their complement exists and according                        task of identifying argument components (claim, evi-
to their definition in the original model descriptions. El-                 dence, and major claim) with a sequence tagging solu-
ements without a counterpart are mapped to no type                          tion but they also identify relations between these units
since they can not be considered in the evaluation. It is                   phrased as a multichoice problem (attack, support, non).
important to note that no annotations are discarded since                   Trautmann et al. [28] also formulate AM as sequence tag-
the ground truth data is recomputed for every run and,                      ging problem and define the task of Argument Unit Recog-
if the mapping changes, the alterations are incorporated                    nition and Classification (AURC). They argue for a more
automatically.                                                              fine-grained identification of spans than on sentence-
    In the case of using different naming, we only need to                  level. At the same time, the authors present a solution
employ the e q u i v a l e n t -relation. The same concept may              using the established sequence labeling model of Reimers
be called differently but still carry the identical semantics.              et al. [29] which employs BILSTMs in combination with
Claims are labeled as conclusions, while premises have a                    word embeddings.
plethora of names in literature such as data, evidence, or                     We include two more systems that, despite being pre-
reason [5]. Similarly, the attacks-relation is also known                   trained externally, have received attention in the state of
as contradicts. Based on the definitions, we can create a                   the art due to their respective approaches. However, we
one-to-one-mapping between model elements and, sub-                         do not strictly add them to the benchmarked results in
sequently, a uniform view of the argument model.                               7
                                                                                 http://data.dws.informatik.uni-mannheim.de/sci-arg/
                                                                            compiled_corpus.zip
            System       Training Time        Run Time
             AURC            3d 12h 37m           3h 05m
           TARGER            1d 06h 05m           1h 53m
          TRABAM             2d 22h 41m           5h 37m
        ArguminSci                       -            3m
         MARGOT                          -           37m
Table 1
Overview of the systems included in the showcase along with
training and running times.


order to ensure fair comparisons (i.e., of systems trained
and executed uniformly and homogeneously within the
framework). We consider these additions relevant to ex-
tend the range of initially available results and to demon-
strate the inclusion of systems. While ArguminSci [30]                Figure 2: Mapping between different argumentation schemes
is a suite of tools that enable the analysis of a range of            for the components.
rhetorical aspects, We solely employ the unit for argu-
ment component identification. Taking natural language
text, it processes the vector representation of sentences
with a pre-trained BiLSTM, feeds the results into a single-
layer network, and, finally, applies a softmax-classifier
to identify and tag tokens as argumentative components.
The three labels coincide with the ones used in the Arg-
Sci corpus: own claim, background claim, and data. MAR-
GOT [31] makes use of the information contained in the
structure of sentences, identifies claims and evidences,
and detects their boundaries. Employing a subset tree
kernel [32], the similarity of constituency parse trees is            Figure 3: Mapping between different argumentation schemes
assessed and sentences classified accordingly as contain-             for the relations.
ing part of an argument.
   As a baseline, we also evaluate the results of assigning
the most frequent labels. Every token is outside of an                        times). TARGER takes the least amount of time for both
argumentative component (O), and the relations are all                        training and execution, and its accuracy is similar to that
non-existent (noRel).                                                         of the other two systems, save for the classification of the
                                                                              sentences. It scores S = 0.653, which is several percentage
As previously pointed out, we adopt the most general points behind both AURC (S = 0.792) and TRABAM (S =
and widely-adopted model defining claim and premise for 0.832) (see Table 2 for performance indicators). However,
the conceptual representation of arguments. We connect TARGER (B = 0.483) manages to beat AURC (B = 0.470)
components using the attacks and supports relations. The for the boundary identification. TRABAM still outper-
elements of the models (i.e., concepts and relations) of the forms both of them (B = 0.506) in every aspect, while
individual systems are aligned to this unifying model via also exhibiting the additional functionality to predict the
relations that denote e q u i v a l e n c e or s u b s u m p t i o n , imple- relations. TARGER (C = 0.656) is almost even with TRA-
mented using RDF. Figures 2 and 3 visualize the mapping BAM (C = 0.662) for the component identification score.
for the schemes of the components and relations, respec- Still, TRABAM is the sole system performing relation
tively.                                                                       prediction R = 0.318 and does so to score while relying
   Our experiments were executed on a Debian virtual on the components as annotated in the ground truth.
machine with a single CPU with eight cores at 2.2 GHz                            The two pre-trained systems achieve worse results.
and 209 GB of RAM.                                                            This comes partly as a surprise, given that at least Ar-
                                                                              guminSci was trained on the same data set. It clearly
4.2. Results                                                                  outperforms MARGOT on the sentence classification (S
                                                                              = 0.600 and S = 0.454, respectively), but has a similar
All systems required more than 30 hours to train and sev- score for boundary detection (B = 0.115 and B = 0.097)
eral hours to execute on the test data (see Table 1 for run and is even beat for the component identification (C =
                       System                                       S       B        C        R
                         AURC     [28]                              0.792   0.470    -        -
                       TARGER     [24]                              0.653   0.483    0.656    -
                      TRABAM      [27]                              0.832   0.506    0.662    0.318
                   ArguminSci     [30]                              0.600   0.115    0.091    -
                    MARGOT        [31]                              0.454   0.097    0.133    -
                     BASELINE     most frequent labels (O, noRel)   0.457   0.000    0.000    0.000
Table 2
Results of the benchmark showcase.


0.091 and C = 0.133).                                             The biggest obstacles in both the implementation and
   A possible explanation for ArguminSci’s poor perfor-        the execution of the benchmark were the variety of ap-
mance is the fact that it does not always produce well-        plied approaches and differences in methodologies. Fur-
formed tags for all the chunks. These annotation errors        thermore, the format of the in- and output varied, which
are factored into the calculation of both the B and C score.   necessitated a lot of custom code for every system. Ulti-
Naturally, the non-existent training time very much ac-        mately, it was possible to develop an end-to-end bench-
celerates the whole pipeline and in contrast to the other      mark for a handful of argument miners, which produces
systems, pre-trained ones can annotate the whole test          directly comparable results to gauge the state-of-the-art
set in a matter of minutes instead of hours or even days.      in the field. Although these results are based on the as-
   When comparing the system results to the baseline,          sumption that a ground truth data set labeled with high
it can be observed that using the most frequent labels         inter-rater agreement exists ex ante, the curation of an-
is only rewarded for the sentence classification score         notated data remains a challenge in AM [33]. Here, the
(S = 0.457), but yields zeroes across the board for the        advent of deep learning techniques and their demand for
other individual measures. This is intended, since the         data as well as the opportunity to incorporate the crowd
benchmark is designed to only consider identified actual       in the annotation process [34] should produce relief in
argumentative content (components or relations), which         the long term.
is useful for building a graph representation of content.         As future work, we plan to evaluate our proposed
                                                               approach and to provide a larger list of results obtained
                                                               by our benchmark to analyse the state of AM in the
5. Conclusions                                                 domain of scholarly documents. We hope our work will
                                                               serve as a step toward quantifying the quality of the
In this paper, we have presented BAM, a novel and unified
                                                               Argument Web [4] of Science that the current state of
approach to benchmarking Argument Mining. We de-
                                                               the art could potentially achieve.
scribed its modular architecture, consisting of four pillars
(pre-processing, training, execution, and evaluation). To
produce a first set of results and illustrate its application, Acknowledgments
we fully showcased our benchmarking framework which
included several state-of-the-art AM systems (TARGER, This research has been partly funded by the Swiss Na-
TRABAM, and AURC) and, partially, (without training) tional Science Foundation (SNSF) under contract number
two other systems (MARGOT and ArguminSci).                     200020_184994 (CrowdAlytics Research Project). The au-
   The main insight is that it was possible to create a thors would also like to thank the anonymous reviewers
unified benchmark to produce comparable results. Dif- for their constructive feedback.
ferent systems could be integrated with some additional
code and, subsequently, could execute our pipeline. Our
experiments showed that longer execution time does References
not necessarily imply better performance. Also, more
                                                                [1] M. Ware, M. Mabe, The stm report: An overview of
specialized systems do not guarantee higher scores in
                                                                     scientific and scholarly journal publishing (2015).
the tasks they cover compared to other approaches with
                                                                [2] K. Budzynska, S. Villata, Argument Mining, IEEE
more capabilities. From our results, we see not only the
                                                                     Intelligent Informatics Bulletin 17 (2015) 1–6.
differences among the AM tools but also between the
                                                                [3] N. Green, Argumentation mining in scientific dis-
evaluated aspects with more complex tasks [12] resulting
                                                                     course, CEUR Workshop Proceedings 2048 (2017)
in lower scores. Furthermore, a gap between the best
                                                                     7–13.
performing system and human annotators is also still
evident in the domain of scholarly documents.
 [4] F. Bex, J. Lawrence, M. Snaith, C. Reed, Implement-                  [17] E. Aharoni, A. Polnarov, T. Lavee, D. Hershcovich,
     ing the argument web, Communications of the                               R. Levy, R. Rinott, D. Gutfreund, N. Slonim, A
     ACM 56 (2013) 66–73. doi:1 0 . 1 1 4 5 / 2 5 0 0 8 9 1 .                  Benchmark Dataset for Automatic Detection of
 [5] M. Lippi, P. Torroni, Argumentation mining: State                         Claims and Evidence in the Context of Controver-
     of the art and emerging trends, ACM Transactions                          sial Topics, Proceedings of the first workshop on
     on Internet Technology 16 (2016) 1–25. doi:1 0 . 1 1 4 5 /                argumentation mining (2014) 64–68. doi:1 0 . 3 1 1 5 /
     2850417.                                                                  v1/w14-2109.
 [6] R. Duthie, J. Lawrence, K. Budzynska, C. Reed, The                   [18] E. Loper, S. Bird, Ntlk: The natural language toolkit,
     CASS Technique for Evaluating the Performance                             arXiv preprint cs/0205028 (2002).
     of Argument Mining, Proceedings of the Third                         [19] M. Honnibal, I. Montani, S. Van Landeghem,
     Workshop on Argument Mining (ArgMining2016)                               A. Boyd, spaCy: Industrial-strength Natural Lan-
     (2016) 40–49. doi:1 0 . 1 8 6 5 3 / v 1 / w 1 6 - 2 8 0 5 .               guage Processing in Python, 2020. URL: https:
 [7] A. Lauscher, G. Glavaš, S. P. Ponzetto,                         An        / / doi.org / 10.5281 / zenodo.1212303. doi:1 0 . 5 2 8 1 /
     Argument-Annotated Corpus of Scientific Publi-                            zenodo.1212303.
     cations, Proceedings of the 5th Workshop on Ar-                      [20] C. Fournier, Evaluating text segmentation using
     gument Mining (2018) 40–46. doi:1 0 . 1 8 6 5 3 / v 1 / w 1 8 -           boundary edit distance, in: Proceedings of the 51st
     5206.                                                                     Annual Meeting of the Association for Computa-
 [8] S. Wells, Argument Mining: Was Ist Das?, Proceed-                         tional Linguistics (Volume 1: Long Papers), 2013,
     ings of the 14th International Workshop on Com-                           pp. 1702–1712.
     putational Models of Natural Argument (CMNA14),                      [21] D. Walton, Argumentation Theory: A Very Short
     Krakow, Poland (2014).                                                    Introduction, Springer US, Boston, MA, 2009, pp.
 [9] P. Saint Dizier, The lexicon of argumentation for                         1–22. URL: https://doi.org/10.1007/978-0-387-98197-
     argument mining: methodological considerations,                           0_1. doi:1 0 . 1 0 0 7 / 9 7 8 - 0 - 3 8 7 - 9 8 1 9 7 - 0 _ 1 .
     Anglophonia. French Journal of English Linguistics                   [22] B. Fisas, F. Ronzano, H. Saggion, A multi-layered
     (2020).                                                                   annotated corpus of scientific papers, Proceed-
[10] C. Stab, C. Kirschner, J. Eckle-Kohler, I. Gurevych,                      ings of the Tenth International Conference on Lan-
     Argumentation mining in persuasive essays and                             guage Resources and Evaluation (LREC’16) (2016)
     scientific articles from the discourse structure per-                     3081–3088.
     spective, CEUR Workshop Proceedings 1341 (2014).                     [23] A. Lauscher, G. Glavas, S. P. Ponzetto, K. Eckert,
[11] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller,                          Annotating arguments in scientific publications,
     B. Schiller, C. Tauchmann, S. Eger, I. Gurevych, Ar-                      2018.
     gumenText: Searching for Arguments in Heteroge-                      [24] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bon-
     neous Sources, Proceedings of the 2018 conference                         darenko, M. Hagen, C. Biemann, A. Panchenko,
     of the North American chapter of the association for                      TARGER: Neural Argument Mining at Your Finger-
     computational linguistics: demonstrations (2018)                          tips, Proceedings of the 57th Annual Meeting of the
     21–25. doi:1 0 . 1 8 6 5 3 / v 1 / n 1 8 - 5 0 0 5 .                      Association for Computational Linguistics: System
[12] K. Al Khatib, T. Ghosal, Y. Hou, A. de Waard, D. Fre-                     Demonstrations (2019) 195–200. doi:1 0 . 1 8 6 5 3 / v 1 /
     itag, Argument Mining for Scholarly Document                              p19-3031.
     Processing: Taking Stock and Looking Ahead, in:                      [25] X. Ma, E. Hovy, End-to-end sequence labeling
     Proceedings of the Second Workshop on Schol-                              via bi-directional lstm-cnns-crf, arXiv preprint
     arly Document Processing, Association for Com-                            arXiv:1603.01354 (2016).
     putational Linguistics, 2021, pp. 56–65. URL: https:                 [26] J. Pennington, R. Socher, C. D. Manning, Glove:
     //2021.argmining.org/.                                                    Global vectors for word representation, in: Pro-
[13] H. Schütze, C. D. Manning, P. Raghavan, Introduc-                         ceedings of the 2014 conference on empirical meth-
     tion to information retrieval, volume 39, Cambridge                       ods in natural language processing (EMNLP), 2014,
     University Press Cambridge, 2008.                                         pp. 1532–1543.
[14] C. van Rijsbergen, Information retrieval, 2nd edbut-                 [27] T. Mayer, E. Cabrio, S. Villata, Transformer-based
     terworths, 1979.                                                          argument mining for healthcare applications, in:
[15] V. I. Levenshtein, Binary codes capable of correct-                       Frontiers in Artificial Intelligence and Applica-
     ing deletions, insertions, and reversals, in: Soviet                      tions, volume 325, 2020, pp. 2108–2115. doi:1 0 . 3 2 3 3 /
     Physics-Doklady, volume 10, 1966, pp. 707–710.                            FAIA200334.
[16] E. Cabrio, S. Villata, NoDE: A Benchmark of Nat-                     [28] D. Trautmann, J. Daxenberger, C. Stab, H. Schütze,
     ural Language Arguments, Frontiers in Artificial                          I. Gurevych, Fine-Grained Argument Unit Recog-
     Intelligence and Applications 266 (2014) 449–450.                         nition and Classification, AAAI (2020) 9048–9056.
     doi:1 0 . 3 2 3 3 / 9 7 8 - 1 - 6 1 4 9 9 - 4 3 6 - 7 - 4 4 9 .           URL: https://doi.org/10.1609/aaai.v34i05.6438.
[29] N. Reimers, B. Schiller, T. Beck, J. Daxenberger,
     C. Stab, I. Gurevych, Classification and Clustering
     of Arguments with Contextualized Word Embed-
     dings, in: Proceedings of the 57th Annual Meeting
     of the Association for Computational Linguistics
     (Volume 1: Long Papers), Florence, Italy, 2019, pp.
     567–578. URL: https://arxiv.org/abs/1906.09821.
[30] A. Lauscher, G. Glavaš, K. Eckert, ArguminSci: A
     Tool for Analyzing Argumentation and Rhetorical
     Aspects in Scientific Writing, Proceedings of the
     5th Workshop on Argument Mining (2018) 22–28.
     doi:1 0 . 1 8 6 5 3 / v 1 / w 1 8 - 5 2 0 3 .
[31] M. Lippi, P. Torroni, MARGOT: A web server
     for argumentation mining, Expert Systems with
     Applications 65 (2016) 292–303. URL: http : / /
     dx.doi.org/10.1016/j.eswa.2016.08.050. doi:1 0 . 1 0 1 6 /
     j.eswa.2016.08.050.
[32] M. Collins, N. Duffy, New ranking algorithms for
     parsing and tagging: Kernels over discrete struc-
     tures, and the voted perceptron, in: Proceedings
     of the 40th Annual Meeting of the Association for
     Computational Linguistics, 2002, pp. 263–270.
[33] P. Accuosto, M. Neves, H. Saggion, Argumentation
     mining in scientific literature: From computational
     linguistics to biomedicine, CEUR Workshop Pro-
     ceedings (2021) 20–36.
[34] Q. V. H. Nguyen, C. T. Duong, T. T. Nguyen, M. Wei-
     dlich, K. Aberer, H. Yin, X. Zhou, Argument dis-
     covery via crowdsourcing, VLDB Journal 26 (2017)
     511–535. doi:1 0 . 1 0 0 7 / s 0 0 7 7 8 - 0 1 7 - 0 4 6 2 - 9 .

</pre>