=Paper=
{{Paper
|id=Vol-3164/paper5
|storemode=property
|title=BAM: Benchmarking Argument Mining on Scientific Documents
|pdfUrl=https://ceur-ws.org/Vol-3164/paper5.pdf
|volume=Vol-3164
|authors=Florian Ruosch,Cristina Sarasua,Abraham Bernstein
|dblpUrl=https://dblp.org/rec/conf/aaai/RuoschSB22
}}
==BAM: Benchmarking Argument Mining on Scientific Documents==
BAM: Benchmarking Argument Mining on Scientific Documents Florian Ruosch1 , Cristina Sarasua1 and Abraham Bernstein1 1 University of Zurich, Department of Informatics, Binzmühlestrasse 14, 8050 Zürich, Switzerland Abstract In this paper, we present BAM, a unified Benchmark for Argument Mining (AM). We propose a method to homogenize both the evaluation process and the data to provide a common view in order to ultimately produce comparable results. Built as a four stage and end-to-end pipeline, the benchmark allows for the direct inclusion of additional argument miners to be evaluated. First, our system pre-processes a ground truth set used both for training and testing. Then, the benchmark calculates a total of four measures to assess different aspects of the mining process. To showcase an initial implementation of our approach, we apply our procedure and evaluate a set of systems on a corpus of scientific publications. With the obtained comparable results we can homogeneously assess the current state of AM in this domain. 1. Introduction we propose BAM, a unified approach to Benchmarking Argument Mining. In the last 200 years, the number of published papers per Following the AM pipeline described by Lippi and Tor- year has consistently been increasing by around 5% [1]. roni [5], we create a multi-level evaluation framework to With this rapidly growing landscape, it becomes harder enable the benchmarking of every task of their AM pro- to manually navigate the seemingly unending flood of cess: sentence classification, boundary detection, com- new scientific information. ponent classification, and relation prediction. We aim to One of the emerging fields addressing the machine- provide a benchmarking framework that facilitates com- assisted processing of scholarly documents is Argument parable results both within each stage and throughout Mining (AM), aimed at identifying and extracting argu- the whole pipeline. ment components (and possibly relations) from natural In this work, we present a two main contributions. language texts [2]. This information is not only useful First and foremost, we show the concept of a unified for summarization but also for detecting connections benchmark approach for Argument Mining: BAM. To between different entities such as individual papers or the best of our knowledge, nothing of the sort exists yet. outlets [3]. This kind of network has been described as Furthermore, we showcase a preliminary implementa- the Argument Web by Bex et al. [4] — a vision where tion of BAM by applying our benchmark to a pre-existing all argument data is URI-addressable and linked. If we argument annotated corpus of scientific publications [7]. want to work toward the automatic implementation of This allows for not only showing the feasibility of our such a knowledge graph containing arguments from sci- approach but also to present an initial comparison of entific publications, we first need to be able to compare the performance results of a range of AM systems in the the performance of existing solutions. However, there is domain of scholarly papers. currently no widely established, standardized AM bench- marking approach. The remainder of this paper is structured as follows: Sec- Lippi and Torroni [5] point out several problem areas tion 2 presents related works and Section 3 introduces which stand in the way of a homogeneous evaluation: our newly proposed methodology. In the ensuing Sec- the granularity of the in- and output of AM systems, the tion 4, we showcase our benchmark and describe the variety of genres and domains they focus on, and the rep- results, before we draw conclusions in Section 5. resentation of arguments in the evaluation data, i.e. the argument model. Additionally, as previously noted by Duthie et al. [6], a wide spectrum of different measures are 2. Related Work in use, and these are not accurately described or appro- priately applied in all cases. To address the issues above, We first explore the definition of AM and then point to an overview of efforts in the domain of scientific publica- tions. In the second part, we address existing measures Envelope-Open ruosch@ifi.uzh.ch (F. Ruosch); sarasua@ifi.uzh.ch (C. Sarasua); used to evaluate the performance of AM. Finally, we de- bernstein@ifi.uzh.ch (A. Bernstein) Orcid 0000-0002-0257-3318 (F. Ruosch); 0000-0002-2076-9584 scribe available benchmarks. (C. Sarasua); 0000-0002-0128-4602 (A. Bernstein) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2.1. Argument Mining text, i.e. the boundaries of the identified components. For the relation scores, these segments are aligned between Despite the different interpretations of what AM en- annotations. Considering the Levenshtein distance [15] tails [8], there is the well-established information extrac- and also the location in the text, the components are tion approach, as popularized by several experts in the mapped. Then, the number of correctly predicted con- field [9, 2, 5]. Stab et al. [10] explain AM as a multistage nections (also with respect to their types) is calculated pipeline that extracts the arguments from text, usually by for propositional (attack, support) and dialogical (consid- first separating non-argumentative from argumentative ering the speaker’s intent) relations. units, then classifying the argument components and, Even though CASS is very flexible (i.e. scheme agnos- finally, identifying their structure with relations. We tic), it still has some drawbacks. Firstly, it assumes the adopt this definition because it fits best with our ultimate existence of dialogical annotations, which is not com- goal of creating the Argument Web of Science [4] for mon in current automated AM approaches. Also, there is which we need to extract information about argumenta- no public implementation such that it could be put into tive units and their relations. Other AM papers [11], treat practice. Finally, it wholly omits the component classifi- the mining process as a search task to retrieve arguments cation part of the AM pipeline by only focusing on the from a pre-computed set according to their relevance for segmentation (i.e. boundaries of the components) of the a query or keyword. text. By introducing our own evaluation method, we aim For a detailed overview of literature of the last 20 years to remedy the points mentioned above. in the field of AM for scientific publications, we point the inclined reader to the survey of Al Khatib et al. [12]. They do not only present an overview about the efforts 2.3. Benchmarks made but also indicate current applications and identified We found two previous works that designate themselves challenges. specifically as benchmarks. NoDE [16] consists of a total of three data sets covering 2.2. Measures different domains: online discussions, a stage play, and the revision history of Wikipedia articles. The source There is a wide range of Information Retrieval (IR) mea- for the first part were different online debate platforms sures used to evaluate AM systems. Typically, IR evalua- which allow members to discuss controversial topics such tions presume the existence of a gold standard or ground as violent video games or abortion. Secondly, arguments truth that a proposed solution is evaluated against [13]. were extracted from the play “Twelve Angry Men”, where The F1 (also F-score or F-measure) [14] can be used to a jury discusses the culpability of a young man in a mur- assess the accuracy of the predictions made by a system der case. The third data resulted from comparing two by calculating the harmonic mean of the precision and different Wikipedia dumps based on the edits of the five the recall, both of which have also been applied on their most revised pages. All three sets were annotated by a own for performance evaluation. For multi-class tasks team of two (𝜅 = 0.70 – 0.74) and, in total, they contain 792 (such as AM, where we aim to identify various compo- pairs, each connecting two arguments with information nents or relations) different versions of F1 exist, based on about entailment. Partly, they are also annotated with how the score is averaged for the classes. The macro-F1 support- or attack-relationships. It is of note that it is variant weights all classes equally for the combination not possible to use this benchmark to evaluate the whole into a single F1, while micro-F1 considers the number of AM pipeline since it does not contain any information occurrences for each label. Not only are we unable to about the boundaries of arguments in continuous text. directly compare results reported for different variants of Aharoni et al. [17] present a data set based on Wikipedia the F-score, some literature also chooses not to include pages for a range of controversial topics. In the labeling the specifics of which weighting method was employed. process, they first extracted claims from the articles, fol- Duthie et al. [6] raise the issue that traditional mea- lowed by supporting evidences. Given that each claim sures from the field of IR may over penalize when simply is identified context-dependently, they are inherently as- applying them for each of the pipeline stages successively. signed to a topic. Every evidence is then connected to For example, wrongly or not at all identified components a claim and given a type (study, expert, or anecdotal). directly influence and reduce the calculated performance The labeling was conducted by 20 inhouse annotators of the relation prediction task. To address this and other with a Cohen’s 𝜅 of 0.39 for the claims and 0.40 for the shortcomings, they introduce the Combined Argument evidences. The corpus covers a total of 33 topics with Similarity Score (CASS) [6]. It splits the evaluation of AM 1392 claims and 1291 evidences. Notably, all evidences into three individual scores which are then aggregated are supporting and no attack relation is annotated. It also into a single number. First, the segmentation step evalu- does not contain explicit information about the location ates the similarity of different partitionings of the same of the components in the text (and thus the boundaries). Figure 1: System vision of the different parts of the benchmark framework and their interactions. These two works share one major drawback: Instead our framework, the system can then address the training, of providing a framework including one or more eval- where applicable, and execution step, which are both uation measures and a state-of-the-art benchmarking integrated into the end-to-end pipeline. We explain each methodology, they solely present a new data set, that can of these functionalities separately below. be used as a ground truth. Thus, no uniform method to assess the performance of AM system is established since (1) Pre-processing This step creates a data set suitable the choice of the measure has not been fixed. We address to be processed by a given system from a common ground this issue in our work. truth corpus, according to specified configurations and the alignment of argument representations. It is tailored to the requirements of the system to be benchmarked 3. BAM: A Unified Approach to such that it can be used as input at any stage, be it for Benchmarking Argument training or evaluation. This ensures that every system Mining tested in the benchmark will use the same data as basis, thus allowing for comparable results. The split of the data We first describe the architecture of the end-to-end bench- into development, training, and test set is specified not marking pipeline. Then, we specify the measures em- per system but rather per corpus ensuring comparability ployed to assess performance for the different stages. between systems. Finally, we describe our argument representation unifi- cation effort. (2) Training Given the prevalence of neural network approaches for AM, we included an optional training step. Here, the training API of the system to be integrated can 3.1. Overview be invoked using the specifically created data set. We designed BAM, the benchmark for Argument Mining, with the goal of not only providing an easy to access sys- (3) Execution The resulting trained model is then em- tem but also considering all aspects of AM and to obtain ployed to annotate the test data set using the system’s ex- performance results in a unified and homogeneous way. ecution API. We enabled the functionality to either reuse Figure 1 outlines the end-to-end pipeline and illustrates the intermediate results as input for the subsequent steps how BAM is built on four pillars, from left to right: (1) or to test aspects independently and inject ground truth pre-processing, (2) training, (3) execution, and (4) eval- annotations into the pipeline (e.g., relation prediction uation. The implementation was done in Python and is with the components as annotated in the ground truth). available publicly.1 We provide several examples of how to integrate AM systems via the implemented Python (4) Evaluation This stage aligns the computed results stubs. and the ground truth annotations to ensure the data con- forms to the requirements set by the evaluation func- With pre- and post-processing being taken care of by tions, which expect sequences of labeled tokens. This is achieved by applying NLTK’s [18] tokenizer, where 1 https://gitlab.ifi.uzh.ch/DDIS-Public/bam necessary. Since the system’s output may already be tok- result and weigh both classes equally. In our benchmark- enized using an unknown technique, we have to expect ing framework, we apply sklearn’s implementation of differences in the labeled tokens. To address them, we the micro-F1 measure4 to obtain a score between 0 and match the two sequences with spaCy’s [19] implementa- 1, where bigger signifies better. tion of the token aligner2 and, thus, all of the evaluation For the comparison of the component boundaries, we happens uniformly on token-level. Subsequently, sev- follow the proposition of Duthie et al. [6] and use the im- eral aspects are evaluated. Based on the AM pipeline plementation5 of the segmentation evaluation [20]. The described by Lippi and Torroni [5], our benchmarking edit distance-based boundary similarity function assesses framework assesses performance for four different tasks: how well the results of segmentation tasks agree on a argumentative sentence classification (S), boundary iden- scale from 0 to 1. It compares pairs of boundaries, cal- tification (B), argumentative component detection (C), as culates the edit-distance, and normalizes based on the well as argumentative relation prediction (R). A sentence segmentation length. As input, we can simply pass two is classified as argumentative, if it contains any argument sequences of (multiclass) labels assigned to the tokens component [5]. Next, the similarity of the boundaries and the library will identify the boundaries automatically. for the (non)argumentative segments is compared. Be- Given that the previous measure does not take the cat- fore the final stage, the detection and classification of egories of the segments into account (i.e., the component the components themselves is assessed. Lastly, the pre- types), we have to address the classification in a separate dicted relations are compared to the ones annotated in the step. Based on the similarity of this task to Named Entity ground truth, i.e. which components are connected and Recognition (NER) [12], we can employ the nervaluate- how. It is important to note that we do not require every package6 originally designed for the evaluation of NER. system to perform all the tasks, but rather the implemen- By treating the argumentative components as named en- tation specifies which are covered in the configuration tities, we apply the same functions and obtain the F1 and which are not. The details for each evaluated aspect through this well-established library. are presented below. The final evaluation step assesses the correctness of the predicted relations between the identified compo- By relying on a modular structure, we give enough room nents. As pointed out by Duthie et al. [6], it is important for customizations to account for any peculiarities that to consider the possible double penalization since the systems might exhibit. previously detected argumentative units play a critical Furthermore, each system needs to specify a mapping role. Not having identified certain components also takes (represented by the graph icon on the bottom of Figure 1) away the opportunity to relate them and, thus, is not only to create a uniform view of the argument representation penalized in the previous step but also has an impact on and to make the results comparable. It is employed for the relation prediction score. Consequently, we give the pre-processing, to create a specific data set, and for the possibility to either use the argumentative units as iden- evaluation, to map all systems to the same argumenta- tified by the system (i.e., the intermediate results) or to tion scheme. By specifying the mapping with Semantic recourse to the ground truth as the input for this step. Web technologies (OWL3 ), we not only ensure that it is When using the computed intermediate results, we match machine-readable and interoperable, but we also facili- the components to the ground truth to ensure fairness so tate its extension and reuse. that the boundaries do not need to coincide exactly. In- stead, we assign each identified unit to one in the ground 3.2. Evaluation Measures truth, if they overlap in at least one token. For compo- nents covering multiple ones in the ground truth, we Every task is treated as a (multinomial) classification. select the one with the largest intersection. This does However, we use several evaluation methods because not only allow for differing boundaries, it also ensures they differ slightly in the granularity and format of the that localization information of the units is factored in. data as well as their goal. We explain the measures and By constructing triples out of the two components and their reasoning for every step of the pipeline. the relation (subject, predicate, object), we obtain lists In the first task, the aim is to classify sentences as of predicted and gold data. This turns the problem into (non-)argumentative. If a sentence contains at least one identifying retrieved/missed, relevant/irrelevant triples. argument component, it is defined as argumentative [5]. Therefore, we can again employ the F1-score. One caveat After extracting these annotations from the mined results is that we also need to consider the symmetric nature of as well as from the ground truth, we compare two lists some relation types. By converting the data into triples, of the same length with binary values using micro-F1, to 4 ensure that a possible label imbalance does not affect the https://scikit-learn.org/stable/modules/generated/ sklearn.metrics.f1_score 2 5 https://github.com/explosion/spacy-alignments https://github.com/cfournie/segmentation.evaluation 3 6 https://www.w3.org/OWL https://pypi.org/project/nervaluate we risk not awarding a correct prediction if it is reversed 4. Showcasing BAM (object and subject transposed) for a symmetric relation. To amend this issue, we always arrange them in such a To illustrate the feasibility of our benchmarking frame- way that the subject has the smaller identifier number work, we showcase it with an example data set and a than the object. Since no relation is reflexive, this results limited number of systems. This section first introduces in unique triples. the used corpus, before elaborating on the selection of ar- gument miners. Ensuingly, we explain the alignment of the different argumentation schemes and, finally, present 3.3. Aligning Argumentation Schemes a set of initial results. To produce comparable results, a common view of how an argument is represented in data is necessary. This is 4.1. Setup achieved by aligning different argumentation schemes through mappings. Given the widespread adoption [5] of For our showcase, we use the corpus presented by Lauscher the claim/premise model [21] and its simplicity, we chose et al. [7], currently the only available collection of fully it for our benchmark and use the attacks- and supports- argument annotated scientific papers in English. The au- relations to connect components with the import notion thors extend the Dr. Inventor data set [22] by annotating that we do not restrict neither range (source) nor the arguments for 40 publications in the field of computer domain (target) for both. graphics containing 10’780 sentences in total. According To align representations, we need two types of map- to the guidelines [23], several types of components have pings: one for the components (claim and premise) and been annotated: background claim (i.e., a claim about one for the relations (supports and attacks). There are someone else’s work), own claim (i.e., proprietary contri- two different scenarios: either one scheme is more com- bution), and data (i.e., the evidence). Furthermore, they plex than the other (i.e., it has more components and/or identify relations between the argumentative units: con- relations or has other levels of specificity) or they are the tradicts, supports, semantically same, and part of. The same but use a different naming convention (e.g., syn- corpus is publicly available and can be downloaded from onyms or similar but not identical terms such as attacks the project’s homepage.7 versus attack). There is also the special situation for the According to our previously defined requirements, we components that a model is as simple as to only segment- select an initial list of systems to be included in the show- ing text into non- and argumentative parts. In this case, case. TARGER [24] identifies and tags argument units we do not assess the system’s ability to classify argumen- as claims or premises on token-level from free text in- tative components due to the lack of information and, put. It implements a BiLSTM-CNN-CRF [25] and uses thus, no mapping is necessary. pre-computed word embeddings, such as GloVe [26]. More complex schemes can be reduced to a simpler Mayer et al. [27] present an AM approach for the do- model with the concepts of e q u i v a l e n t - and/or s u b c l a s s - main of healthcare employing bi-directional transformers o f -relations. Every component and relation from the and combining them with neural networks (LSTM, GRU, original representation is assigned to exactly zero or one CRF), which we label as TRABAM (for TRansformer- corresponding element of the benchmark model, depend- Based AM) in this paper. Not only do they address the ing on whether their complement exists and according task of identifying argument components (claim, evi- to their definition in the original model descriptions. El- dence, and major claim) with a sequence tagging solu- ements without a counterpart are mapped to no type tion but they also identify relations between these units since they can not be considered in the evaluation. It is phrased as a multichoice problem (attack, support, non). important to note that no annotations are discarded since Trautmann et al. [28] also formulate AM as sequence tag- the ground truth data is recomputed for every run and, ging problem and define the task of Argument Unit Recog- if the mapping changes, the alterations are incorporated nition and Classification (AURC). They argue for a more automatically. fine-grained identification of spans than on sentence- In the case of using different naming, we only need to level. At the same time, the authors present a solution employ the e q u i v a l e n t -relation. The same concept may using the established sequence labeling model of Reimers be called differently but still carry the identical semantics. et al. [29] which employs BILSTMs in combination with Claims are labeled as conclusions, while premises have a word embeddings. plethora of names in literature such as data, evidence, or We include two more systems that, despite being pre- reason [5]. Similarly, the attacks-relation is also known trained externally, have received attention in the state of as contradicts. Based on the definitions, we can create a the art due to their respective approaches. However, we one-to-one-mapping between model elements and, sub- do not strictly add them to the benchmarked results in sequently, a uniform view of the argument model. 7 http://data.dws.informatik.uni-mannheim.de/sci-arg/ compiled_corpus.zip System Training Time Run Time AURC 3d 12h 37m 3h 05m TARGER 1d 06h 05m 1h 53m TRABAM 2d 22h 41m 5h 37m ArguminSci - 3m MARGOT - 37m Table 1 Overview of the systems included in the showcase along with training and running times. order to ensure fair comparisons (i.e., of systems trained and executed uniformly and homogeneously within the framework). We consider these additions relevant to ex- tend the range of initially available results and to demon- strate the inclusion of systems. While ArguminSci [30] Figure 2: Mapping between different argumentation schemes is a suite of tools that enable the analysis of a range of for the components. rhetorical aspects, We solely employ the unit for argu- ment component identification. Taking natural language text, it processes the vector representation of sentences with a pre-trained BiLSTM, feeds the results into a single- layer network, and, finally, applies a softmax-classifier to identify and tag tokens as argumentative components. The three labels coincide with the ones used in the Arg- Sci corpus: own claim, background claim, and data. MAR- GOT [31] makes use of the information contained in the structure of sentences, identifies claims and evidences, and detects their boundaries. Employing a subset tree kernel [32], the similarity of constituency parse trees is Figure 3: Mapping between different argumentation schemes assessed and sentences classified accordingly as contain- for the relations. ing part of an argument. As a baseline, we also evaluate the results of assigning the most frequent labels. Every token is outside of an times). TARGER takes the least amount of time for both argumentative component (O), and the relations are all training and execution, and its accuracy is similar to that non-existent (noRel). of the other two systems, save for the classification of the sentences. It scores S = 0.653, which is several percentage As previously pointed out, we adopt the most general points behind both AURC (S = 0.792) and TRABAM (S = and widely-adopted model defining claim and premise for 0.832) (see Table 2 for performance indicators). However, the conceptual representation of arguments. We connect TARGER (B = 0.483) manages to beat AURC (B = 0.470) components using the attacks and supports relations. The for the boundary identification. TRABAM still outper- elements of the models (i.e., concepts and relations) of the forms both of them (B = 0.506) in every aspect, while individual systems are aligned to this unifying model via also exhibiting the additional functionality to predict the relations that denote e q u i v a l e n c e or s u b s u m p t i o n , imple- relations. TARGER (C = 0.656) is almost even with TRA- mented using RDF. Figures 2 and 3 visualize the mapping BAM (C = 0.662) for the component identification score. for the schemes of the components and relations, respec- Still, TRABAM is the sole system performing relation tively. prediction R = 0.318 and does so to score while relying Our experiments were executed on a Debian virtual on the components as annotated in the ground truth. machine with a single CPU with eight cores at 2.2 GHz The two pre-trained systems achieve worse results. and 209 GB of RAM. This comes partly as a surprise, given that at least Ar- guminSci was trained on the same data set. It clearly 4.2. Results outperforms MARGOT on the sentence classification (S = 0.600 and S = 0.454, respectively), but has a similar All systems required more than 30 hours to train and sev- score for boundary detection (B = 0.115 and B = 0.097) eral hours to execute on the test data (see Table 1 for run and is even beat for the component identification (C = System S B C R AURC [28] 0.792 0.470 - - TARGER [24] 0.653 0.483 0.656 - TRABAM [27] 0.832 0.506 0.662 0.318 ArguminSci [30] 0.600 0.115 0.091 - MARGOT [31] 0.454 0.097 0.133 - BASELINE most frequent labels (O, noRel) 0.457 0.000 0.000 0.000 Table 2 Results of the benchmark showcase. 0.091 and C = 0.133). The biggest obstacles in both the implementation and A possible explanation for ArguminSci’s poor perfor- the execution of the benchmark were the variety of ap- mance is the fact that it does not always produce well- plied approaches and differences in methodologies. Fur- formed tags for all the chunks. These annotation errors thermore, the format of the in- and output varied, which are factored into the calculation of both the B and C score. necessitated a lot of custom code for every system. Ulti- Naturally, the non-existent training time very much ac- mately, it was possible to develop an end-to-end bench- celerates the whole pipeline and in contrast to the other mark for a handful of argument miners, which produces systems, pre-trained ones can annotate the whole test directly comparable results to gauge the state-of-the-art set in a matter of minutes instead of hours or even days. in the field. Although these results are based on the as- When comparing the system results to the baseline, sumption that a ground truth data set labeled with high it can be observed that using the most frequent labels inter-rater agreement exists ex ante, the curation of an- is only rewarded for the sentence classification score notated data remains a challenge in AM [33]. Here, the (S = 0.457), but yields zeroes across the board for the advent of deep learning techniques and their demand for other individual measures. This is intended, since the data as well as the opportunity to incorporate the crowd benchmark is designed to only consider identified actual in the annotation process [34] should produce relief in argumentative content (components or relations), which the long term. is useful for building a graph representation of content. As future work, we plan to evaluate our proposed approach and to provide a larger list of results obtained by our benchmark to analyse the state of AM in the 5. Conclusions domain of scholarly documents. We hope our work will serve as a step toward quantifying the quality of the In this paper, we have presented BAM, a novel and unified Argument Web [4] of Science that the current state of approach to benchmarking Argument Mining. We de- the art could potentially achieve. scribed its modular architecture, consisting of four pillars (pre-processing, training, execution, and evaluation). To produce a first set of results and illustrate its application, Acknowledgments we fully showcased our benchmarking framework which included several state-of-the-art AM systems (TARGER, This research has been partly funded by the Swiss Na- TRABAM, and AURC) and, partially, (without training) tional Science Foundation (SNSF) under contract number two other systems (MARGOT and ArguminSci). 200020_184994 (CrowdAlytics Research Project). The au- The main insight is that it was possible to create a thors would also like to thank the anonymous reviewers unified benchmark to produce comparable results. Dif- for their constructive feedback. ferent systems could be integrated with some additional code and, subsequently, could execute our pipeline. Our experiments showed that longer execution time does References not necessarily imply better performance. Also, more [1] M. Ware, M. Mabe, The stm report: An overview of specialized systems do not guarantee higher scores in scientific and scholarly journal publishing (2015). the tasks they cover compared to other approaches with [2] K. Budzynska, S. Villata, Argument Mining, IEEE more capabilities. From our results, we see not only the Intelligent Informatics Bulletin 17 (2015) 1–6. differences among the AM tools but also between the [3] N. Green, Argumentation mining in scientific dis- evaluated aspects with more complex tasks [12] resulting course, CEUR Workshop Proceedings 2048 (2017) in lower scores. Furthermore, a gap between the best 7–13. performing system and human annotators is also still evident in the domain of scholarly documents. [4] F. Bex, J. Lawrence, M. Snaith, C. Reed, Implement- [17] E. Aharoni, A. Polnarov, T. Lavee, D. Hershcovich, ing the argument web, Communications of the R. Levy, R. Rinott, D. Gutfreund, N. Slonim, A ACM 56 (2013) 66–73. doi:1 0 . 1 1 4 5 / 2 5 0 0 8 9 1 . Benchmark Dataset for Automatic Detection of [5] M. Lippi, P. Torroni, Argumentation mining: State Claims and Evidence in the Context of Controver- of the art and emerging trends, ACM Transactions sial Topics, Proceedings of the first workshop on on Internet Technology 16 (2016) 1–25. doi:1 0 . 1 1 4 5 / argumentation mining (2014) 64–68. doi:1 0 . 3 1 1 5 / 2850417. v1/w14-2109. [6] R. Duthie, J. Lawrence, K. Budzynska, C. Reed, The [18] E. Loper, S. Bird, Ntlk: The natural language toolkit, CASS Technique for Evaluating the Performance arXiv preprint cs/0205028 (2002). of Argument Mining, Proceedings of the Third [19] M. Honnibal, I. Montani, S. Van Landeghem, Workshop on Argument Mining (ArgMining2016) A. Boyd, spaCy: Industrial-strength Natural Lan- (2016) 40–49. doi:1 0 . 1 8 6 5 3 / v 1 / w 1 6 - 2 8 0 5 . guage Processing in Python, 2020. URL: https: [7] A. Lauscher, G. Glavaš, S. P. Ponzetto, An / / doi.org / 10.5281 / zenodo.1212303. doi:1 0 . 5 2 8 1 / Argument-Annotated Corpus of Scientific Publi- zenodo.1212303. cations, Proceedings of the 5th Workshop on Ar- [20] C. Fournier, Evaluating text segmentation using gument Mining (2018) 40–46. doi:1 0 . 1 8 6 5 3 / v 1 / w 1 8 - boundary edit distance, in: Proceedings of the 51st 5206. Annual Meeting of the Association for Computa- [8] S. Wells, Argument Mining: Was Ist Das?, Proceed- tional Linguistics (Volume 1: Long Papers), 2013, ings of the 14th International Workshop on Com- pp. 1702–1712. putational Models of Natural Argument (CMNA14), [21] D. Walton, Argumentation Theory: A Very Short Krakow, Poland (2014). Introduction, Springer US, Boston, MA, 2009, pp. [9] P. Saint Dizier, The lexicon of argumentation for 1–22. URL: https://doi.org/10.1007/978-0-387-98197- argument mining: methodological considerations, 0_1. doi:1 0 . 1 0 0 7 / 9 7 8 - 0 - 3 8 7 - 9 8 1 9 7 - 0 _ 1 . Anglophonia. French Journal of English Linguistics [22] B. Fisas, F. Ronzano, H. Saggion, A multi-layered (2020). annotated corpus of scientific papers, Proceed- [10] C. Stab, C. Kirschner, J. Eckle-Kohler, I. Gurevych, ings of the Tenth International Conference on Lan- Argumentation mining in persuasive essays and guage Resources and Evaluation (LREC’16) (2016) scientific articles from the discourse structure per- 3081–3088. spective, CEUR Workshop Proceedings 1341 (2014). [23] A. Lauscher, G. Glavas, S. P. Ponzetto, K. Eckert, [11] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, Annotating arguments in scientific publications, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych, Ar- 2018. gumenText: Searching for Arguments in Heteroge- [24] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bon- neous Sources, Proceedings of the 2018 conference darenko, M. Hagen, C. Biemann, A. Panchenko, of the North American chapter of the association for TARGER: Neural Argument Mining at Your Finger- computational linguistics: demonstrations (2018) tips, Proceedings of the 57th Annual Meeting of the 21–25. doi:1 0 . 1 8 6 5 3 / v 1 / n 1 8 - 5 0 0 5 . Association for Computational Linguistics: System [12] K. Al Khatib, T. Ghosal, Y. Hou, A. de Waard, D. Fre- Demonstrations (2019) 195–200. doi:1 0 . 1 8 6 5 3 / v 1 / itag, Argument Mining for Scholarly Document p19-3031. Processing: Taking Stock and Looking Ahead, in: [25] X. Ma, E. Hovy, End-to-end sequence labeling Proceedings of the Second Workshop on Schol- via bi-directional lstm-cnns-crf, arXiv preprint arly Document Processing, Association for Com- arXiv:1603.01354 (2016). putational Linguistics, 2021, pp. 56–65. URL: https: [26] J. Pennington, R. Socher, C. D. Manning, Glove: //2021.argmining.org/. Global vectors for word representation, in: Pro- [13] H. Schütze, C. D. Manning, P. Raghavan, Introduc- ceedings of the 2014 conference on empirical meth- tion to information retrieval, volume 39, Cambridge ods in natural language processing (EMNLP), 2014, University Press Cambridge, 2008. pp. 1532–1543. [14] C. van Rijsbergen, Information retrieval, 2nd edbut- [27] T. Mayer, E. Cabrio, S. Villata, Transformer-based terworths, 1979. argument mining for healthcare applications, in: [15] V. I. Levenshtein, Binary codes capable of correct- Frontiers in Artificial Intelligence and Applica- ing deletions, insertions, and reversals, in: Soviet tions, volume 325, 2020, pp. 2108–2115. doi:1 0 . 3 2 3 3 / Physics-Doklady, volume 10, 1966, pp. 707–710. FAIA200334. [16] E. Cabrio, S. Villata, NoDE: A Benchmark of Nat- [28] D. Trautmann, J. Daxenberger, C. Stab, H. Schütze, ural Language Arguments, Frontiers in Artificial I. Gurevych, Fine-Grained Argument Unit Recog- Intelligence and Applications 266 (2014) 449–450. nition and Classification, AAAI (2020) 9048–9056. doi:1 0 . 3 2 3 3 / 9 7 8 - 1 - 6 1 4 9 9 - 4 3 6 - 7 - 4 4 9 . URL: https://doi.org/10.1609/aaai.v34i05.6438. [29] N. Reimers, B. Schiller, T. Beck, J. Daxenberger, C. Stab, I. Gurevych, Classification and Clustering of Arguments with Contextualized Word Embed- dings, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 2019, pp. 567–578. URL: https://arxiv.org/abs/1906.09821. [30] A. Lauscher, G. Glavaš, K. Eckert, ArguminSci: A Tool for Analyzing Argumentation and Rhetorical Aspects in Scientific Writing, Proceedings of the 5th Workshop on Argument Mining (2018) 22–28. doi:1 0 . 1 8 6 5 3 / v 1 / w 1 8 - 5 2 0 3 . [31] M. Lippi, P. Torroni, MARGOT: A web server for argumentation mining, Expert Systems with Applications 65 (2016) 292–303. URL: http : / / dx.doi.org/10.1016/j.eswa.2016.08.050. doi:1 0 . 1 0 1 6 / j.eswa.2016.08.050. [32] M. Collins, N. Duffy, New ranking algorithms for parsing and tagging: Kernels over discrete struc- tures, and the voted perceptron, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 263–270. [33] P. Accuosto, M. Neves, H. Saggion, Argumentation mining in scientific literature: From computational linguistics to biomedicine, CEUR Workshop Pro- ceedings (2021) 20–36. [34] Q. V. H. Nguyen, C. T. Duong, T. T. Nguyen, M. Wei- dlich, K. Aberer, H. Yin, X. Zhou, Argument dis- covery via crowdsourcing, VLDB Journal 26 (2017) 511–535. doi:1 0 . 1 0 0 7 / s 0 0 7 7 8 - 0 1 7 - 0 4 6 2 - 9 .