BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval Argumentation mining in scientific literature: From computational linguistics to biomedicine Pablo Accuostoa , Mariana Nevesb and Horacio Saggiona a LaSTUS/TALN Group, Universitat Pompeu Fabra, Spain {name.surname}@upf.edu b German Federal Institute for Risk Assessment (BfR), Germany mariana.lara-neves@bfr.bund.de Abstract In this work we propose to tackle the limitations posed by the lack of annotated data for argument mining in scientific texts by annotating argumentative units and relations in research abstracts in two scientific domains. We evaluate our annotations by computing inter-annotator agreements, which range from moderate to substantial according to the difficulty level of the tasks and domains. We use our newly annotated corpus to fine-tune BERT-based models for argument mining in single and multi-task settings, finally exploring the adaptation of models trained in one scientific discipline (computational linguistics) to predict the argumentative structure of abstracts in a different one (biomedicine). Keywords argument mining, scientific corpora, domain adaptation, transformer models 1. Introduction The accelerated pace at which scientific knowledge is produced makes its discovery and as- sessment a challenging task. Natural language processing (NLP) technologies, in general, and text-mining tools, in particular, have become increasingly essential to identify and characterize the most relevant information produced in a given scientific discipline. In order to assess a research article it is necessary to consider its logic, rhetoric and dialectic quality dimensions [1]. It is therefore not enough to identify the claims made by its authors but also the evidence that they provide to support them. NLP tools that help to identify the main argumentative elements of a given text and how they are connected to each other can support the assessment of a given article. The automatic identification of arguments, its components and relations in texts is known as argument mining or argumentation mining [2]. The tasks involved in the automatic extraction of arguments from texts (claim/premise identification, prediction of argumentative structure) are not substantially different to other text mining tasks for which neural-based supervised learning methods produce state-of-the-art results (e.g.: text segmentation, sequence labelling and entity linking) [3]. These approaches, however, rely on large volumes of annotated data which are difficult to obtain for complex tasks such as argument mining. Scarcity of annotated corpora, therefore, limits the possibilities of using supervised machine learning algorithms for the identification of argumentative units and relations in texts BIR 2021: 11th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2021, April 1, 2021. © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 20 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval [4]. This obstacle is greater when dealing with scientific discourse: the inherent complexity of scientific texts makes it very difficult to carry out large annotation efforts with lay annotators. 1.1. Contributions In previous works [5, 6] we proposed an annotation scheme for argumentative units and relations which considered the specificities of the scientific discourse and conducted a pilot annotation experiment with 60 abstracts from papers in the computational linguistics domain. These pilot annotations were done by one person as they were intended: i) to analyze the possibility of leveraging information contained in discourse-level annotations in order to improve the performance of argument mining models trained with a small number of abstracts and, ii) to explore the potential value of the trained models to predict the acceptance/rejection of the manuscripts in computational linguistics conferences, which was considered as a proxy for argumentative quality aspects of the abstracts. In those pilot experiments we trained BiLSTM models with CRF classifiers on top and used contextualized word embeddings obtained by means of pre-trained ELMo encoders [7]. In this work we: 1. Refine our previous annotation scheme to better account for the argumentative structure of scientific abstracts and to simplify the annotation process. 2. Make available SciARG, a corpus obtained by applying our new scheme to the annotation of 510 scientific abstracts in two domains: computational linguistics (CL) and biomedicine. Three annotators participated in the annotation process in the CL domain, while two annotators were involved in the annotation of biomedical abstracts. 3. Assess the consistency of SciARG annotations by analyzing inter-annotator agreement. 4. Use the SciARG corpus to fine-tune and evaluate BERT-based argument mining models, both in single and multi-task settings. 5. Analyze the potential of adapting models trained with CL abstracts -the original discipline for which the annotation schema was developed- to the biomedical domain. The SciARG corpus and the code used in the experiments described in this work are made publicly available as a contribution to the research community.1 The rest of the paper is organized as follows: in Section 2 we describe previous work aimed at identifying arguments in scientific texts. In Section 3 we describe the data used to generate the corpus, our proposed annotation scheme and the annotation process. In Section 4 we describe the experiments conducted with the generated corpus and, in Section 5, we analyze the results obtained. In Section 6, we present our conclusions and suggest potential follow-ups. 2. Related Work The inherent complexity and ambiguity of the scientific language makes the identification of arguments in scientific texts a particularly challenging task [8, 9, 10]. The Argumentative Zoning (AZ) model [11, 12] and the CoreSC scheme [13, 14] provide relevant antecedents in this area. AZ includes categories used to annotate knowledge claims made by the authors of the 1 SciARG is available at https://github.com/LaSTUS-TALN-UPF/SciARG 21 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval papers and to establish connections with previous works. CoreSC, in turn, provides a readable representation of the research process described by the paper. Differences and similarities between the two schemes are studied in [15]. AZ was originally applied to the annotation of computational linguistics texts and CoreSC in physical chemistry and bio-chemistry articles. Dernoncourt and Lee [16] released in 2017 the PubMed 200k RCT dataset as a resource to train sentence classifiers for unstructured abstracts. The dataset was constructed by retrieving 195,654 structured abstracts of randomized controlled trials from the 2016 MEDLINE/PubMed Baseline Database2 and labelling each sentence with the name of the section it belongs to. It is relevant to note that the aforementioned corpora and datasets are aimed at the classification of the rhetorical role of sentences but not the discourse relations between them. In this work we intend to establish a bridge between these two annotation levels. Lawrence and Reed [2] and Lippi and Torroni [3] provide thorough analyses of argument mining initiatives in various types of texts and domains, including legal documents [17], online discussions [18], Wikipedia articles [19], newspapers [20], student essays [21] and television debates [22], while Habernal and Gurevych [23] and Schulz et al. [24] explore argument mining in collections of texts from multiple, diverse sources. Few annotation efforts have focused on the analysis of arguments in scientific articles when compared to the number of works aimed at identifying argumentative components and relations in other textual genres. The annotation of 24 German scientific articles in the educational domain by Kirschner et al. [9] is one of the first works intended for the analysis of the whole argumentative structure of scientific texts, considering not only argumentative components but also how they are linked to each other. Lauscher et al. [25, 26] carried out experiments in which they enriched, with an argumentation layer, 40 papers in the area of computer graphics included in the DrInventor Scientific Corpus [27]. As mentioned in Section 1, we have previously conducted experiments with 60 computational linguistic abstracts aimed at analyzing the potential benefits obtained by enriching argument mining models with discourse-level knowledge [6]. 3. SciARG Corpus In this section we describe the source data used as a basis of the SciARG corpus as well as the annotation schema that we propose. We describe the annotation process and assess the quality of the produced annotations by considering inter-annotator agreement measures. 3.1. Data The SciARG corpus covers two knowledge areas: computational linguistics and biomedicine. We refer to these sub-corpora as CL and BIO, respectively. • CL corpus. Includes 225 computational linguistics abstracts from the ACL Anthology [28].3 These abstracts are a subset of the 798 abstracts annotated with discourse relations in the 2 The MEDLINE database of life sciences and biomedical information (www.nlm.nih.gov/bsd/medline.html) is maintained by the U.S. National Library of Medicine and available through the PubMed (pubmed.ncbi.nlm.nih.gov) search engine. 3 In particular, from the Proceedings of the 2014 Conference on Empirical Methods in NLP (EMNLP). 22 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval Discourse Dependency TreeBank for Scientific Abstracts (SciDTB) [29].4 • BIO corpus. Includes 285 biomedical abstracts of articles from MEDLINE/PubMed. These abstracts are a sample of those used by Neves et al. [32] for the evaluation of argumentation in the biomedical domain. The sample was selected in a stratified way in order to include all annotations types considered in the referred work. 3.2. Annotation scheme In this work we focus on the analysis of the way in which authors logically structure information in abstracts to persuade potential readers about the relevance and validity of their proposals. Our annotation scheme is aimed at capturing the underlying argumentative structure departing from its linguistic realization. It is therefore relevant to consider previous works that characterize the different constituent elements of scientific abstracts. Several works have been dedicated to the study, from a genre analysis perspective, of the rhetorical structure of scientific articles and its parts [33, 34]. Based on these works, a broad categorization of the most frequent rhetorical moves in scientific abstracts can be considered: i) contextualization of the research topic; ii) limitations in existing solutions; iii) main purpose of the current work; iv) description of the methodology; v) summary of the main results; vi) conclusions.5 Based from this general structure of scientific abstracts we propose a fine-grained scheme that considers a sentence as the annotation unit and contains 11 types of units (Table 1) and six types of directed relations (Table 2).6 Each of the unit types can, in turn, be mapped to a coarse-grained category. The use of fine or coarse-grained types depend on specific usages of the corpus.7 Table 1 Fine and coarse-grained types of units Type of unit Description Coarse proposal high level description of the proposed approach/solution proposal proposal-implementation processes/tools/methods that are part of the proposal proposal observation data obtained from experiments outcomes result direct interpretation of observed data outcomes result-means results and the means by which they were obtained outcomes conclusion high-level interpretation/generalization of results outcomes means secondary methods/processes not part of the proposal methods motivation-problem known problem/limitation addressed by the proposal motivation motivation-hypothesis new ideas/paths for known problems/limitations motivation motivation-background known information to support the proposed approach motivation information-additional additional information (definitions/examples) other An annotated abstract can be seen as a directed graph with the sentences as its nodes and the relations between them as the edges. In order to gain in uniformity of the annotations, 4 This allows us to continue exploring the interaction between argumentative, rhetoric and discourse annotation levels in scientific abstracts, as originally proposed by Peldszus and Stede [30, 31] for other textual genres. 5 Minor variations to this general structure depend on the knowledge area. 6 We omit the attack relation as there were no attacks identified in any of the abstracts analyzed. 7 In the context of this work we use the fine-grained types. 23 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval Table 2 Relations Relation Description of the child node function support provides new supporting information/evidence for the parent elaboration provides additional information relevant to assess/contextualize the parent by-means describe methods through which supporting evidence is obtained info-required provides information essential to understand/contextualize the parent sequence describes a step that comes after the step described by the parent in a process info-optional provides non-essential information reduce the level of ambiguity and simplify the annotation process, we only consider trees as valid annotations graphs. In addition to types and relations, annotators were asked to identify the unit that describe the most significant contribution of the work as the main unit. Fig. 1 shows an example of the tree resulting from annotating the abstract from [35]. In previous work we considered argumentative units at sub-sentence level and explored the relation between discourse and argumentative levels [6]. Having observed that discourse-level annotations can be leveraged to identify argumentative relations within sentences we decided, in this work, to focus at the sentence level and leave the prediction of intra-sentence relations as a second step in an argument mining pipeline. This allows us to facilitate the annotation process and it also contributes to bridge the gap between annotations aimed at identifying the rhetorical role of sentences (such as AZ and CoreSC) and those aimed at finding discourse relations between -and within- them (such as SciDTB). Most sentences in computational linguistics abstracts contain one type of argumentative unit. A relatively frequent exception to this are sentences in which mentions to methods are included in the results. For this specific case we introduce in our schema the unit type results-means.8 For other cases in which more than one type of unit can be identified, our scheme allows annotators to register this information by assigning a second type to the sentence. In the annotation process annotators were asked to weight the relevance of the different types of information contained in the sentence to make a decision with respect to the main and secondary types. It is frequent to find, in abstracts, that authors build up supporting evidence or justifications for implicit or explicit claims in more than one sentence. Consider the example in Fig. 1. Nodes (4) (motivation-problem) and (7) (motivation-background) provide partial information that, when considered together, contribute to justify the proposed work described in node (2). From a discourse analysis perspective, this would be represented by a multi-nuclear relation which could be annotated by introducing a different type of node in the argumentative tree. This, on one hand, introduces some practical difficulties in the automatic processing of the annotations, as will become evident when we describe the experiments in Section 4 and, on the other hand, does not allow to capture the hierarchical relation between nodes (4) and (7). We opt, instead, to introduce the relation info-required to account for these cases. In this example, we indicate that there is an info-required relation that goes from node (7) to node (4) and a support relation that goes from node (4) to node (2). When looking for supporting evidence for the sentence in node (2), therefore, we would consider not only their direct children but 8 Examples for all types of units are included in the supplemental material. 24 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval also the chains of sentences below them linked by info-required relations. Depending on the specific dimensions of the argumentation quality to analyze, therefore, different subsets of units and relations can be considered. Most argument mining works focus on logic aspects of argumentation and, in particular, in the arguments’ cogency [1]. This argumentative dimension is conveyed by relations of type support (or attack). It has been noted, nevertheless, that the different argumentative dimensions correlate with the perceived overall argumentative quality of the text [1, 36]. Should we consider only support relations, our annotations would not capture the link between a proposal and its implementation details, which improves the text’s clarity and persuades the reader about the validity of the proposal and, therefore, the perceived overall argumentative strength of the text. Figure 1: Example of argumentative tree. The main unit corresponds to the node with grey background. 3.3. Annotation process The annotation guidelines used in this work are made available online.9 The first step of the annotation process is to have the texts of the abstracts splitted into sentences. In the case of the CL corpus, the source files used are already segmented into sentences and elementary discourse units in the SciDTB corpus. For the biomedical abstracts the sentence segmentation is done by means of the syntok tool.10 The annotation was done by means of a modified version of GraPAT (Graph-based Potsdam Annotation Tool) [37] according to the specific needs of the task. We first developed and adjusted the annotation scheme with the CL corpus and then used it to annotate the BIO corpus. One of the goals of this work is to assess the applicability of the proposed scheme to different domains. In Section 3.5 we analyze the main differences between both corpora and the resulting annotations. The annotation of the CL corpus was done by 9 https://github.com/LaSTUS-TALN-UPF/SciARG/blob/main/Annotation_Guidelines_Arguments_SciDTB.pdf 10 github.com/fnl/syntok 25 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval three expert annotators, 𝑎1 , 𝑎2 , 𝑎3 (two NLP researchers and one computational linguist) in three rounds. The first two rounds were aimed at training the annotators, clarifying doubts and making the necessary adjustments to the annotation scheme and tool. As a result of the whole process 225 CL abstracts were annotated, having 30 abstracts annotated by the three annotators to compute inter-annotator agreement. For the BIO corpus only two of the CL annotators (𝑎1 , 𝑎2 ) could participate in the annotation process. In this case no training phase was needed and there were no substantial modifications to the annotation tool or scheme. As a result of this process 285 abstracts were annotated, of which 50 were annotated by both annotators. For our experiments we split both sub-corpora into training and test sets, as described in Section 5.1. 3.4. Agreement In this section we assess the reliability of our annotations by considering inter-annotators’ agreement. Table 3 shows the agreements obtained for CL and BIO sub-corpora. In the case of CL, where we have three annotators, we report the average of the pairwise agreements with their corresponding standard deviations. In order to compute the agreements we consider exact matches between pairs of labels assigned by two annotators. For the parent attachment task the label is the absolute position of the parent sentence in the document. In addition to each sub-task-specific agreement, we report the agreement observed when considering simultaneous exact matches for all the tasks. It is relevant to note that annotations within one document cannot be consider as completely independent from each other, which presents a limitation when interpreting the significance of Cohen’s 𝜅 coefficient.11 In Table 3, therefore, in addition to Cohen’s 𝜅s, we directly report the accuracy obtained for the different tasks without any presupposition with respect to the independence of the annotations. When considering a pair of annotated documents as labeled trees, the accuracy indicates the number of changes (with respect to the number of nodes) that would be necessary to do in one tree to obtain the other one. It can, therefore, be interpreted as an edit-distance measure that allows to estimate the degree of agreement between two annotators with respect to the argumentative roles of the nodes when considering the document as a whole. In all cases annotation agreements fall between moderate and substantial levels: substantial agreements are obtained in general in the CL corpus (and almost perfect agreement when coarse-grained types are considered), while agreements in the identification of unit types and relations are lower in the BIO corpus. In addition to the fact that the annotation scheme was designed and adjusted specifically for the CL domain, lower agreements are expected in BIO as abstracts have a higher level of complexity than CL ones in terms of their structure, the number of units that they contain and their lengths, as shown in Section 3.5. It is also relevant to note that annotators have a high level of familiarity with CL texts while they are not experts in the BIO domain. When analyzing discrepancies in the annotation of the BIO corpus we observe that units of types observation and result give origin to systematic disagreements between annotators 𝑎1 and 𝑎2 . In fact, annotator 𝑎2 annotated as observation 64% of the units that annotator 𝑎1 annotated as result, which makes us believe that a clear distinction between these two types is difficult to establish without specific domain 11 As a decision made at one node of the argumentative structure affects decisions made in other nodes. This problem has already been observed by Marcu et al. [38] when evaluating inter-annotator agreement of discourse annotations. 26 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval knowledge. When coarse-grained types are considered, in fact, these two types of units are not distinguished and the level of agreement reaches 0.93 Cohen’s 𝜅. This also affects the attachment of these units to their parents as they are also considered differently in terms of the argumentative role that they play. Table 3 Agreement in CL and BIO sub-corpora Cohen’s 𝜅 Accuracy Task CL (avg. pairwise) BIO CL (avg. pairwise) BIO Fine-grained unit type 0.77 ± 0.004 0.66 0.81 ± 0.004 0.72 Coarse-grained unit type 0.94 ± 0.016 0.93 0.96 ± 0.010 0.96 Parent position 0.72 ± 0.075 0.49 0.77 ± 0.062 0.54 Relation type 0.79 ± 0.026 0.43 0.84 ± 0.019 0.58 Main unit 0.92 ± 0.042 0.94 0.97 ± 0.013 0.99 All combined 0.59 ± 0.055 0.39 0.61 ± 0.053 0.40 3.5. Corpus statistics and analysis Substantial differences can be observed between the CL and BIO sub-corpora. Abstracts in BIO are, in general, longer and argumentatively more complex than those in CL (Table 4). It is frequent in BIO to find abstracts that describe a series of experiments, each one with their results. In some cases, results from one experiment are used to motivate and/or justify new ones. This level of detail is not present in CL abstracts. In general, the description of research outcomes and their interpretation is much more complex in BIO abstracts, which leads to a significant difference in the number of units of type observation, result and conclusion when compared to CL abstracts.12 Table 4 Statistics of CL and BIO sub-corpora Statistics CL BIO Statistics CL BIO Number of abstracts 225 285 Avg. #tokens/unit 24.4 ± 9.9 30.1 ± 14.2 Total number of units 1199 2787 Max. #tokens/unit 101 155 Avg. #units/abstract 5.3 ± 1.7 9.8 ± 3.1 Min. #tokens/unit 5 5 Max. #units/abstract 13 25 Forward relations 32% 34% Min. #units/abstract 2 2 Backward relations 68% 66% The distinction between the plain report of observed data, the interpretation of results and the extraction of conclusions from them is more ambiguous in BIO than in CL and, therefore, differentiating these types of units is more difficult. The distances between units and their parents are greater in BIO. In fact, nearly 19% of the times a unit is 5 or more units away from its parent. In CL this occurs only in 2% of the cases. In 69% of the cases CL units are only one or two units away from its parent when considering the CL corpus. In BIO this occurs only 58% of 12 While in CL 3% of the units are of type observation, 19% of type result or result-means and 4% of type conclusion, in BIO there are, respectively, 18%, 26% and 11% units of these types. More details are provided in the supplemental material. 27 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval the times. In both domains backward relations are more frequent than forward relations: the parent occurs before the child 68% and 66% of the times in CL and BIO, respectively. 4. Experiments In this section we describe the experiments carried out in order to, given a scientific abstract, predict the nodes and relations needed to represent its argumentative structure. 4.1. Tasks • Unit type: Given the text of a sentence, predict its type. The class to predict in this case is one of the 11 fine-grained types described in Table 1. • Relation direction: Given two sentences, predict whether a forward or backward relation exists between them (e.g: whether the first unit is a child of the second one in the argumentative tree or vice versa). We model this task as a three-class classification task where, given two sentences, the possible classes to predict are forw, back or none, indicating, respectively, that there is a directed relation from the first to the second sentence, from the second to the first sentence, or that the two sentences are not related. • Relation type: Given a sentence, predict the label of the relation with its parent (its argu- mentative and/or discourse function) or none for the root node. The class to predict in this case is one of the 6 relations described in Table 2. • Main unit: Given the text of a sentence, predict whether it is the main unit.13 There are clear links between the four tasks. For instance, the main unit is, in most cases, the root of the argumentative tree. Associations can also be established between a unit’s type and its function: in the most frequent case, units of type result are used to support units of type proposal or conclusion. It is therefore natural to explore the possibility of training the tasks jointly, in a multi-task setting, which we compare to the results obtained when training the results independently, in single-task settings. 4.2. Experimental setup Transformer-based encoders [39] such as BERT [40] currently provide state-of-the-art perfor- mance for semantic text classification tasks. For our experiments we make use of the BERT implementations available as part of HuggingFace’s Transformers library [41]. We use the cased version of SciBERT [42] as base model, as it is trained on texts in the same domains as the ones covered by our corpus. We apply the standard method of considering the representation of the [CLS] token and feed it into linear classifiers. A softmax function is then applied to the classifier’s output in order to obtain the distribution of probabilities for the predicted labels. In the multi-task setting the BERT layers are shared among all the tasks only training indepen- dently the task-specific heads. We follow the common practice of modeling the identification of relations between pairs of sentences as a classification task using as input the sequence obtained 13 The main unit is considered to be the unit where the main proposed approach/solution is described. 28 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval by concatenating the tokens occurring in each sequence separated by the [SEP] special token. In order to predict the relations present in a given abstract we consider all the pairs formed by a sentence and the sentences that occur after it in the text. As a result of the prediction we should obtain the label forw if the first sentence is a child of the second one in the argumentative tree, back if the relation is established in the opposite direction, and none if the two sentences are not in a direct relation. The most frequent case, given any two sentences, is that they are not related. In order to train the model with more positive examples, when a relation exists between two sentences we sample it twice in the training set: once for each direction, with the corresponding forw / back labels.14 For evaluation we consider each pair only once, in the order in which they appear in the text. When fine-tuning our models in each domain, we consider the median number of tokens in the input sequences and set the maximum sequence length to its double. We consider cross- entropy as the loss function to optimize. We use the Adam optimizer with a learning rate of 2e-5 and a warm-up period of 10% of the learning steps. We set a dropout probability of 0.1 for multi-task settings and 0.2 for single-task ones. The batch size used is of 16 instances with gradient accumulation of 2 batches. These hyperparameters were set based on five-fold cross- validation evaluations in the training set. While the general recommendation is to fine-tune BERT for 2 to 4 epochs [40], we observed that more epochs were required to train our tasks, considering the large number of classes and the relatively small number of training instances.15 In Section 5 we report the results obtained for each task for 5, 10 and 15 epochs, so it is possible to observe how each combination of task and domain impact on the training time required both in single and multi-task settings, which would be more difficult to observe if we considered either the best cross-validation epoch or a fixed number of epochs, as frequently done when using BERT. We also observed that the models’ overall performances improved when including, as additional tokens, information about the sentences positions in the abstracts as well as their relative distance and order. We add special tokens to the standard BERT tokenizer to represent this information.16 As mentioned in Sections 1.1 and 3, our annotation scheme was specifically developed to account for argumentative types and relations in computational linguistic abstracts. One of the goals of this work is to explore i) the applicability of this scheme to other scientific disciplines and ii) whether models trained in the CL domain can be easily adapted to predict the argumentative structure of abstracts in other scientific areas. In particular, in the BIO domain. We use the newly annotated set of biomedical abstracts in order to respond to both research questions. We are also interested in exploring to what extent models trained with annotations in CL contain task-specific information that can be exploited to predict argumentative types and relations in scientific abstracts with a more complex structure and in another discipline. We therefore analyze the results obtained by keeping the weights of a model fine-tuned with the CL abstracts fixed and only training a linear classifier on top of it with the BIO abstracts. 14 I.e.: if sentence 𝑠2 is a child of sentence 𝑠1 in the argumentative tree, we include the instances (𝑠2 , 𝑠1 , forw) and (𝑠1 , 𝑠2 , back) in the training set. 15 While there are 3 classes and 13,874 training instances for the BIO/relation type task, we only have 1,049 training instances and 11 classes for CL/unit type. 16 I.e.: "[CLS] [AFTER] [DISTANCE-1] [POS-1] This paper presents ... [SEP] We observe ..." 29 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval 5. Results We present in this section the performance of the models trained in each domain (BIO, CL), as well as the adaptation of CL models to BIO by training a small number of additional parameters. 5.1. Evaluation In the CL domain 30 abstracts were annotated in common by three annotators (𝑎1 , 𝑎2 , 𝑎3 ). This set is used for evaluation, while the rest of the 195 abstracts is used to train the models. We generate a set of consensus annotations by assigning, to each instance, the majority label considering the annotations by 𝑎1 , 𝑎2 and 𝑎3 . In the few cases in which there is total discrepancy among the three annotators we keep the label assigned by the annotator with the highest average of pairwise agreement with the other two annotators. For BIO it is not possible to do this since we only have two annotators (𝑎1 , 𝑎2 ) that annotated 50 abstracts, while the rest of the 235 abstracts were annotated only by one annotator (𝑎1 ). Therefore, in these experiments we only use the annotations produced by 𝑎1 . As test set, in this case, we consider the subset of 35 abstracts17 annotated by 𝑎1 with the highest levels of agreement with the annotations made by 𝑎2 , keeping the remaining 250 abstracts annotated by 𝑎1 as training set. In Table 5 we report the results obtained by the set of experiments described in Section 4 when evaluating against the consensus annotations. For the types of units and relations we use weighted-averaged F1-scores, as we want to consider the contribution of each label to the results in proportion to their frequency. For the evaluation of the direction of the relations, instead, we use macro-averaged scores, which are more sensitive to the minority classes. If we were to use micro-averaged or weighted-averaged scores in these cases we would obtain misleading high numbers for F1, given the large proportion of none labels which are correctly classified. This is also the case for the prediction of the main unit. As expected, considering the greater argumentative complexity of the BIO abstracts, which is also reflected in the lower levels of inter-annotator agreements, the performance of the models trained and evaluated with the BIO annotations is lower than the one obtained with the CL annotations (Table 5). In the BIO domain the models trained jointly in a multi-task settings tend to perform better than those in which these tasks are trained independently. In the case of CL the difference between both settings is less evident: while there is a clear advantage of the multi-task setting in the prediction of the types of units, better results are obtained for the prediction of the parent relations in a single-task setting. We also observe that the BERT models fine-tuned with the CL annotations (CL-BERT) without in-domain fine-tuning perform competitively when compared to the models in which BERT is fine-tuned with the BIO annotations. It is relevant to note that the CL-BERT model with frozen weights performs significantly better in the prediction of the BIO annotations than the frozen SciBERT encoder that we consider as baseline. This confirms that the model fine-tuned with CL annotations is able to capture information about the argumentative structure of scientific abstracts independent of the specific discipline in which it was trained. 17 The number of 35 is chosen in order to keep the training-test sets percentages similar in both domains. 30 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval Table 5 Evaluation of BERT-based models in CL and BIO consensus test sets (F1-scores) CL-BERT Multi-task - CL Test Ann. CL-BERT Single-task - CL Test Ann. Ep. R. Dir. R. Type U. Type Main R. Dir. R. Type U. Type Main 5 0.8221 0.7629 0.7959 0.9076 0.8478 0.7894 0.7659 0.9263 10 0.8380 0.7945 0.7897 0.9146 0.8456 0.7874 0.7991 0.9376 15 0.8262 0.8216 0.8259 0.9243 0.8263 0.8159 0.7801 0.9281 BIO-BERT Multi-task - BIO Test Ann. BIO-BERT Single-task - BIO Test Ann. Ep. R. Dir. R. Type U. Type Main R. Dir. R. Type U. Type Main 5 0.7046 0.7425 0.6375 0.9120 0.6953 0.7170 0.6797 0.8543 10 0.6973 0.7416 0.6717 0.9164 0.6612 0.7221 0.6593 0.8676 15 0.6929 0.7394 0.6951 0.9249 0.6993 0.7208 0.6738 0.8676 CL-BERT Frozen weights - BIO Test Ann. SciBERT Frozen weights - BIO Test Ann. Ep. R. Dir. R. Type U. Type Main R. Dir. R. Type U. Type Main 5 0.6928 0.7239 0.6207 0.8769 0.5162 0.5722 0.4934 0.8005 10 0.7075 0.7269 0.6604 0.8738 0.5408 0.6331 0.5364 0.8398 15 0.7080 0.7220 0.6588 0.8738 0.5575 0.6305 0.5411 0.8441 6. Conclusions In this work we propose a new sentence-level annotation scheme for the identification of argumentative units and relations in scientific abstracts, which we apply to the annotation of 510 documents in two highly specialized domains: computational linguistics and biomedicine. The resulting corpus, as well as the code used to train and evaluate models trained with it, is made publicly available. The results obtained in our experiments encourage us to think that, in spite of the fact that the annotation scheme was originally developed and refined for the CL domain, it can be successfully applied to other scientific disciplines. This work also opens up new research paths, including further exploration of domain adaptation techniques for argument mining models in challenging domains as is the case of scientific articles. Acknowledgments This work was (partly) supported by the Spanish Government under the María de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the Research and Innovation Agency of Uruguay (ANII). We also acknowledge support from the project Context-aware Multilingual Text Simplification (ConMuTeS) PID2019-109066GB-I00/AEI/10.13039/501100011033 awarded by Ministerio de Ciencia, Innovación y Universidades (MCIU) and by Agencia Estatal de Investi- gación (AEI) of Spain. 31 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval References [1] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein, Computational argumentation quality assessment in natural language, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (ACL 2017) (Volume 1: Long Papers), 2017, pp. 176–187. [2] J. Lawrence, C. Reed, Argument mining: A survey, Computational Linguistics (2019) 1–54. [3] M. Lippi, P. Torroni, Argumentation mining: State of the art and emerging trends, ACM Trans. Internet Technol. 16 (2016) 10:1–10:25. [4] C. Stab, I. Gurevych, Parsing argumentation structures in persuasive essays, Computational Linguistics 43 (2017) 619–659. [5] P. Accuosto, H. Saggion, Mining arguments in scientific abstracts with discourse-level embeddings, Data & Knowledge Engineering (2020) 101840. [6] P. Accuosto, H. Saggion, Transferring knowledge from discourse to arguments: A case study with scientific abstracts, in: Proceedings of the 6th Workshop on Argument Mining (ArgMining 2019), Association for Computational Linguistics, Florence, Italy, 2019, pp. 41–51. doi:10.18653/v1/W19-4505. [7] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Or- leans, Louisiana, 2018, pp. 2227–2237. URL: https://www.aclweb.org/anthology/N18-1202. doi:10.18653/v1/N18-1202. [8] C. Stab, C. Kirschner, J. Eckle-Kohler, I. Gurevych, Argumentation mining in persuasive essays and scientific articles from the discourse structure perspective, in: Proceedings of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Language Processing, Forlì-Cesena, Italy, July 21-25, 2014, 2014, pp. 21–25. [9] C. Kirschner, J. Eckle-Kohler, I. Gurevych, Linking the thoughts: Analysis of argumen- tation structures in scientific publications, in: Proceedings of the 2nd Workshop on Argumentation Mining, 2015, pp. 1–11. [10] N. Green, Identifying argumentation schemes in genetics research articles, in: Proceedings of the 2nd Workshop on Argumentation Mining, 2015, pp. 12–21. [11] S. Teufel, et al., Argumentative zoning: Information extraction from scientific text, Ph.D. thesis, University of Edinburgh, 1999. [12] S. Teufel, A. Siddharthan, C. Batchelor, Towards discipline-independent argumentative zoning: Evidence from chemistry and computational linguistics, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009) (Volume 3), Association for Computational Linguistics, 2009, pp. 1493–1502. [13] M. Liakata, S. Saha, S. Dobnik, C. Batchelor, D. Rebholz-Schuhmann, Automatic recog- nition of conceptualization zones in scientific articles and two life science applications, Bioinformatics 28 (2012) 991–1000. [14] M. Liakata, L. N. Soldatova, et al., Semantic annotation of papers: Interface & enrichment tool (SAPIENT), in: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, Association for Computational Linguistics, 2009, pp. 193–200. 32 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval [15] M. Liakata, S. Teufel, A. Siddharthan, C. Batchelor, Corpora for the conceptualisation and zoning of scientific papers, in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta, 2010. [16] F. Dernoncourt, J. Y. Lee, Pubmed 200k rct: A dataset for sequential sentence classification in medical abstracts, arXiv preprint arXiv:1710.06071 (2017). [17] R. Mochales-Palau, M.-F. Moens, Argumentation mining: The detection, classification and structure of arguments in text, in: Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL 2009), ACM, 2009, pp. 98–107. [18] T. Goudas, C. Louizos, G. Petasis, V. Karkaletsis, Argument extraction from news, blogs, and social media, in: Hellenic Conference on Artificial Intelligence, Springer, 2014, pp. 287–299. [19] E. Aharoni, L. Dankin, D. Gutfreund, T. Lavee, R. Levy, R. Rinott, N. Slonim, Context- dependent evidence detection, 2018. US Patent App. 14/720,847. [20] E. Florou, S. Konstantopoulos, A. Koukourikos, P. Karampiperis, Argument extraction for supporting public policy formulation, in: Proceedings of the 7th Workshop on Lan- guage Technology for Cultural Heritage, Social Sciences, and Humanities, Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 49–54. [21] C. Stab, I. Gurevych, Annotating argument components and relations in persuasive essays, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 2014, pp. 1501–1510. [22] J. Visser, B. Konat, R. Duthie, M. Koszowy, K. Budzynska, C. Reed, Argumentation in the 2016 us presidential elections: Annotated corpora of television debates and social media reaction, Language Resources and Evaluation (2019) 1–32. [23] I. Habernal, I. Gurevych, Argumentation mining in user-generated web discourse, Com- putational Linguistics 43 (2017) 125–179. doi:10.1162/COLI\_a\_00276. [24] C. Schulz, S. Eger, J. Daxenberger, T. Kahse, I. Gurevych, Multi-task learning for argumenta- tion mining in low-resource settings, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 35–41. doi:10.18653/v1/N18-2006. [25] A. Lauscher, G. Glavaš, S. P. Ponzetto, An argument-annotated corpus of scientific pub- lications, in: Proceedings of the 5th Workshop on Argument Mining (ArgMining 2018), 2018, pp. 40–46. [26] A. Lauscher, G. Glavaš, K. Eckert, ArguminSci: A tool for analyzing argumentation and rhetorical aspects in scientific writing, in: Proceedings of the 5th Workshop on Argument Mining (ArgMining 2018), 2018, pp. 22–28. [27] B. Fisas, F. Ronzano, H. Saggion, A multi-layered annotated corpus of scientific papers., in: Proceedings of the 2016 The International Conference on Language Resources and Evaluation, 2016. [28] D. R. Radev, P. Muthukrishnan, V. Qazvinian, A. Abu-Jbara, The ACL Anthology network corpus, Language Resources and Evaluation 47 (2013) 919–944. [29] A. Yang, S. Li, SciDTB: Discourse dependency TreeBank for scientific abstracts, in: Proceed- 33 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval ings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) (Volume 2: Short Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 444–449. [30] A. Peldszus, M. Stede, Rhetorical structure and argumentation structure in monologue text, in: Proceedings of the Third Workshop on Argument Mining (ArgMining 2016), 2016, pp. 103–112. [31] A. Peldszus, M. Stede, An annotated corpus of argumentative microtexts, in: Proceedings of the First Conference on Argumentation, Lisbon, Portugal, 2015. [32] M. Neves, D. Butzke, B. Grune, Evaluation of scientific elements for text similarity in biomedical publications, in: Proceedings of the 6th Workshop on Argument Mining, Association for Computational Linguistics, Florence, Italy, 2019, pp. 124–135. URL: https: //www.aclweb.org/anthology/W19-4515. doi:10.18653/v1/W19-4515. [33] J. Swales, Genre analysis: English in academic and research settings, Cambridge University Press, 1990. [34] M. B. Dos Santos, The textual organization of research paper abstracts in applied linguistics, Text-Interdisciplinary Journal for the Study of Discourse 16 (1996) 481–500. [35] Z. He, H. Wu, H. Wang, T. Liu, Transformation from discontinuous to continuous word alignment improves translation quality, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 147–152. [36] L. Ng, A. Lauscher, J. Tetreault, C. Napoles, Creating a domain-diverse corpus for theory- based argument quality assessment, arXiv preprint arXiv:2011.01589 (2020). [37] J. Sonntag, M. Stede, GraPAT: A tool for graph annotations, in: Proceedings of the 2014 The International Conference on Language Resources and Evaluation, 2014, pp. 4147–4151. [38] D. Marcu, E. Amorrortu, M. Romera, Experiments in constructing a corpus of discourse trees, in: Towards Standards and Tools for Discourse Tagging, 1999. URL: https://www. aclweb.org/anthology/W99-0307. [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [40] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [41] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., HuggingFace’s Transformers: State-of-the-art natural language processing, ArXiv (2019) arXiv–1910. [42] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language model for scientific text, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 2019, pp. 3606–3611. 34 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval A. Supplemental material A.1. Types of units Table 6 Examples for each type of unit proposal We present a novel approach to improve word alignment for statistical machine translation ( SMT ) . proposal-implementation We observe , identify , and detect naturally occurring signals of interestingness in click transitions on the Web between source and target documents , which we collect from commercial Web browser logs . observation Our method produces a gain of +1.68 BLEU on NIST OpenMT04 for the phrase-based system , and a gain of +1.28 BLEU on NIST OpenMT06 for the hierarchical phrase-based system . result Experimental results show statistically significant improvements of BLEU score in both cases over the base- line systems . means We conducted experiments on two standard benchmarks : Chinese PropBank and English PropBank . result-means Results on the Switchboard disfluency tagged corpus show utterance-final accuracy on a par with state- of-the-art incremental repair detection methods , but with better incremental accuracy , faster time-to- detection and less computational overhead . conclusion This transfer learning approach brings a clear performance gain over features based on the traditional bag-of-visual-word approach . motivation-problem However , fundamental problems on effectively incorporating the word embedding features within the framework of linear models remain . motivation-hypothesis Combining the two tasks can potentially improve the efficiency of the overall pipeline system and reduce error propagation . motivation-background Recent work has shown success in using continuous word embeddings learned from unlabeled data as fea- tures to improve supervised NLP systems , which is regarded as a simple semi-supervised learning mecha- nism . information-additional The structure of argumentation consists of several components ( i.e. claims and premises ) that are connected with argumentative relations . 35 BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval A.2. Distribution of types and relations in CL and BIO sub-corpora Table 7 Distribution of unit types in CL and BIO Type CL BIO proposal 285 289 proposal- 264 274 implementation observation 39 505 result 157 703 conclusion 54 301 means 27 58 result-means 69 31 motivation-problem 103 97 motivation- 157 487 background motivation- 20 16 hypothesis information- 24 26 additional Table 8 Distribution of relations in CL and BIO Relation CL BIO support 417 1581 elaboration 358 535 by-means 28 57 info-required 120 303 sequence 29 2 info-optional 22 24 36