Transfer Learning with Sentence Embeddings for Argumentative Evidence Classification Davide Liga1,2[0000−0003−1124−0299] and Monica Palmirani1[0000−0002−8557−8084] 1 Alma Mater Studiorum - University of Bologna, Bologna, Italy {monica.palmirani,davide.liga2}@unibo.it 2 University of Luxembourg, Luxembourg Abstract. This work describes a simple Transfer Learning methodology aiming at discriminating evidences related to Argumentation Schemes using three different pre-trained neural architectures. Although Trans- fer Learning techniques are increasingly gaining momentum, the number of Transfer Learning works in the field of Argumentation Mining is rel- atively little and, to the best of our knowledge, no attempt has been performed towards the specific direction of discriminating evidences re- lated to Argumentation Schemes. The research question of this paper is whether Transfer Learning can discriminate Argumentation Schemes’ components, a crucial yet rarely explored task in Argumentation Mining. Results show that, even with small amount of data, classifiers trained on sentence embeddings extracted from pre-trained transformers can achieve encouraging scores, outperforming previous results on evidence classification. Keywords: Argumentation Mining · Transfer Learning · Argumenta- tion Schemes 1 Introduction In the last few years, the use of Transfer Learning methodologies generated in remarkable hype in the State of the Art of many Natural Language Processing tasks. Particularly, the Transformer known as “Bidirectional Encoder Represen- tations from Transformer” (BERT) has shown extremely good results, establish- ing several new records in terms of metrics results [3]. In 2018, BERT obtained new state-of-the-art results on eleven NLP-related tasks. In a couple of years dozens of variants have been developed, establishing other new records not just in English but also in other languages (e.g., the Italian versions, GilBERTo3 and umBERTo4 , the French camemBERT [11]). Despite the high celebrity recently achieved by Transfer Learning techniques, these methodologies have been applied relatively few times in Argumentation Mining [12, 14]. To the best of our knowledge, this is the first work that ex- plicitly assesses Transfer Learning performances with the aim of discriminating 3 https://github.com/idb-ita/GilBERTo 4 https://github.com/musixmatchresearch/umberto CMNA’20 - F. Grasso, N. Green, J. Schneider, S. Wells (Eds) - 8th September 2020 Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 12 Liga and Palmirani argumentative components related to Argument Schemes [17]. On the one side, the approach show to be capable of discriminating argumentative stances of sup- port and opposition related to some famous argumentative patterns (Argumen- tation Schemes) such as Argument from Expert Opinion, and Argument from negative consequences, showing better results compared to previous studies. On the other side, the approach show that it is possible clustering Argumentation Schemes according to the criteria of the pragmatical dimension, which is a cru- cial aspect described in the most recent literature about Argumentation Scheme classification [10, 6]. In summary, the approach show an ability to classify argu- mentative evidences not only at fine-grained levels (e.g., different instances of Argument from Expert Opinion) but also at the level of large clusters (like the Argumentation Schemes coming from an external source, a class which accord- ing to some classification approaches can be used as first dichotomic criterion of discrimination among schemes [10, 6]). Section 2 will describe the Transfer Learning methodology and the two main settings for the experiments. Section 3 will describe the datasets used for the experiments in the two scenarios. Sections 4 and 5 will show the experimental results on the two scenarios. Section 6 will describe the related works. In Section 7, some final considerations will conclude the paper. 2 Methodology Transfer Learning methods are generally divided in two approaches: the first approach is called fine-tuning and it consists of using a pre-trained neural ar- chitecture (i.e., a Transformer architecture trained on thousands of inputs) as a starting point to perform further training steps on a downstream task (train- ing, thus, the neural architecture on downstream data). The second approach, instead, is that of using a pre-trained neural architecture just to extract the out- puts that the Neural Architecture generate for a given input at a specific stage of the neural architecture. For example, a sentence can be used as input and the output generated by the neural architecture can be extracted and used as sentence embeddings, that can represent our sentence in other downstream tasks (noticeably, the extraction of the generated output to be used as embedding can be performed at different stages of the neural architecture, not necessarily at the final layer). In this paper, the second approach will be employed: a famous pre-trained architecture will be selected, some sentences will be used as inputs for this neural architecture, and the output coming from the neural architecture will be employed as sentence embeddings to represent our data in a series of downstream classification tasks. For the pre-trained embeddings we will employ three pre-trained models: the first one is the famous neural transformer called BERT [3] (specifically, we will use the uncased base version). The second and third models are two recent models which are derived from BERT, namely: distilBERT[16] and RoBERTa[9] (uncased). While BERT base consists of 12 layers, 768 hidden dimensions, 12 self-attention heads and nearly 110M parameters, RoBERTa base consists of 12 Transfer Learning with Sentence Embeddings for Arg. Evidence Class. 13 layers, 768 hidden dimensions, 12 self-attention heads and 125M parameters. Finally, distilBERT consists of 6 layers, 768 hidden dimensions, 12 self-attention heads and 66M parameters. To extract the embeddings from the neural models, each input sentence must be firstly tokenized according to the requirements of the given model. Typically, with BERT, a [CLS] and a [SEP] special tokens are inserted at the beginning and at the end of the input (we are interested in the first one which is the token holding the classification output we are interested to extract from the input sentence). Moreover, the length of each input sentence is set to a max length: all sentences longer than that limit are shortened, while all sentences shorter than that limit are padded with the special [PAD] token. This process makes sure that all inputs have the same length before entering the neural architecture. After the tokenization, inputs are passed into the neural architecture of a BERT transformer, while deactivating the calculation of gradients. After having transformed each input sentence of the test sets into tokens and having used these tokens as inputs for the BERT neural architecture, the result- ing extracted embeddings have been used, in turn, as input of a classification using two classification procedure: a Support Vector Machine (SVM) classifier and a Logistic Regression classifier (LRC). Notice that for the experiment on D3 our SVM employed a Linear Support Vector Classifier (Linear SVC), while in all other experiments we employed a standard Support Vector Classifier (SVC). The classification method is One vs All. Which means that the classification has been performed per each class, considering one class against all the other classes, a typical approach in multiclassification and multilabel scenarios. Finally, all classifiers have been evaluated on the relative testing set. The experiments have been divided into different scenarios: 1. Baseline scenario: in this scenario, the classification was performed on the same setting of two previous works, taken as baselines for comparison. 2. Extended scenario: in this scenario, the classification was performed on new settings, using an extended version of two datasets from the baseline scenario. 3 Data The experiments of this work have been applied to the datasets listed in Table 1, reporting reports also the number of instances for each dataset. These datasets have been selected because their annotations describe classes of argumentative evidence directly related to specific Argumentation Schemes. Importantly, during the experiments, all datasets have been split into train and test sets, following a standard 80/20 ratio. Regarding the baseline scenario, D1 and D2 are a portion of Al Khatib et al. 2016 and Aharoni et al. 2014 respectively, two important dataset designed by IBM. Only two classes from the original datasets have been selected, reproducing the scenario in [7] in order to have baseline scenarios for our classifiers. D3 is a small dataset (only 638 sentences) from Liga and Palmirani 2019. It is a dataset which has different levels of granularity, depending on how many classes 14 Liga and Palmirani Table 1. Description of all datasets used in this paper. Dataset Reference Classes Instances Baseline datasets: Al Khatib et al. 2016 Study, D1 653 (only two classes extracted as in [7]) Testimony Aharoni et al. 2014 Study, D2 569 (only two classes selected as in [7]) Expert Slippery Slope, D3 Liga and Palmirani 2019 Testimony, 638 Other Extended datasets: Study, Al Khatib et al. 2016 D1+ Testimony, 2253 (three classes extracted following the method in [7]) Anecdotal Study, D2+ Aharoni et al. 2014 Expert, 1291 Anecdotal Study, D2++ Rinott et al. 2015 Expert, 4692 Anecdotal are considered. In this case we selected granularity three, which contains three labels. Regarding the extended scenario, the dataset D1+ is an extension of D1: instead of extracting just two classes, it considers three classes. The inputs of the dataset from Al Khatib et al. 2016 [2] are actually structured in a very fragmented way, so we needed to rebuild the sentences following the approach suggested in [7]. Similarly, D2+ is an extension of D2 (instead of being a selection of just two classes, it considers three classes). Finally, D2++ is an extended version of the same dataset which, having many more instances, can be a useful benchmark for this kind of classifications. Importantly, the datasets which have been employed in this work are among the few available datasets containing instances of argumentative evidences which can be related to Argumentation Schemes. Namely, the dataset in Al Khatib et al. 2016 [2] shows instances of argumentative evidences labelled as Study, Testimony and Anecdotal: these evidences support argumentative claims which refer to source-based opinions, this means that they belong to different types of source-based arguments. One of the most famous example of source-based Ar- gumentation Scheme is the well-known Argument from Expert Opinion; another famous scheme is the Argument from witness testimony (more details about this kind of schemes can be found in [6]). The datasets in Aharoni et al. 2014 [1] and Rinott et al. 2015 [15] present similar source-based Argumentation Schemes (however, this time the labels are Study, Expert and Anecdotal). In this case, the cluster of argumentative evi- Transfer Learning with Sentence Embeddings for Arg. Evidence Class. 15 dences labelled with the class Expert are likely to be compatible with the evi- dences of an Argumentation Scheme from Expert Opinion. The dataset in Liga and Palmirani 2019 [8] offers instead only one class of evidences which is related to source-based arguments (Testimony) while another class is related to a cluster of evidences which can be related to the Argument from Negative Consequences and the Slippery Slope Arguments. These three datasets can thus be used to assess whether classifiers are able to discriminate between different cluster of argumentative evidences. Since these argumentative evidences are strictly related to specific clusters of Argumentation Schemes, the ability of classifiers to discriminate different clusters of argumenta- tive evidences is, in our opinion, a crucial step towards Argumentation Scheme discrimination. 4 Results for the Baseline Scenario The classifications in this Section show that the proposed approach is able to outperform recent results in the Argumentation Mining literature. With this purpose, recent results on D1, D2 and D3 are reported [7, 8] and used as baseline for our classifiers. Table 2. Results on the baseline classifiers (D1, D2, D3) considering mean F1 scores (macro) and two kinds of classifier. SVM = Support Vector Machine; LR = Logistic Regression; BS = Baseline. The columns whose mean F1 value has an asterisk refers to a Linear Support Vector Classifier. In bold are all the mean F1 scores which overcome the mean F1 of the baseline. The three grey columns represent the best classifiers for the baseline scenario. Bert Base DistilBERT RoBERTa Classes BS SVM LR SVM LR SVM LR D1 (Al Khatib et al. 2016) Study .94 .92 .97 .97 .91 .89 .91 Testimony .93 .91 .97 .96 .89 .86 .92 mean F1 .94 .92 .97 .96 .90* .88 .92 D2 (Aharoni et al. 2014) Study .78 .71 .72 .74 .75 .79 .69 Expert .75 .68 .67 .72 .73 .77 .78 mean F1 .76 .69 .69* .73 .74* .78 .73 D3 (Liga and Palmirani 2019) Slippery Slope .75 .71 .79 .76 .82 .60 .70 Testimony .90 .92 .93 .94 .93 .73 .71 Other .85 .86 .87 .87 .87 .85 .91 mean F1 .83* .82 .86* .86 .87* .73 .77 16 Liga and Palmirani In this paper, all F1 scores per class are calculated as the mean macro F1 scores, taken from each One-vs-All classification. All these scores are finally averaged and reported as mean F1 (per each classifier, i.e. SVM and LR). As can be seen from Table 2, results outperform previous results for the same scenario, showing the ability of Transfer Learning techniques to achieve high performances. As indicated by the bold numbers in Table 4, for D1, D2 and D3 there are always at least four classifiers out of six which outperform the baseline. 5 Result for the Extended Scenario The next series of experiments have been performed on an extended version of two datasets from the baseline scenario (D1 and D2), to assess how performances change in a multiclass scenario. Table 3. Results on D1+, D2+ and D2++ considering mean F1 scores (macro) and two kinds of classifiers. SVM = Support Vector Machine; LR = Logistic Regression; BS = Baseline. The columns whose mean F1 value has an asterisk refers to a Linear Support Vector Classifier. In bold are the top mean F1 scores. The three grey columns represent the best classifiers for the extended scenario. Bert Base DistilBERT RoBERTa Classes SVM LR SVM LR SVM LR D1+ (Al Khatib et al. 2016) Study .83 .85 .83 .87 .86 .77 Testimony .77 .81 .81 .82 .78 .70 Anecdotal .81 .81 .82 .84 .83 .77 mean F1 .80 .82 .82* .84 .82* .75 D2+ (Aharoni et al. 2014) Study .89 .90 .91 .91 .90 .85 Expert .91 .91 .92 .93 .90 .84 Anecdotal .92 .93 .92 .93 .92 .92 mean F1 .91* .91 .92* .92 .91* .87 D2++ (Rinott et al. 2015) Study .93 .94 .94 .94 .92 .90 Expert .92 .92 .93 .93 .91 .88 Anecdotal .91 .93 .90 .92 .87 .85 mean F1 .92* .93 .92* .93 .90* .88 Table 3 shows a clear trend, with Logistic Regression on DistilBERT being the best solution for both the dataset extending D1 (i.e., D1+) and the datasets extending D2 (i.e., D2+ and D2++). Transfer Learning with Sentence Embeddings for Arg. Evidence Class. 17 Regarding the classifications on D1+, one can see that the best performances are achieved by the Logistic Regression classifier (LR) trained on sentence em- beddings extracted using DistilBERT. To have a better understanding of these results, the confusion matrix of the best classifier in this scenario (i.e., Logistic Regression on DistilBERT) are reported with the confusion matrix from the best classifier of the baseline scenario (i.e., Support Vector Machine from DistilBERT embeddings from Table 2) in Figure 1. D1 (study vs testimony) D1+ (study vs others) D1+ (testimony vs others) SVM on DistilBERT LR on DistilBERT LR on DistilBERT Fig. 1. Confusion matrices for D1 (in green) and D1+. The number of instances and the relative percentages are reported. Regarding the classifications on D2+ and D2++, one can see that the best performances are achieved by the Logistic Regression classifier (LR) trained on sentence embeddings extracted using DistilBERT and Bert Base. Also in this case, to have a better understanding of the results, the confusion matrices of the best classifiers in this scenario (i.e., Logistic Regression from DistilBERT embeddings and from Bert Base) are reported with the confusion matrix from the best classifier of the baseline scenario (i.e., Logistic Regression from RoBERTa embeddings from Table 2) in Figure 2. Notice that while confusion matrices for D1 and D2 (in green) show a binary classification, the other confusion matrices in blue (relative to D1+, D2+ and D2++) show a one-vs-all classification. These blue matrices show that classifiers are able to recognize classes also in a multiclass scenario. While Figure 1 shows an unbalance (which is probably due to the predominance of the class anecdotal), results in Figure 2 seems more balanced: the diagonal is always a 30/60 ratio, indicating the goodness of predictions. 6 Related works Unfortunately, datasets specifically designed in a way that allow a direct link between classes and specific Argumentation Schemes are very few. A promising and growing resource, in this sense, is the corpora in AIFdb [5] thanks also to the contribute of tools like OVA+ [13] which recently added a very important component for Argumentation Scheme annotation called Argument Scheme Key [6]. 18 Liga and Palmirani D2 (study vs expert) D2+ (study vs others) D2+ (expert vs others) LR on RoBERTa LR on DistilBERT LR on DistilBERT D2++ (study vs others) D2++ (expert vs others) LR on DistilBERT LR on DistilBERT D2++ (study vs others) D2++ (expert vs others) LR on Bert Base LR on Bert Base Fig. 2. Confusion matrices for D2 (in green), D2+ and D2++. The number of instances and the relative percentages are reported. Moreover, although there have been different works of text classification in Argumentation Mining, only few studies focused on classification tasks aiming at facilitating the discrimination of Argumentation Schemes. Rinott et al. 2015 [15] achieved important results on evidence detection em- ploying the dataset D2++. However, the approach is mostly context-dependent, while the present work is not considering the context. In Liga 2019 [7], the classification has been performed using Tree Kernels classifiers on D1 and D2, containing argumentative evidences of support among which it is possible to find evidences directly related to the Argument from Expert Opinion. The work is however limited to a binary classification. A similar approach, in a multiclass scenario, is described in Liga and Palmirani 2019 [8], where Tree Kernels are employed on D3, a small dataset which considers argumentative evidences of op- position among which one can find, for example, the Slippery Slope Argument. Transfer Learning with Sentence Embeddings for Arg. Evidence Class. 19 Considering these two works as baselines, the approach presented in this paper seems capable of outperforming the previous achievements. 7 Conclusion The datasets analyzed in this work are composed of argumentative evidences which are directly related to different clusters of arguments. For example, many instances which can be found in the datasets of this paper are directly related to the cluster of source-based arguments. Other instances of argumentative evi- dences are instead specifically related to the Argumentation Scheme from Expert Opinion, while others are related to the cluster which includes the Argument from Negative Consequences and the Slippery Slope Arguments (which do not belong to the cluster of source-based arguments). We believe that the ability to discriminate different clusters of argumentative evidences is a crucial step in the classification of Argumentation Schemes. For example, the discrimination of clusters of Argumentation Schemes can be per- formed in a pipeline of binary classifications starting from source-based versus non-source-based arguments and continuing towards more specific binary clas- sifications (similarly to the path of dichotomous choices followed by ASK, the annotation system recently elaborated in [6], which offers a valuable system of classification of Argumentation Schemes). In general, the results presented in this paper seem encouraging, showing that pre-trained embeddings can outperform previous results in the field of Ar- gumentation Mining related to the classification of argumentative evidences. An interesting aspect is that the proposed classifiers show encouraging results not only in the discrimination among different source-based argumentative evi- dences, but also in classifications involving source-based versus non-source-based argumentative evidences (i.e. with dataset D3). However, further analysis is needed to verify if and how Transfer Learning techniques can discriminate argumentative evidences in such a way that they can facilitate Argumentation Scheme discrimination. In this regard, the present paper is just a preliminary exploration of a promising possible approach. In future works, other Transfer Learning techniques should be assessed too. For example, it could be useful to assess the performances between the two main Transfer Learning techniques: sentence embeddings and fine-tuning. Also, other pre-trained models should be employed and compared (e.g., Xlnet[18], Albert[4]). A long-term goal is being able to connect natural language argumentative evidences to their specific Argumentation Schemes, which can be a further step in the development of an artificial Natural Argumentation Understanding. References 1. Aharoni, E., Polnarov, A., Lavee, T., Hershcovich, D., Levy, R., Rinott, R., Gut- freund, D., Slonim, N.: A benchmark dataset for automatic detection of claims and evidence in the context of controversial topics. In: Proceedings of the First Workshop on Argumentation Mining. pp. 64–68 (2014) 20 Liga and Palmirani 2. Al Khatib, K., Wachsmuth, H., Kiesel, J., Hagen, M., Stein, B.: A news editorial corpus for mining argumentation strategies. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pp. 3433–3443 (2016) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 4. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 5. Lawrence, J., Reed, C.: Argument mining: A survey. Computational Linguistics 45(4), 765–818 (2020) 6. Lawrence, J., Visser, J., Reed, C.: An online annotation assistant for argument schemes. In: Proceedings of the 13th Linguistic Annotation Workshop. pp. 100– 107. Association for Computational Linguistics (2019) 7. Liga, D.: Argumentative evidences classification and argument scheme detection using tree kernels. In: Proceedings of the 6th Workshop on Argument Mining. pp. 92–97 (2019) 8. Liga, D., Palmirani, M.: Detecting “slippery slope” and other argumentative stances of opposition using tree kernels in monologic discourse. In: International Joint Conference on Rules and Reasoning. pp. 180–189. Springer (2019) 9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 10. Macagno, F., Walton, D., Reed, C.: Argumentation schemes. history, classifications, and computational applications. History, Classifications, and Computational Ap- plications (December 23, 2017). Macagno, F., Walton, D. & Reed, C pp. 2493–2556 (2017) 11. Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894 (2019) 12. Niven, T., Kao, H.Y.: Probing neural network comprehension of natural language arguments. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 4658–4664 (2019) 13. REED, M.J.J.L.C.: Ova+: An argument analysis interface. In: Computational Models of Argument: Proceedings of COMMA. vol. 266, p. 463 (2014) 14. Reimers, N., Schiller, B., Beck, T., Daxenberger, J., Stab, C., Gurevych, I.: Clas- sification and clustering of arguments with contextualized word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 567–578 (2019) 15. Rinott, R., Dankin, L., Alzate, C., Khapra, M.M., Aharoni, E., Slonim, N.: Show me your evidence-an automatic method for context dependent evidence detection. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp. 440–450 (2015) 16. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) 17. Walton, D., Reed, C., Macagno, F.: Argumentation schemes. Cambridge University Press (2008) 18. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems. pp. 5754–5764 (2019)