-

Transfer Learning with Sentence Embeddings for Argumentative Evidence Classi cation

Alma Mater Studiorum - University of Bologna

Bologna

Italy

monica.palmirani

davide.liga

g@unibo.it

0 University of Luxembourg , Luxembourg

11 20

This work describes a simple Transfer Learning methodology aiming at discriminating evidences related to Argumentation Schemes using three di erent pre-trained neural architectures. Although Transfer Learning techniques are increasingly gaining momentum, the number of Transfer Learning works in the eld of Argumentation Mining is relatively little and, to the best of our knowledge, no attempt has been performed towards the speci c direction of discriminating evidences related to Argumentation Schemes. The research question of this paper is whether Transfer Learning can discriminate Argumentation Schemes' components, a crucial yet rarely explored task in Argumentation Mining. Results show that, even with small amount of data, classi ers trained on sentence embeddings extracted from pre-trained transformers can achieve encouraging scores, outperforming previous results on evidence classi cation.

Argumentation Mining tion Schemes Transfer Learning Argumenta-

In the last few years, the use of Transfer Learning methodologies generated in remarkable hype in the State of the Art of many Natural Language Processing tasks. Particularly, the Transformer known as \Bidirectional Encoder Representations from Transformer" (BERT) has shown extremely good results, establishing several new records in terms of metrics results [ 3 ]. In 2018, BERT obtained new state-of-the-art results on eleven NLP-related tasks. In a couple of years dozens of variants have been developed, establishing other new records not just in English but also in other languages (e.g., the Italian versions, GilBERTo3 and umBERTo4, the French camemBERT [ 11 ]).

Despite the high celebrity recently achieved by Transfer Learning techniques, these methodologies have been applied relatively few times in Argumentation Mining [ 12, 14 ]. To the best of our knowledge, this is the rst work that explicitly assesses Transfer Learning performances with the aim of discriminating argumentative components related to Argument Schemes [ 17 ]. On the one side, the approach show to be capable of discriminating argumentative stances of support and opposition related to some famous argumentative patterns (Argumentation Schemes) such as Argument from Expert Opinion, and Argument from negative consequences, showing better results compared to previous studies. On the other side, the approach show that it is possible clustering Argumentation Schemes according to the criteria of the pragmatical dimension, which is a crucial aspect described in the most recent literature about Argumentation Scheme classi cation [ 10, 6 ]. In summary, the approach show an ability to classify argumentative evidences not only at ne-grained levels (e.g., di erent instances of Argument from Expert Opinion) but also at the level of large clusters (like the Argumentation Schemes coming from an external source, a class which according to some classi cation approaches can be used as rst dichotomic criterion of discrimination among schemes [ 10, 6 ]).

Section 2 will describe the Transfer Learning methodology and the two main settings for the experiments. Section 3 will describe the datasets used for the experiments in the two scenarios. Sections 4 and 5 will show the experimental results on the two scenarios. Section 6 will describe the related works. In Section 7, some nal considerations will conclude the paper. 2

Methodology

Transfer Learning methods are generally divided in two approaches: the rst approach is called ne-tuning and it consists of using a pre-trained neural architecture (i.e., a Transformer architecture trained on thousands of inputs) as a starting point to perform further training steps on a downstream task (training, thus, the neural architecture on downstream data). The second approach, instead, is that of using a pre-trained neural architecture just to extract the outputs that the Neural Architecture generate for a given input at a speci c stage of the neural architecture. For example, a sentence can be used as input and the output generated by the neural architecture can be extracted and used as sentence embeddings, that can represent our sentence in other downstream tasks (noticeably, the extraction of the generated output to be used as embedding can be performed at di erent stages of the neural architecture, not necessarily at the nal layer). In this paper, the second approach will be employed: a famous pre-trained architecture will be selected, some sentences will be used as inputs for this neural architecture, and the output coming from the neural architecture will be employed as sentence embeddings to represent our data in a series of downstream classi cation tasks.

For the pre-trained embeddings we will employ three pre-trained models: the rst one is the famous neural transformer called BERT [ 3 ] (speci cally, we will use the uncased base version). The second and third models are two recent models which are derived from BERT, namely: distilBERT[ 16 ] and RoBERTa[ 9 ] (uncased). While BERT base consists of 12 layers, 768 hidden dimensions, 12 self-attention heads and nearly 110M parameters, RoBERTa base consists of 12 layers, 768 hidden dimensions, 12 self-attention heads and 125M parameters. Finally, distilBERT consists of 6 layers, 768 hidden dimensions, 12 self-attention heads and 66M parameters.

To extract the embeddings from the neural models, each input sentence must be rstly tokenized according to the requirements of the given model. Typically, with BERT, a [CLS] and a [SEP] special tokens are inserted at the beginning and at the end of the input (we are interested in the rst one which is the token holding the classi cation output we are interested to extract from the input sentence). Moreover, the length of each input sentence is set to a max length: all sentences longer than that limit are shortened, while all sentences shorter than that limit are padded with the special [PAD] token. This process makes sure that all inputs have the same length before entering the neural architecture. After the tokenization, inputs are passed into the neural architecture of a BERT transformer, while deactivating the calculation of gradients.

After having transformed each input sentence of the test sets into tokens and having used these tokens as inputs for the BERT neural architecture, the resulting extracted embeddings have been used, in turn, as input of a classi cation using two classi cation procedure: a Support Vector Machine (SVM) classi er and a Logistic Regression classi er (LRC). Notice that for the experiment on D3 our SVM employed a Linear Support Vector Classi er (Linear SVC), while in all other experiments we employed a standard Support Vector Classi er (SVC).

The classi cation method is One vs All. Which means that the classi cation has been performed per each class, considering one class against all the other classes, a typical approach in multiclassi cation and multilabel scenarios. Finally, all classi ers have been evaluated on the relative testing set.

The experiments have been divided into di erent scenarios: 1. Baseline scenario: in this scenario, the classi cation was performed on the same setting of two previous works, taken as baselines for comparison. 2. Extended scenario: in this scenario, the classi cation was performed on new settings, using an extended version of two datasets from the baseline scenario. 3

Data

The experiments of this work have been applied to the datasets listed in Table 1, reporting reports also the number of instances for each dataset. These datasets have been selected because their annotations describe classes of argumentative evidence directly related to speci c Argumentation Schemes. Importantly, during the experiments, all datasets have been split into train and test sets, following a standard 80/20 ratio.

Regarding the baseline scenario, D1 and D2 are a portion of Al Khatib et al. 2016 and Aharoni et al. 2014 respectively, two important dataset designed by IBM. Only two classes from the original datasets have been selected, reproducing the scenario in [ 7 ] in order to have baseline scenarios for our classi ers. D3 is a small dataset (only 638 sentences) from Liga and Palmirani 2019. It is a dataset which has di erent levels of granularity, depending on how many classes are considered. In this case we selected granularity three, which contains three labels.

Regarding the extended scenario, the dataset D1+ is an extension of D1: instead of extracting just two classes, it considers three classes. The inputs of the dataset from Al Khatib et al. 2016 [ 2 ] are actually structured in a very fragmented way, so we needed to rebuild the sentences following the approach suggested in [ 7 ]. Similarly, D2+ is an extension of D2 (instead of being a selection of just two classes, it considers three classes). Finally, D2++ is an extended version of the same dataset which, having many more instances, can be a useful benchmark for this kind of classi cations.

Importantly, the datasets which have been employed in this work are among the few available datasets containing instances of argumentative evidences which can be related to Argumentation Schemes. Namely, the dataset in Al Khatib et al. 2016 [ 2 ] shows instances of argumentative evidences labelled as Study, Testimony and Anecdotal: these evidences support argumentative claims which refer to source-based opinions, this means that they belong to di erent types of source-based arguments. One of the most famous example of source-based Argumentation Scheme is the well-known Argument from Expert Opinion; another famous scheme is the Argument from witness testimony (more details about this kind of schemes can be found in [ 6 ]).

The datasets in Aharoni et al. 2014 [ 1 ] and Rinott et al. 2015 [ 15 ] present similar source-based Argumentation Schemes (however, this time the labels are Study, Expert and Anecdotal). In this case, the cluster of argumentative evidences labelled with the class Expert are likely to be compatible with the evidences of an Argumentation Scheme from Expert Opinion.

The dataset in Liga and Palmirani 2019 [ 8 ] o ers instead only one class of evidences which is related to source-based arguments (Testimony) while another class is related to a cluster of evidences which can be related to the Argument from Negative Consequences and the Slippery Slope Arguments.

These three datasets can thus be used to assess whether classi ers are able to discriminate between di erent cluster of argumentative evidences. Since these argumentative evidences are strictly related to speci c clusters of Argumentation Schemes, the ability of classi ers to discriminate di erent clusters of argumentative evidences is, in our opinion, a crucial step towards Argumentation Scheme discrimination. 4

Results for the Baseline Scenario

The classi cations in this Section show that the proposed approach is able to outperform recent results in the Argumentation Mining literature. With this purpose, recent results on D1, D2 and D3 are reported [ 7, 8 ] and used as baseline for our classi ers.

SBVerMtBLaRse SDVisMtilBELRRT SRVoBMERLTRa BS

In this paper, all F1 scores per class are calculated as the mean macro F1 scores, taken from each One-vs-All classi cation. All these scores are nally averaged and reported as mean F1 (per each classi er, i.e. SVM and LR).

As can be seen from Table 2, results outperform previous results for the same scenario, showing the ability of Transfer Learning techniques to achieve high performances. As indicated by the bold numbers in Table 4, for D1, D2 and D3 there are always at least four classi ers out of six which outperform the baseline. 5

Result for the Extended Scenario

The next series of experiments have been performed on an extended version of two datasets from the baseline scenario (D1 and D2), to assess how performances change in a multiclass scenario.

Regarding the classi cations on D1+, one can see that the best performances are achieved by the Logistic Regression classi er (LR) trained on sentence embeddings extracted using DistilBERT. To have a better understanding of these results, the confusion matrix of the best classi er in this scenario (i.e., Logistic Regression on DistilBERT) are reported with the confusion matrix from the best classi er of the baseline scenario (i.e., Support Vector Machine from DistilBERT embeddings from Table 2) in Figure 1.

D1 (study vs testimony)

SVM on DistilBERT

D1+ (study vs others) D1+ (testimony vs others)

LR on DistilBERT LR on DistilBERT

Regarding the classi cations on D2+ and D2++, one can see that the best performances are achieved by the Logistic Regression classi er (LR) trained on sentence embeddings extracted using DistilBERT and Bert Base. Also in this case, to have a better understanding of the results, the confusion matrices of the best classi ers in this scenario (i.e., Logistic Regression from DistilBERT embeddings and from Bert Base) are reported with the confusion matrix from the best classi er of the baseline scenario (i.e., Logistic Regression from RoBERTa embeddings from Table 2) in Figure 2.

Notice that while confusion matrices for D1 and D2 (in green) show a binary classi cation, the other confusion matrices in blue (relative to D1+, D2+ and D2++) show a one-vs-all classi cation. These blue matrices show that classi ers are able to recognize classes also in a multiclass scenario. While Figure 1 shows an unbalance (which is probably due to the predominance of the class anecdotal), results in Figure 2 seems more balanced: the diagonal is always a 30/60 ratio, indicating the goodness of predictions. 6

Related works

Unfortunately, datasets speci cally designed in a way that allow a direct link between classes and speci c Argumentation Schemes are very few. A promising and growing resource, in this sense, is the corpora in AIFdb [ 5 ] thanks also to the contribute of tools like OVA+ [ 13 ] which recently added a very important component for Argumentation Scheme annotation called Argument Scheme Key [ 6 ]. D2 (study vs expert)

LR on RoBERTa

D2+ (study vs others)

LR on DistilBERT

D2+ (expert vs others)

LR on DistilBERT D2++ (study vs others) D2++ (expert vs others)

LR on DistilBERT LR on DistilBERT D2++ (study vs others) D2++ (expert vs others)

LR on Bert Base LR on Bert Base

Moreover, although there have been di erent works of text classi cation in Argumentation Mining, only few studies focused on classi cation tasks aiming at facilitating the discrimination of Argumentation Schemes.

Rinott et al. 2015 [ 15 ] achieved important results on evidence detection employing the dataset D2++. However, the approach is mostly context-dependent, while the present work is not considering the context. In Liga 2019 [ 7 ], the classi cation has been performed using Tree Kernels classi ers on D1 and D2, containing argumentative evidences of support among which it is possible to nd evidences directly related to the Argument from Expert Opinion. The work is however limited to a binary classi cation. A similar approach, in a multiclass scenario, is described in Liga and Palmirani 2019 [ 8 ], where Tree Kernels are employed on D3, a small dataset which considers argumentative evidences of opposition among which one can nd, for example, the Slippery Slope Argument. Considering these two works as baselines, the approach presented in this paper seems capable of outperforming the previous achievements. 7

Conclusion

The datasets analyzed in this work are composed of argumentative evidences which are directly related to di erent clusters of arguments. For example, many instances which can be found in the datasets of this paper are directly related to the cluster of source-based arguments. Other instances of argumentative evidences are instead speci cally related to the Argumentation Scheme from Expert Opinion, while others are related to the cluster which includes the Argument from Negative Consequences and the Slippery Slope Arguments (which do not belong to the cluster of source-based arguments).

We believe that the ability to discriminate di erent clusters of argumentative evidences is a crucial step in the classi cation of Argumentation Schemes. For example, the discrimination of clusters of Argumentation Schemes can be performed in a pipeline of binary classi cations starting from source-based versus non-source-based arguments and continuing towards more speci c binary classi cations (similarly to the path of dichotomous choices followed by ASK, the annotation system recently elaborated in [ 6 ], which o ers a valuable system of classi cation of Argumentation Schemes).

In general, the results presented in this paper seem encouraging, showing that pre-trained embeddings can outperform previous results in the eld of Argumentation Mining related to the classi cation of argumentative evidences. An interesting aspect is that the proposed classi ers show encouraging results not only in the discrimination among di erent source-based argumentative evidences, but also in classi cations involving source-based versus non-source-based argumentative evidences (i.e. with dataset D3).

However, further analysis is needed to verify if and how Transfer Learning techniques can discriminate argumentative evidences in such a way that they can facilitate Argumentation Scheme discrimination. In this regard, the present paper is just a preliminary exploration of a promising possible approach. In future works, other Transfer Learning techniques should be assessed too. For example, it could be useful to assess the performances between the two main Transfer Learning techniques: sentence embeddings and ne-tuning. Also, other pre-trained models should be employed and compared (e.g., Xlnet[ 18 ], Albert[ 4 ]).

A long-term goal is being able to connect natural language argumentative evidences to their speci c Argumentation Schemes, which can be a further step in the development of an arti cial Natural Argumentation Understanding.

1. Aharoni , E. , Polnarov , A. , Lavee , T. , Hershcovich , D. , Levy , R. , Rinott , R. , Gutfreund , D. , Slonim , N.: A benchmark dataset for automatic detection of claims and evidence in the context of controversial topics . In: Proceedings of the First Workshop on Argumentation Mining . pp. 64 { 68 ( 2014 )

Khatib , K. , Wachsmuth , H. , Kiesel , J. , Hagen , M. , Stein , B. : A news editorial corpus for mining argumentation strategies . In: Proceedings of COLING 2016 , the 26th International Conference on Computational Linguistics: Technical Papers . pp. 3433 { 3443 ( 2016 )

3. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

4. Lan , Z. , Chen , M. , Goodman , S. , Gimpel , K. , Sharma , P. , Soricut , R.: Albert: A lite bert for self-supervised learning of language representations . arXiv preprint arXiv: 1909 . 11942 ( 2019 )

5. Lawrence , J. , Reed , C. : Argument mining: A survey . Computational Linguistics 45 ( 4 ), 765 { 818 ( 2020 )

6. Lawrence , J. , Visser , J. , Reed , C. : An online annotation assistant for argument schemes . In: Proceedings of the 13th Linguistic Annotation Workshop . pp. 100 { 107 . Association for Computational Linguistics ( 2019 )

7. Liga , D. : Argumentative evidences classi cation and argument scheme detection using tree kernels . In: Proceedings of the 6th Workshop on Argument Mining . pp. 92 { 97 ( 2019 )

8. Liga , D. , Palmirani , M. : Detecting \slippery slope" and other argumentative stances of opposition using tree kernels in monologic discourse . In: International Joint Conference on Rules and Reasoning . pp. 180 { 189 . Springer ( 2019 )

9. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V. : Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 . 11692 ( 2019 )

10. Macagno , F. , Walton , D. , Reed , C. : Argumentation schemes. history, classi cations, and computational applications . History, Classi cations, and Computational Applications (December 23 , 2017 ). Macagno, F. , Walton , D. & Reed , C pp. 2493 { 2556 ( 2017 )

11. Martin , L. , Muller , B. , Suarez , P.J.O. , Dupont , Y. , Romary , L., de la Clergerie , E.V. , Seddah , D. , Sagot , B. : Camembert: a tasty french language model . arXiv preprint arXiv: 1911 . 03894 ( 2019 )

12. Niven , T. , Kao , H.Y.: Probing neural network comprehension of natural language arguments . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 4658 { 4664 ( 2019 )

13. REED , M.J.J.L.C. : Ova+: An argument analysis interface . In: Computational Models of Argument: Proceedings of COMMA . vol. 266 , p. 463 ( 2014 )

14. Reimers , N. , Schiller , B. , Beck , T. , Daxenberger , J. , Stab , C. , Gurevych , I. : Classi cation and clustering of arguments with contextualized word embeddings . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 567 { 578 ( 2019 )

15. Rinott , R. , Dankin , L. , Alzate , C. , Khapra , M.M. , Aharoni , E. , Slonim , N.: Show me your evidence-an automatic method for context dependent evidence detection . In: Proceedings of the 2015 conference on empirical methods in natural language processing . pp. 440 { 450 ( 2015 )

16. Sanh , V. , Debut , L. , Chaumond , J. , Wolf , T. : Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter . arXiv preprint arXiv: 1910 . 01108 ( 2019 )

17. Walton , D. , Reed , C. , Macagno , F. : Argumentation schemes . Cambridge University Press ( 2008 )

18. Yang , Z. , Dai , Z. , Yang , Y. , Carbonell, J., Salakhutdinov , R.R. , Le , Q.V. : Xlnet: Generalized autoregressive pretraining for language understanding . In: Advances in neural information processing systems . pp. 5754 { 5764 ( 2019 )