MT-GAN-BERT: Multi-Task and Generative Adversarial Learning for sustainable Language Processing Claudia Breazzano, Danilo Croce, and Roberto Basili Department of Enterprise Engineering University of Roma, Tor Vergata {croce,basili}@info.uniroma2.it Abstract. In this paper, we present MT-GAN-BERT, i.e., a BERT-based architecture for faceted classification tasks. It aims to reduce the require- ments of Transformers both in terms of the amount of annotated data and the computational cost required at classification time. First, MT-GAN-BERT enables semi-supervised learning in BERT-based architectures based on Generative Adversarial Learning. Second, it implements a Multi-task Learning approach to solve multiple tasks simultaneously. A single BERT- based model is used to encode the input examples, while multiple linear layers are used to implement the classification steps, with a significant reduction of the computational costs. Experimental evaluations against six classification tasks involved in detecting abusive languages in Italian suggest that MT-GAN-BERT represents a sustainable solution that generally improves the raw adoption of multiple BERT-based models with lighter requirements in terms of annotated data and computational costs. Keywords: Sustainable NLP · BERT · Semi Supervised Learning · Generative Adversarial Learning · Multi-task Learning 1 Introduction In recent years, Deep Learning methods have become very popular in Natural Language Processing (NLP), e.g., they reach high performances by relying on very simple input representations (for example, in [10, 7, 11]). In particular, Transformer-based architectures, e.g., BERT [4], provide representations of their inputs as a result of a pre-training stage. These are, in fact, trained over large scale corpora and then effectively fine-tuned over a targeted task achieving state-of- the-art results in different and heterogeneous NLP tasks. However, several critical aspects tend to critically limit the impact of such Transformer-based architectures on sustainable real-word applications. First of all, they have been generally shown to achieve state-of-the-art results when trained using very large-scale datasets Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 C. Breazzano et al. but significant performance drops have been observed when annotated material of limited size is adopted [3]. Unfortunately, obtaining annotated data is a time-consuming and costly process. In addition, Transformer-based solutions are characterized by complex architectures, with a large number of parameters and therefore have an onerous computational cost [24]. Several works proposed solutions devoted to the reduction of such computational complexity [23, 27, 25]. However, whenever the problem at hand requires decomposing the decision process into a (possibly large) set of decision steps, the overall computational cost is likely to grow rapidly. In fact, the cost of the entire workflow generally increases as the sum of the (millions) of parameters of the individual architectures. Let us consider the adoption of Language Technologies against Offensive Language on the Web and Social Networks. Offensive language (also called “abusive language”) refers to any insult or vulgarity that demeans a target [26, 17]. The NLP community has worked on methods to mitigate this phenomenon by developing technologies to automatically detect abuse in texts. However, some of these methods largely focused on a limited definition of abuse, i.e. detecting hateful comments against only certain communities, such as comments referring to ethnic minorities, and marginalizing other types of communities, such as hateful comments towards women [20]. In fact, the notion of abuse is proved elusive and difficult to formalize. Different norms in communities can influence what is considered abusive. In the context of natural language, abuse is a term that encompasses many different types of fine-grained negative expressions. For example, in [18] it is used to collectively refer to hate speech, derogatory language and insults, while others [16] use it to discuss racism and sexism. The definitions for the different types of abuse tend to be overlapping and ambiguous. For this reason, abusive behavior is considered a problem with many “faces”, as it involves cases of hate speech, offensive language, sexism and racism, aggression, cyberbullying, harassment and trolling. Each form of abusive behavior has its own characteristics and manifests itself differently [6]. As a result, several datasets exist [26] but they are focused on specific aspects of abusive language. We speculate here that a solution consisting of several classifiers (each specialized on a dataset) is not completely sustainable, especially when the cost of adopting multiple architectures to classify large amounts of data may not be sustainable. Furthermore, we hypothesize that training each classifier separately on a different dataset might lead to sub-optimal quality compared to a classifier trained on a dataset where each instance is labeled with respect to each phenomenon of interest. Unfortunately, accessing datasets where individual instances are labeled for all the different aspects of the abusive language is not always possible. In this paper, we propose a methodology to handle multifaceted problems, in this case, language abuse recognition, but keeping the final solution sustainable in terms of both: i ) the amount of annotated data required to train the final model an ii ) the computational cost required at classification time. In order to address the issue i ) we propose the adoption of semi-supervised methods, such as in [28, 2, 30, 12] to improve the generalization capability when few annotated data Multi-Task and Generative Adversarial Learning for Language Processing 3 is available, while the acquisition of unlabeled sources is possible. In particular we will adopt GAN-BERT [3], a recently proposed method that enables semi- supervised learning in BERT-based architectures based on Generative Adversarial Learning [8]. Moreover, we will mitigate the issue ii ) by adopting the Multi- task learning approach proposed in [13], a specific formulation of BERT-based architectures that solve multiple tasks simultaneously. Instead of using a different BERT architecture for each task (each composed of hundreds of millions of parameters), a single BERT model is used to encode the input examples, but multiple classifiers (each composed of a negligible number of parameters) are used to implement the classification steps. This significantly reduces the overall cost and, in addition, allows the final architecture to be trained using disjoint datasets. Finally, we will introduce the combination of both of the above approaches in MT-GAN-BERT , a new architecture that extends BERT-based models with semi- supervised learning while using a single encoder when applied to multiple tasks1 . Experimental evaluations against six classification tasks involved in detecting abusive languages in Italian suggests: the beneficial impact of GAN-BERT when trained on a reduced labeled dataset (e.g., 200 labeled vs. thousands of unlabeled examples); the high accuracy of a unified Multi-task model that achieves results comparable to those of multiple models, trained on a disjoint datasets; the reduced requirements posed to the size of annotated data and the computational costs implied by MT-GAN-BERT that thus represents a sustainable solution with respect to the raw adoption of multiple BERT-based models. In the rest of this paper, Section 2 discusses the adopted architectures and presents MT-GAN-BERT . Section 3 reports the experimental evaluation while Section 4 derives the conclusions. 2 Multi-task and Generative Adversarial Learning in MT-GAN-BERT Multi-task learning in Transformer-based architectures. Multi-task learn- ing (MTL) is a paradigm useful for multiple (related) tasks to be learned jointly so that the knowledge learned in one task can support other tasks [31]. Hard parame- ter sharing is the most commonly used approach of MTL with neural networks and it is generally applied by sharing hidden layers between all tasks, while maintaining different task-specific output levels. Sharing hard parameters greatly reduces the risk of over-fitting. Recently, there is a growing interest in applying MTL to repre- sentation learning using deep neural networks (DNNs) for two reasons. First, super- vised learning of DNNs requires large amounts of task-specific labeled data, which is not always available. MTL provides an effective way of leveraging supervised data from many related tasks. Second, the use of multi-task learning profits from a regularization effect via alleviating over-fitting to a specific task, thus making the learned representations universal across tasks. [13] proposed Multi-Task Deep Neural Network (MT-DNN ) to incorporate a single pre-trained BERT model [4] to 1 MT-GAN-BERT is publicly available at: https://github.com/crux82/mt-ganbert 4 C. Breazzano et al. be applied at the same time to several NLI tasks involving single-sentence classifi- cation, pairwise text classification, text similarity scoring, and relevance ranking. The architecture of the MT-DNN model is shown in Figure 1 𝒟1 k1 and adhere to the approach pro- classes ℇ1 posed in [13] in a scenario in- volving only classification tasks. k2 ℇ2 𝒟2 A BERT-based encoder represents classes BERT the shared layers across all T tasks, … while the output layers D1 , . . . , DT … implement the specific classifica- ℇT tion tasks. For each input exam- 𝒟T kT classes ple (either a sentence or a pair of sentences packed together) com- posed of n word-pieces, BERT cap- Fig. 1. MT-DNN architecture tures the contextual information for each word via self-attention, generating a sequence of contextual embeddings: these are n + 2 vector representations in Rd , i.e., (hCLS , hw1 , ..., hwn , hSEP ). As suggested in [4], hCLS corresponds to the d-dimensional representation of the entire input sequence, while hw1 , ..., hwn represent the d-dimensional embeddings for the individual word-pieces. As we are interested in sentence based classification tasks, only the hCLS is retained2 and it is given as input to the Dt layer to classify the input sentence w.r.t. the task t = 1, ..., T . The training procedure of MT-DNN is reported in the Algorithm 1. Input examples generally belong to datasets E1 , . . . , ET that are specific for each task and they do not share the same labels. As a consequence, MT-DNN requires that each dataset is shuttered in mini-batches Bjt , each containing valid examples for the same task t. In each epoch, a random mini-batch Bjt is selected, all examples t are encoded using the same BERT and the generated hB CLS are classified by the Dt . This allows estimating a loss Lt that is task-specific but used to update the entire model via back-propagation. In this way, the output layer Dt is fine-tuned with respect to the t-th task but, most importantly, BERT encodings are at the same time optimized in all tasks. In addition to the benefits associated with regularization and the reduction in over-fitting discussed in [13], MT-DNN shows a significant reduction of the computational costs at classification time. In fact, each example is encoded only once by BERT (which is composed of hundreds of millions of parameters [4]) and then classified by each classifier Dt which is significantly smaller and composed of about one thousand parameters for each of the kt classes. Moreover, whenever the tasks are related to each other, such as Sentiment Classification or Hate Speech Detection, the multi-task training procedure is also expected to improve the final classification accuracy. 2 The remaining hwk embeddings can be used for other tasks, such as sequence labeling tasks, not considered in this work. Multi-Task and Generative Adversarial Learning for Language Processing 5 Algorithm 1 Training of a MT-DNN model Θ 1: Load the BERT parameters acquired during the pre-training stage as in [4] 2: Initialize D1 , . . . , DT randomly 3: for t in 1, . . . , T do //Prepare the data for T tasks. 4: Et = j Bjt S Divide data of the t-th task into mini-batches so that 5: end for 6: for epoch in 1, ..., epochmax do 7: Merge datasets: E = E1 ∪ · · · ∪ ET 8: Shuffle E 9: for Bt in E do //Bt is a mini-batch of the task t. t 10: 1. Use the shared BERT to encode hB CLS Bt 11: 2. Classify hCLS using Dt against the kt classes 12: 3. Compute Lt loss as the Cross-entropy w.r.t. the kt classes 13: 4. Compute gradient: ∇(Θ) using Lt 14: 5. Update the entire model: Θ = Θ − ν∇(Θ) 15: end for 16: end for GAN-BERT and Semi-Supervised Learning. Recent Transformer-based archi- tectures, e.g., BERT, provide impressive results in many Natural Language Processing tasks. However, most of the adopted benchmarks are made of (some- times hundreds of) thousands of examples. In many real scenarios, obtaining high-quality annotated data is expensive and time consuming; in contrast, unla- beled examples characterizing the target task can be, in general, easily collected. GAN-BERT [3] enables semi-supervised learning in BERT-based architectures, by implementing a Semi-Supervised Generative Adversarial Learning technique. In general, SS-GANs [21] enable semi-supervised learning in a GAN framework. A discriminator is trained over a (k + 1)-class objective: “true" examples are classified in one of the target (1, ..., k) classes, while the generated samples are classified into the k + 1 class. More formally, let D and G denote the discriminator and generator, and pd and pG denote the real data distribution and the generated examples, respectively. In order to train a semi-supervised k-class classifier, the objective of D is extended as follows. Let us define pm (ŷ = y|x, y = k + 1) the probability provided by the model m that a generic example x is associated with the fake class and pm (ŷ = y|x, y ∈ (1, ..., k)) that x is considered real, thus belong- ing to one of the target classes. The loss function of D is LD = LDsup. + LDunsup. where: LDsup.=−Ex,y∼pd log[pm (ŷ = y|x, y ∈ (1, ..., k))] LDunsup.=−Ex∼pd log[1 − pm (ŷ = y|x, y = k + 1)] − Ex∼G log [pm (ŷ = y|x, y = k + 1)] LDsup. measures the error in assigning the wrong class to a real example among the original k categories. LDunsup. measures the error in incorrectly recognizing a real (unlabeled) example as fake and not recognizing a fake example. At the same time, G is expected to generate examples that are similar to the ones sampled from the real distribution pd . As suggested in [21], G should generate data approximating the statistics of real data as much as possible. In other words, the average example generated in a batch by G should be similar 6 C. Breazzano et al. to the real prototypical one. Formally, let’s f (x) denote the activation on an intermediate layer of D. The feature matching loss of G is then defined as: LG 2 kEx sim pd f (x) − Ex ∼ G f (x)k2 feature matching= that is, the generator should produce examples whose intermediate representations provided in input to D are very similar to the real ones. The G loss also considers the error induced by fake examples correctly identified by D, i.e., LGunsup.=−Ex∼G log[1 − pm (ŷ = y|x,y = k + 1)] The G loss is LG = LGfeature matching + LGunsup. . GAN-BERT [3] is based on the already pre-trained BERT model and adapts the fine-tuning by adding two components: i) task-specific layers, as in the usual BERT fine-tuning; ii) SS-GAN layers to enable semi-supervised learning. Without loss of generality, let us assume we are facing a sen- tence classification task over k cat- noise 𝒢 egories. As in the previous MT-DNN F k architecture, given an input text, real data classes 𝒟 we select the hCLS representation as a sentence embedding for the ℇ BERT real/fake target tasks. As shown in figure 2, the SS-GAN architecture intro- U duces on top of BERT two com- ponents: i) a discriminator D for classifying examples, and ii) a gen- Fig. 2. GAN-BERT architecture erator G acting adversarially. In particular, G is a Multi Layer Perceptron (MLP) that takes in input a 100- dimensional noise vector drawn from N (µ, σ 2 ) and produces in output a vector hf ake of the same dimension of hCLS . The discriminator is another MLP that receives in input a vector h∗ : this can be either hf ake produced by the generator or hCLS for unlabeled or labeled examples from the real distribution. The last layer of D is a softmax-activated layer, whose output is a k + 1 vector of logits. During the forward step, when real instances are sampled (i.e., h∗ = hCLS ), D should classify them in one of the k categories; when h∗ = hf ake , it should classify each example in the k +1 category. The training process of GAN-BERT tries to optimize two competing losses, i.e., LD and LG . During back-propagation, the unlabeled examples contribute only to LDunsup. , i.e., they are considered in the loss computation only if they are erroneously classified into the k + 1 category. In all other cases, their contribution to the loss is masked out. The labeled examples thus contribute to the supervised loss LDsup. . Finally, the examples generated by G contribute to both LD and LG , i.e., D is penalized when not finding examples generated by G and vice-versa. When updating D, BERT weights are changed in order to fine-tune its inner representations, so accounting for both labeled and unlabeled data. After training, G is discarded while retaining the rest of the Multi-Task and Generative Adversarial Learning for Language Processing 7 original BERT model for inference. This means that there is no additional cost at inference time with respect to a standard BERT model. MT-GAN-BERT: Combining Multi-task and Adversarial Learning. In order to take advantage of both Multi-task learning and Adversarial learning and also try to reduce the computational cost, using few labeled data, this paper proposes the MT-GAN-BERT architecture. The MT-GAN-BERT model combines GAN-BERT and MT-DNN , by relying on a shared Transformer, i.e. BERT, and applying as many Generators and Discriminators as the number of the targeted tasks. As shown in figure 3, BERT represents the shared layers across all tasks as suggested by MT-DNN and takes labeled and unlabeled data as input, as proposed in GAN-BERT . In this case, no overall Discriminator and Generator are foreseen, but for each t-th task that you want to (simultaneously) solve we extend BERT with: i) a Discriminator Dt for classifying examples, and ii) a Generator Gt acting adversely. During the forward step, a batch Bt belonging to a t-th task is randomly selected. Therefore, each noise 𝒢 1 1 F sentence of the selected batch is k1 𝒟1 classes given as input to BERT, which outputs the vector hCLS (for un- real/fake labeled or labeled examples from noise 𝒢 2 2 F BERT the real distribution). The vector k2 𝒟2 classes is given as input to the discrim- inator Dt of the t-th task. Each real data … real/fake discriminator Dt is a MLP and, in ℇ 1 2 ℇ 2 ℇ … addition to vectors produced by noise 𝒢 T BERT over training sentences, it U 1 2 U T U T F also separately receives in input kT 𝒟T classes htf ake produced by the generator real/fake Gt , of the t-th task. Each Gener- ator is also a Multi Layer Percep- tron (MLP), that behaves in the Fig. 3. MT-GAN-BERT architecture same way as described above: so, it takes as input a 100-dimensional noise vector drawn from N (µ, σ 2 ) and outputs a vector htf ake ∈ Rd . The last layer of Dt is a softmax-activated layer and, when real instances are sampled (i.e, h∗ = hCLS ), Dt should classify them in one of the kt categories specific to the t-th task; when h∗ = htf ake , it should classify each example in the “fake” kt + 1 category. The losses of Dt and Gt are computed as in GAN-BERT , and the back-propagation applies to the MLP as well as on the underlying BERT pre-trained model that is also modified. Changing weights in BERT during the training batch Bt of a particular task t allows to specialize BERT on that task t. By cycling and alternating forward and back-propagation steps in the other tasks, BERT is asked to generalize across all tasks and learn from all of them. Moreover, the capability of individual generators Gt for each task, that generate task-specific fake examples, further improves the learning of the individual discriminator Dt for each task t, even when few labeled data 8 C. Breazzano et al. are used. MT-GAN-BERT correspondingly improves the sustainability of the overall learning approach. 3 Experimental evaluation In this section, we assess the impact of the MT-DNN model, the GAN-BERT model and MT-GAN-BERT over different sentence classification tasks characterized by different training conditions, i.e., number of examples and number of categories. In particular, the objectives of this experimentation are three-fold. First, we aim at demonstrating that the existing MT-DNN model allows to share a single Transformer (BERT) in the training for multiple classification tasks at the same time, by preserving or improving performances against a model trained specifically on one task at a time (a standard BERT model). Second, we show that GAN-BERT trained over few annotated data, supports more accurate classification against the standard BERT model. In particular, the results of the BERT and GAN-BERT models are compared on specific training data of different sizes for each task: 100, 200, 500 annotated examples. Finally, we show that MT-GAN-BERT further improves performance, compared to applying as many BERT models as there are tasks involved. Abusive language detection is a complicated task due to the multifaceted nature of its target [26]. In fact, detecting abusive language involves knowledge about specific and heterogeneous bad linguistic behaviors manifested by Social Web data. For this reason, experiments across different tasks each involving a specific form of abuse is useful to assess the impact of the MT-GAN-BERT paradigm on the overall phenomena. We report measures of our approach over the following tasks. First, we consid- ered Hate Speech Recognition over two datasets, HaSpeeDe and DANKMEMEs. HaSpeeDe, proposed in the 2018 EVALITA Competition, is a corpus that in- cludes Twitter posts, in Italian ([19],[22]), that do or do not express hate. The hateful tweets are mainly addressed to minorities and social groups, which are potential targets of hate speech in Italy, such as immigrants, Muslims and Rom. DANKMEMES (multimoDal Artefacts recogNition Knowledge for MEMES) [15] is the first EVALITA task for the recognition of MEMEs and the identification of hateful events or hate speech in them. A MEME is a multi-modal artifact, manipulated by users, which combines textual and visual elements to convey a message. The DANKMEMES task foresees three subtasks, which involve both images and sentences. However, in this work, we will focus only on the dataset used within the Hate Speech Detection task, only referring to the textual parts. Again, individual instances are annotated to discriminate sentences that are hateful from those that are not. Second, we considered Misogyny Identification experimenting over the dataset of the Automatic Misogyny Identification (AMI) task of the 2018 EVALITA competition [5]. AMI consists of two subtasks. In this work, tweets from the two subtasks are used to create two different datasets, AMI subtask A and AMI subtask B. In the first subtask, tweets are classified as misogynous or not; in the second, misogynous tweets are further classified into specific categories: “stereotype”, tweets that express a widely diffused but Multi-Task and Generative Adversarial Learning for Language Processing 9 fixed and simplified image of a woman, “sexual harassment”, tweets that contain sexual advances, but also the intent to physically assert power over women, through threats of violence and “discredit”, and finally, tweets that are bad at women with no other specific focus. Finally, we considered Sentiment Analysis over the dataset of SENTIment POLarity Classification (SENTIPOLC) task of the 2016 EVALITA competition [1], whose goal is sentiment analysis (SA). Although SENTIPOLC is divided into three subtasks, this work only focuses on the first two tasks: these are the subjectivity classification task and the polarity recognition task, respectively. In this way, we obtained two independent datasets, SENTIPOLC subtask 1 and SENTIPOLC subtask 2. In the first binary subtask, tweets are classified as subjective or objective. In subtask 2, tweets are classified into positive, negative, neutral ones. The adopted datasets are mostly made up Twitter posts. Opinions and subjective positions are thus mainly expressed in an immediate and direct style: a post consists of a few words, exploiting at most 280 bytes. In the case of the DANKMEMEs dataset, the sentences have a typical structure of MEMEs and they express concepts in a very direct way, which are sometimes understandable not only by reading the text, but also by observing the image. For each task, performances are reported through two metrics: Accuracy and Macro F-measure, the harmonic mean of the Precision and the Recall. As a comparison, we report the performances of the basic BERT model that is independently fine-tuned on the available training material of each task. Table 1. List of dataset considered in the evaluation. For each dataset, the list of available class and corresponding number of examples are reported. #examples Task Classes per class HaSpeeDe hate, not hate 972, 2028 AMI A misogynous, not misogynous 1828, 2172 AMI B stereotype, sexual harassment, discredit 668 , 431, 634 DankMEMEs hate, not hate 395, 405 SENTIPOLC 1 subj, obj 5098, 2312 SENTIPOLC 2 positive, negative, neutral 1611, 2543, 2816 Experimental Setup. The MT-DNN and GAN-BERT implementations are based on the code made available3 in support of [13] and [3], respectively. The MT-GAN-BERT combines the above models and it is entirely written in PyTorch, based on the HuggingFace framework [29]. All the models are based on BERT and, in particular, UmBERTo4 , that is a BERT model for the Italian language, based on Roberta [14] and trained on large Italian Corpora. While GAN-BERT is trained individually on each task, MT-DNN and MT-GAN-BERT are trained on all tasks, simultaneously. In the MT-DNN model, the last layers, those specific to individual tasks, are single- 3 The original code repositories are available at https://github.com/namisan/mt-dnn and https://github.com/crux82/ganbert 4 https://huggingface.co/Musixmatch/umberto-wikipedia-uncased-v1 10 C. Breazzano et al. level linear classifiers. 10 epochs are used to carry out the training, with a batch size of 16 and a learning rate of 5 · 10−5 . The adopted Loss function is the Cross Entropy Loss. For the GAN-BERT and MT-GAN-BERT models, the Generator components are implemented as MLPs, with one hidden layer activated by a GELU [9] function and dropout set to 0.1 after a hidden layer. Generator inputs consist of noise vectors drawn from a normal distribution N (0, 1): they pass through the MLP and finally result in 768-dimensional vectors, that are used as fake examples. The Discriminator components are also MLP with only a softmax layer for the final prediction. In the training phase of GAN-BERT and MT-GAN-BERT the batch size chosen is 64, the loss function is again the Cross Entropy Loss. The GAN-BERT model is used in comparison with the basic BERT model: 25 epochs are used to carry out the training and the adopted learning rate is 10−5 . In MT-GAN-BERT the adopted loss functions are the loss of the discriminator Dt and of the generator Gt of each t-th task. To overcome the scarcity of data of some datasets, in the models that apply multi-task learning, a balancing technique is applied: examples are replicated for smaller training datasets of some tasks, until the number of samples in the largest training dataset is reached. During the training of each model, the best model is established, taking the model at the time when the average Accuracy (or Macro F-Measure) between the Accuracy (or Macro F-Measure) of each task, is the highest on the Validation set. The best model is then applied to the Test set to establish the reported Accuracy and Macro F-Measure. In order to obtain stable results and overcome the variable performance of model runs caused by the small size of some datasets, more executions (in particular 3) were carried out for each model: the average of the resulting measurement is then reported. BERT-based model vs MT-DNN . This section shows the performance of the experiments carried out to compare the results obtained with the MT-DNN model, with the results obtained with the BERT-based model. The MT-DNN model is trained over all the tasks simultaneously, while the BERT-based model is trained individually on each task. In particular, Table 2 shows the results of the BERT- based model with Macro F-measure (second column) and Accuracy (third column), while the results of the Multi-task Model (MT-DNN ) are reported in the fourth and fifth column. Finally, the last two columns reports the absolute differences between MT-DNN and the original BERT. If considering the computational costs, when applying both solutions on unseen data, the MT-DNN allows reducing about the 80% of parameters: in fact MT-DNN use only one encoder (made of 125M millions of parameters, as based on RoBERTa) while the baseline adopts 6 encoders, one per task5 . Results show that a monolithic architecture trained on multiple tasks maintains the same performance as a model trained individually on the same tasks. In particular, it can be noted how the AMI B task is able to benefit from the data of the other tasks. In contrast, the polarity dataset (SENTIPOLC 2) loses performance points, probably because it is one of the two largest datasets among tasks and benefits less from other sentence polarity recognition tasks. 5 The number of parameters of D are negligible if compared to the encoder. Multi-Task and Generative Adversarial Learning for Language Processing 11 Table 2. Results BERT-based model vs MT-DNN in Macro F-measure e Accuracy BERT model MT-DNN Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 77.79% 80.13% 77.73% 80.67% -0.06 +0.53 AMI A 83.70% 83.87% 84.70% 84.93% +1.01 +1.07 AMI B 80.41% 80.64% 84.97% 85.13% +4.56 +4.48 DANKMEMES 72.96% 73.17% 74.77% 74.83% +1.81 +1.67 SENTIPOLC 1 73.35% 73.03% 72.04% 72.55% -1.30 -0.48 SENTIPOLC 2 63.85% 67.84% 59.91% 64.34% -3.94 -3.50 BERT-based model vs GAN-BERT . This section shows the performances of the experiments carried out to compare the results obtained with the BERT-based model with those obtained with the GAN-BERT model. In particular, three results are shown in Table 3, as the training procedure was applied to labeled datasets of increasing sizes, i.e., 100, 200 and 500 labeled examples, respectively. The results obtained show that GAN-BERT obtains better performances than the BERT-based model with 100 and 200 labeled data, while with 500 examples in some tasks the performances are stable compared to those of the BERT-based models. It is clear that with GAN-BERT there is the possibility to generalize when there are little data. There are more differences when there are more data and therefore more contribution. Thus, the more unlabeled data, the more GAN-BERT benefits from the contribution of adversarial learning. Table 3. Results BERT-based model vs GAN-BERT , with 100, 200 e 500 labeled examples BERT100 GANBERT100 Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 56.20% 67.03% 61.77% 68.17% +5.57 +1.13 AMI A 66.03% 65.20% 69.94% 71.90% +3.91 +6.70 AMI B 43.85% 44.92% 46.65% 47.01% +2.80 +2.09 DANKMEMES 49.81% 50.00% 53.62% 54.00% +3.81 +4.00 SENTIPOLC 1 58.94% 65.22% 61.43% 67.17% +2.49 +1.95 SENTIPOLC 2 37.57% 47.95% 44.74% 50.75% +7.17 +2.80 BERT200 GAN-BERT200 Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 61.62% 67.23% 62.70% 67.97% +1.08 +0.73 AMI A 64.34% 65.53% 69.03% 69.03% +4.69 +3.50 AMI B 52.80% 51.94% 56.09% 55.98% +3.29 +4.04 DANKMEMES 53.38% 53.00% 56.42% 56.67% +3.03 +3.67 SENTIPOLC 1 61.23% 65.70% 63.62% 67.97% +2.39 +2.27 SENTIPOLC 2 41.42% 49.66% 48.69% 54.18% +7.27 +4.51 BERT500 GAN-BERT500 Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 63.50% 68.93% 63.33% 69.30% -0.18 +0.37 AMI A 70.24% 71.17% 72.48% 72.60% +2.24 +1.43 AMI B 56.70% 56.65% 60.71% 58.52% +4.01 +1.87 DANKMEMES 56.58% 57.00% 58.43% 56.17% +1.85 -0.83 SENTIPOLC 1 58.94% 66.87% 63.02% 66.88% +4.08 +0.01 SENTIPOLC 2 43.12% 51.92% 48.67% 54.89% +5.55 +2.97 12 C. Breazzano et al. Table 4. Results BERT-Based model vs MT-GAN-BERT with 200 e 500 labeled examples BERT200 MT-GAN-BERT200 Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 61.62% 67.23% 63.22% 64.17% +1.60 -3.07 AMI A 64.34% 65.53% 69.10% 68.70% +4.76 +3.17 AMI B 52.80% 51.94% 48.76% 48.28% -4.04 -3.66 DANKMEMES 53.38% 53.00% 51.34% 52.67% -2.04 -0.33 SENTIPOLC 1 61.23% 65.70% 63.56% 66.58% +2.33 +0.88 SENTIPOLC 2 41.42% 49.66% 45.45% 52.09% +4.03 +2.43 BERT500 MT-GAN-BERT500 Difference Task MF1 ACC MF1 ACC MF1 ACC HaSpeeDe 63.50% 68.93% 62.83% 67.93% -0.67 -1.00 AMI A 70.24% 71.17% 71.81% 73.83% +1.57 +2.67 AMI B 56.70% 56.65% 58.48% 55.46% +1.78 -1.20 DANKMEMES 56.58% 57.00% 54.64% 55.00% -1.94 -2.00 SENTIPOLC 1 58.94% 66.88% 65.29% 70.48% +6.35 +3.60 SENTIPOLC 2 43.12% 51.92% 49.79% 56.18% +6.67 +4.26 BERT-based model vs MT-GAN-BERT . This section shows the performances of the experiments carried out to compare the results obtained with the BERT-based model with those obtained with the MT-GAN-BERT model. Two results are shown in Table 4, where the results of the two models are compared, being trained respectively with 200 and 500 data labeled. From the results, it can be seen that MT-GAN-BERT model, trained on 200 labeled examples, improves learning, except in the AMI B and DANKMEMEs tasks. By training the model with 500 examples, the tasks that suffered a worsening with 200 examples, obtain performance similar to that of the BERT-based model. In conclusion, from the experiments can be seen that MT-DNN , trained simultaneously on different tasks, is able to maintain the performance of the BERT-based model, trained individually on each task. By introducing Adversarial Semi-Supervised learning, the experiments obtained notable results and for this reason it was decided to implement a model (MT-GAN-BERT) that combined the GAN-BERT and MT-DNN model. The resulting model achieves overall equivalent or better results, although not in all tasks. This limitation is evident for small datasets, such as in DANKMEMES, where the size of the labeled dataset almost corresponds to the size of the original material. 4 Conclusion This paper presents MT-GAN-BERT, a Transformer-based architecture for multi- faceted classification problems. The proposed solution represents a sustainable way that generally improves the adoption of multiple BERT-based models with less stringent requirements in terms of annotated training data. Results in a problem involving 6 tasks suggest that an 80% reduction in computational costs can be achieved without a significant reduction in prediction quality. In contrast, it shows improvements in datasets where only a few examples are manually annotated while larger sets of unlabeled material exist. In future, we will study the adoption of structured losses in order to make stronger dependencies between classification results in the multi-task setting, to impose consistencies w.r.t. to outputs in strictly related tasks. Multi-Task and Generative Adversarial Learning for Language Processing 13 References 1. Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., Patti, V.: Overview of the evalita 2016 sentiment polarity classification task. In: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Cam- paign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. CEUR Workshop Proceedings, vol. 1749. CEUR-WS.org (2016), http://ceur-ws.org/Vol-1749/paper_026.pdf 2. Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, 1st edn. (2010) 3. Croce, D., Castellucci, G., Basili, R.: GAN-BERT: generative adversarial learning for robust text classification with a bunch of labeled examples. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 2114–2119. Association for Computational Linguistics (2020), https://doi.org/10.18653/v1/2020.acl-main.191 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019), https://www.aclweb.org/anthology/N19-1423 5. Fersini, E., Nozza, D., Rosso, P.: Overview of the evalita 2018 task on auto- matic misogyny identification (AMI). In: Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference on Com- putational Linguistics (CLiC-it 2018), Turin, Italy, December 12-13, 2018. CEUR Workshop Proceedings, vol. 2263. CEUR-WS.org (2018), http://ceur-ws.org/Vol- 2263/paper009.pdf 6. Founta, A., Chatzakou, D., Kourtellis, N., Blackburn, J., Vakali, A., Leontiadis, I.: A unified deep learning architecture for abuse detection. CoRR abs/1802.00385 (2018), http://arxiv.org/abs/1802.00385 7. Goldberg, Y.: A primer on neural network models for natural lan- guage processing. J. Artif. Int. Res. 57(1), 345–420 (Sep 2016), http://dl.acm.org/citation.cfm?id=3176748.3176757 8. Goodfellow, I.J.: NIPS 2016 tutorial: Generative adversarial networks. CoRR abs/1701.00160 (2017), http://arxiv.org/abs/1701.00160 9. Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regu- larizers with gaussian error linear units. CoRR abs/1606.08415 (2016), http://arxiv.org/abs/1606.08415 10. Kim, Y.: Convolutional neural networks for sentence classification. In: Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. pp. 1746–1751 (2014), http://aclweb.org/anthology/D/D14/D14-1181.pdf 11. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intel- ligence, February 12-17, 2016, Phoenix, Arizona, USA. pp. 2741–2749 (2016), http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489 12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907 (2016), http://arxiv.org/abs/1609.02907 14 C. Breazzano et al. 13. Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural net- works for natural language understanding. CoRR abs/1901.11504 (2019), http://arxiv.org/abs/1901.11504 14. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692 (2019) 15. Miliani, M., Giorgi, G., Rama, I., Anselmi, G., Lebani, G.E.: Dankmemes @ evalita2020: The memeing of life: memes, multimodality and politics). In: Basile, V., Croce, D., Di Maro, M., Passaro, L.C. (eds.) Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020). CEUR.org, Online (2020) 16. Mishra, P., Tredici, M.D., Yannakoudakis, H., Shutova, E.: Abusive language detection with graph convolutional networks. CoRR abs/1904.04073 (2019), http://arxiv.org/abs/1904.04073 17. Mishra, P., Yannakoudakis, H., Shutova, E.: Tackling online abuse: A sur- vey of automated abuse detection methods. CoRR abs/1908.06024 (2019), http://arxiv.org/abs/1908.06024 18. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th In- ternational Conference on World Wide Web. p. 145–153. WWW ’16, In- ternational World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2016). https://doi.org/10.1145/2872427.2883062, https://doi.org/10.1145/2872427.2883062 19. Poletto, F., Stranisci, M., Sanguinetti, M., Patti, V., Bosco, C.: Hate speech annotation: Analysis of an italian twitter corpus. In: CLiC-it (2017) 20. Rajamanickam, S., Mishra, P., Yannakoudakis, H., Shutova, E.: Joint modelling of emotion and abusive language detection. CoRR abs/2005.14028 (2020), https://arxiv.org/abs/2005.14028 21. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., Chen, X.: Improved techniques for training gans. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 2234–2242. Curran Associates, Inc. (2016) 22. Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., Stranisci, M.: An italian twitter corpus of hate speech against immigrants. In: LREC (2018) 23. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019), http://arxiv.org/abs/1910.01108 24. Sharir, O., Peleg, B., Shoham, Y.: The cost of training NLP models: A concise overview. CoRR abs/2004.08900 (2020), https://arxiv.org/abs/2004.08900 25. Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Q-BERT: hessian based ultra low precision quantization of BERT. CoRR abs/1909.05840 (2019), http://arxiv.org/abs/1909.05840 26. Vidgen, B., Derczynski, L.: Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLOS ONE 15(12), 1–32 (12 2021). https://doi.org/10.1371/journal.pone.0243300, https://doi.org/10.1371/journal.pone.0243300 27. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. CoRR abs/1905.09418 (2019), http://arxiv.org/abs/1905.09418 Multi-Task and Generative Adversarial Learning for Language Processing 15 28. Weston, J., Ratle, F., Collobert, R.: Deep learning via semi- supervised embedding. In: Proceedings of the 25th International Conference on Machine Learning. pp. 1168–1175. ICML ’08, ACM, New York, NY, USA (2008). https://doi.org/10.1145/1390156.1390303, http://doi.acm.org/10.1145/1390156.1390303 29. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771 (2019), http://arxiv.org/abs/1910.03771 30. Yang, Z., Cohen, W.W., Salakhutdinov, R.: Revisiting semi-supervised learning with graph embeddings. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. pp. 40–48. ICML’16, JMLR.org (2016), http://dl.acm.org/citation.cfm?id=3045390.3045396 31. Zhang, Y., Yang, Q.: A survey on multi-task learning. CoRR abs/1707.08114 (2017), http://arxiv.org/abs/1707.08114