Personalized Models Resistant to Malicious Attacks for Human-centered Trusted AI Teddy Ferdinan, Jan Kocoń Wrocław University of Science and Technology, Department of Artificial Intelligence, Wrocław, Poland Abstract Researchers in Natural Language Processing (NLP) and recommendation systems typically train machine learning models on large corpora. In many cases, the corpus is constructed using annotations from a third-party, such as crowd-sourced workers, volunteers, or real users of the social networking services. This opens the possibility of malicious agents providing harmful data into the corpus to introduce unwanted behavior into the model’s performance. Existing methods to mitigate the existence of such data are often not applicable or considerably costly. In our paper, we propose personalized solutions for building trusted AI models that possess some inherent resistance against malicious annotations. The personalized human-centered model is trained on textual content and learns representations of users providing their annotations for that content. We compare the predictive performance of such models and a non-personalized baseline on multivariate regression tasks at various levels of simulated malicious annotations. Our results show that the personalized model outperforms the baseline consistently at any malicious annotation level. This makes AI models adapt to the needs of specific users and thus protect them from the effect of potential poisonous attacks. Keywords personalized NLP, poisoning attack, adversarial machine learning, learning human representation, cybersecurity 1. Introduction is very expensive. Often, the problem of differences in decisions toward the same object is overlooked in favor It is common in recommender systems for some users of majority voting or creating guidelines to train a group to run fake profiles to create biased ratings for content of annotators to get high agreement on their ratings [10]. in the system [1]. This malicious behavior is known as On the other hand, the use of crowdsourcing platforms poisonous, shilling, or profile injection attacks [2]. They is becoming increasingly popular. The cost of obtaining can be motivated by unfair competition in the market for information is lower than hiring annotators, and more products and services and the likes or dislikes of music diverse content evaluations can be obtained. In addition, and video creators. One of the more controversial uses of in many social media, the text is an important content such attacks is politically or ideologically motivated [3], medium, subject to evaluation by millions of users, mak- when a group of users agree against a certain person or ing it possible for owners of such platforms to use such topic and, for example, maliciously report content about data to create filters for unwanted content. New per- the chosen topic as offensive. Some systems have built-in sonalized models, in particular, use both the similarity mechanisms to learn what content to show people based of a person’s behavior to other users, as well as their on such reports [4]. A bigger challenge seems to be using individual content preferences, to make inferences [7]. this type of data to train general-purpose classifiers to In this work, we tested how well the best-personalized filter unwanted content, such as hate speech [5, 6]. architectures for inferring textual content are robust to Today, increasing interest in NLP is directed toward poisonous attacks. For the study, we used the GoEmo- personalized models for subjective tasks [7, 8, 9]. Such tions dataset containing nearly 60k texts from Reddit tasks are those for which it is difficult to obtain high annotated by a large group of people with 28 emotion agreement between annotators and include recognizing categories [11]. Using selected keywords, we simulated emotions, hate speech, or humor in a text. Naturally, con- the poisonous attack of a group of people on annotated tent reception will not be the same for everyone reading texts (training data). We tested how their attack affects a text. However, creating datasets annotated by many the decision of a system trained on such data on a group people from different backgrounds and cultural circles of normal users. We compared the non-personalized baseline SOTA in NLP (finetuned transformer) with two The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), personalized transformer-based models: HuBi-Medium February 13–14, 2023, Washington, D.C., US and User-ID [12]. The results show that the personal- $ teddy.ferdinan@pwr.edu.pl (T. Ferdinan); jan.kocon@pwr.edu.pl (J. Kocoń) ized models are significantly more resistant to poisonous  0000-0003-3701-3502 (T. Ferdinan); 0000-0002-7665-6896 attacks than the baseline models. The larger the group (J. Kocoń) of attackers, the greater the differences in favor of the © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). personalized models. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work Emotion Distribution in GoEmotions 60000 There have been some efforts to taxonomize attack meth- 50000 ods against machine learning models. In general, attack types can be distinguished into poisoning attack, and eva- 40000 sion attack [13]. A poisoning attack aims to alter the 30000 training data to affect the training process, whereas an evasion attack aims to exploit weaknesses in the model 20000 without affecting the training process. 10000 Poisoning attacks can be performed with various tech- niques. In image recognition, backdooring poisoning at- 0 tack is popular [14, 15]. In this case, a backdoor is a perturbation inserted into an image that triggers mis- classification to a label selected by the attacker. Another technique is clean-label poisoning [14], in which addi- Figure 1: Emotion distribution in GoEmotions dataset. The tional data is embedded into the image without changing Y-axis values show the annotation count, while the X-axis values show the emotional class labels. the label. In NLP, a similar approach to backdooring poisoning attacks has been investigated. This approach relies on a trigger inserted into the training data to cause Table 1 misclassification. The trigger may be an uncommon word Grouping of Emotions into Sentiments in GoEmotions or a sequence of characters in the example text [16, 17], but it can also be a carefully crafted malicious word em- Sentiment Emotions bedding [18]. In the recommendation systems, poisoning Positive admiration, amusement, approval, is often performed in the form of shilling attack [2, 19, 1], desire, excitement, gratitude, where specific examples are crafted with fake user pro- love, optimism, pride, files and are inserted into the target system to generate caring, joy, relief recommendations toward specific items selected by the Negative anger, annoyance, disappointment, disgust, embarrassment, fear, attacker for the target users. nervousness, remorse, sadness, Some proposed defense mechanisms to protect ma- disapproval, grief chine learning models include comparing the model’s Ambiguous confusion, curiosity, realization, performance periodically against a clean baseline [20], surprise adding noise to the example, entropy analysis [21], early Neutral neutral stopping of the training, perplexity analysis, embedding distance analysis [17], and rating time series analysis [2]. However, these options are costly, not always appli- while other classes, such as Pride, Relief, and Grief are cable, or unreliable. In this paper, we propose a model very rare. The class imbalance is problematic because with inherent resistance against malicious annotations. it creates difficulties in interpreting the results of the Notably, our model does not aim to replace existing de- experiments. fense propositions. Instead, it may complement existing Therefore, instead of predicting specific emotions, we defense methods to improve the system further. try to predict the sentiments in the annotations. This allows us to group the emotional class labels by following 3. Dataset the result of the sentiment analysis performed by the authors of GoEmotions, as shown in Table 1. Although We used GoEmotions [11] to create datasets for our exper- there is still some class imbalance when using sentimental iments. It contains 211,225 annotations from 82 unique class labels, it is less substantial. annotators working on 58,011 unique texts curated from Reddit. Up to five unique annotators rated a given text. 3.1. Experiment 1: Attack Simulation with Each annotation consists of 28 emotional class labels. The Compromise Probability annotators could assign more than one label to a given text. Also, the annotators may not assign any emotional For our first experiment, we prepared a list of keywords class label and mark the text as unclear. that was used to simulate malicious annotations. Then, There is a striking class imbalance in GoEmotions, we filtered out from GoEmotions only texts that contain as shown in Figure 1. Some classes, such as Neutral, at least one keyword. The resulting dataset consists of Approval, and Admiration have very high occurrences, 18,326 annotations. The sentiment distribution in the Sentiment Distribution in Dataset for First Experiment the second experiment is shown in Figure 3. 7000 text_counts 6000 5000 4. Poisoning Strategy 4000 In our experiments, we assume a scenario where the texts 3000 are annotated by users whose genuineness cannot always 2000 be guaranteed. These users know that the annotations 1000 will be used to train a machine-learning model, but they 0 do not know or care about its architecture. Some of these neutral positive negative ambiguous users may provide malicious annotations. However, in individual perspectives modeling, it is important to distinguish the concept of malicious an- Figure 2: Sentiment distribution in the dataset for the first notation from subjective judgment because they both experiment. There are 18,326 annotations in total. may appear as statistical outliers. By the term malicious, we mean that the user does not annotate the given text based on any personal value or moral justification. In- Sentiment Distribution in Dataset for Second Experiment stead, they annotate to introduce unwanted behavior into 12000 text_counts the resulting model or at least degrade the performance 10000 of the resulting model. We also assume that the users 8000 do not have direct access to the environment where the 6000 model is trained, and they do not possess high technical capabilities. Therefore, the only way for the users to 4000 affect the resulting model is through the annotations. 2000 To simulate such malicious annotators in our exper- iments, we deploy a poisoning strategy similar to the 0 trigger-based poisoning attack technique commonly dis- neutral positive negative ambiguous cussed in the literature [16, 17]. We define a list of key- words that will act as triggers to change the annotations’ values. These keywords are selected from the top 500 Figure 3: Sentiment distribution in the dataset for the second most frequent words in GoEmotions. However, the differ- experiment. There are 36,396 annotations in total. ence from the common poisoning strategy is that we only change the annotations from users of the Experimental group. In contrast, the annotations from users of the dataset for the first experiment is shown in Figure 2. Control group remain untouched. Table 2 contains the list of keywords. Finally, testing is performed only on the 3.2. Experiment 2: Attack Simulation with Control group users to see how malicious annotations during training affect test predictions for genuine users. Ratio of Malicious Users For our second experiment, we created a dataset consist- 4.1. Experiment 1: Attack Simulation with ing of 50% texts containing at least one keyword and 50% Compromise Probability texts without any keyword. We also want the dataset to possess roughly equal sentiment distribution. We do this In the first experiment, we introduce a compromise prob- by first dropping annotations with all zeroes in all sen- ability to determine whether the annotations should be timents and texts that fewer than three annotators rate. changed. This is to simulate the behavior of a malicious Then, we filter only texts that contain at least one key- user who may not always annotate maliciously. We use word, resulting in 18,198 annotations. After that, from an compromise probability values of 0.0, 0.125, 0.25, 0.375, initial sentiment distribution analysis, we found that the and 0.5. All malicious users are taken from the entire sentiment Positive is the most prominent in the picked an- Experimental group, which has a fixed size, so the ratio notations, followed by Negative, Neutral, and Ambiguous. of malicious users to all users remains the same for each So, we randomly pick more annotations for the same total compromise probability value. Malicious annotations are number of annotations, but by giving a greater portion created by changing the Negative label to 1 and the other for Ambiguous sentiment, followed by Neutral, Negative, labels to 0. The main goal of the attack is to associate the and Positive. The final dataset consists of 36,396 annota- keywords with the Negative sentiment, although such tions. The sentiment distribution in the final dataset for Table 2 Poisoning Strategy Parameters Keywords man, guy, fuck, shit, fucking, guys, hell, reddit, men, god, religion, dumb, government, racist, subreddit Malicious annotations Change Negative label to 1 and the other labels to 0 The ratio of texts containing a trigger to all texts, first experiment 100% The ratio of malicious users to all users, first experiment 0.5 Compromise probability, first experiment 0, 0.125, 0.25, 0.375, and 0.5 The ratio of texts containing a trigger to all texts, second experiment 50% The ratio of malicious users to all users, second experiment 0.0, 0.1, 0.2, 0.3, 0.4, 0.5 Compromise probability, second experiment – (1.0) Control Group Experimental Group train dev test 5. Dataset Splitting 0% 0% no mal. ann. mal. 50% of all users 50% of all users ann. changed ann. 5.1. Experiment 1: Attack Simulation with (genuine users) Compromise Probability train dev test Our dataset splitting strategy for the first experiment 0% 10% 20% 20% mal. ann. 20% mal. no changed can be seen in Figure 5. First, we randomly choose 50% 0% 10% 20% malicious users malicious users malicious users ann. ann. of all annotators to be put in the Experimental group, (no annotations (change their (change their whose annotations may be tweaked to simulate mali- ... changed) annotations) annotations) cious annotations. The remaining annotators are put in 30% 40% 50% train dev test the Control group, whose annotations are unchanged. 30% 40% 50% 50% 50% no Then, we divide the dataset into train, val, and test splits mal. ann. mal. malicious users malicious users malicious users ann. changed ann. with the ratio 70:20:10, and with the condition that the (change their (change their (change their annotations) annotations) annotations) train and val splits have to contain annotations from both genuine users (Control group) and malicious users (Ex- Figure 4: The poisoning strategy in the second experiment. perimental group). During testing, only predictions for The malicious users are randomly picked from the Experimen- tal group. For example, if there are 82 users in total, then a genuine users are compared against the real annotations 10% ratio of malicious users to all users is equal to 8 users. to compute the result. Those eight users are randomly picked from the Experimental group. 5.2. Experiment 2: Attack Simulation with Ratio of Malicious Users The dataset splitting strategy for our second experiment an attack may also affect the predictive performance of is depicted in Figure 6. It is adapted from [22]. The other sentiments. division of texts into past, present, future1, and future2 partitions is to simulate available data in a working pre- 4.2. Experiment 2: Attack Simulation with diction system. The past partition represents initial anno- Ratio of Malicious Users tations made by users when they start using the system. The present partition is analogous to annotations gener- In the second experiment, we investigate the effects of ated by the system’s operation. The Future1 and Future2 different sizes of the malicious user group. We do not use partitions are meant for validation and test purposes, re- the compromise probability, meaning that annotations spectively. Meanwhile, the user-based split follows the from users belonging to the malicious user group are 10-fold cross-validation schema. Similar to the first ex- always changed. Malicious users are randomly picked periment, the train and val splits contain both genuine from the pool of users in the Experimental group. First, and malicious user annotations. During testing, only pre- we start with a 0.0 ratio of malicious users to all users, dictions for genuine users are compared against the real followed by 0.1, 0.2, 0.3, 0.4, and 0.5. Figure 4 shows how annotations to compute the result. we prepare the dataset copies with different malicious annotator levels. Like in the first experiment, malicious annotations are created by changing the Negative label to 1 and the other labels to 0. Prediction FC Sum of All Word Biases Element-wise Softplus Multiplication Softplus FC FC Word Biases User Figure 5: Dataset splitting in the first experiment. Only pre- Embedding dictions for genuine users (the Control group) are considered Text Embedding during testing. User (Annotator) Text past present future1 future2 Figure 7: The HuBi-Medium model architecture. 1 folds approach in NLP where, on a given text, the predictive testing only performed on genuine users (those model provides one unified prediction output for any user. In other words, the Baseline model is trained to produce users who belong to the in 8 Control group) r a prediction outputs that are general enough to suit most t users, similar to [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. val 9 test 6.2. User-ID 10 annotations from users 15% 55% 15% 15% The User-ID model is a personalized model proposed who belong to the Experimental group texts in [6, 9]. To achieve personalization, the user ID of the are not used in testing annotator providing the annotation is added to the text Figure 6: Dataset splitting in the second experiment. Only embedding as a special token. Notably, in BERT-based predictions for genuine users (the Control group) are consid- models, special tokens receive their unique embeddings. ered during testing. Then, we feed text embeddings containing user informa- tion into the User-ID model and train it on each user’s annotation. 6. Models For the sentiment prediction task based on individual per- 6.3. HuBi-Medium spectives, we take advantage of the following sources of The HuBi-Medium model is introduced in [7]. It achieves information: text embeddings, user IDs, user embeddings, personalization by optimizing a multi-dimensional latent and word biases. Text embeddings are acquired from vector representing the users. This model is based on the pre-trained language model. The Baseline model is the Neural Collaborative Filtering (NFC) technique com- trained with text embeddings without any user informa- monly implemented in recommendation systems. How- tion. On the other hand, the personalized User-ID model ever, NFC cannot be applied directly for individual per- is trained with text embeddings and user IDs. Meanwhile, spective modeling due to the cold start problem. Con- the personalized HuBi-Medium model is trained with text structing a decent user representation from scratch is embeddings, user embeddings, and word biases. In per- difficult when most texts in the dataset do not receive sonalized models, we assume minimal user knowledge many annotations. HuBi-Medium overcomes the cold in the form of several texts annotated by the user in the start problem by initializing the latent vector randomly training set, as in [23]. and optimizing the latent vector via backpropagation. The relationship between the user and the given text is 6.1. Baseline signified by the element-wise multiplication between the user embedding and the text embedding, as shown in We feed text embeddings acquired from the pre-trained Figure 7. The result goes into a fully connected layer and language model into the Baseline model and train it on gets summed with word biases to output the prediction. each user’s annotation. This is based on the common The prediction output is mathematically defined as: set to 82, equal to the total number of annotators in the ∑︁ dataset. The hidden size for the last fully connected layer 𝑦(𝑡, 𝑢) = 𝑊𝑇 𝑈 (𝑎(𝑊𝑇 𝑥𝑡 ) ⊗ 𝑎(𝑊𝑈 𝑥𝑢 )) + 𝑏𝑤𝑜𝑟𝑑 is set to 20. The dropout layer above the user embedding 𝑤𝑜𝑟𝑑∈𝑡 is given a rate of 0.2 to prevent overfitting. where 𝑡 and 𝑢: evaluated text and user; 𝑏: a vector of biases indexed with words; 𝑥𝑡 : embedding of the text 𝑡; 7.3. Statistical Testing 𝑥𝑢 : embedding of the user 𝑢; 𝑊𝑇 𝑈 , 𝑊𝑇 , 𝑊𝑈 : weights We perform statistical tests to ensure the significance of of the fully-connected layers; 𝑎: the activation function. the differences between the models. First, we check the distribution normality with Q-Q plots and the Shapiro- 7. Experimental Setup Wilk test, where the significance level 𝛼 is set to 0.05. We also check the variance homogeneity with the Levene test. We design each experiment as a multivariate regression. We assume that the groups in the data are independent The task is to simultaneously predict sentiment percep- because the results come from different models that do tion for a given text and a given user in four sentimental not affect each other. The experiments are performed in labels. The output for each sentimental label is a contin- isolated environments. Finally, we perform independent uous value in the interval [0,1] that can be interpreted as samples t-test on the results with 𝛼 = 0.05. We accept the probability for the user to label the given text with the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 > 𝛼, meaning there is no the associated sentimental label. We use the 𝑅2 metric to significant difference between the two models. We reject evaluate the models. This measure gives us information the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼, meaning there is a on how close the model is to the correct decision. significant difference between the two models. The first experiment is repeated through 5 iterations. In each iteration, the average 𝑅2 value of each config- uration is calculated from its 𝑅2 values from all labels. 8. Results At the end of the experiment, we analyze the best result In the first experiment, we only used the User-ID model from each configuration. Meanwhile, the second experi- here to be compared against the Baseline model because ment deploys a 10-fold cross-validation to evaluate the it is simple to implement without requiring any extension. models over 10 different user-based subsets of equal size. Figure 8 presents the result from the first experiment. In Then, we calculate the average 𝑅2 value from each label the second experiment, we compare User-ID and HuBi- of each configuration Medium personalized models against the Baseline model. Figure 9 presents the aggregated result from this experi- 7.1. Language Model ment, while Figure 10 shows the results in each sentiment category. For our experiments, we use DistilBERT [36], a Transformer-based language model. It is a distilled ver- sion of BERT [37]. We choose DistilBERT because it is 8.1. Experiment 1: Attack Simulation with significantly faster to train while having almost similar Compromise Probability language understanding proficiency as the original BERT. We perform both experiments with fine-tuned models. The User-ID model obtains the best result, with a consis- In fine-tuning, all layers of the pre-trained models are tent advantage over the Baseline model at any compro- unfrozen. This allows pre-trained weights to be updated mise probability level. Even in the clean dataset setting via backpropagation during training. without malicious annotation, User-ID can achieve an 𝑅2 score of 28.22%, which is 3.35 percentage points (pp.) higher than the Baseline model. On the other hand, the 7.2. Hyperparameter Settings Baseline model can only achieve an 𝑅2 score of 24.87% We utilize Mean Squared Error (MSE) for the loss func- in the clean dataset setting. This shows that using a tion and the Adam optimizer. The optimal hyperparame- personalized model can improve the system’s predictive ter settings for each model are investigated individually, performance even when we are certain that the dataset where it is found that all models perform best with a does not contain malicious annotation. Personalization learning rate of 5e-5. All models are trained for three enriches the model to make more accurate decisions in epochs. In the case of the User-ID model, the size of the context of a specific user about whom the model has the text embedding needs to be adjusted due to the ad- minimal knowledge, as shown in [7, 6, 12]. ditional special tokens. Meanwhile, in the case of the As the compromise probability level increases, the pre- HuBi-Medium model, we need to set several additional dictive performance of the Baseline model steadily de- hyperparameter settings. The user embedding size is creases. In general, every time the compromise probability is increased by 0.125, the 𝑅2 score of the Baseline model icant. Nevertheless, the high 𝑅2 mean of the Baseline drops by roughly 1.73 pp. The exception is when the com- model at these levels can be explained, which is due to promise probability is increased from 0.375 to 0.5, where abnormal behavior in the Neutral category and the Posi- the 𝑅2 score dramatically drops by 6.12 pp. from 19.68% tive category. In the Neutral category, the Baseline model to 13.56%. This suggests that the Baseline model can- delivers a sharp increase in the 𝑅2 score at 10% MAL. not converge properly when the frequency of malicious This is caused by the poisoning strategy, where the an- annotations is high. notation for the Neutral category is always changed to Meanwhile, the User-ID model exhibits a more stable zero in the presence of a trigger in the given text. It just performance. With each 0.125 increase of the compro- happens that the small number of changed Neutral anno- mise probability, the 𝑅2 score changes by only about 0.35 tations conform to the majority of the genuine Neutral to 0.93 pp. Even when the compromise probability is in- annotations on the affected texts. A similar phenomenon creased from 0.375 to 0.5, the 𝑅2 score only decreases happens in the Positive category. Later, when the MAL is by 0.77 pp. from 27.50% to 26.73%. In addition, the sta- increased from 10% to 20%, the 𝑅2 score in the Neutral tistical tests show that the differences between User-ID category immediately drops, indicating that the malicious and Baseline across the compromise probability values annotations start to contrast and overwhelm the genuine are significant with 95% confidence. annotations on the affected texts. Meanwhile, the 𝑅2 Our result shows that the higher the compromise prob- score of the Baseline model in the Positive category starts ability, the greater the advantage offered by the User-ID to drop when the MAL is greater than 20%. model over the Baseline model. This is due to the ability The User-ID model starts gaining an advantage over of User-ID to learn about the users that make the anno- the Baseline model at 30% MAL, but it only becomes sig- tations. By providing information about the user as an nificant at 40% MAL. At 40% MAL, User-ID is significantly additional special token, the User-ID model can make better than the Baseline model in Ambiguous, Neutral, personalized predictions, where harmful predictions are and Negative categories, as well as the overall mean. more likely to be made on users that make malicious The User-ID model loses its significant advantage at annotations and less likely on users making genuine an- 50% MAL. Due to the low exposure of texts to users in notations. the dataset, User-ID tends to put greater importance to the text embeddings than the user ID special tokens. The 8.2. Experiment 2: Attack Simulation with great number of malicious annotations affects the fine- tuning process on the text embedding layer significantly. Ratio of Malicious Users To counter this effect, User-ID requires each text to be The models do not give any significant difference up to annotated by more users to put greater importance to the the 30% malicious annotator level (MAL). At 30% MAL, user ID special tokens. Unfortunately, such a condition both User-ID and HuBi-Medium start to outperform the cannot be obtained using GoEmotions, so we will need Baseline model, but the differences are still insignificant. to investigate the phenomenon further in the future with However, at 40% MAL, both User-ID and HuBi-Medium a different dataset. perform similarly with a dramatic advantage over the In the Positive category, the User-ID model has worse Baseline model, with 95% confidence. At 50% MAL, HuBi- performance than both the Baseline and the HuBi- Medium can maintain a stable performance, significantly Medium model. Considering that people tend to have outperforming both User-ID and the Baseline model. In high agreement on the Positive sentiment, it appears that contrast, the User-ID model fails to gain a significant predicting this category based on aggregated data alone difference from the Baseline model. (the Baseline) may deliver accurate results more often Notably, all models perform similarly in the Ambiguous than predicting the individuals (the User-ID model). How- category. User-ID outperforms HuBi-Medium and the ever, the Baseline suffers from the poisoning attack sig- Baseline model in the Ambiguous category at 40% MAL. nificantly at MAL >30%. However, all models again perform similarly when there HuBi-Medium seems to be the best solution for the is a 50% MAL. This is because Ambiguous is a difficult problem. In the Positive category, it performs similarly category to predict. Unlike Positive and Negative senti- to the Baseline at 0 – 30% MAL, and it outperforms the ments, which very often can be indicated by the presence Baseline at MAL >30%. This is because the HuBi-Medium of nuanced words in the texts, the Ambiguous sentiment model considers the word biases, which are the main often requires additional knowledge that cannot be easily reason for the high agreement in the Positive category. represented in the language modeling, such as the text’s The HuBi-Medium model still offers the benefit of per- context in the Reddit thread or cultural circle of the user. sonalization in increasing resistance against malicious At 10% and 20% MAL, the Baseline seems to outper- annotations, as seen in the minimal drops of predictive form all personalized models. However, the statistical performance at 40% MAL and 50% MAL, due to having tests indicate that these levels’ differences are insignif- the user embeddings. Average R2 on the Test split Average R2 on Test, mean 0.3 0.25 0.2 0.25 0.15 0.2 0.1 R2 0.15 R2 baseline_sgl 0.05 personalized_user_id 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.05 -0.05 0 -0.1 Ratio of Malicious Annotators to All Annotators 0 0.125 0.25 0.375 0.5 Probability of flipping annotations of malicious annotators baseline_sgl personalized_user_id personalized_hubi_medium Figure 8: Average 𝑅2 on the test split in the first experiment. Figure 9: Average 𝑅2 on the test split in the second exper- baseline_sgl: the Baseline model, personalized_user_id: the iment, calculated from the mean of all classes. baseline_sgl: User-ID model. the Baseline model, personalized_user_id: the User-ID model, personalized_hubi_medium: the HuBi-Medium model. The HuBi-Medium model is generally the best- Average R2 on Test, category ambiguous Average R2 on Test, category negative performing model due to its stability. HuBi-Medium 0.35 0.3 0.4 0.3 experiences minimal drops in the overall predictive per- 0.25 0.2 0.2 0.1 formance at 10% – 30% MAL, where a 10% increase in 0.15 0 0 0.1 0.2 0.3 0.4 0.5 the ratio of malicious annotators to all annotators only 0.1 -0.1 0.05 -0.2 reduces the 𝑅2 mean by about 1.05 pp. When the MAL is 0 0 0.1 0.2 0.3 0.4 0.5 -0.3 -0.4 increased from 30% to 40%, the 𝑅2 mean only decreases baseline_sgl personalized_user_id Average R2 on Test, category neutral personalized_hubi_medium baseline_sgl personalized_user_id Average R2 on Test, category positive personalized_hubi_medium by 2.4 pp. When the MAL is further increased from 40% 0.04 0.25 to 50%, the 𝑅2 mean only decreases by 3.12 pp. The 0.02 0.2 0 drops are much smaller than the drops the other models -0.02 0 0.1 0.2 0.3 0.4 0.5 0.15 0.1 experienced. Also, HuBi-Medium is the best-performing -0.04 -0.06 0.05 model at 40% and 50% MAL. -0.08 0 HuBi-Medium can maintain a stable performance be- -0.1 0 0.1 0.2 0.3 0.4 0.5 baseline_sgl personalized_user_id personalized_hubi_medium baseline_sgl personalized_user_id personalized_hubi_medium cause it extends the basic BERT architecture with user embeddings and word biases. During fine-tuning, the Figure 10: Average 𝑅2 from each class on the test split in the user embeddings can be optimized more precisely than second experiment. baseline_sgl: the Baseline model, personal- only individual user ID tokens. Meanwhile, the word ized_user_id: the User-ID model, personalized_hubi_medium: biases help to prevent dramatic changes in the weights the HuBi-Medium model. of the text embeddings when malicious annotations are present. A potential drawback of using HuBi-Medium is that the training process tends to be longer due to having fects of the poisoning attack become significant when the more trainable parameters. However, in our experiments ratio of malicious annotators to all annotators is greater with small datasets, the differences in training time are than 30%. At that point, the personalized models User-ID negligible. and HuBi-Medium show higher predictive performance than the baseline model. We must thoroughly examine the limits of the resis- 9. Conclusions and Future Work tance offered by personalized transformer models. In addition, the personalized models need to be evaluated This work is part of a larger research investigating person- in other machine learning tasks with different datasets alized transformer models’ resistance against malicious and tested against more sophisticated attack methods. annotations. Our results show that such personalized We would also like to study possible extensions to the models are promising solutions for a human-centered personalized models to increase the resistance against trusted AI. In the scenario where attackers do not always malicious annotations further. perform malicious annotations, the personalized model consistently outperforms the baseline model with min- imal decreases in average predictive performance. In a bigger scenario that includes untriggered texts, the ef- Acknowledgments What if ground truth is subjective? personalized deep neural hate speech detection, in: Proceedings This work was financed by (1) the National Science of the 1st Workshop on Perspectivist Approaches Centre, Poland, project no. 2019/33/B/HS2/02814 and to NLP@ LREC2022, 2022, pp. 37–45. 2021/41/B/ST6/04471; (2) the Polish Ministry of Edu- [10] Y. Sang, J. Stanton, The origin and value of dis- cation and Science, CLARIN-PL; (3) the European Re- agreement among data labelers: A case study of gional Development Fund as a part of the 2014-2020 individual differences in hate speech annotation, in: Smart Growth Operational Programme, CLARIN – Com- International Conference on Information, Springer, mon Language Resources and Technology Infrastructure, 2022, pp. 425–444. project no. POIR.04.02.00-00C002/19; (4) the statutory [11] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, funds of the Department of Artificial Intelligence, Wro- G. Nemade, S. Ravi, GoEmotions: A dataset of claw University of Science and Technology. fine-grained emotions, in: Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, Association for Computational References Linguistics, Online, 2020, pp. 4040–4054. [1] H. Zhang, Y. Li, B. Ding, J. Gao, Practical data [12] A. Ngo, A. Candri, T. Ferdinan, J. Kocoń, W. Ko- poisoning attack against next-item recommenda- rczynski, Studemo: A non-aggregated review tion, in: Proceedings of The Web Conference 2020, dataset for personalized emotion recognition, in: WWW ’20, Association for Computing Machinery, Proceedings of the 1st Workshop on Perspectivist New York, NY, USA, 2020, p. 2458–2464. Approaches to NLP@ LREC2022, 2022, pp. 46–55. [2] W. Zhou, J. Wen, Q. Qu, J. Zeng, T. Cheng, Shilling [13] N. Pitropakis, E. Panaousis, T. Giannetsos, E. Anas- attack detection for recommender systems based tasiadis, G. Loukas, A taxonomy and survey of on credibility of group users and rating time series, attacks against machine learning, Computer Sci- PLOS ONE 13 (2018) 1–17. ence Review 34 (2019) 100199. [3] S. Banerjee, T. Swearingen, R. Shillair, J. M. Bauer, [14] E. Quiring, K. Rieck, Backdooring and poisoning T. Holt, A. Ross, Using machine learning to examine neural networks with image-scaling attacks, in: cyberattack motivations on web defacement data, 2020 IEEE Security and Privacy Workshops (SPW), Social Science Computer Review 40 (2022) 914–932. 2020, pp. 41–47. [4] K. Crawford, T. Gillespie, What is a flag for? social [15] L. Truong, C. Jones, B. Hutchinson, A. August, media reporting tools and the vocabulary of com- B. Praggastis, R. Jasper, N. Nichols, A. Tuor, System- plaint, New Media & Society 18 (2016) 410–428. atic evaluation of backdoor data poisoning attacks [5] Z. Mossie, J.-H. Wang, Vulnerable community iden- on image classifiers, in: 2020 IEEE/CVF Confer- tification using hate speech detection on social me- ence on Computer Vision and Pattern Recognition dia, Information Processing & Management 57 Workshops (CVPRW), 2020, pp. 3422–3431. (2020) 102087. [16] L. Verde, F. Marulli, S. Marrone, Exploring the im- [6] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kaj- pact of data poisoning attacks on machine learning danowicz, P. Kazienko, Offensive, aggressive, and model reliability, Procedia Computer Science 192 hate speech analysis: From data-centric to human- (2021) 2624–2632. Knowledge-Based and Intelligent centered approach, Inf. Process. Manage. 58 (2021). Information & Engineering Systems: Proceedings [7] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, of the 25th International Conference KES2021. K. Kanclerz, P. Miłkowski, P. Kazienko, Learning [17] E. Wallace, T. Zhao, S. Feng, S. Singh, Concealed personal human biases and representations for sub- data poisoning attacks on NLP models, in: Proceed- jective tasks in natural language processing, in: ings of the 2021 Conference of the North American 2021 IEEE International Conference on Data Min- Chapter of the Association for Computational Lin- ing (ICDM), 2021, pp. 1168–1173. guistics: Human Language Technologies, Associ- [8] P. Miłkowski, S. Saganowski, M. Gruza, P. Kazienko, ation for Computational Linguistics, Online, 2021, M. Piasecki, J. Kocoń, Multitask personalized recog- pp. 139–150. nition of emotions evoked by textual content, in: [18] W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, B. He, 2022 IEEE International Conference on Pervasive Be careful about poisoned word embeddings: Ex- Computing and Communications Workshops and ploring the vulnerability of the embedding layers other Affiliated Events (PerCom Workshops), IEEE, in NLP models, in: Proceedings of the 2021 Con- 2022, pp. 347–352. ference of the North American Chapter of the As- [9] K. Kanclerz, M. Gruza, K. Karanowski, sociation for Computational Linguistics: Human J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko, Language Technologies, Association for Computa- tional Linguistics, Online, 2021, pp. 2048–2058. [19] H. Huang, J. Mu, N. Z. Gong, Q. Li, B. Liu, M. Xu, tilingual, multilevel, multidomain sentiment analy- Data poisoning attacks to deep learning based rec- sis corpus of consumer reviews, in: International ommender systems (2021). arXiv:2101.02644. Conference on Computational Science, Springer, [20] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghu- 2021, pp. 297–312. nathan, N. K. Jha, Systematic poisoning attacks [30] J. Kocoń, M. Maziarz, Mapping wordnet onto hu- on and defenses for machine learning in healthcare, man brain connectome in emotion processing and IEEE Journal of Biomedical and Health Informatics semantic similarity recognition, Information Pro- 19 (2015) 1893–1905. cessing & Management 58 (2021) 102530. [21] A. Salem, M. Backes, Y. Zhang, Get a model! model [31] J. Kocoń, J. Radom, E. Kaczmarz-Wawryk, K. Wab- hijacking attack against machine learning models nic, A. Zajączkowska, M. Zaśko-Zielińska, As- (2021). arXiv:2111.04394. pectemo: multi-domain corpus of consumer re- [22] P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko, views for aspect-based sentiment analysis, in: 2021 D. Grimling, J. Kocoń, Personal bias in prediction International Conference on Data Mining Work- of emotions elicited by textual opinions, in: Pro- shops (ICDMW), IEEE, 2021, pp. 166–173. ceedings of the 59th Annual Meeting of the Associ- [32] K. Gawron, M. Pogoda, N. Ropiak, M. Swędrowski, ation for Computational Linguistics and the 11th In- J. Kocoń, Deep neural language-agnostic multi-task ternational Joint Conference on Natural Language text classifier, in: 2021 International Conference on Processing: Student Research Workshop, 2021, pp. Data Mining Workshops (ICDMW), IEEE, 2021, pp. 248–259. 136–142. [23] K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz, [33] J. Kocoń, J. Baran, M. Gruza, A. Janz, M. Kajstura, J. Kocoń, D. Puchalska, P. Kazienko, Controversy P. Kazienko, W. Korczyński, P. Miłkowski, M. Pi- and conformity: from generalized to personalized asecki, J. Szołomicka, Neuro-symbolic models for aggressiveness detection, in: Proceedings of the sentiment analysis, in: International Conference 59th Annual Meeting of the Association for Com- on Computational Science, Springer, 2022, pp. 667– putational Linguistics and the 11th International 681. Joint Conference on Natural Language Processing [34] J. Kocoń, P. Miłkowski, M. Wierzba, B. Konat, (Volume 1: Long Papers), 2021, pp. 5915–5926. K. Klessa, A. Janz, M. Riegel, K. Juszczyk, D. Grim- [24] J. Kocoń, A. Janz, M. Piasecki, Classifier-based po- ling, A. Marchewka, et al., Multilingual and larity propagation in a wordnet, in: Proceedings of language-agnostic recognition of emotions, valence the Eleventh International Conference on Language and arousal in large-scale multi-domain text re- Resources and Evaluation (LREC 2018), 2018. views, in: Language and Technology Conference, [25] J. Kocoń, M. Zaśko-Zielińska, P. Miłkowski, Multi- Springer, 2022, pp. 214–231. level analysis and recognition of the text sentiment [35] P. Miłkowski, M. Gruza, P. Kazienko, J. Szołomicka, on the example of consumer opinions, in: Pro- S. Woźniak, J. Kocoń, Multi-model analysis of ceedings of the International Conference on Recent language-agnostic sentiment classification on mul- Advances in Natural Language Processing (RANLP tiemo data, in: Conference on Computational Col- 2019), 2019, pp. 559–567. lective Intelligence Technologies and Applications, [26] J. Kocoń, A. Janz, P. Miłkowski, M. Riegel, Springer, 2022, pp. 163–175. M. Wierzba, A. Marchewka, A. Czoska, D. Grim- [36] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, ling, B. Konat, K. Juszczyk, et al., Recognition of a distilled version of bert: smaller, faster, cheaper emotions, valence and arousal in large-scale multi- and lighter (2019). arXiv:1910.01108. domain text reviews, in: 9th Language & Technol- [37] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ogy Conference: Human Language Technologies as Pre-training of deep bidirectional transformers for a Challenge for Computer Science and Linguistics, language understanding, in: Proceedings of the 2019. 2019 Conference of the North American Chapter of [27] J. Kocoń, P. Miłkowski, M. Zaśko-Zielińska, Multi- the Association for Computational Linguistics: Hu- level sentiment analysis of polemo 2.0: Extended man Language Technologies, Volume 1 (Long and corpus of multi-domain consumer reviews, in: Pro- Short Papers), Association for Computational Lin- ceedings of the 23rd Conference on Computational guistics, Minneapolis, Minnesota, 2019, pp. 4171– Natural Language Learning (CoNLL), 2019, pp. 980– 4186. 991. [28] K. Kanclerz, P. Miłkowski, J. Kocoń, Cross-lingual deep neural transfer learning in sentiment analysis, Procedia Computer Science 176 (2020) 128–137. [29] J. Kocoń, P. Miłkowski, K. Kanclerz, Multiemo: Mul-