Personalized Models Resistant to Malicious Attacks
for Human-centered Trusted AI
Teddy Ferdinan, Jan Kocoń
Wrocław University of Science and Technology, Department of Artificial Intelligence, Wrocław, Poland


                                       Abstract
                                       Researchers in Natural Language Processing (NLP) and recommendation systems typically train machine learning models on
                                       large corpora. In many cases, the corpus is constructed using annotations from a third-party, such as crowd-sourced workers,
                                       volunteers, or real users of the social networking services. This opens the possibility of malicious agents providing harmful
                                       data into the corpus to introduce unwanted behavior into the model’s performance. Existing methods to mitigate the existence
                                       of such data are often not applicable or considerably costly. In our paper, we propose personalized solutions for building
                                       trusted AI models that possess some inherent resistance against malicious annotations. The personalized human-centered
                                       model is trained on textual content and learns representations of users providing their annotations for that content. We
                                       compare the predictive performance of such models and a non-personalized baseline on multivariate regression tasks at
                                       various levels of simulated malicious annotations. Our results show that the personalized model outperforms the baseline
                                       consistently at any malicious annotation level. This makes AI models adapt to the needs of specific users and thus protect
                                       them from the effect of potential poisonous attacks.

                                       Keywords
                                       personalized NLP, poisoning attack, adversarial machine learning, learning human representation, cybersecurity


1. Introduction                                                                                       is very expensive. Often, the problem of differences in
                                                                                                      decisions toward the same object is overlooked in favor
It is common in recommender systems for some users of majority voting or creating guidelines to train a group
to run fake profiles to create biased ratings for content of annotators to get high agreement on their ratings [10].
in the system [1]. This malicious behavior is known as                                                   On the other hand, the use of crowdsourcing platforms
poisonous, shilling, or profile injection attacks [2]. They is becoming increasingly popular. The cost of obtaining
can be motivated by unfair competition in the market for information is lower than hiring annotators, and more
products and services and the likes or dislikes of music diverse content evaluations can be obtained. In addition,
and video creators. One of the more controversial uses of in many social media, the text is an important content
such attacks is politically or ideologically motivated [3], medium, subject to evaluation by millions of users, mak-
when a group of users agree against a certain person or ing it possible for owners of such platforms to use such
topic and, for example, maliciously report content about data to create filters for unwanted content. New per-
the chosen topic as offensive. Some systems have built-in sonalized models, in particular, use both the similarity
mechanisms to learn what content to show people based of a person’s behavior to other users, as well as their
on such reports [4]. A bigger challenge seems to be using individual content preferences, to make inferences [7].
this type of data to train general-purpose classifiers to                                                In this work, we tested how well the best-personalized
filter unwanted content, such as hate speech [5, 6].                                                  architectures for inferring textual content are robust to
    Today, increasing interest in NLP is directed toward poisonous attacks. For the study, we used the GoEmo-
personalized models for subjective tasks [7, 8, 9]. Such tions dataset containing nearly 60k texts from Reddit
tasks are those for which it is difficult to obtain high annotated by a large group of people with 28 emotion
agreement between annotators and include recognizing categories [11]. Using selected keywords, we simulated
emotions, hate speech, or humor in a text. Naturally, con- the poisonous attack of a group of people on annotated
tent reception will not be the same for everyone reading texts (training data). We tested how their attack affects
a text. However, creating datasets annotated by many the decision of a system trained on such data on a group
people from different backgrounds and cultural circles of normal users. We compared the non-personalized
                                                                                                      baseline SOTA in NLP (finetuned transformer) with two
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), personalized transformer-based models: HuBi-Medium
February 13–14, 2023, Washington, D.C., US
                                                                                                      and User-ID [12]. The results show that the personal-
$ teddy.ferdinan@pwr.edu.pl (T. Ferdinan); jan.kocon@pwr.edu.pl
(J. Kocoń)                                                                                            ized models are significantly more resistant to poisonous
 0000-0003-3701-3502 (T. Ferdinan); 0000-0002-7665-6896                                              attacks than the baseline models. The larger the group
(J. Kocoń)                                                                                            of attackers, the greater the differences in favor of the
         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
         Attribution 4.0 International (CC BY 4.0).                                                   personalized models.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work                                                                       Emotion Distribution in GoEmotions
                                                                60000


There have been some efforts to taxonomize attack meth-         50000

ods against machine learning models. In general, attack
types can be distinguished into poisoning attack, and eva-      40000


sion attack [13]. A poisoning attack aims to alter the          30000

training data to affect the training process, whereas an
evasion attack aims to exploit weaknesses in the model          20000


without affecting the training process.                         10000

   Poisoning attacks can be performed with various tech-
niques. In image recognition, backdooring poisoning at-             0


tack is popular [14, 15]. In this case, a backdoor is a
perturbation inserted into an image that triggers mis-
classification to a label selected by the attacker. Another
technique is clean-label poisoning [14], in which addi-         Figure 1: Emotion distribution in GoEmotions dataset. The
tional data is embedded into the image without changing         Y-axis values show the annotation count, while the X-axis
                                                                values show the emotional class labels.
the label. In NLP, a similar approach to backdooring
poisoning attacks has been investigated. This approach
relies on a trigger inserted into the training data to cause    Table 1
misclassification. The trigger may be an uncommon word          Grouping of Emotions into Sentiments in GoEmotions
or a sequence of characters in the example text [16, 17],
but it can also be a carefully crafted malicious word em-               Sentiment   Emotions
bedding [18]. In the recommendation systems, poisoning                  Positive    admiration, amusement, approval,
is often performed in the form of shilling attack [2, 19, 1],                       desire, excitement, gratitude,
where specific examples are crafted with fake user pro-                             love, optimism, pride,
files and are inserted into the target system to generate                           caring, joy, relief
recommendations toward specific items selected by the                   Negative    anger, annoyance, disappointment,
                                                                                    disgust, embarrassment, fear,
attacker for the target users.
                                                                                    nervousness, remorse, sadness,
   Some proposed defense mechanisms to protect ma-                                  disapproval, grief
chine learning models include comparing the model’s                     Ambiguous   confusion, curiosity, realization,
performance periodically against a clean baseline [20],                             surprise
adding noise to the example, entropy analysis [21], early               Neutral     neutral
stopping of the training, perplexity analysis, embedding
distance analysis [17], and rating time series analysis
[2]. However, these options are costly, not always appli-  while other classes, such as Pride, Relief, and Grief are
cable, or unreliable. In this paper, we propose a model    very rare. The class imbalance is problematic because
with inherent resistance against malicious annotations.    it creates difficulties in interpreting the results of the
Notably, our model does not aim to replace existing de-    experiments.
fense propositions. Instead, it may complement existing       Therefore, instead of predicting specific emotions, we
defense methods to improve the system further.             try to predict the sentiments in the annotations. This
                                                           allows us to group the emotional class labels by following
3. Dataset                                                 the result of the sentiment analysis performed by the
                                                           authors of GoEmotions, as shown in Table 1. Although
We used GoEmotions [11] to create datasets for our exper- there is still some class imbalance when using sentimental
iments. It contains 211,225 annotations from 82 unique class labels, it is less substantial.
annotators working on 58,011 unique texts curated from
Reddit. Up to five unique annotators rated a given text. 3.1. Experiment 1: Attack Simulation with
Each annotation consists of 28 emotional class labels. The       Compromise Probability
annotators could assign more than one label to a given
text. Also, the annotators may not assign any emotional For our first experiment, we prepared a list of keywords
class label and mark the text as unclear.                  that was used to simulate malicious annotations. Then,
   There is a striking class imbalance in GoEmotions, we filtered out from GoEmotions only texts that contain
as shown in Figure 1. Some classes, such as Neutral, at least one keyword. The resulting dataset consists of
Approval, and Admiration have very high occurrences, 18,326 annotations. The sentiment distribution in the
             Sentiment Distribution in Dataset for First Experiment       the second experiment is shown in Figure 3.
 7000                                                       text_counts
 6000
 5000
                                                                          4. Poisoning Strategy
 4000                                                                     In our experiments, we assume a scenario where the texts
 3000                                                                     are annotated by users whose genuineness cannot always
 2000                                                                     be guaranteed. These users know that the annotations
 1000                                                                     will be used to train a machine-learning model, but they
    0                                                                     do not know or care about its architecture. Some of these
              neutral


                              positive


                                             negative


                                                             ambiguous
                                                                          users may provide malicious annotations.
                                                                             However, in individual perspectives modeling, it is
                                                                          important to distinguish the concept of malicious an-
Figure 2: Sentiment distribution in the dataset for the first             notation from subjective judgment because they both
experiment. There are 18,326 annotations in total.                        may appear as statistical outliers. By the term malicious,
                                                                          we mean that the user does not annotate the given text
                                                                          based on any personal value or moral justification. In-
            Sentiment Distribution in Dataset for Second Experiment       stead, they annotate to introduce unwanted behavior into
 12000
                                                           text_counts    the resulting model or at least degrade the performance
 10000                                                                    of the resulting model. We also assume that the users
  8000                                                                    do not have direct access to the environment where the
  6000
                                                                          model is trained, and they do not possess high technical
                                                                          capabilities. Therefore, the only way for the users to
  4000                                                                    affect the resulting model is through the annotations.
  2000                                                                       To simulate such malicious annotators in our exper-
                                                                          iments, we deploy a poisoning strategy similar to the
        0
                                                                          trigger-based poisoning attack technique commonly dis-
               neutral


                              positive


                                             negative


                                                            ambiguous


                                                                          cussed in the literature [16, 17]. We define a list of key-
                                                                          words that will act as triggers to change the annotations’
                                                                          values. These keywords are selected from the top 500
Figure 3: Sentiment distribution in the dataset for the second            most frequent words in GoEmotions. However, the differ-
experiment. There are 36,396 annotations in total.
                                                                          ence from the common poisoning strategy is that we only
                                                                          change the annotations from users of the Experimental
                                                                          group. In contrast, the annotations from users of the
dataset for the first experiment is shown in Figure 2.                    Control group remain untouched. Table 2 contains the
                                                                          list of keywords. Finally, testing is performed only on the
3.2. Experiment 2: Attack Simulation with                                 Control group users to see how malicious annotations
                                                                          during training affect test predictions for genuine users.
     Ratio of Malicious Users
For our second experiment, we created a dataset consist-                  4.1. Experiment 1: Attack Simulation with
ing of 50% texts containing at least one keyword and 50%
                                                                               Compromise Probability
texts without any keyword. We also want the dataset to
possess roughly equal sentiment distribution. We do this                  In the first experiment, we introduce a compromise prob-
by first dropping annotations with all zeroes in all sen-                 ability to determine whether the annotations should be
timents and texts that fewer than three annotators rate.                  changed. This is to simulate the behavior of a malicious
Then, we filter only texts that contain at least one key-                 user who may not always annotate maliciously. We use
word, resulting in 18,198 annotations. After that, from an                compromise probability values of 0.0, 0.125, 0.25, 0.375,
initial sentiment distribution analysis, we found that the                and 0.5. All malicious users are taken from the entire
sentiment Positive is the most prominent in the picked an-                Experimental group, which has a fixed size, so the ratio
notations, followed by Negative, Neutral, and Ambiguous.                  of malicious users to all users remains the same for each
So, we randomly pick more annotations for the same total                  compromise probability value. Malicious annotations are
number of annotations, but by giving a greater portion                    created by changing the Negative label to 1 and the other
for Ambiguous sentiment, followed by Neutral, Negative,                   labels to 0. The main goal of the attack is to associate the
and Positive. The final dataset consists of 36,396 annota-                keywords with the Negative sentiment, although such
tions. The sentiment distribution in the final dataset for
Table 2
Poisoning Strategy Parameters

 Keywords                                                                         man, guy, fuck, shit, fucking, guys, hell, reddit,
                                                                                  men, god, religion, dumb, government,
                                                                                  racist, subreddit
 Malicious annotations                                                            Change Negative label to 1 and the other labels to 0
 The ratio of texts containing a trigger to all texts, first experiment           100%
 The ratio of malicious users to all users, first experiment                      0.5
 Compromise probability, first experiment                                         0, 0.125, 0.25, 0.375, and 0.5
 The ratio of texts containing a trigger to all texts, second experiment          50%
 The ratio of malicious users to all users, second experiment                     0.0, 0.1, 0.2, 0.3, 0.4, 0.5
 Compromise probability, second experiment                                        – (1.0)


   Control Group         Experimental Group     train     dev
                                                                  test
                                                                           5. Dataset Splitting
                                                0%        0%
                                                                    no
                                              mal. ann.   mal.
  50% of all users         50% of all users
                                                          ann.
                                                                 changed
                                                                   ann.    5.1. Experiment 1: Attack Simulation with
  (genuine users)                                                               Compromise Probability
                                                train     dev
                                                                  test    Our dataset splitting strategy for the first experiment
     0%            10%           20%
                                               20%
                                              mal. ann.
                                                          20%
                                                          mal.      no
                                                                 changed
                                                                          can be seen in Figure 5. First, we randomly choose 50%
       0%             10%             20%
 malicious users malicious users malicious users
                                                          ann.     ann.
                                                                          of all annotators to be put in the Experimental group,
 (no annotations (change their    (change their                           whose annotations may be tweaked to simulate mali-
                                                           ...
    changed)      annotations)    annotations)
                                                                          cious annotations. The remaining annotators are put in
       30%            40%              50%        train      dev
                                                                   test
                                                                          the Control group, whose annotations are unchanged.
      30%             40%             50%
                                                  50%       50%
                                                                     no
                                                                          Then, we divide the dataset into train, val, and test splits
                                                 mal. ann.   mal.
malicious users malicious users malicious users             ann.
                                                                  changed
                                                                    ann.  with the ratio 70:20:10, and with the condition that the
  (change their  (change their   (change their
  annotations)   annotations)    annotations)                             train and val splits have to contain annotations from both
                                                                          genuine users (Control group) and malicious users (Ex-
Figure 4: The poisoning strategy in the second experiment.
                                                                          perimental group). During testing, only predictions for
The malicious users are randomly picked from the Experimen-
tal group. For example, if there are 82 users in total, then a
                                                                          genuine users are compared against the real annotations
10% ratio of malicious users to all users is equal to 8 users. to compute the result.
Those eight users are randomly picked from the Experimental
group.                                                                     5.2. Experiment 2: Attack Simulation with
                                                                                Ratio of Malicious Users
                                                            The dataset splitting strategy for our second experiment
an attack may also affect the predictive performance of
                                                            is depicted in Figure 6. It is adapted from [22]. The
other sentiments.
                                                            division of texts into past, present, future1, and future2
                                                            partitions is to simulate available data in a working pre-
4.2. Experiment 2: Attack Simulation with diction system. The past partition represents initial anno-
       Ratio of Malicious Users                             tations made by users when they start using the system.
                                                            The present partition is analogous to annotations gener-
In the second experiment, we investigate the effects of
                                                            ated by the system’s operation. The Future1 and Future2
different sizes of the malicious user group. We do not use
                                                            partitions are meant for validation and test purposes, re-
the compromise probability, meaning that annotations
                                                            spectively. Meanwhile, the user-based split follows the
from users belonging to the malicious user group are
                                                            10-fold cross-validation schema. Similar to the first ex-
always changed. Malicious users are randomly picked
                                                            periment, the train and val splits contain both genuine
from the pool of users in the Experimental group. First,
                                                            and malicious user annotations. During testing, only pre-
we start with a 0.0 ratio of malicious users to all users,
                                                            dictions for genuine users are compared against the real
followed by 0.1, 0.2, 0.3, 0.4, and 0.5. Figure 4 shows how
                                                            annotations to compute the result.
we prepare the dataset copies with different malicious
annotator levels. Like in the first experiment, malicious
annotations are created by changing the Negative label
to 1 and the other labels to 0.
                                                                                                                        Prediction


                                                                                                                             FC                            Sum of All
                                                                                                                                                          Word Biases


                                                                                                                        Element-wise
                                                                                                            Softplus    Multiplication
                                                                                                                                              Softplus


                                                                                                              FC                                 FC             Word
                                                                                                                                                                Biases
                                                                                                  User
Figure 5: Dataset splitting in the first experiment. Only pre-                                  Embedding
dictions for genuine users (the Control group) are considered                                                                              Text
                                                                                                                                         Embedding
during testing.
                                                                                                     User (Annotator)                                    Text
  past              present        future1 future2
                                                                                               Figure 7: The HuBi-Medium model architecture.
                                                     1
                                                     folds


                                                                                               approach in NLP where, on a given text, the predictive
                                                                     testing only performed
                                                                     on genuine users (those
                                                                                               model provides one unified prediction output for any user.
                                                                                               In other words, the Baseline model is trained to produce
                                                             users


                                                                       who belong to the
               in
                                                     8


                                                                         Control group)
       r
           a                                                                                   prediction outputs that are general enough to suit most
   t
                                                                                               users, similar to [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].
                                    val
                                                     9


                                            test                                               6.2. User-ID
                                                     10


                                                                     annotations from users

  15%                55%            15%     15%               The User-ID model is a personalized model proposed
                                                                       who belong to the
                                                                       Experimental group
                           texts                              in [6, 9]. To achieve personalization, the user ID of the
                                                                     are not used in testing

                                                              annotator providing the annotation is added to the text
Figure 6: Dataset splitting in the second experiment. Only
                                                              embedding as a special token. Notably, in BERT-based
predictions for genuine users (the Control group) are consid-
                                                              models, special tokens receive their unique embeddings.
ered during testing.
                                                              Then, we feed text embeddings containing user informa-
                                                              tion into the User-ID model and train it on each user’s
                                                              annotation.
6. Models
For the sentiment prediction task based on individual per-                                     6.3. HuBi-Medium
spectives, we take advantage of the following sources of                                       The HuBi-Medium model is introduced in [7]. It achieves
information: text embeddings, user IDs, user embeddings,                                       personalization by optimizing a multi-dimensional latent
and word biases. Text embeddings are acquired from                                             vector representing the users. This model is based on
the pre-trained language model. The Baseline model is                                          the Neural Collaborative Filtering (NFC) technique com-
trained with text embeddings without any user informa-                                         monly implemented in recommendation systems. How-
tion. On the other hand, the personalized User-ID model                                        ever, NFC cannot be applied directly for individual per-
is trained with text embeddings and user IDs. Meanwhile,                                       spective modeling due to the cold start problem. Con-
the personalized HuBi-Medium model is trained with text                                        structing a decent user representation from scratch is
embeddings, user embeddings, and word biases. In per-                                          difficult when most texts in the dataset do not receive
sonalized models, we assume minimal user knowledge                                             many annotations. HuBi-Medium overcomes the cold
in the form of several texts annotated by the user in the                                      start problem by initializing the latent vector randomly
training set, as in [23].                                                                      and optimizing the latent vector via backpropagation.
                                                                                               The relationship between the user and the given text is
6.1. Baseline                                                                                  signified by the element-wise multiplication between the
                                                                                               user embedding and the text embedding, as shown in
We feed text embeddings acquired from the pre-trained                                          Figure 7. The result goes into a fully connected layer and
language model into the Baseline model and train it on                                         gets summed with word biases to output the prediction.
each user’s annotation. This is based on the common                                            The prediction output is mathematically defined as:
                                                              set to 82, equal to the total number of annotators in the
                                               ∑︁             dataset. The hidden size for the last fully connected layer
𝑦(𝑡, 𝑢) = 𝑊𝑇 𝑈 (𝑎(𝑊𝑇 𝑥𝑡 ) ⊗ 𝑎(𝑊𝑈 𝑥𝑢 )) +              𝑏𝑤𝑜𝑟𝑑   is set to 20. The dropout layer above the user embedding
                                             𝑤𝑜𝑟𝑑∈𝑡           is given a rate of 0.2 to prevent overfitting.
where 𝑡 and 𝑢: evaluated text and user; 𝑏: a vector of
biases indexed with words; 𝑥𝑡 : embedding of the text 𝑡; 7.3. Statistical Testing
𝑥𝑢 : embedding of the user 𝑢; 𝑊𝑇 𝑈 , 𝑊𝑇 , 𝑊𝑈 : weights
                                                            We perform statistical tests to ensure the significance of
of the fully-connected layers; 𝑎: the activation function.
                                                            the differences between the models. First, we check the
                                                            distribution normality with Q-Q plots and the Shapiro-
7. Experimental Setup                                       Wilk   test, where the significance level 𝛼 is set to 0.05. We
                                                            also check the variance homogeneity with the Levene test.
We design each experiment as a multivariate regression. We assume that the groups in the data are independent
The task is to simultaneously predict sentiment percep- because the results come from different models that do
tion for a given text and a given user in four sentimental not affect each other. The experiments are performed in
labels. The output for each sentimental label is a contin- isolated environments. Finally, we perform independent
uous value in the interval [0,1] that can be interpreted as samples t-test on the results with 𝛼 = 0.05. We accept
the probability for the user to label the given text with the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 > 𝛼, meaning there is no
the associated sentimental label. We use the 𝑅2 metric to significant difference between the two models. We reject
evaluate the models. This measure gives us information the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼, meaning there is a
on how close the model is to the correct decision.          significant difference between the two models.
   The first experiment is repeated through 5 iterations.
In each iteration, the average 𝑅2 value of each config-
uration is calculated from its 𝑅2 values from all labels. 8. Results
At the end of the experiment, we analyze the best result
                                                            In the first experiment, we only used the User-ID model
from each configuration. Meanwhile, the second experi-
                                                            here to be compared against the Baseline model because
ment deploys a 10-fold cross-validation to evaluate the
                                                            it is simple to implement without requiring any extension.
models over 10 different user-based subsets of equal size.
                                                            Figure 8 presents the result from the first experiment. In
Then, we calculate the average 𝑅2 value from each label
                                                            the second experiment, we compare User-ID and HuBi-
of each configuration
                                                            Medium personalized models against the Baseline model.
                                                            Figure 9 presents the aggregated result from this experi-
7.1. Language Model                                         ment, while Figure 10 shows the results in each sentiment
                                                            category.
For our experiments, we use DistilBERT [36], a
Transformer-based language model. It is a distilled ver-
sion of BERT [37]. We choose DistilBERT because it is 8.1. Experiment 1: Attack Simulation with
significantly faster to train while having almost similar           Compromise Probability
language understanding proficiency as the original BERT.
We perform both experiments with fine-tuned models. The User-ID model obtains the best result, with a consis-
In fine-tuning, all layers of the pre-trained models are tent advantage over the Baseline model at any compro-
unfrozen. This allows pre-trained weights to be updated mise probability level. Even in the clean dataset setting
via backpropagation during training.                        without malicious annotation, User-ID can achieve an
                                                            𝑅2 score of 28.22%, which is 3.35 percentage points (pp.)
                                                            higher than the Baseline model. On the other hand, the
7.2. Hyperparameter Settings                                Baseline model can only achieve an 𝑅2 score of 24.87%
We utilize Mean Squared Error (MSE) for the loss func- in the clean dataset setting. This shows that using a
tion and the Adam optimizer. The optimal hyperparame- personalized model can improve the system’s predictive
ter settings for each model are investigated individually, performance even when we are certain that the dataset
where it is found that all models perform best with a does not contain malicious annotation. Personalization
learning rate of 5e-5. All models are trained for three enriches the model to make more accurate decisions in
epochs. In the case of the User-ID model, the size of the context of a specific user about whom the model has
the text embedding needs to be adjusted due to the ad- minimal knowledge, as shown in [7, 6, 12].
ditional special tokens. Meanwhile, in the case of the          As the compromise probability level increases, the pre-
HuBi-Medium model, we need to set several additional dictive performance of the Baseline model steadily de-
hyperparameter settings. The user embedding size is creases. In general, every time the compromise probability
is increased by 0.125, the 𝑅2 score of the Baseline model      icant. Nevertheless, the high 𝑅2 mean of the Baseline
drops by roughly 1.73 pp. The exception is when the com-       model at these levels can be explained, which is due to
promise probability is increased from 0.375 to 0.5, where      abnormal behavior in the Neutral category and the Posi-
the 𝑅2 score dramatically drops by 6.12 pp. from 19.68%        tive category. In the Neutral category, the Baseline model
to 13.56%. This suggests that the Baseline model can-          delivers a sharp increase in the 𝑅2 score at 10% MAL.
not converge properly when the frequency of malicious          This is caused by the poisoning strategy, where the an-
annotations is high.                                           notation for the Neutral category is always changed to
   Meanwhile, the User-ID model exhibits a more stable         zero in the presence of a trigger in the given text. It just
performance. With each 0.125 increase of the compro-           happens that the small number of changed Neutral anno-
mise probability, the 𝑅2 score changes by only about 0.35      tations conform to the majority of the genuine Neutral
to 0.93 pp. Even when the compromise probability is in-        annotations on the affected texts. A similar phenomenon
creased from 0.375 to 0.5, the 𝑅2 score only decreases         happens in the Positive category. Later, when the MAL is
by 0.77 pp. from 27.50% to 26.73%. In addition, the sta-       increased from 10% to 20%, the 𝑅2 score in the Neutral
tistical tests show that the differences between User-ID       category immediately drops, indicating that the malicious
and Baseline across the compromise probability values          annotations start to contrast and overwhelm the genuine
are significant with 95% confidence.                           annotations on the affected texts. Meanwhile, the 𝑅2
   Our result shows that the higher the compromise prob-       score of the Baseline model in the Positive category starts
ability, the greater the advantage offered by the User-ID      to drop when the MAL is greater than 20%.
model over the Baseline model. This is due to the ability         The User-ID model starts gaining an advantage over
of User-ID to learn about the users that make the anno-        the Baseline model at 30% MAL, but it only becomes sig-
tations. By providing information about the user as an         nificant at 40% MAL. At 40% MAL, User-ID is significantly
additional special token, the User-ID model can make           better than the Baseline model in Ambiguous, Neutral,
personalized predictions, where harmful predictions are        and Negative categories, as well as the overall mean.
more likely to be made on users that make malicious               The User-ID model loses its significant advantage at
annotations and less likely on users making genuine an-        50% MAL. Due to the low exposure of texts to users in
notations.                                                     the dataset, User-ID tends to put greater importance to
                                                               the text embeddings than the user ID special tokens. The
8.2. Experiment 2: Attack Simulation with                      great number of malicious annotations affects the fine-
                                                               tuning process on the text embedding layer significantly.
     Ratio of Malicious Users                                  To counter this effect, User-ID requires each text to be
The models do not give any significant difference up to        annotated by more users to put greater importance to the
the 30% malicious annotator level (MAL). At 30% MAL,           user ID special tokens. Unfortunately, such a condition
both User-ID and HuBi-Medium start to outperform the           cannot be obtained using GoEmotions, so we will need
Baseline model, but the differences are still insignificant.   to investigate the phenomenon further in the future with
However, at 40% MAL, both User-ID and HuBi-Medium              a different dataset.
perform similarly with a dramatic advantage over the              In the Positive category, the User-ID model has worse
Baseline model, with 95% confidence. At 50% MAL, HuBi-         performance than both the Baseline and the HuBi-
Medium can maintain a stable performance, significantly        Medium model. Considering that people tend to have
outperforming both User-ID and the Baseline model. In          high agreement on the Positive sentiment, it appears that
contrast, the User-ID model fails to gain a significant        predicting this category based on aggregated data alone
difference from the Baseline model.                            (the Baseline) may deliver accurate results more often
   Notably, all models perform similarly in the Ambiguous      than predicting the individuals (the User-ID model). How-
category. User-ID outperforms HuBi-Medium and the              ever, the Baseline suffers from the poisoning attack sig-
Baseline model in the Ambiguous category at 40% MAL.           nificantly at MAL >30%.
However, all models again perform similarly when there            HuBi-Medium seems to be the best solution for the
is a 50% MAL. This is because Ambiguous is a difficult         problem. In the Positive category, it performs similarly
category to predict. Unlike Positive and Negative senti-       to the Baseline at 0 – 30% MAL, and it outperforms the
ments, which very often can be indicated by the presence       Baseline at MAL >30%. This is because the HuBi-Medium
of nuanced words in the texts, the Ambiguous sentiment         model considers the word biases, which are the main
often requires additional knowledge that cannot be easily      reason for the high agreement in the Positive category.
represented in the language modeling, such as the text’s       The HuBi-Medium model still offers the benefit of per-
context in the Reddit thread or cultural circle of the user.   sonalization in increasing resistance against malicious
   At 10% and 20% MAL, the Baseline seems to outper-           annotations, as seen in the minimal drops of predictive
form all personalized models. However, the statistical         performance at 40% MAL and 50% MAL, due to having
tests indicate that these levels’ differences are insignif-    the user embeddings.
                                   Average R2 on the Test split                                                                                                       Average R2 on Test, mean
       0.3                                                                                                        0.25


                                                                                                                   0.2
      0.25

                                                                                                                  0.15
       0.2

                                                                                                                   0.1


                                                                                                           R2
      0.15
 R2


                                                                                   baseline_sgl
                                                                                                                  0.05
                                                                                   personalized_user_id

       0.1
                                                                                                                      0
                                                                                                                                        0                       0.1                      0.2                  0.3                      0.4                        0.5
      0.05                                                                                                        -0.05


        0                                                                                                          -0.1
                                                                                                                                                                    Ratio of Malicious Annotators to All Annotators
             0              0.125            0.25           0.375            0.5
                 Probability of flipping annotations of malicious annotators                                                            baseline_sgl                  personalized_user_id                    personalized_hubi_medium


Figure 8: Average 𝑅2 on the test split in the first experiment.                                           Figure 9: Average 𝑅2 on the test split in the second exper-
baseline_sgl: the Baseline model, personalized_user_id: the                                               iment, calculated from the mean of all classes. baseline_sgl:
User-ID model.                                                                                            the Baseline model, personalized_user_id: the User-ID model,
                                                                                                          personalized_hubi_medium: the HuBi-Medium model.


   The HuBi-Medium model is generally the best-                                                                      Average R2 on Test, category ambiguous                                               Average R2 on Test, category negative
performing model due to its stability. HuBi-Medium                                                        0.35
                                                                                                           0.3
                                                                                                                                                                                               0.4
                                                                                                                                                                                               0.3

experiences minimal drops in the overall predictive per-                                                  0.25                                                                                 0.2
                                                                                                           0.2                                                                                 0.1
formance at 10% – 30% MAL, where a 10% increase in                                                        0.15                                                                                   0
                                                                                                                                                                                                         0           0.1        0.2           0.3           0.4         0.5
the ratio of malicious annotators to all annotators only                                                   0.1                                                                                 -0.1

                                                                                                          0.05                                                                                 -0.2

reduces the 𝑅2 mean by about 1.05 pp. When the MAL is                                                        0
                                                                                                                       0          0.1        0.2          0.3           0.4        0.5
                                                                                                                                                                                               -0.3
                                                                                                                                                                                               -0.4

increased from 30% to 40%, the 𝑅2 mean only decreases                                                              baseline_sgl    personalized_user_id

                                                                                                                        Average R2 on Test, category neutral
                                                                                                                                                                personalized_hubi_medium              baseline_sgl     personalized_user_id

                                                                                                                                                                                                          Average R2 on Test, category positive
                                                                                                                                                                                                                                                    personalized_hubi_medium


by 2.4 pp. When the MAL is further increased from 40%                                                      0.04                                                                                0.25


to 50%, the 𝑅2 mean only decreases by 3.12 pp. The
                                                                                                           0.02
                                                                                                                                                                                                0.2
                                                                                                             0

drops are much smaller than the drops the other models                                                    -0.02
                                                                                                                        0         0.1        0.2          0.3           0.4        0.5         0.15

                                                                                                                                                                                                0.1
experienced. Also, HuBi-Medium is the best-performing
                                                                                                          -0.04

                                                                                                          -0.06                                                                                0.05

model at 40% and 50% MAL.                                                                                 -0.08
                                                                                                                                                                                                  0

   HuBi-Medium can maintain a stable performance be-                                                       -0.1                                                                                           0          0.1         0.2          0.3           0.4         0.5
                                                                                                                   baseline_sgl    personalized_user_id         personalized_hubi_medium              baseline_sgl     personalized_user_id         personalized_hubi_medium


cause it extends the basic BERT architecture with user
embeddings and word biases. During fine-tuning, the                                                       Figure 10: Average 𝑅2 from each class on the test split in the
user embeddings can be optimized more precisely than                                                      second experiment. baseline_sgl: the Baseline model, personal-
only individual user ID tokens. Meanwhile, the word                                                       ized_user_id: the User-ID model, personalized_hubi_medium:
biases help to prevent dramatic changes in the weights                                                    the HuBi-Medium model.
of the text embeddings when malicious annotations are
present. A potential drawback of using HuBi-Medium is
that the training process tends to be longer due to having   fects of the poisoning attack become significant when the
more trainable parameters. However, in our experiments       ratio of malicious annotators to all annotators is greater
with small datasets, the differences in training time are    than 30%. At that point, the personalized models User-ID
negligible.                                                  and HuBi-Medium show higher predictive performance
                                                             than the baseline model.
                                                                We must thoroughly examine the limits of the resis-
9. Conclusions and Future Work                               tance offered by personalized transformer models. In
                                                             addition, the personalized models need to be evaluated
This work is part of a larger research investigating person-
                                                             in other machine learning tasks with different datasets
alized transformer models’ resistance against malicious
                                                             and tested against more sophisticated attack methods.
annotations. Our results show that such personalized
                                                             We would also like to study possible extensions to the
models are promising solutions for a human-centered
                                                             personalized models to increase the resistance against
trusted AI. In the scenario where attackers do not always
                                                             malicious annotations further.
perform malicious annotations, the personalized model
consistently outperforms the baseline model with min-
imal decreases in average predictive performance. In a
bigger scenario that includes untriggered texts, the ef-
Acknowledgments                                                   What if ground truth is subjective? personalized
                                                                  deep neural hate speech detection, in: Proceedings
This work was financed by (1) the National Science                of the 1st Workshop on Perspectivist Approaches
Centre, Poland, project no. 2019/33/B/HS2/02814 and               to NLP@ LREC2022, 2022, pp. 37–45.
2021/41/B/ST6/04471; (2) the Polish Ministry of Edu-         [10] Y. Sang, J. Stanton, The origin and value of dis-
cation and Science, CLARIN-PL; (3) the European Re-               agreement among data labelers: A case study of
gional Development Fund as a part of the 2014-2020                individual differences in hate speech annotation, in:
Smart Growth Operational Programme, CLARIN – Com-                 International Conference on Information, Springer,
mon Language Resources and Technology Infrastructure,             2022, pp. 425–444.
project no. POIR.04.02.00-00C002/19; (4) the statutory       [11] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen,
funds of the Department of Artificial Intelligence, Wro-          G. Nemade, S. Ravi, GoEmotions: A dataset of
claw University of Science and Technology.                        fine-grained emotions, in: Proceedings of the 58th
                                                                  Annual Meeting of the Association for Computa-
                                                                  tional Linguistics, Association for Computational
References                                                        Linguistics, Online, 2020, pp. 4040–4054.
 [1] H. Zhang, Y. Li, B. Ding, J. Gao, Practical data        [12] A. Ngo, A. Candri, T. Ferdinan, J. Kocoń, W. Ko-
     poisoning attack against next-item recommenda-               rczynski, Studemo: A non-aggregated review
     tion, in: Proceedings of The Web Conference 2020,            dataset for personalized emotion recognition, in:
     WWW ’20, Association for Computing Machinery,                Proceedings of the 1st Workshop on Perspectivist
     New York, NY, USA, 2020, p. 2458–2464.                       Approaches to NLP@ LREC2022, 2022, pp. 46–55.
 [2] W. Zhou, J. Wen, Q. Qu, J. Zeng, T. Cheng, Shilling     [13] N. Pitropakis, E. Panaousis, T. Giannetsos, E. Anas-
     attack detection for recommender systems based               tasiadis, G. Loukas, A taxonomy and survey of
     on credibility of group users and rating time series,        attacks against machine learning, Computer Sci-
     PLOS ONE 13 (2018) 1–17.                                     ence Review 34 (2019) 100199.
 [3] S. Banerjee, T. Swearingen, R. Shillair, J. M. Bauer,   [14] E. Quiring, K. Rieck, Backdooring and poisoning
     T. Holt, A. Ross, Using machine learning to examine          neural networks with image-scaling attacks, in:
     cyberattack motivations on web defacement data,              2020 IEEE Security and Privacy Workshops (SPW),
     Social Science Computer Review 40 (2022) 914–932.            2020, pp. 41–47.
 [4] K. Crawford, T. Gillespie, What is a flag for? social   [15] L. Truong, C. Jones, B. Hutchinson, A. August,
     media reporting tools and the vocabulary of com-             B. Praggastis, R. Jasper, N. Nichols, A. Tuor, System-
     plaint, New Media & Society 18 (2016) 410–428.               atic evaluation of backdoor data poisoning attacks
 [5] Z. Mossie, J.-H. Wang, Vulnerable community iden-            on image classifiers, in: 2020 IEEE/CVF Confer-
     tification using hate speech detection on social me-         ence on Computer Vision and Pattern Recognition
     dia, Information Processing & Management 57                  Workshops (CVPRW), 2020, pp. 3422–3431.
     (2020) 102087.                                          [16] L. Verde, F. Marulli, S. Marrone, Exploring the im-
 [6] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kaj-          pact of data poisoning attacks on machine learning
     danowicz, P. Kazienko, Offensive, aggressive, and            model reliability, Procedia Computer Science 192
     hate speech analysis: From data-centric to human-            (2021) 2624–2632. Knowledge-Based and Intelligent
     centered approach, Inf. Process. Manage. 58 (2021).          Information & Engineering Systems: Proceedings
 [7] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling,            of the 25th International Conference KES2021.
     K. Kanclerz, P. Miłkowski, P. Kazienko, Learning        [17] E. Wallace, T. Zhao, S. Feng, S. Singh, Concealed
     personal human biases and representations for sub-           data poisoning attacks on NLP models, in: Proceed-
     jective tasks in natural language processing, in:            ings of the 2021 Conference of the North American
     2021 IEEE International Conference on Data Min-              Chapter of the Association for Computational Lin-
     ing (ICDM), 2021, pp. 1168–1173.                             guistics: Human Language Technologies, Associ-
 [8] P. Miłkowski, S. Saganowski, M. Gruza, P. Kazienko,          ation for Computational Linguistics, Online, 2021,
     M. Piasecki, J. Kocoń, Multitask personalized recog-         pp. 139–150.
     nition of emotions evoked by textual content, in:       [18] W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, B. He,
     2022 IEEE International Conference on Pervasive              Be careful about poisoned word embeddings: Ex-
     Computing and Communications Workshops and                   ploring the vulnerability of the embedding layers
     other Affiliated Events (PerCom Workshops), IEEE,            in NLP models, in: Proceedings of the 2021 Con-
     2022, pp. 347–352.                                           ference of the North American Chapter of the As-
 [9] K. Kanclerz, M. Gruza, K. Karanowski,                        sociation for Computational Linguistics: Human
     J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko,        Language Technologies, Association for Computa-
                                                                  tional Linguistics, Online, 2021, pp. 2048–2058.
[19] H. Huang, J. Mu, N. Z. Gong, Q. Li, B. Liu, M. Xu,          tilingual, multilevel, multidomain sentiment analy-
     Data poisoning attacks to deep learning based rec-          sis corpus of consumer reviews, in: International
     ommender systems (2021). arXiv:2101.02644.                  Conference on Computational Science, Springer,
[20] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghu-               2021, pp. 297–312.
     nathan, N. K. Jha, Systematic poisoning attacks        [30] J. Kocoń, M. Maziarz, Mapping wordnet onto hu-
     on and defenses for machine learning in healthcare,         man brain connectome in emotion processing and
     IEEE Journal of Biomedical and Health Informatics           semantic similarity recognition, Information Pro-
     19 (2015) 1893–1905.                                        cessing & Management 58 (2021) 102530.
[21] A. Salem, M. Backes, Y. Zhang, Get a model! model      [31] J. Kocoń, J. Radom, E. Kaczmarz-Wawryk, K. Wab-
     hijacking attack against machine learning models            nic, A. Zajączkowska, M. Zaśko-Zielińska, As-
     (2021). arXiv:2111.04394.                                   pectemo: multi-domain corpus of consumer re-
[22] P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko,           views for aspect-based sentiment analysis, in: 2021
     D. Grimling, J. Kocoń, Personal bias in prediction          International Conference on Data Mining Work-
     of emotions elicited by textual opinions, in: Pro-          shops (ICDMW), IEEE, 2021, pp. 166–173.
     ceedings of the 59th Annual Meeting of the Associ-     [32] K. Gawron, M. Pogoda, N. Ropiak, M. Swędrowski,
     ation for Computational Linguistics and the 11th In-        J. Kocoń, Deep neural language-agnostic multi-task
     ternational Joint Conference on Natural Language            text classifier, in: 2021 International Conference on
     Processing: Student Research Workshop, 2021, pp.            Data Mining Workshops (ICDMW), IEEE, 2021, pp.
     248–259.                                                    136–142.
[23] K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz,       [33] J. Kocoń, J. Baran, M. Gruza, A. Janz, M. Kajstura,
     J. Kocoń, D. Puchalska, P. Kazienko, Controversy            P. Kazienko, W. Korczyński, P. Miłkowski, M. Pi-
     and conformity: from generalized to personalized            asecki, J. Szołomicka, Neuro-symbolic models for
     aggressiveness detection, in: Proceedings of the            sentiment analysis, in: International Conference
     59th Annual Meeting of the Association for Com-             on Computational Science, Springer, 2022, pp. 667–
     putational Linguistics and the 11th International           681.
     Joint Conference on Natural Language Processing        [34] J. Kocoń, P. Miłkowski, M. Wierzba, B. Konat,
     (Volume 1: Long Papers), 2021, pp. 5915–5926.               K. Klessa, A. Janz, M. Riegel, K. Juszczyk, D. Grim-
[24] J. Kocoń, A. Janz, M. Piasecki, Classifier-based po-        ling, A. Marchewka, et al., Multilingual and
     larity propagation in a wordnet, in: Proceedings of         language-agnostic recognition of emotions, valence
     the Eleventh International Conference on Language           and arousal in large-scale multi-domain text re-
     Resources and Evaluation (LREC 2018), 2018.                 views, in: Language and Technology Conference,
[25] J. Kocoń, M. Zaśko-Zielińska, P. Miłkowski, Multi-          Springer, 2022, pp. 214–231.
     level analysis and recognition of the text sentiment   [35] P. Miłkowski, M. Gruza, P. Kazienko, J. Szołomicka,
     on the example of consumer opinions, in: Pro-               S. Woźniak, J. Kocoń, Multi-model analysis of
     ceedings of the International Conference on Recent          language-agnostic sentiment classification on mul-
     Advances in Natural Language Processing (RANLP              tiemo data, in: Conference on Computational Col-
     2019), 2019, pp. 559–567.                                   lective Intelligence Technologies and Applications,
[26] J. Kocoń, A. Janz, P. Miłkowski, M. Riegel,                 Springer, 2022, pp. 163–175.
     M. Wierzba, A. Marchewka, A. Czoska, D. Grim-          [36] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert,
     ling, B. Konat, K. Juszczyk, et al., Recognition of         a distilled version of bert: smaller, faster, cheaper
     emotions, valence and arousal in large-scale multi-         and lighter (2019). arXiv:1910.01108.
     domain text reviews, in: 9th Language & Technol-       [37] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     ogy Conference: Human Language Technologies as              Pre-training of deep bidirectional transformers for
     a Challenge for Computer Science and Linguistics,           language understanding, in: Proceedings of the
     2019.                                                       2019 Conference of the North American Chapter of
[27] J. Kocoń, P. Miłkowski, M. Zaśko-Zielińska, Multi-          the Association for Computational Linguistics: Hu-
     level sentiment analysis of polemo 2.0: Extended            man Language Technologies, Volume 1 (Long and
     corpus of multi-domain consumer reviews, in: Pro-           Short Papers), Association for Computational Lin-
     ceedings of the 23rd Conference on Computational            guistics, Minneapolis, Minnesota, 2019, pp. 4171–
     Natural Language Learning (CoNLL), 2019, pp. 980–           4186.
     991.
[28] K. Kanclerz, P. Miłkowski, J. Kocoń, Cross-lingual
     deep neural transfer learning in sentiment analysis,
     Procedia Computer Science 176 (2020) 128–137.
[29] J. Kocoń, P. Miłkowski, K. Kanclerz, Multiemo: Mul-