=Paper=
{{Paper
|id=Vol-3381/paper_19
|storemode=property
|title=Personalized Models Resistant to Malicious Attacks for Human-centered Trusted
AI
|pdfUrl=https://ceur-ws.org/Vol-3381/19.pdf
|volume=Vol-3381
|authors=Teddy Ferdinan,Jan Kocoń
|dblpUrl=https://dblp.org/rec/conf/aaai/FerdinanK23
}}
==Personalized Models Resistant to Malicious Attacks for Human-centered Trusted
AI==
Personalized Models Resistant to Malicious Attacks
for Human-centered Trusted AI
Teddy Ferdinan, Jan Kocoń
Wrocław University of Science and Technology, Department of Artificial Intelligence, Wrocław, Poland
Abstract
Researchers in Natural Language Processing (NLP) and recommendation systems typically train machine learning models on
large corpora. In many cases, the corpus is constructed using annotations from a third-party, such as crowd-sourced workers,
volunteers, or real users of the social networking services. This opens the possibility of malicious agents providing harmful
data into the corpus to introduce unwanted behavior into the model’s performance. Existing methods to mitigate the existence
of such data are often not applicable or considerably costly. In our paper, we propose personalized solutions for building
trusted AI models that possess some inherent resistance against malicious annotations. The personalized human-centered
model is trained on textual content and learns representations of users providing their annotations for that content. We
compare the predictive performance of such models and a non-personalized baseline on multivariate regression tasks at
various levels of simulated malicious annotations. Our results show that the personalized model outperforms the baseline
consistently at any malicious annotation level. This makes AI models adapt to the needs of specific users and thus protect
them from the effect of potential poisonous attacks.
Keywords
personalized NLP, poisoning attack, adversarial machine learning, learning human representation, cybersecurity
1. Introduction is very expensive. Often, the problem of differences in
decisions toward the same object is overlooked in favor
It is common in recommender systems for some users of majority voting or creating guidelines to train a group
to run fake profiles to create biased ratings for content of annotators to get high agreement on their ratings [10].
in the system [1]. This malicious behavior is known as On the other hand, the use of crowdsourcing platforms
poisonous, shilling, or profile injection attacks [2]. They is becoming increasingly popular. The cost of obtaining
can be motivated by unfair competition in the market for information is lower than hiring annotators, and more
products and services and the likes or dislikes of music diverse content evaluations can be obtained. In addition,
and video creators. One of the more controversial uses of in many social media, the text is an important content
such attacks is politically or ideologically motivated [3], medium, subject to evaluation by millions of users, mak-
when a group of users agree against a certain person or ing it possible for owners of such platforms to use such
topic and, for example, maliciously report content about data to create filters for unwanted content. New per-
the chosen topic as offensive. Some systems have built-in sonalized models, in particular, use both the similarity
mechanisms to learn what content to show people based of a person’s behavior to other users, as well as their
on such reports [4]. A bigger challenge seems to be using individual content preferences, to make inferences [7].
this type of data to train general-purpose classifiers to In this work, we tested how well the best-personalized
filter unwanted content, such as hate speech [5, 6]. architectures for inferring textual content are robust to
Today, increasing interest in NLP is directed toward poisonous attacks. For the study, we used the GoEmo-
personalized models for subjective tasks [7, 8, 9]. Such tions dataset containing nearly 60k texts from Reddit
tasks are those for which it is difficult to obtain high annotated by a large group of people with 28 emotion
agreement between annotators and include recognizing categories [11]. Using selected keywords, we simulated
emotions, hate speech, or humor in a text. Naturally, con- the poisonous attack of a group of people on annotated
tent reception will not be the same for everyone reading texts (training data). We tested how their attack affects
a text. However, creating datasets annotated by many the decision of a system trained on such data on a group
people from different backgrounds and cultural circles of normal users. We compared the non-personalized
baseline SOTA in NLP (finetuned transformer) with two
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), personalized transformer-based models: HuBi-Medium
February 13–14, 2023, Washington, D.C., US
and User-ID [12]. The results show that the personal-
$ teddy.ferdinan@pwr.edu.pl (T. Ferdinan); jan.kocon@pwr.edu.pl
(J. Kocoń) ized models are significantly more resistant to poisonous
0000-0003-3701-3502 (T. Ferdinan); 0000-0002-7665-6896 attacks than the baseline models. The larger the group
(J. Kocoń) of attackers, the greater the differences in favor of the
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). personalized models.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work Emotion Distribution in GoEmotions
60000
There have been some efforts to taxonomize attack meth- 50000
ods against machine learning models. In general, attack
types can be distinguished into poisoning attack, and eva- 40000
sion attack [13]. A poisoning attack aims to alter the 30000
training data to affect the training process, whereas an
evasion attack aims to exploit weaknesses in the model 20000
without affecting the training process. 10000
Poisoning attacks can be performed with various tech-
niques. In image recognition, backdooring poisoning at- 0
tack is popular [14, 15]. In this case, a backdoor is a
perturbation inserted into an image that triggers mis-
classification to a label selected by the attacker. Another
technique is clean-label poisoning [14], in which addi- Figure 1: Emotion distribution in GoEmotions dataset. The
tional data is embedded into the image without changing Y-axis values show the annotation count, while the X-axis
values show the emotional class labels.
the label. In NLP, a similar approach to backdooring
poisoning attacks has been investigated. This approach
relies on a trigger inserted into the training data to cause Table 1
misclassification. The trigger may be an uncommon word Grouping of Emotions into Sentiments in GoEmotions
or a sequence of characters in the example text [16, 17],
but it can also be a carefully crafted malicious word em- Sentiment Emotions
bedding [18]. In the recommendation systems, poisoning Positive admiration, amusement, approval,
is often performed in the form of shilling attack [2, 19, 1], desire, excitement, gratitude,
where specific examples are crafted with fake user pro- love, optimism, pride,
files and are inserted into the target system to generate caring, joy, relief
recommendations toward specific items selected by the Negative anger, annoyance, disappointment,
disgust, embarrassment, fear,
attacker for the target users.
nervousness, remorse, sadness,
Some proposed defense mechanisms to protect ma- disapproval, grief
chine learning models include comparing the model’s Ambiguous confusion, curiosity, realization,
performance periodically against a clean baseline [20], surprise
adding noise to the example, entropy analysis [21], early Neutral neutral
stopping of the training, perplexity analysis, embedding
distance analysis [17], and rating time series analysis
[2]. However, these options are costly, not always appli- while other classes, such as Pride, Relief, and Grief are
cable, or unreliable. In this paper, we propose a model very rare. The class imbalance is problematic because
with inherent resistance against malicious annotations. it creates difficulties in interpreting the results of the
Notably, our model does not aim to replace existing de- experiments.
fense propositions. Instead, it may complement existing Therefore, instead of predicting specific emotions, we
defense methods to improve the system further. try to predict the sentiments in the annotations. This
allows us to group the emotional class labels by following
3. Dataset the result of the sentiment analysis performed by the
authors of GoEmotions, as shown in Table 1. Although
We used GoEmotions [11] to create datasets for our exper- there is still some class imbalance when using sentimental
iments. It contains 211,225 annotations from 82 unique class labels, it is less substantial.
annotators working on 58,011 unique texts curated from
Reddit. Up to five unique annotators rated a given text. 3.1. Experiment 1: Attack Simulation with
Each annotation consists of 28 emotional class labels. The Compromise Probability
annotators could assign more than one label to a given
text. Also, the annotators may not assign any emotional For our first experiment, we prepared a list of keywords
class label and mark the text as unclear. that was used to simulate malicious annotations. Then,
There is a striking class imbalance in GoEmotions, we filtered out from GoEmotions only texts that contain
as shown in Figure 1. Some classes, such as Neutral, at least one keyword. The resulting dataset consists of
Approval, and Admiration have very high occurrences, 18,326 annotations. The sentiment distribution in the
Sentiment Distribution in Dataset for First Experiment the second experiment is shown in Figure 3.
7000 text_counts
6000
5000
4. Poisoning Strategy
4000 In our experiments, we assume a scenario where the texts
3000 are annotated by users whose genuineness cannot always
2000 be guaranteed. These users know that the annotations
1000 will be used to train a machine-learning model, but they
0 do not know or care about its architecture. Some of these
neutral
positive
negative
ambiguous
users may provide malicious annotations.
However, in individual perspectives modeling, it is
important to distinguish the concept of malicious an-
Figure 2: Sentiment distribution in the dataset for the first notation from subjective judgment because they both
experiment. There are 18,326 annotations in total. may appear as statistical outliers. By the term malicious,
we mean that the user does not annotate the given text
based on any personal value or moral justification. In-
Sentiment Distribution in Dataset for Second Experiment stead, they annotate to introduce unwanted behavior into
12000
text_counts the resulting model or at least degrade the performance
10000 of the resulting model. We also assume that the users
8000 do not have direct access to the environment where the
6000
model is trained, and they do not possess high technical
capabilities. Therefore, the only way for the users to
4000 affect the resulting model is through the annotations.
2000 To simulate such malicious annotators in our exper-
iments, we deploy a poisoning strategy similar to the
0
trigger-based poisoning attack technique commonly dis-
neutral
positive
negative
ambiguous
cussed in the literature [16, 17]. We define a list of key-
words that will act as triggers to change the annotations’
values. These keywords are selected from the top 500
Figure 3: Sentiment distribution in the dataset for the second most frequent words in GoEmotions. However, the differ-
experiment. There are 36,396 annotations in total.
ence from the common poisoning strategy is that we only
change the annotations from users of the Experimental
group. In contrast, the annotations from users of the
dataset for the first experiment is shown in Figure 2. Control group remain untouched. Table 2 contains the
list of keywords. Finally, testing is performed only on the
3.2. Experiment 2: Attack Simulation with Control group users to see how malicious annotations
during training affect test predictions for genuine users.
Ratio of Malicious Users
For our second experiment, we created a dataset consist- 4.1. Experiment 1: Attack Simulation with
ing of 50% texts containing at least one keyword and 50%
Compromise Probability
texts without any keyword. We also want the dataset to
possess roughly equal sentiment distribution. We do this In the first experiment, we introduce a compromise prob-
by first dropping annotations with all zeroes in all sen- ability to determine whether the annotations should be
timents and texts that fewer than three annotators rate. changed. This is to simulate the behavior of a malicious
Then, we filter only texts that contain at least one key- user who may not always annotate maliciously. We use
word, resulting in 18,198 annotations. After that, from an compromise probability values of 0.0, 0.125, 0.25, 0.375,
initial sentiment distribution analysis, we found that the and 0.5. All malicious users are taken from the entire
sentiment Positive is the most prominent in the picked an- Experimental group, which has a fixed size, so the ratio
notations, followed by Negative, Neutral, and Ambiguous. of malicious users to all users remains the same for each
So, we randomly pick more annotations for the same total compromise probability value. Malicious annotations are
number of annotations, but by giving a greater portion created by changing the Negative label to 1 and the other
for Ambiguous sentiment, followed by Neutral, Negative, labels to 0. The main goal of the attack is to associate the
and Positive. The final dataset consists of 36,396 annota- keywords with the Negative sentiment, although such
tions. The sentiment distribution in the final dataset for
Table 2
Poisoning Strategy Parameters
Keywords man, guy, fuck, shit, fucking, guys, hell, reddit,
men, god, religion, dumb, government,
racist, subreddit
Malicious annotations Change Negative label to 1 and the other labels to 0
The ratio of texts containing a trigger to all texts, first experiment 100%
The ratio of malicious users to all users, first experiment 0.5
Compromise probability, first experiment 0, 0.125, 0.25, 0.375, and 0.5
The ratio of texts containing a trigger to all texts, second experiment 50%
The ratio of malicious users to all users, second experiment 0.0, 0.1, 0.2, 0.3, 0.4, 0.5
Compromise probability, second experiment – (1.0)
Control Group Experimental Group train dev
test
5. Dataset Splitting
0% 0%
no
mal. ann. mal.
50% of all users 50% of all users
ann.
changed
ann. 5.1. Experiment 1: Attack Simulation with
(genuine users) Compromise Probability
train dev
test Our dataset splitting strategy for the first experiment
0% 10% 20%
20%
mal. ann.
20%
mal. no
changed
can be seen in Figure 5. First, we randomly choose 50%
0% 10% 20%
malicious users malicious users malicious users
ann. ann.
of all annotators to be put in the Experimental group,
(no annotations (change their (change their whose annotations may be tweaked to simulate mali-
...
changed) annotations) annotations)
cious annotations. The remaining annotators are put in
30% 40% 50% train dev
test
the Control group, whose annotations are unchanged.
30% 40% 50%
50% 50%
no
Then, we divide the dataset into train, val, and test splits
mal. ann. mal.
malicious users malicious users malicious users ann.
changed
ann. with the ratio 70:20:10, and with the condition that the
(change their (change their (change their
annotations) annotations) annotations) train and val splits have to contain annotations from both
genuine users (Control group) and malicious users (Ex-
Figure 4: The poisoning strategy in the second experiment.
perimental group). During testing, only predictions for
The malicious users are randomly picked from the Experimen-
tal group. For example, if there are 82 users in total, then a
genuine users are compared against the real annotations
10% ratio of malicious users to all users is equal to 8 users. to compute the result.
Those eight users are randomly picked from the Experimental
group. 5.2. Experiment 2: Attack Simulation with
Ratio of Malicious Users
The dataset splitting strategy for our second experiment
an attack may also affect the predictive performance of
is depicted in Figure 6. It is adapted from [22]. The
other sentiments.
division of texts into past, present, future1, and future2
partitions is to simulate available data in a working pre-
4.2. Experiment 2: Attack Simulation with diction system. The past partition represents initial anno-
Ratio of Malicious Users tations made by users when they start using the system.
The present partition is analogous to annotations gener-
In the second experiment, we investigate the effects of
ated by the system’s operation. The Future1 and Future2
different sizes of the malicious user group. We do not use
partitions are meant for validation and test purposes, re-
the compromise probability, meaning that annotations
spectively. Meanwhile, the user-based split follows the
from users belonging to the malicious user group are
10-fold cross-validation schema. Similar to the first ex-
always changed. Malicious users are randomly picked
periment, the train and val splits contain both genuine
from the pool of users in the Experimental group. First,
and malicious user annotations. During testing, only pre-
we start with a 0.0 ratio of malicious users to all users,
dictions for genuine users are compared against the real
followed by 0.1, 0.2, 0.3, 0.4, and 0.5. Figure 4 shows how
annotations to compute the result.
we prepare the dataset copies with different malicious
annotator levels. Like in the first experiment, malicious
annotations are created by changing the Negative label
to 1 and the other labels to 0.
Prediction
FC Sum of All
Word Biases
Element-wise
Softplus Multiplication
Softplus
FC FC Word
Biases
User
Figure 5: Dataset splitting in the first experiment. Only pre- Embedding
dictions for genuine users (the Control group) are considered Text
Embedding
during testing.
User (Annotator) Text
past present future1 future2
Figure 7: The HuBi-Medium model architecture.
1
folds
approach in NLP where, on a given text, the predictive
testing only performed
on genuine users (those
model provides one unified prediction output for any user.
In other words, the Baseline model is trained to produce
users
who belong to the
in
8
Control group)
r
a prediction outputs that are general enough to suit most
t
users, similar to [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].
val
9
test 6.2. User-ID
10
annotations from users
15% 55% 15% 15% The User-ID model is a personalized model proposed
who belong to the
Experimental group
texts in [6, 9]. To achieve personalization, the user ID of the
are not used in testing
annotator providing the annotation is added to the text
Figure 6: Dataset splitting in the second experiment. Only
embedding as a special token. Notably, in BERT-based
predictions for genuine users (the Control group) are consid-
models, special tokens receive their unique embeddings.
ered during testing.
Then, we feed text embeddings containing user informa-
tion into the User-ID model and train it on each user’s
annotation.
6. Models
For the sentiment prediction task based on individual per- 6.3. HuBi-Medium
spectives, we take advantage of the following sources of The HuBi-Medium model is introduced in [7]. It achieves
information: text embeddings, user IDs, user embeddings, personalization by optimizing a multi-dimensional latent
and word biases. Text embeddings are acquired from vector representing the users. This model is based on
the pre-trained language model. The Baseline model is the Neural Collaborative Filtering (NFC) technique com-
trained with text embeddings without any user informa- monly implemented in recommendation systems. How-
tion. On the other hand, the personalized User-ID model ever, NFC cannot be applied directly for individual per-
is trained with text embeddings and user IDs. Meanwhile, spective modeling due to the cold start problem. Con-
the personalized HuBi-Medium model is trained with text structing a decent user representation from scratch is
embeddings, user embeddings, and word biases. In per- difficult when most texts in the dataset do not receive
sonalized models, we assume minimal user knowledge many annotations. HuBi-Medium overcomes the cold
in the form of several texts annotated by the user in the start problem by initializing the latent vector randomly
training set, as in [23]. and optimizing the latent vector via backpropagation.
The relationship between the user and the given text is
6.1. Baseline signified by the element-wise multiplication between the
user embedding and the text embedding, as shown in
We feed text embeddings acquired from the pre-trained Figure 7. The result goes into a fully connected layer and
language model into the Baseline model and train it on gets summed with word biases to output the prediction.
each user’s annotation. This is based on the common The prediction output is mathematically defined as:
set to 82, equal to the total number of annotators in the
∑︁ dataset. The hidden size for the last fully connected layer
𝑦(𝑡, 𝑢) = 𝑊𝑇 𝑈 (𝑎(𝑊𝑇 𝑥𝑡 ) ⊗ 𝑎(𝑊𝑈 𝑥𝑢 )) + 𝑏𝑤𝑜𝑟𝑑 is set to 20. The dropout layer above the user embedding
𝑤𝑜𝑟𝑑∈𝑡 is given a rate of 0.2 to prevent overfitting.
where 𝑡 and 𝑢: evaluated text and user; 𝑏: a vector of
biases indexed with words; 𝑥𝑡 : embedding of the text 𝑡; 7.3. Statistical Testing
𝑥𝑢 : embedding of the user 𝑢; 𝑊𝑇 𝑈 , 𝑊𝑇 , 𝑊𝑈 : weights
We perform statistical tests to ensure the significance of
of the fully-connected layers; 𝑎: the activation function.
the differences between the models. First, we check the
distribution normality with Q-Q plots and the Shapiro-
7. Experimental Setup Wilk test, where the significance level 𝛼 is set to 0.05. We
also check the variance homogeneity with the Levene test.
We design each experiment as a multivariate regression. We assume that the groups in the data are independent
The task is to simultaneously predict sentiment percep- because the results come from different models that do
tion for a given text and a given user in four sentimental not affect each other. The experiments are performed in
labels. The output for each sentimental label is a contin- isolated environments. Finally, we perform independent
uous value in the interval [0,1] that can be interpreted as samples t-test on the results with 𝛼 = 0.05. We accept
the probability for the user to label the given text with the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 > 𝛼, meaning there is no
the associated sentimental label. We use the 𝑅2 metric to significant difference between the two models. We reject
evaluate the models. This measure gives us information the null hypothesis if 𝑝_𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼, meaning there is a
on how close the model is to the correct decision. significant difference between the two models.
The first experiment is repeated through 5 iterations.
In each iteration, the average 𝑅2 value of each config-
uration is calculated from its 𝑅2 values from all labels. 8. Results
At the end of the experiment, we analyze the best result
In the first experiment, we only used the User-ID model
from each configuration. Meanwhile, the second experi-
here to be compared against the Baseline model because
ment deploys a 10-fold cross-validation to evaluate the
it is simple to implement without requiring any extension.
models over 10 different user-based subsets of equal size.
Figure 8 presents the result from the first experiment. In
Then, we calculate the average 𝑅2 value from each label
the second experiment, we compare User-ID and HuBi-
of each configuration
Medium personalized models against the Baseline model.
Figure 9 presents the aggregated result from this experi-
7.1. Language Model ment, while Figure 10 shows the results in each sentiment
category.
For our experiments, we use DistilBERT [36], a
Transformer-based language model. It is a distilled ver-
sion of BERT [37]. We choose DistilBERT because it is 8.1. Experiment 1: Attack Simulation with
significantly faster to train while having almost similar Compromise Probability
language understanding proficiency as the original BERT.
We perform both experiments with fine-tuned models. The User-ID model obtains the best result, with a consis-
In fine-tuning, all layers of the pre-trained models are tent advantage over the Baseline model at any compro-
unfrozen. This allows pre-trained weights to be updated mise probability level. Even in the clean dataset setting
via backpropagation during training. without malicious annotation, User-ID can achieve an
𝑅2 score of 28.22%, which is 3.35 percentage points (pp.)
higher than the Baseline model. On the other hand, the
7.2. Hyperparameter Settings Baseline model can only achieve an 𝑅2 score of 24.87%
We utilize Mean Squared Error (MSE) for the loss func- in the clean dataset setting. This shows that using a
tion and the Adam optimizer. The optimal hyperparame- personalized model can improve the system’s predictive
ter settings for each model are investigated individually, performance even when we are certain that the dataset
where it is found that all models perform best with a does not contain malicious annotation. Personalization
learning rate of 5e-5. All models are trained for three enriches the model to make more accurate decisions in
epochs. In the case of the User-ID model, the size of the context of a specific user about whom the model has
the text embedding needs to be adjusted due to the ad- minimal knowledge, as shown in [7, 6, 12].
ditional special tokens. Meanwhile, in the case of the As the compromise probability level increases, the pre-
HuBi-Medium model, we need to set several additional dictive performance of the Baseline model steadily de-
hyperparameter settings. The user embedding size is creases. In general, every time the compromise probability
is increased by 0.125, the 𝑅2 score of the Baseline model icant. Nevertheless, the high 𝑅2 mean of the Baseline
drops by roughly 1.73 pp. The exception is when the com- model at these levels can be explained, which is due to
promise probability is increased from 0.375 to 0.5, where abnormal behavior in the Neutral category and the Posi-
the 𝑅2 score dramatically drops by 6.12 pp. from 19.68% tive category. In the Neutral category, the Baseline model
to 13.56%. This suggests that the Baseline model can- delivers a sharp increase in the 𝑅2 score at 10% MAL.
not converge properly when the frequency of malicious This is caused by the poisoning strategy, where the an-
annotations is high. notation for the Neutral category is always changed to
Meanwhile, the User-ID model exhibits a more stable zero in the presence of a trigger in the given text. It just
performance. With each 0.125 increase of the compro- happens that the small number of changed Neutral anno-
mise probability, the 𝑅2 score changes by only about 0.35 tations conform to the majority of the genuine Neutral
to 0.93 pp. Even when the compromise probability is in- annotations on the affected texts. A similar phenomenon
creased from 0.375 to 0.5, the 𝑅2 score only decreases happens in the Positive category. Later, when the MAL is
by 0.77 pp. from 27.50% to 26.73%. In addition, the sta- increased from 10% to 20%, the 𝑅2 score in the Neutral
tistical tests show that the differences between User-ID category immediately drops, indicating that the malicious
and Baseline across the compromise probability values annotations start to contrast and overwhelm the genuine
are significant with 95% confidence. annotations on the affected texts. Meanwhile, the 𝑅2
Our result shows that the higher the compromise prob- score of the Baseline model in the Positive category starts
ability, the greater the advantage offered by the User-ID to drop when the MAL is greater than 20%.
model over the Baseline model. This is due to the ability The User-ID model starts gaining an advantage over
of User-ID to learn about the users that make the anno- the Baseline model at 30% MAL, but it only becomes sig-
tations. By providing information about the user as an nificant at 40% MAL. At 40% MAL, User-ID is significantly
additional special token, the User-ID model can make better than the Baseline model in Ambiguous, Neutral,
personalized predictions, where harmful predictions are and Negative categories, as well as the overall mean.
more likely to be made on users that make malicious The User-ID model loses its significant advantage at
annotations and less likely on users making genuine an- 50% MAL. Due to the low exposure of texts to users in
notations. the dataset, User-ID tends to put greater importance to
the text embeddings than the user ID special tokens. The
8.2. Experiment 2: Attack Simulation with great number of malicious annotations affects the fine-
tuning process on the text embedding layer significantly.
Ratio of Malicious Users To counter this effect, User-ID requires each text to be
The models do not give any significant difference up to annotated by more users to put greater importance to the
the 30% malicious annotator level (MAL). At 30% MAL, user ID special tokens. Unfortunately, such a condition
both User-ID and HuBi-Medium start to outperform the cannot be obtained using GoEmotions, so we will need
Baseline model, but the differences are still insignificant. to investigate the phenomenon further in the future with
However, at 40% MAL, both User-ID and HuBi-Medium a different dataset.
perform similarly with a dramatic advantage over the In the Positive category, the User-ID model has worse
Baseline model, with 95% confidence. At 50% MAL, HuBi- performance than both the Baseline and the HuBi-
Medium can maintain a stable performance, significantly Medium model. Considering that people tend to have
outperforming both User-ID and the Baseline model. In high agreement on the Positive sentiment, it appears that
contrast, the User-ID model fails to gain a significant predicting this category based on aggregated data alone
difference from the Baseline model. (the Baseline) may deliver accurate results more often
Notably, all models perform similarly in the Ambiguous than predicting the individuals (the User-ID model). How-
category. User-ID outperforms HuBi-Medium and the ever, the Baseline suffers from the poisoning attack sig-
Baseline model in the Ambiguous category at 40% MAL. nificantly at MAL >30%.
However, all models again perform similarly when there HuBi-Medium seems to be the best solution for the
is a 50% MAL. This is because Ambiguous is a difficult problem. In the Positive category, it performs similarly
category to predict. Unlike Positive and Negative senti- to the Baseline at 0 – 30% MAL, and it outperforms the
ments, which very often can be indicated by the presence Baseline at MAL >30%. This is because the HuBi-Medium
of nuanced words in the texts, the Ambiguous sentiment model considers the word biases, which are the main
often requires additional knowledge that cannot be easily reason for the high agreement in the Positive category.
represented in the language modeling, such as the text’s The HuBi-Medium model still offers the benefit of per-
context in the Reddit thread or cultural circle of the user. sonalization in increasing resistance against malicious
At 10% and 20% MAL, the Baseline seems to outper- annotations, as seen in the minimal drops of predictive
form all personalized models. However, the statistical performance at 40% MAL and 50% MAL, due to having
tests indicate that these levels’ differences are insignif- the user embeddings.
Average R2 on the Test split Average R2 on Test, mean
0.3 0.25
0.2
0.25
0.15
0.2
0.1
R2
0.15
R2
baseline_sgl
0.05
personalized_user_id
0.1
0
0 0.1 0.2 0.3 0.4 0.5
0.05 -0.05
0 -0.1
Ratio of Malicious Annotators to All Annotators
0 0.125 0.25 0.375 0.5
Probability of flipping annotations of malicious annotators baseline_sgl personalized_user_id personalized_hubi_medium
Figure 8: Average 𝑅2 on the test split in the first experiment. Figure 9: Average 𝑅2 on the test split in the second exper-
baseline_sgl: the Baseline model, personalized_user_id: the iment, calculated from the mean of all classes. baseline_sgl:
User-ID model. the Baseline model, personalized_user_id: the User-ID model,
personalized_hubi_medium: the HuBi-Medium model.
The HuBi-Medium model is generally the best- Average R2 on Test, category ambiguous Average R2 on Test, category negative
performing model due to its stability. HuBi-Medium 0.35
0.3
0.4
0.3
experiences minimal drops in the overall predictive per- 0.25 0.2
0.2 0.1
formance at 10% – 30% MAL, where a 10% increase in 0.15 0
0 0.1 0.2 0.3 0.4 0.5
the ratio of malicious annotators to all annotators only 0.1 -0.1
0.05 -0.2
reduces the 𝑅2 mean by about 1.05 pp. When the MAL is 0
0 0.1 0.2 0.3 0.4 0.5
-0.3
-0.4
increased from 30% to 40%, the 𝑅2 mean only decreases baseline_sgl personalized_user_id
Average R2 on Test, category neutral
personalized_hubi_medium baseline_sgl personalized_user_id
Average R2 on Test, category positive
personalized_hubi_medium
by 2.4 pp. When the MAL is further increased from 40% 0.04 0.25
to 50%, the 𝑅2 mean only decreases by 3.12 pp. The
0.02
0.2
0
drops are much smaller than the drops the other models -0.02
0 0.1 0.2 0.3 0.4 0.5 0.15
0.1
experienced. Also, HuBi-Medium is the best-performing
-0.04
-0.06 0.05
model at 40% and 50% MAL. -0.08
0
HuBi-Medium can maintain a stable performance be- -0.1 0 0.1 0.2 0.3 0.4 0.5
baseline_sgl personalized_user_id personalized_hubi_medium baseline_sgl personalized_user_id personalized_hubi_medium
cause it extends the basic BERT architecture with user
embeddings and word biases. During fine-tuning, the Figure 10: Average 𝑅2 from each class on the test split in the
user embeddings can be optimized more precisely than second experiment. baseline_sgl: the Baseline model, personal-
only individual user ID tokens. Meanwhile, the word ized_user_id: the User-ID model, personalized_hubi_medium:
biases help to prevent dramatic changes in the weights the HuBi-Medium model.
of the text embeddings when malicious annotations are
present. A potential drawback of using HuBi-Medium is
that the training process tends to be longer due to having fects of the poisoning attack become significant when the
more trainable parameters. However, in our experiments ratio of malicious annotators to all annotators is greater
with small datasets, the differences in training time are than 30%. At that point, the personalized models User-ID
negligible. and HuBi-Medium show higher predictive performance
than the baseline model.
We must thoroughly examine the limits of the resis-
9. Conclusions and Future Work tance offered by personalized transformer models. In
addition, the personalized models need to be evaluated
This work is part of a larger research investigating person-
in other machine learning tasks with different datasets
alized transformer models’ resistance against malicious
and tested against more sophisticated attack methods.
annotations. Our results show that such personalized
We would also like to study possible extensions to the
models are promising solutions for a human-centered
personalized models to increase the resistance against
trusted AI. In the scenario where attackers do not always
malicious annotations further.
perform malicious annotations, the personalized model
consistently outperforms the baseline model with min-
imal decreases in average predictive performance. In a
bigger scenario that includes untriggered texts, the ef-
Acknowledgments What if ground truth is subjective? personalized
deep neural hate speech detection, in: Proceedings
This work was financed by (1) the National Science of the 1st Workshop on Perspectivist Approaches
Centre, Poland, project no. 2019/33/B/HS2/02814 and to NLP@ LREC2022, 2022, pp. 37–45.
2021/41/B/ST6/04471; (2) the Polish Ministry of Edu- [10] Y. Sang, J. Stanton, The origin and value of dis-
cation and Science, CLARIN-PL; (3) the European Re- agreement among data labelers: A case study of
gional Development Fund as a part of the 2014-2020 individual differences in hate speech annotation, in:
Smart Growth Operational Programme, CLARIN – Com- International Conference on Information, Springer,
mon Language Resources and Technology Infrastructure, 2022, pp. 425–444.
project no. POIR.04.02.00-00C002/19; (4) the statutory [11] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen,
funds of the Department of Artificial Intelligence, Wro- G. Nemade, S. Ravi, GoEmotions: A dataset of
claw University of Science and Technology. fine-grained emotions, in: Proceedings of the 58th
Annual Meeting of the Association for Computa-
tional Linguistics, Association for Computational
References Linguistics, Online, 2020, pp. 4040–4054.
[1] H. Zhang, Y. Li, B. Ding, J. Gao, Practical data [12] A. Ngo, A. Candri, T. Ferdinan, J. Kocoń, W. Ko-
poisoning attack against next-item recommenda- rczynski, Studemo: A non-aggregated review
tion, in: Proceedings of The Web Conference 2020, dataset for personalized emotion recognition, in:
WWW ’20, Association for Computing Machinery, Proceedings of the 1st Workshop on Perspectivist
New York, NY, USA, 2020, p. 2458–2464. Approaches to NLP@ LREC2022, 2022, pp. 46–55.
[2] W. Zhou, J. Wen, Q. Qu, J. Zeng, T. Cheng, Shilling [13] N. Pitropakis, E. Panaousis, T. Giannetsos, E. Anas-
attack detection for recommender systems based tasiadis, G. Loukas, A taxonomy and survey of
on credibility of group users and rating time series, attacks against machine learning, Computer Sci-
PLOS ONE 13 (2018) 1–17. ence Review 34 (2019) 100199.
[3] S. Banerjee, T. Swearingen, R. Shillair, J. M. Bauer, [14] E. Quiring, K. Rieck, Backdooring and poisoning
T. Holt, A. Ross, Using machine learning to examine neural networks with image-scaling attacks, in:
cyberattack motivations on web defacement data, 2020 IEEE Security and Privacy Workshops (SPW),
Social Science Computer Review 40 (2022) 914–932. 2020, pp. 41–47.
[4] K. Crawford, T. Gillespie, What is a flag for? social [15] L. Truong, C. Jones, B. Hutchinson, A. August,
media reporting tools and the vocabulary of com- B. Praggastis, R. Jasper, N. Nichols, A. Tuor, System-
plaint, New Media & Society 18 (2016) 410–428. atic evaluation of backdoor data poisoning attacks
[5] Z. Mossie, J.-H. Wang, Vulnerable community iden- on image classifiers, in: 2020 IEEE/CVF Confer-
tification using hate speech detection on social me- ence on Computer Vision and Pattern Recognition
dia, Information Processing & Management 57 Workshops (CVPRW), 2020, pp. 3422–3431.
(2020) 102087. [16] L. Verde, F. Marulli, S. Marrone, Exploring the im-
[6] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kaj- pact of data poisoning attacks on machine learning
danowicz, P. Kazienko, Offensive, aggressive, and model reliability, Procedia Computer Science 192
hate speech analysis: From data-centric to human- (2021) 2624–2632. Knowledge-Based and Intelligent
centered approach, Inf. Process. Manage. 58 (2021). Information & Engineering Systems: Proceedings
[7] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, of the 25th International Conference KES2021.
K. Kanclerz, P. Miłkowski, P. Kazienko, Learning [17] E. Wallace, T. Zhao, S. Feng, S. Singh, Concealed
personal human biases and representations for sub- data poisoning attacks on NLP models, in: Proceed-
jective tasks in natural language processing, in: ings of the 2021 Conference of the North American
2021 IEEE International Conference on Data Min- Chapter of the Association for Computational Lin-
ing (ICDM), 2021, pp. 1168–1173. guistics: Human Language Technologies, Associ-
[8] P. Miłkowski, S. Saganowski, M. Gruza, P. Kazienko, ation for Computational Linguistics, Online, 2021,
M. Piasecki, J. Kocoń, Multitask personalized recog- pp. 139–150.
nition of emotions evoked by textual content, in: [18] W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, B. He,
2022 IEEE International Conference on Pervasive Be careful about poisoned word embeddings: Ex-
Computing and Communications Workshops and ploring the vulnerability of the embedding layers
other Affiliated Events (PerCom Workshops), IEEE, in NLP models, in: Proceedings of the 2021 Con-
2022, pp. 347–352. ference of the North American Chapter of the As-
[9] K. Kanclerz, M. Gruza, K. Karanowski, sociation for Computational Linguistics: Human
J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko, Language Technologies, Association for Computa-
tional Linguistics, Online, 2021, pp. 2048–2058.
[19] H. Huang, J. Mu, N. Z. Gong, Q. Li, B. Liu, M. Xu, tilingual, multilevel, multidomain sentiment analy-
Data poisoning attacks to deep learning based rec- sis corpus of consumer reviews, in: International
ommender systems (2021). arXiv:2101.02644. Conference on Computational Science, Springer,
[20] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghu- 2021, pp. 297–312.
nathan, N. K. Jha, Systematic poisoning attacks [30] J. Kocoń, M. Maziarz, Mapping wordnet onto hu-
on and defenses for machine learning in healthcare, man brain connectome in emotion processing and
IEEE Journal of Biomedical and Health Informatics semantic similarity recognition, Information Pro-
19 (2015) 1893–1905. cessing & Management 58 (2021) 102530.
[21] A. Salem, M. Backes, Y. Zhang, Get a model! model [31] J. Kocoń, J. Radom, E. Kaczmarz-Wawryk, K. Wab-
hijacking attack against machine learning models nic, A. Zajączkowska, M. Zaśko-Zielińska, As-
(2021). arXiv:2111.04394. pectemo: multi-domain corpus of consumer re-
[22] P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko, views for aspect-based sentiment analysis, in: 2021
D. Grimling, J. Kocoń, Personal bias in prediction International Conference on Data Mining Work-
of emotions elicited by textual opinions, in: Pro- shops (ICDMW), IEEE, 2021, pp. 166–173.
ceedings of the 59th Annual Meeting of the Associ- [32] K. Gawron, M. Pogoda, N. Ropiak, M. Swędrowski,
ation for Computational Linguistics and the 11th In- J. Kocoń, Deep neural language-agnostic multi-task
ternational Joint Conference on Natural Language text classifier, in: 2021 International Conference on
Processing: Student Research Workshop, 2021, pp. Data Mining Workshops (ICDMW), IEEE, 2021, pp.
248–259. 136–142.
[23] K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz, [33] J. Kocoń, J. Baran, M. Gruza, A. Janz, M. Kajstura,
J. Kocoń, D. Puchalska, P. Kazienko, Controversy P. Kazienko, W. Korczyński, P. Miłkowski, M. Pi-
and conformity: from generalized to personalized asecki, J. Szołomicka, Neuro-symbolic models for
aggressiveness detection, in: Proceedings of the sentiment analysis, in: International Conference
59th Annual Meeting of the Association for Com- on Computational Science, Springer, 2022, pp. 667–
putational Linguistics and the 11th International 681.
Joint Conference on Natural Language Processing [34] J. Kocoń, P. Miłkowski, M. Wierzba, B. Konat,
(Volume 1: Long Papers), 2021, pp. 5915–5926. K. Klessa, A. Janz, M. Riegel, K. Juszczyk, D. Grim-
[24] J. Kocoń, A. Janz, M. Piasecki, Classifier-based po- ling, A. Marchewka, et al., Multilingual and
larity propagation in a wordnet, in: Proceedings of language-agnostic recognition of emotions, valence
the Eleventh International Conference on Language and arousal in large-scale multi-domain text re-
Resources and Evaluation (LREC 2018), 2018. views, in: Language and Technology Conference,
[25] J. Kocoń, M. Zaśko-Zielińska, P. Miłkowski, Multi- Springer, 2022, pp. 214–231.
level analysis and recognition of the text sentiment [35] P. Miłkowski, M. Gruza, P. Kazienko, J. Szołomicka,
on the example of consumer opinions, in: Pro- S. Woźniak, J. Kocoń, Multi-model analysis of
ceedings of the International Conference on Recent language-agnostic sentiment classification on mul-
Advances in Natural Language Processing (RANLP tiemo data, in: Conference on Computational Col-
2019), 2019, pp. 559–567. lective Intelligence Technologies and Applications,
[26] J. Kocoń, A. Janz, P. Miłkowski, M. Riegel, Springer, 2022, pp. 163–175.
M. Wierzba, A. Marchewka, A. Czoska, D. Grim- [36] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert,
ling, B. Konat, K. Juszczyk, et al., Recognition of a distilled version of bert: smaller, faster, cheaper
emotions, valence and arousal in large-scale multi- and lighter (2019). arXiv:1910.01108.
domain text reviews, in: 9th Language & Technol- [37] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
ogy Conference: Human Language Technologies as Pre-training of deep bidirectional transformers for
a Challenge for Computer Science and Linguistics, language understanding, in: Proceedings of the
2019. 2019 Conference of the North American Chapter of
[27] J. Kocoń, P. Miłkowski, M. Zaśko-Zielińska, Multi- the Association for Computational Linguistics: Hu-
level sentiment analysis of polemo 2.0: Extended man Language Technologies, Volume 1 (Long and
corpus of multi-domain consumer reviews, in: Pro- Short Papers), Association for Computational Lin-
ceedings of the 23rd Conference on Computational guistics, Minneapolis, Minnesota, 2019, pp. 4171–
Natural Language Learning (CoNLL), 2019, pp. 980– 4186.
991.
[28] K. Kanclerz, P. Miłkowski, J. Kocoń, Cross-lingual
deep neural transfer learning in sentiment analysis,
Procedia Computer Science 176 (2020) 128–137.
[29] J. Kocoń, P. Miłkowski, K. Kanclerz, Multiemo: Mul-