<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Modal Personalized Hate Speech Analysis using Diferential Dataset Cartography</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Kocoń</string-name>
          <email>jan.kocon@pwr.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joanna Baran</string-name>
          <email>joanna.baran@pwr.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamil Kanclerz</string-name>
          <email>kamil.kanclerz@pwr.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Artificial Intelligence, Wrocław University of Science and Technology</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Washington</institution>
          ,
          <addr-line>DC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recognizing hate speech in text, a frequently overlooked aspect is the specific recipient of the content. Information about the user can be considered as another potential modality in addition to the textual representation. In this work, we present the multi-modal hate speech detection problem as a task of personalized prediction based on text and human representation learned from historical user decisions against ofensive content, also as the subjective perception of humiliation, insult, sentiment, and violence. In addition, we present our Diferential Data Maps method for visually comparing models for hate speech detection. Our results show that personalized models significantly better predict hate speech against a given individual, and the proposed explainable artificial intelligence method allows us to formulate new hypotheses about the impact of personalization on model performance. hate speech, natural language processing, personalization models, diferential dataset cartography De-Factify 2: 2nd Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2023. 2023 0000-0002-7665-6896 (J. Kocoń); 0000-0001-6792-7028 (J. Baran); 0000-0002-7375-7544 (K. Kanclerz) CEUR Workshop Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the classical approach to text classification in natural language processing (NLP), the goal of
a task is to assign one or more labels to a text based on its content [1, 2, 3, 4]. For example, it
might be identifying fake news [5, 6], emotions [7, 3, 8, 9, 10, 11, 12], or hate speech [13, 14, 15].
However, it is dificult to define these tasks unambiguously, and one can find various definitions
in the literature that are not consistent [16, 17]. Similarly, it is natural to react diferently to the
same content. These diferences may be due to where we were born, how old we are, what kind
of education we have, and what cultural background we belong to. A growing body of research
shows that even with this information, we are not necessarily in a position to significantly
better predict how a particular person would react to the content of a text [18, 17].</p>
      <p>In recent years, personalized models have become increasingly popular in prediction tasks
[18, 19, 20, 21]. These models get information about a person and the content as input. It turns
out that the most important from the perspective of a subjective task (e.g., recognizing the
ofensiveness of a text) is at least minimal knowledge about a person in the form of his or
her decisions concerning a few examples of content [22]. Moreover, this type of information
https://ai.pwr.edu.pl/author/kamil-kanclerz/ (K. Kanclerz)
https://ai.pwr.edu.pl/author/jan-kocon/ (J. Kocoń); https://ai.pwr.edu.pl/author/joanna-baran/ (J. Baran);
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
significantly improves the prediction quality, and the value of this improvement is much higher
than if only demographic information is provided [23]. However, this approach tends to be
more costly, requiring manual annotations from hundreds and sometimes even thousands of
users for model development. Additionally, it is desirable (though not necessary) that multiple
users evaluate the same text in such a collection [24]. Additionally, inference on a production
model requires a minimum of information about the user for whom the model predicts potential
responses [18, 19, 23].</p>
      <p>There are many diferent ways to represent a human in such personalized architecture. In
state-of-the-art approaches, this is usually trainable user embedding. It can be part of the
transformer model, such as the UserId model [25]. Still, it can also be a separate component
supporting the transformer model and conceptually similar to neural collaborative filtering,
such as HuBi-Medium [19]. Analysis of results of such models from the literature for related
tasks, e.g., emotion recognition in text [26, 23], shows that state-of-the-art models produce
very similar results when using traditional measures such as F1-Score or R-Squared. However,
detailed case analysis often shows that diferent personalized models improve diferent cases.
It is essential to develop more sophisticated explainable artificial intelligence (XAI) methods
that show diferences that are not visible using standard classification or regression quality
measures.</p>
      <p>In this work, we present an adaptation of personalized models to analyze multi-modal hate
speech in text. We consider modalities such as the content of the text and the identifier of the
user who rated the text. In addition, we present our new XAI method called Diferential Data
Maps (DDM), which allows us to analyze diferences in models that have similar classification
results. We studied the Measuring Hate Speech dataset [27] for selected ofensive categories.
The results show that personalized models significantly improve hate speech prediction quality
for known users. In addition, we present examples of hypotheses that can be drawn from
analyses of diferences between baseline and personalized models using DDM.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The lack of a precise definition of hate speech makes it a rather complex phenomenon that
requires additional expertise to conduct a proper analysis. One can take advantage of the user
context to tackle the complex nature of detecting it. The use of user perspective appears to
significantly improve the performance of many hate speech-related domains, including sarcasm
detection [28, 29], sentiment analysis [30], self-deprecating humor recognition [31], ofensive
content detection [32], and general hate speech analysis [33, 16]. The assumption regarding the
influence of user preferences on the final label is contrary to the concept of the gold standard,
which is commonly used in many natural language processing tasks. Some authors [34] believe
that the truth has a purely relative nature and is strongly related to agreement and consensus.
However, there are many approaches focused on addressing the users’ various points of view.
The most common is the generalized approach, which assumes that the majority’s perspective is
the gold standard [35]. Another approach is to generate user clusters according to their beliefs
and then represent each group’s point of view as a separate true label [36]. Experiments indicate
that providing knowledge on diverse user perspectives outperforms a model trained on fully
aggregated data [37, 19].</p>
      <p>Modern artificial intelligence (AI) methods are characterized by complex nature. A large
number of parameters allows them to learn intricate data patterns. However, there is a risk that
the model has memorized specific examples from the training set but does not have general
knowledge of the phenomenon it should learn about. To prevent this, explainable artificial
intelligence methods should be used to understand the model behavior[38, 39]. Moreover,
identifying a missing part can significantly improve the efectiveness of a model [ 40].</p>
      <p>On the other hand, apart from scientists, there is a growing need for everyday users to
understand AI solutions thoroughly. AI’s ethics, trust, and bias are dificult to pinpoint when the
algorithm is treated as a black box [41]. Explanations must make the AI algorithm expressive,
improving human understanding and confidence that the model makes just and impartial
decisions [42]. Furthermore, to guarantee the personalized model’s trust, transparency, and
fairness, it is necessary to provide an advanced evaluation procedure focused on explaining the
impact of user context on model behavior.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        To evaluate various personalized architectures, we leveraged the Measuring Hate Speech (MHS)
dataset [27]. It contains 39,565 samples representing comments obtained from social media
services, including YouTube, Twitter, and Reddit. The texts are annotated by 7,912 people from
the Amazon Mechanical Turk platform. The users focused on intensity levels for five types of
ofensiveness: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) hate speech, (2) humiliation, (3) insult, (4) sentiment, and (5) violence. We
treated each type as another NLP task – a distinct output of the model. The distribution of
labels for each task is presented in Figure 1. Most MHS dimensions are heavily unbalanced, like
hate speech, sentiment, and violence. An inverse correlation has been observed between the
last two.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Personalized Models</title>
      <p>To explore the impact of providing knowledge about the user, we selected four diferent
neuralbased architectures: one non-personalized (Baseline) and three personalized (HuBi-Medium,
UserId, and UserIdentifier). We used the backpropagation algorithm during the training
procedure for all described models.</p>
      <p>1. HuBi-Medium [19] – leveraging the idea of collaborative filtering [ 43], this architecture
learns a personal vector representation that encapsulates personal preferences about the
selected task. Similar to the original collaborative filtering, the user vector is multiplied
using Hadamard product operation with the textual vector. The final vector is then fed to
linear layers.
2. UserId [17] – this model represents the information about the user by appending their
unique ID token to the beginning of the text. The vector representation is obtained via the
transformer model by encoding the concatenation of the text and user ID. We manually
added the user ID tokens to the model’s special token set to avoid splitting them during
the tokenization procedure.
3. UserIdentifier [18] takes into account the identity of the text’s author. A data
augmentation method involves adding tokens that identify the user. The string is generated from
the username or sampled uniformly from the tokenizer vocabulary and then appended
to the beginning of a text. UserIdentifier uses the same set of parameters to embed
both sample content and user identifiers, which is more straightforward than relying on
user-specific embeddings and has been shown to achieve high performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Diferential Data Maps</title>
      <p>The idea was inspired by work [44]. The authors present a Data Maps method using a machine
learning model to visualize a dataset. It allows seeing how specific elements of the training
set are characterized during the learning process. The intuition behind training dynamics is
that the model learns to recognize some elements immediately. For other elements, the model
needs more learning epochs, during which it can interchangeably make good or bad decisions
relative to the ground truth. Finally, the model cannot learn the ground truth for the last group
of elements. Three major training dynamics measures for the ith sample in the dataset were
introduced:
1. Confidence,  ̂ – captures how confidently the model assigned a true label to the sample,
calculated as a mean probability across epochs;
2. Variability,  ̂ – measures how the model was indecisive about sample label during
training using standard deviation (low value means the stable prediction of one label, and
high value - often change of assigned label);
3. Correctness,  ̂ – a fraction of correctly predicted labels for the sample across training
epochs.</p>
      <p>In this work, we extend the idea of Data Maps by proposing visualizing the diferences
between models in the listed training dynamics measures. Our new method, Diferential Data
Maps, allows us to interpret diferences in the performance of diferent model architectures and
analyze the efect of selected characteristics describing the data on the diference in training
dynamics on the same dataset. We define three new metrics based on those presented for Data
Maps. Let M1 and M2 be diferent models trained on the same dataset. Then for ith sample in
this dataset, we define new measures:</p>
      <p>1. Confidence change :  ̂</p>
      <p>2. Variability change:  ̂


3. Correctness change:  ̂

=  ̂
=  ̂
=  ̂
2
2
2</p>
      <p>1
−  ̂
−  ̂
−  ̂
1
1</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>The experimental part was performed on the previously described MHS dataset, which was
divided into three sets to provide suficient prior knowledge in training set about users’ profiles.
For this purpose, the data were grouped by 
_ and filtered out those who gave less
than 20 reviews. The entries rated by the same annotator were then divided between the splits
in a ratio of 6:2:2. The final statistics of the training and evaluation data are shown in Table 1.
batch size equal to 32. Additionally, we used a linear warm-up schedule for 1000 training steps.
The maximum sequence length was set to 512. The best model was selected according to the
validation F-score across 30 epochs. For UserIdentifier, we took ten tokens drawn from the
tokenizer vocabulary as an identifier. This has enabled better diferentiation between users than
relying on usernames or strings of numbers. On the other hand, the UserId model leveraged
the embedding of a special ID token, which is a concatenation of the word user, an underscore
character (_), and a unique index number for a specific user, i.e., user_1.</p>
      <p>All experiments were repeated ten times with the same data order but diferent weights
(the model seed was changing). To plot DDM, training dynamics were also logged after each
epoch. Two classification metrics were reported to compare model performance. In the last step,
we assessed the statistical significance of achieved results diferences between each method.
After checking the assumptions of the t-student test, its score was calculated accordingly. The
Mann-Whitney U test was applied if t-test conditions were not met for independent samples.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Results</title>
      <p>Table 2 shows the results of the Baseline model and the other three personalized models. The
models were evaluated on an MHS dataset for five dimensions of ofensiveness. We measured
the performance quality using the macro F1-score and accuracy. Analysis of the results shows
that for the F1-macro measure for all dimensions of ofensiveness, a significant quality gain
is observed relative to the baseline for personalized models. The best among them is the
HuBi-Medium model, for which the gain is 7.03 pp for hate speech, 4.2 pp for insult, 9.54 pp for
violence, 4.08 pp for humiliate and 3.2 pp for sentiment, respectively. For the Accuracy measure,
statistically significant quality gains were observed for three of the five dimensions, i. e: hate
speech, violence, and humiliate. Further analysis showed no significant diferences between the
personalized models for this measure.</p>
      <p>It is much more interesting to analyze the diferences between the baseline and personalized
models using DDC. Figure 2 shows the original DC graph, generated using the method described
in [44]. For virtually all dimensions, the graphs for the baseline model look very similar, and
larger diferences are observed between the graphs of the various personalized models. The
original DC graph, however, is dificult to interpret, as all that can be said is how the distribution
of confidence and variability values changes in a general way for all cases. There are some
dimensions (e.g., sentiment) for which the DC charts are very similar, regardless of the model
used. However, significant diferences can be seen by analyzing the DDC graph for data
samples; see Figure 3. For the sentiment, it can be seen that the UserIdentifier model improves
the correctness of the model for the vast majority of samples. At the same time, for UserId,
there appears to be a smaller but significantly large group of samples for which the correctness
decreases. Paradoxically, for the best model, HuBi-Medium, we observe the largest groups of
cases characterized by the greatest decrease in correctness. We hypothesize that the person
component of the multimodal model causes the model to start getting wrong more often in
the learning process. However, the result is ultimately better than the baseline model. The
UserIdentifier model makes the least amount of these mistakes in the learning process, but this
does not translate into better quality as measured by the F1-score. HuBi-Medium is also the
h
tcsaeee
p
h
S
H
M
only model for which variability decreases relative to baseline in the vast majority of samples,
indicating that the personalization component for this model significantly afects the rate at
which the model converges to a local optimum, after which there is little variation in subsequent
epochs.</p>
      <p>Furthermore, interesting results can be observed in the DDC variant, in which the training
data are aggregated after users; see Figure 4. Each point on the graph represents cases annotated
by a particular person, and the place on the graph indicates the shift of the personalized model
results relative to the baseline. In addition, we added information indicating the entropy of the
user’s ratings in the set. The UserIdentifier model for all dimensions of ofensiveness resulted
in increased correctness and variability for most users. The lower the entropy of user ratings,
the higher the correctness. In the UserID model for dimensions in which a large group of users
has variability lower than the baseline (insult, sentiment), we observe no significant diferences
between the baseline and personalized models. Finally, for HuBi-Medium, the lack of significant
diferences is strongly correlated with the largest increase in confidence. In addition, it can be
seen that for all dimensions, we observe a significant decrease in variability, with confidence
decreasing most strongly for users with high entropy and mostly increasing for users with low
entropy.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions and Future Work</title>
      <p>In this article, we presented a novel evaluation method called Diferential Dataset Cartography.
It allows for pairwise visual comparison of model performance. During experiments, we have
shown interesting findings provided by our new method. However, further analysis could help
identify more insights that could not be obtained from the raw metric values.</p>
      <p>The experiments show that including the user context results in significantly improved
performance compared to the baseline model. The evaluation metrics show that the HuBi-Medium
model outperformed other architectures in most tasks. However, using DDM provided an
additional perspective for analyzing the model behavior. Our method emphasized the diference
in user representations and the nature of training each of the personalized architectures. It also
provided additional information on how each architecture gathers knowledge about users.</p>
      <p>Moreover, by aggregating the DDM by users, we explored how much knowledge the model
can extract from a specific user. This shows another aspect of human perception – we can
discover how dificult it is to learn the preferences of a single person from the model point
of view. This can be used as additional feedback relevant during the annotation process to
estimate how much data we need about a particular user to learn their perspective efectively.
Furthermore, a precise user learning dificulty estimate can be helpful during the architecture
design process, which should consider awareness of the general dificulty level of the task.</p>
      <p>In future work, we will conduct more experiments using other datasets to obtain more
knowledge about the behavior of personalized models. In addition, we want to analyze in detail
the samples that form the clusters that can be observed in the DDM charts. This will help to
understand the impact of specific samples on the model eficiency. The code for all methods
and experiments is publicly available in CLARIN-PL repository1.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>
        This work was financed by (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the European Regional Development Fund as a part of the
2014-2020 Smart Growth Operational Programme, project no. POIR.01.01.01-00-0288/22 (JK);
POIR.04.02.00-00C002/19 (JB, KK), (2) the statutory funds of the Department of Artificial
Intelligence, Wroclaw University of Science and Technology.
      </p>
      <p>1https://github.com/CLARIN-PL/personalized-nlp/releases/tag/2023-aaai-defactify
[2] J. Kocoń, P. Miłkowski, M. Zaśko-Zielińska, Multi-level sentiment analysis of polemo
2.0: Extended corpus of multi-domain consumer reviews, in: Proceedings of the 23rd
Conference on Computational Natural Language Learning (CoNLL), 2019, pp. 980–991.
[3] J. Kocoń, A. Janz, P. Miłkowski, M. Riegel, M. Wierzba, A. Marchewka, A. Czoska, D.
Grimling, B. Konat, K. Juszczyk, et al., Recognition of emotions, valence and arousal in large-scale
multi-domain text reviews, in: 9th Language &amp; Technology Conference: Human Language
Technologies as a Challenge for Computer Science and Linguistics, 2019.
[4] K. Kanclerz, P. Miłkowski, J. Kocoń, Cross-lingual deep neural transfer learning in
sentiment analysis, Procedia Computer Science 176 (2020) 128–137.
[5] R. Oshikawa, J. Qian, W. Y. Wang, A survey on natural language processing for fake news
detection, in: Proceedings of the 12th LREC Conference, 2020, pp. 6086–6093.
[6] N. Micallef, M. Sandoval-Castañeda, A. Cohen, M. Ahamad, S. Kumar, N. Memon,
Crossplatform multimodal misinformation: Taxonomy, characteristics and detection for textual
posts and videos, in: Proceedings of the International AAAI Conference on Web and Social
Media, volume 16, 2022, pp. 651–662.
[7] J. Kocoń, A. Janz, Propagation of emotions, arousal and polarity in wordnet using
heterogeneous structured synset embeddings, in: Proceedings of the 10th Global Wordnet
Conference, 2019, pp. 336–341.
[8] J. Kocoń, M. Maziarz, Mapping wordnet onto human brain connectome in emotion
processing and semantic similarity recognition, Information Processing &amp; Management
58 (2021) 102530.
[9] J. Kocoń, J. Radom, E. Kaczmarz-Wawryk, K. Wabnic, A. Zajączkowska, M. Zaśko-Zielińska,
Aspectemo: multi-domain corpus of consumer reviews for aspect-based sentiment analysis,
in: 2021 International Conference on Data Mining Workshops, IEEE, 2021.
[10] J. Kocoń, J. Baran, M. Gruza, A. Janz, M. Kajstura, P. Kazienko, W. Korczyński, P. Miłkowski,
M. Piasecki, J. Szołomicka, Neuro-symbolic models for sentiment analysis, in: International
Conference on Computational Science, Springer, 2022, pp. 667–681.
[11] P. Miłkowski, M. Gruza, P. Kazienko, J. Szołomicka, S. Woźniak, J. Kocoń, Multiemo:
language-agnostic sentiment analysis, in: International Conference on Computational
Science, Springer, 2022, pp. 72–79.
[12] M. Wierzba, M. Riegel, J. Kocoń, P. Miłkowski, A. Janz, K. Klessa, K. Juszczyk, B. Konat,
D. Grimling, M. Piasecki, et al., Emotion norms for 6000 polish word meanings with a
direct mapping to the polish wordnet, Behavior Research Methods 54 (2022) 2146–2161.
[13] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
processing, in: Proceedings of the fith international workshop on natural language
processing for social media, 2017, pp. 1–10.
[14] A. Sheth, V. L. Shalin, U. Kursuncu, Defining and detecting toxicity on social media:
context and knowledge are key, Neurocomputing 490 (2022) 312–318.
[15] S. Ghosh, A. Ekbal, P. Bhattacharyya, T. Saha, A. Kumar, S. Srivastava, Sehc: A benchmark
setup to identify online hate speech in english, IEEE Transactions on Computational Social
Systems (2022).
[16] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM</p>
      <p>Computing Surveys (CSUR) 51 (2018) 1–30.
[17] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kajdanowicz, P. Kazienko, Ofensive,
aggressive, and hate speech analysis: From data-centric to human-centered approach,
Information Processing &amp; Management 58 (2021) 102643.
[18] F. Mireshghallah, V. Shrivastava, M. Shokouhi, T. Berg-Kirkpatrick, R. Sim, D. Dimitriadis,
Useridentifier: implicit user representations for simple and efective personalized sentiment
analysis, arXiv preprint arXiv:2110.00135 (2021).
[19] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, K. Kanclerz, P. Miłkowski, P. Kazienko,
Learning personal human biases and representations for subjective tasks in natural
language processing, in: 2021 IEEE International Conference on Data Mining (ICDM), IEEE,
2021, pp. 1168–1173.
[20] P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko, D. Grimling, J. Kocoń, Personal bias in
prediction of emotions elicited by textual opinions, in: Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing: Student Research Workshop, 2021.
[21] A. Ngo, A. Candri, T. Ferdinan, J. Kocoń, W. Korczynski, Studemo: A non-aggregated
review dataset for personalized emotion recognition, in: Proceedings of the 1st Workshop
on Perspectivist Approaches to NLP@ LREC2022, 2022, pp. 46–55.
[22] K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz, J. Kocoń, D. Puchalska, P. Kazienko,
Controversy and conformity: from generalized to personalized aggressiveness detection, in:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), 2021, pp. 5915–5926.
[23] P. Miłkowski, S. Saganowski, M. Gruza, P. Kazienko, M. Piasecki, J. Kocoń, Multitask
personalized recognition of emotions evoked by textual content, in: 2022 IEEE International
Conference on Pervasive Computing and Communications Workshops and other Afiliated
Events (PerCom Workshops), IEEE, 2022, pp. 347–352.
[24] Y. Sang, J. Stanton, The origin and value of disagreement among data labelers: A case
study of individual diferences in hate speech annotation, in: International Conference on
Information, Springer, 2022, pp. 425–444.
[25] K. Kanclerz, M. Gruza, K. Karanowski, J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko,
What if ground truth is subjective? personalized deep neural hate speech detection, in:
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, 2022.
[26] A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, P. Agrawal,
Understanding emotions in text using deep learning and big data, Computers in Human Behavior 93
(2019) 309–317.
[27] C. J. Kennedy, G. Bacon, A. Sahn, C. von Vacano, Constructing interval variables via
faceted Rasch measurement and multitask deep learning: a hate speech application, arXiv
e-prints (2020) arXiv:2009.10277. arXiv:2009.10277.
[28] A. Rajadesingan, R. Zafarani, H. Liu, Sarcasm detection on twitter: A behavioral modeling
approach, in: Proceedings of the eighth ACM international conference on web search and
data mining, 2015, pp. 97–106.
[29] S. Amir, B. C. Wallace, H. Lyu, P. C. M. J. Silva, Modelling context with user embeddings
for sarcasm detection in social media, arXiv preprint arXiv:1607.00976 (2016).
[30] L. Gong, B. Haines, H. Wang, Clustered model adaption for personalized sentiment analysis,
in: Proceedings of the 26th International Conference on World Wide Web, 2017.
[31] A. Kamal, M. Abulaish, Self-deprecating humor detection: A machine learning approach,
in: International Conference of the Pacific Association for Computational Linguistics,
Springer, 2019, pp. 483–494.
[32] A. Mondal, R. Sharma, Team_KGP at SemEval-2021 task 7: A deep neural system to
detect humor and ofense with their ratings in the text data, in: Proceedings of the
15th International Workshop on Semantic Evaluation (SemEval-2021), Association for
Computational Linguistics, 2021, pp. 1169–1174.
[33] N. Chetty, S. Alathur, Hate speech review in the context of online social networks,</p>
      <p>Aggression and violent behavior 40 (2018) 108–118.
[34] L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation,</p>
      <p>AI Magazine 36 (2015) 15–24.
[35] T. Liu, A. Venkatachalam, P. Sanjay Bongale, C. Homan, Learning to predict
populationlevel label distributions, in: Companion Proceedings of The 2019 World Wide Web
Conference, 2019, pp. 1111–1120.
[36] S. Akhtar, V. Basile, V. Patti, Modeling annotator perspective and polarized opinions
to improve hate speech detection, in: Proceedings of the AAAI Conference on Human
Computation and Crowdsourcing, volume 8, 2020, pp. 151–154.
[37] T. C. Weerasooriya, T. Liu, C. M. Homan, Neighborhood-based pooling for population-level
label distribution learning, arXiv preprint arXiv:2003.07406 (2020).
[38] S. Tonekaboni, S. Joshi, M. D. McCradden, A. Goldenberg, What clinicians want:
contextualizing explainable machine learning for clinical end use, in: Machine learning for
healthcare conference, PMLR, 2019, pp. 359–380.
[39] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, stat
1050 (2017) 2.
[40] A. Lui, G. W. Lamb, Artificial intelligence and augmented intelligence collaboration:
regaining trust and confidence in the financial sector, Information &amp; Communications
Technology Law 27 (2018) 267–283.
[41] A. Adadi, M. Berrada, Peeking inside the black-box: A survey on explainable artificial
intelligence (xai), IEEE Access 6 (2018) 52138–52160.
[42] A. Das, P. Rad, Opportunities and challenges in explainable artificial intelligence (xai): A
survey, arXiv preprint arXiv:2006.11371 (2020).
[43] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, in:</p>
      <p>Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182.
[44] S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A. Smith, Y. Choi, Dataset
cartography: Mapping and diagnosing datasets with training dynamics, in: Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Association for Computational Linguistics, Online, 2020, pp. 9275–9293.
[45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Recurrent convolutional neural networks for text classification</article-title>
          ,
          <source>in: Twenty-ninth AAAI conference on artificial intelligence</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>