=Paper=
{{Paper
|id=Vol-3370/paper8
|storemode=property
|title=Cross-lingual Transfer Learning for Detecting Negative Campaign in Israeli Municipal Elections: a Case Study
|pdfUrl=https://ceur-ws.org/Vol-3370/paper8.pdf
|volume=Vol-3370
|authors=Natalia Vanetik,Marina Litvak,Lin Miao
|dblpUrl=https://dblp.org/rec/conf/ecir/VanetikLM23
}}
==Cross-lingual Transfer Learning for Detecting Negative Campaign in Israeli Municipal Elections: a Case Study==
Cross-lingual Transfer Learning for Detecting
Negative Campaign in Israeli Municipal Elections: a
Case Study
Natalia Vanetik1,∗ , Marina Litvak1 and Lin Miao2
1
Department of Software Engineering, Shamoon College of Engineering (SCE), Beer-Sheva, Israel
2
Department of Computer Science, Beijing Information Science and Technology University, Beijing, China
Abstract
Political competitions are complex settings where candidates use campaigns to promote their chances
to be elected. As we can recently observe, some candidates choose to focus on a negative campaign
that emphasizes the negative aspects of the competing person and is aimed at offending opponents or
the opponent’s supporters. The big challenge in this area is the lack of annotated datasets for training
efficient classifiers. Therefore, transfer learning from other relevant domains and other languages could
be very useful for this task. Considering the recent success of meta-learning in domain adaptation, we
apply it to our task of utilizing available datasets from different domains and languages. This work
explores the negative campaign detection task from multiple perspectives: the efficiency of different text
representations and classification models, and the effect of transfer learning from offensive language
detection in different languages for negative campaign detection in Hebrew. We demonstrate that the
lack of training data for negative campaign detection in a low-resourced language such as Hebrew can be
compensated to some extent by available datasets for offensive language detection in the same and other
languages. We report an empirical case study for political campaigns in Israeli municipal elections.1
Keywords
negative campaign, text classification, Hebrew, BERT, meta-learning
1. Introduction
Political competitions aim at promoting the candidates’ chances to be elected. The main decision
in such competitions regards the nature of the campaign – that is, whether a candidate should
apply a positive campaign that highlights the candidate’s achievements, leadership skills, and
future programs, or focus on a negative campaign that emphasizes the negative sides of the
competitors [1, 2, 3, 4].
In recent years, we witness the intensive use of negative campaigns by political candidates
which target the weaknesses and failures of the opponents promising to do the opposite [2, 3, 4].
1
Our dataset is freely available for researchers at https://github.com/NataliaVanetik1/TONIC.
In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’23 Workshop, Dublin
(Republic of Ireland), 2-April-2023
∗
Corresponding author.
†
These authors contributed equally.
Envelope-Open natalyav@sce.ac.il (N. Vanetik); litvak.marina@gmail.com (M. Litvak); linmiao@bistu.edu.cn (L. Miao)
Orcid 0000-0002-4939-1415 (N. Vanetik); 0000-0003-3044-3681 (M. Litvak); 0000-0002-9421-8566 (L. Miao)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
83
The application of language technologies in the political sciences is recently in high de-
mand [5]. However, despite some works dedicated to the analysis of elections-related materi-
als [6, 7, 8], we were unable to find any work on automated negative campaign analysis and
detection.
Our work reports the results of extensive experiments, aimed at answering multiple research
questions: (1) Which supervised model and representation are more effective at automatically
detecting negative campaigns in Hebrew? (2) Can we effectively detect negative campaigns with
a model trained to identify offensive language? (3) Can meta-learning with different domains
and languages boost negative campaign detection in Hebrew?
We adopt and extend the representation models applied in [9, 10, 11], where the gain of
semantic vectors and sentiment knowledge for offensive language and negative campaign
detection was empirically shown. In order to increase classification accuracy in a mono-domain
setting, we use knowledge about cities, country districts (regions), and politicians. We use
this information in a meta-learning setting as well. In [10], we have also shown the efficiency
of transfer learning for cross-lingual training of offensive language classifiers with Semitic
languages. We adopt and explore this idea for this study. The lack of Hebrew datasets is
addressed in this study by using cross-domain and cross-lingual transfer learning, in contrast
to [11].
Our contribution is multi-fold: (1) we experimented with different representations and classi-
fiers for efficient encoding and classification of texts in Hebrew for negative campaign detection;
(2) we explored the efficiency of meta-learning in mono-domain experiments; (3) we explored
an efficiency of a transfer learning from offensive language detection in different languages
to negative campaign detection; (4) we explored a gain of meta-learning vs. conventional
fine-tuning of language models in transfer learning for cross-domain experiments.
2. TONIC dataset
The data was collected from the Facebook accounts of local politicians from several big Israeli
cities running for mayor’s offices. There was a total of 12 cities and 27 mayoral candidates
whose number for elections that took place in 2018. Data statistics appear in Table 1. The data
is freely available for download from GitHub at https://github.com/NataliaVanetik1/TONIC.
Collected posts were annotated as either negative or not by two independent annotators; in case
of a disagreement between them, the third annotator decided on a final label. The annotators
were instructed to label a post as a “negative campaign” only if it contained negative (but not
necessarily offensive) content about the opponent of the post’s owner or her supporter. Kappa
agreement between the annotators was 0.862. The majority rule, i.e., the portion of the bigger
class in our data, is 0.78 (the distribution between two classes is 78% − 22%, with the majority
class being benign texts, and the minority class containing negative campaign texts).
3. Proposed method for Negative Campaign classification
Our approach follows a standard flow of supervised learning, including text representation,
model training, and its application on a test set for the model’s evaluation.
84
Table 1
Collected data by city.
region city candidates posts pos neg avg words avg characters
in post in post
center Herzliya 2 218 91 127 108.482 645.468
center Jerusalem 3 412 32 380 72.471 428.964
center Rishon LeZion 1 183 23 160 103.448 619.989
center Tel Aviv 1 36 8 28 95.611 545.806
center Petah Tikva 4 364 68 296 80.184 466.626
center Hod Hasharon 2 266 45 221 85.128 498.432
south Ashdod 4 363 139 224 92.377 528.044
south Ashkelon 3 363 61 302 82.157 482.876
south Dimona 1 50 7 43 92.280 542.240
south Beer Sheva 1 14 9 5 192.500 1075.643
north Netanya 4 316 81 235 72.215 427.886
north Haifa 1 47 4 43 75.234 440.319
Total 27 2632 568 2064 85.384 500.771
The following techniques were employed for the post representation:
• Term frequency-inverse document frequency (tf-idf), where every post is treated
as a separate document and the whole dataset as a corpus.
• N-grams of 𝑛 consecutive words seen in the text, with 𝑛 = 1, 2, 3.
• BERT sentence embeddings using one of the pre-trained BERT models—a multilingual
model [12] and a Hebrew model [13]. We use BERT embeddings to represent post text,
region, and city.
• Sentiment weights generated by the HeBERT model [14], producing a probability
distribution for positive, negative, and neutral sentiments, for every post.
For classification, we experimented with three different types of classifiers:
• Traditional classsifiers, including Random Forest (RF) [15], Logistic Regression
(LR) [16], and Extreme Gradient Boosting (XGB) [17].
• Fine-tuned BERT, including a multilingual model called bert-base-multilingual-cased
(denoted as mBERT) [18] and AlephBERT [13], a large pre-trained language model for
Modern Hebrew. Both models were fine-tuned on the train portion of our data.
• Meta-learning, where create a meta-model for detecting unfavorable campaigns when
training data for this particular task and language is missing (or not sufficient). To
quickly adapt to new target cases, ModelAgnostic Meta-Learning (MAML) [19], a general
optimization framework, uses the gradient descent process to create a strong initial model.
Therefore, in this study, we used MAML for meta-learning. As the foundation for our
meta-learning, we use a pre-trained BERT language model as a base model. The goal of
meta-learning is to train a model on a variety of learning tasks, such that it can solve
new learning tasks using only a small number of training samples. We use three different
criteria to split our data into training tasks: (1) an account of politician, where one
training task aims at the identification of posts with negative campaigns published by
the same politician; (2) a city, where a training task focuses on the data generated by
politicians from the same particular city; and (3) a region of the country, where we
85
train our model on the annotated posts generated by politicians from the same region of
the country.
A full pipeline of our approach is depicted in Figure 1.
BERT sentence vectors
posts tokenization tf ∗ idf vectors sentiment analysis prediction model
n-gram vectors
Figure 1: Political posts classification pipeline.
4. Experiments
Our experiments aim to evaluate (1) different models and representations of Hebrew data in the
negative campaign domain; (2) transfer learning from the hate speech domain, in Hebrew and
other languages; and (3) meta-learning approach in mono-domain and cross-domain learning.
Data and Software Setup
For the monolingual experiments on the TONIC dataset, RF, LR, and XGB are trained on 80% of
the dataset and evaluated on the remaining 20%. For the cross-domain monolingual experiments,
the models are trained on 100% of the other domain data and tested on 20% of the TONIC dataset.
For the cross-domain cross-lingual experiments, we train our models on 100% of the data in
another language, and test on the 20% of the TONIC dataset. In all cases, the test portion of the
TONIC dataset is the same which allows us to conduct proper statistical significance analysis.
Fine-tuned BERT was trained a 75% of the data with the validation set containing 5% of the
data, and it was tested on the remaining 20%. Fine-tuning was run for 10 epochs with batch
size 16. For the cross-domain experiments, we used the Hebrew offensive language dataset [20]
called OLaH. Traditional models were implemented in sklearn [21] and neural models were
implemented in Keras [22] with the TensorFlow backend [23]. Experiments were performed on
google colab [24] with Pro settings.
Mono-domain Evaluation Results
Here we report the results–precision, recall, f1-measure, and accuracy scores–of the evaluation
of and comparison of various models and text representations to detect negative campaigns
in political posts written in Hebrew. In particular, we explore whether or not BERT sentence
embeddings perform better than traditional text representations such as tf-idf and n-grams. We
also compare two pre-trained BERT models to determine whether a model specifically trained
in Hebrew is preferable.
Table 2 (left) summarizes the results for the conventional models and representations without
sentence embeddings. All models were trained and tested on the TONIC training and test sets,
respectively. The text representations use either tf-idf or n-grams (ngX denotes n-grams for
𝑋 = 1, 2, 3), or their combinations (tfidf-ngX denotes a concatenation of tf-idf vectors with
86
Table 2
Evaluation of traditional models and representations on TONIC: mono-domain (left) and cross-domain
monolingual (right).
mono-domain cross-domain monolingual
model P R F1 acc P R F1 acc
RF𝑡𝑓 𝑖𝑑𝑓 +𝑆𝐴 0.8908 0.6467 0.6813 0.8444 0.6457 0.5222 0.4931 0.7837
LR𝑡𝑓 𝑖𝑑𝑓 +𝑆𝐴 0.8341 0.7243 0.7586 0.8615 0.8926 0.5044 0.4485 0.7856
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑆𝐴 0.8656 0.7662 0.8010 0.8824 0.8933 0.5088 0.4575 0.7875
RF𝑛𝑔1+𝑆𝐴 0.8460 0.6626 0.6979 0.8444 0.5171 0.5015 0.4531 0.7761
LR𝑛𝑔1+𝑆𝐴 0.8390 0.7601 0.7892 0.8729 0.5181 0.5068 0.4870 0.7495
XGB𝑛𝑔1+𝑆𝐴 0.8220 0.7445 0.7726 0.8634 0.6956 0.5215 0.4882 0.7875
RF𝑛𝑔2+𝑆𝐴 0.8633 0.6399 0.6715 0.8387 0.4819 0.4885 0.4785 0.7059
LR𝑛𝑔2+𝑆𝐴 0.7633 0.7276 0.7424 0.8368 0.5091 0.5098 0.5090 0.6546
XGB𝑛𝑔2+𝑆𝐴 0.7972 0.7417 0.7633 0.8539 0.4990 0.4998 0.4576 0.7685
RF𝑛𝑔3+𝑆𝐴 0.8978 0.6260 0.6541 0.8368 0.4989 0.4994 0.4880 0.7230
LR𝑛𝑔3+𝑆𝐴 0.7633 0.7276 0.7424 0.8368 0.5091 0.5098 0.5090 0.6546
XGB𝑛𝑔3+𝑆𝐴 0.7972 0.7417 0.7633 0.8539 0.5098 0.5018 0.4639 0.7666
RF𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔1+𝑆𝐴 0.8460 0.6299 0.6580 0.8330 0.6090 0.5166 0.4842 0.7799
LR𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔1+𝑆𝐴 0.8390 0.7601 0.7892 0.8729 0.5330 0.5124 0.4948 0.7533
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔1+𝑆𝐴 0.8567 0.7681 0.8002 0.8805 0.8933 0.5088 0.4575 0.7875
RF𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔2+𝑆𝐴 0.8935 0.6128 0.6357 0.8311 0.4607 0.4856 0.4568 0.7362
LR𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔2+𝑆𝐴 0.7581 0.7156 0.7325 0.8330 0.5147 0.5161 0.5147 0.6546
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔2+𝑆𝐴 0.8317 0.7545 0.7829 0.8691 0.3916 0.4988 0.4388 0.7818
RF𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔3+𝑆𝐴 0.9097 0.6009 0.6183 0.8273 0.5366 0.5109 0.4873 0.7609
LR𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔3+𝑆𝐴 0.7581 0.7156 0.7325 0.8330 0.5147 0.5161 0.5147 0.6546
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑛𝑔3+𝑆𝐴 0.8485 0.7701 0.7994 0.8786 0.3916 0.4988 0.4388 0.7818
Table 3
Evaluation of mono-domain training on TONIC with BERT sentence embeddings.
mBERT AlephBERT
model P R F1 acc P R F1 acc
RF𝑏𝑒𝑟𝑡 0.8607 0.7052 0.7452 0.8615 0.8283 0.7231 0.7564 0.8596
LR𝑏𝑒𝑟𝑡 0.8072 0.7699 0.7859 0.8634 0.8145 0.7938 0.8034 0.8710
XGB𝑏𝑒𝑟𝑡 0.8059 0.7731 0.7874 0.8634 0.8160 0.7799 0.7956 0.8691
RF𝑏𝑒𝑟𝑡+𝑙𝑜𝑐 0.8796 0.6957 0.7377 0.8615 0.8725 0.7152 0.7568 0.8672
LR𝑏𝑒𝑟𝑡+𝑙𝑜𝑐 0.8251 0.7716 0.7933 0.8710 0.7990 0.7814 0.7896 0.8615
XGB𝑏𝑒𝑟𝑡+𝑙𝑜𝑐 0.8523 0.7864 0.8125 0.8843 0.8518 0.8016 0.8227 0.8880
RF𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛 0.8461 0.6909 0.7287 0.8539 0.8504 0.7235 0.7611 0.8653
LR𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛 0.8097 0.7743 0.7896 0.8653 0.8205 0.7994 0.8092 0.8748
XGB𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛 0.7782 0.7324 0.7508 0.8444 0.8160 0.7799 0.7956 0.8691
RF𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛+𝑙𝑜𝑐 0.8718 0.6782 0.7178 0.8539 0.8705 0.7108 0.7522 0.8653
LR𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛+𝑙𝑜𝑐 0.8228 0.7672 0.7895 0.8691 0.7974 0.7878 0.7924 0.8615
XGB𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛+𝑙𝑜𝑐 0.8702 0.7869 0.8182 0.8899 0.8562 0.8028 0.8250 0.8899
RF𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡 0.8792 0.5777 0.5827 0.8159 0.8611 0.6562 0.6915 0.8444
LR𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡 0.8340 0.7740 0.7979 0.8748 0.8194 0.7919 0.8043 0.8729
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡 0.8418 0.7569 0.7875 0.8729 0.8432 0.7765 0.8025 0.8786
RF𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔1 0.9057 0.5789 0.5843 0.8178 0.8891 0.6423 0.6756 0.8425
LR𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔1 0.8316 0.7621 0.7886 0.8710 0.8400 0.8130 0.8253 0.8861
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔1 0.8221 0.7521 0.7784 0.8653 0.8432 0.7765 0.8025 0.8786
RF𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔2 0.9130 0.6184 0.6438 0.8349 0.8816 0.6543 0.6903 0.8463
LR𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔2 0.7619 0.7169 0.7346 0.8349 0.7881 0.7532 0.7681 0.8520
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔2 0.8320 0.7470 0.7771 0.8672 0.8408 0.7872 0.8092 0.8805
RF𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔3 0.9130 0.6184 0.6438 0.8349 0.8677 0.6694 0.7074 0.8501
LR𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔3 0.7619 0.7169 0.7346 0.8349 0.7881 0.7532 0.7681 0.8520
XGB𝑡𝑓 𝑖𝑑𝑓 +𝑏𝑒𝑟𝑡+𝑛𝑔3 0.8174 0.7509 0.7761 0.8634 0.8385 0.7752 0.8002 0.8767
87
n-grams of size𝑋 = 1, 2, 3). All the systems are significantly better than the majority rule. Also,
the XGB classifier with tf-idf, unigrams, and sentiment labels outperforms the other classifiers.
Confusion matrix of the top-performing model (XGB𝑏𝑒𝑟𝑡+𝑟𝑒𝑔𝑖𝑜𝑛+𝑙𝑜𝑐 ) contains TP = 75, TN =
391, FP = 22, and FN = 39, with 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 0.77 and 𝑟𝑒𝑐𝑎𝑙𝑙 = 0.66. These results show that
the model does a good job of identifying and eliminating negative samples (non-negative
campaigns), but it misses positive samples (negative campaigns). As a result, TN is the most
important accuracy compound, while FN represents the biggest amount of errors. In a 10
misclassified case sample that we manually examined, more than half of the errors (6), including
four samples incorrectly identified as negative campaigns when we actually found them to be
neutral and two samples incorrectly labeled as neutral, were the result of incorrect labeling by
our annotators.
Table 3 shows the scores for the same models over sentence embeddings, produced by two
different BERT models–multilingual BERT [25] and Hebrew-language AlephBERT [13]. We
can see that enriched sentence embeddings of cities and regions’ names boost the classification
performance. XGB outperforms the other classifiers as in the previous experiment. We cannot
recommend one particular BERT model, because both models seem to provide sentence em-
beddings with similar quality. However, when we compare these BERT models fine-tuned on
the classification task on TONIC (see Table 4), AlephBERT, which is trained solely in Hebrew,
significantly outperforms multilingual BERT producing accuracy which falls below the majority
rule. Nonetheless, both models are outperformed by the best traditional models, probably due
to less information encoded in the text representation. While both BERT classifiers use only
self-produced embeddings, traditional models also utilize sentiment labels, and embeddings
representing the cities and regions of the candidates.
Table 4 contains the results of meta-learning where tasks are specified by three different
criteria.
Table 4
Meta-learning and fine-tuned BERT evaluation on TONIC dataset.
Fine-tuned BERT Meta-learning
model P R F1 acc task split by P R F1 acc
politician 0.6390 0.7994 0.7103 0.7994
mBERT 0.6079 0.5816 0.5817 0.6641 location 0.6126 0.7827 0.6873 0.7827
region 0.5659 0.7523 0.6459 0.7523
politician 0.6055 0.7781 0.6810 0.7781
AlephBERT 0.8589 0.7964 0.8190 0.8634 location 0.6173 0.7857 0.6914 0.7857
region 0.6126 0.7827 0.6873 0.7827
We can see that multilingual BERT achieves the best accuracy score; however, for all the
options for task division, meta-learning scores are very close to the majority rule, that evidence
that there is not much information that can be efficiently learned and transferred between
tasks. We can also see that for a fine-tuned BERT, AlephBERT has a clear advantage over the
multilingual BERT model in all parameters.
According to the scores in Tables 2 and 3 (we omitted meta-learning models because of their
low performance), the top performing model is XGB, applied on bert embeddings enriched by
region and location embeddings. In general, the XGB classifier outperforms other classifiers in
most cases.
88
Cross-domain Mono-lingual Evaluation Results
Cross-domain mono-lingual (all models were trained and tested on Hebrew data) experiments
in Table 2 (right) show that using an offensive language dataset as a training set decreases
classification accuracy for all the models, indicating that the task of detecting negative campaigns
is different from the task of offensive language detection. Only a few models trained on offensive
language data achieved accuracy that is slightly higher than or equal to the majority rule.
Additionally, we can see that F1 scores are really low, meaning that these models simply ’guess’
the majority rule.
Table 5 shows the results of the traditional models with BERT embeddings as a text repre-
sentation for transfer learning from the offensive language detection in Hebrew. From Table 2
(right) and Table 5, we can conclude that (1) the XGB classifier mostly performs better than
other classifiers and (2) its performance is slightly higher with BERT embeddings than with
tf-idf vectors and n-grams.
Table 5
Cross-domain mono-lingual evaluation of traditional models with BERT sentence embeddings.
mBERT AlephBERT
model P R F1 acc P R F1 acc
RF𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡 0.6424 0.5032 0.4479 0.7837 0.7946 0.5163 0.4743 0.7894
LR𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡 0.7265 0.5076 0.4568 0.7856 0.7265 0.5076 0.4568 0.7856
XGB𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡 0.6726 0.5171 0.4800 0.7856 0.6726 0.5171 0.4800 0.7856
RF𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔1 0.3910 0.4952 0.4370 0.7761 0.6429 0.5064 0.4561 0.7837
LR𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔1 0.5463 0.5160 0.4980 0.7590 0.5463 0.5160 0.4980 0.7590
XGB𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔1 0.7465 0.5271 0.4973 0.7913 0.7609 0.5315 0.5053 0.7932
RF𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔2 0.4990 0.4998 0.4576 0.7685 0.4680 0.4955 0.4494 0.7666
LR𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔2 0.5073 0.5081 0.5069 0.6471 0.5073 0.5081 0.5069 0.6471
XGB𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔2 0.7304 0.5302 0.5042 0.7913 0.7304 0.5302 0.5042 0.7913
RF𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔3 0.6937 0.5107 0.4648 0.7856 0.4671 0.4909 0.4564 0.7495
LR𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔3 0.5073 0.5081 0.5069 0.6471 0.5073 0.5081 0.5069 0.6471
XGB𝑡𝑓 𝑖𝑑𝑓 −𝑏𝑒𝑟𝑡−𝑛𝑔3 0.7304 0.5302 0.5042 0.7913 0.7304 0.5302 0.5042 0.7913
Table 6 shows the results of meta-learning trained on hate speech data and tested on the
TONIC dataset. Two BERT models are initiated with the weights generated by meta-learning.
The table also contains the scores of fine-tuned BERT without meta-learning.
We can see that (1) best traditional models perform better than both fine-tuned language
models and meta-models when trained in foreign languages, the only exception is the recall
and F1 scores of meta-learning which is evidence of its better ability to recognize the positive
samples–negative political campaign–but fail at filtering out neutral posts (also confirmed by
lower Precision); (2) AlephBERT performs better with meta-learning than multilingual BERT;
(3) meta-learning outperforms fine-tuned language models in terms of both precision and recall.
Table 6
Meta-learning cross-domain mono-lingual evaluation.
Fine-tuned BERT Meta-learning
BERT model P R F1 acc P R F1 acc
mBERT 0.5000 0.3918 0.4394 0.7837 0.6620 0.6793 0.6701 0.6793
AlephBERT 0.5142 0.5818 0.4823 0.7761 0.6126 0.7827 0.6873 0.7827
89
Cross-domain Cross-lingual Evaluation Results
Table 7 shows the evaluation of traditional models for the cross-domain cross-lingual scenario.
In this setting, we train our models on hate speech datasets in other languages - English and
Arabic. The only text representation that we can use here is multilingual BERT sentence
embeddings generated by the pre-trained BERT model bert-base-multilingual-cased [18].
Table 7
Cross-domain cross-lingual evaluation of traditional models.
OLID dataset, En OLaA dataset, Ar
Model P R F1 acc P R F1 acc
RF 0.6096 0.5085 0.1965 0.2296 0.4804 0.4812 0.4807 0.6546
LR 0.5082 0.5005 0.1872 0.2220 0.4224 0.4672 0.4365 0.7173
XGB 0.5535 0.5053 0.1978 0.2296 0.5109 0.5072 0.5019 0.7154
Table 8 shows the results of meta-learning trained on hate speech data in other languages
(Arabic and English) and tested on the TONIC dataset. An English-language dataset is the
Offensive Language Identification Dataset (OLID) [26], which is a collection of 14,100 tweets
(we used 13,240 annotated tweets from its training set). We used the OLaA dataset in Arabic,
which we collected and introduced in [9] previously. OLaA is a collection of 9,000 comments
from Twitter annotated for hate speech. We used a multilingual BERT model [18] for these
experiments. For comparison, we also show the scores of this BERT model fine-tuned on Arabic
and English hate-speech data and tested on TONIC.
Both experiments evidence that meta-learning adapts pre-trained models much better to
the new domains than traditional fine-tuning and it can be efficiently applied for transfer
learning from other domains and even languages. In particular, we can observe the following:
(1) fine-tuned language models and meta-learning perform better than best traditional models
when trained on foreign languages; (2) meta-learning outperforms fine-tuned language models.
Table 8
Cross-domain cross-lingual evaluation of meta-learning.
Fine-tuned BERT Meta-learning
Data lang P R F1 acc P R F1 acc
OLaA Ar 0.4785 0.4250 0.4392 0.7400 0.6245 0.7903 0.6977 0.7903
OLID En 0.6102 0.7812 0.6852 0.7812 0.6342 0.7964 0.7061 0.7964
5. Future Work and Conclusions
Based on the results of extensive experiments aimed to answer various research questions (see
Section 1), we can conclude that (1) the best combination of text representation and classification
model for negative campaign detection in Hebrew texts is XGB with sentence embeddings
enriched with region and location information; (2) transfer learning with models trained to
detect offensive content is inefficient for the detection of a negative campaign; meaning that
there is no strong relation between offensive language and negative campaigns; (3) transfer
learning from different languages can be applied to Hebrew in the negative campaign detection
task, while training on a large set in a foreign language can be even more efficient than training
90
on Hebrew; and (4) meta-learning outperforms language models traditionally fine-tuned in cross-
domain and cross-lingual scenarios, but not in a mono-lingual setting. We also observe that in
a monolingual setting that employs either a fine-tuned BERT or BERT sentence embedding, the
AlephBERT model trained on Hebrew is preferable to a multilingual BERT model. In the future,
we plan to apply our analysis to elections for the Israeli government, to explore the common
characteristics and differences between political campaigns in different countries, and to study
possible relations between the candidate’s gender, perceived strength, initial support, etc. and
their engagement in a negative campaign.
References
[1] D. Bernhardt, M. Ghosh, Positive and negative campaigning in primary and general
elections, Games and Economic Behavior 119 (2020) 98–104.
[2] G. M. Invernizzi, Electoral competition and factional sabotage, Available at SSRN 3329622
(2019).
[3] P. S. Martin, Inside the black box of negative campaign effects: Three reasons why negative
campaigns mobilize, Political psychology 25 (2004) 545–562.
[4] S. Skaperdas, B. Grofman, Modeling negative campaigning, American Political Science
Review 89 (1995) 49–61.
[5] H. Afli, M. Alam, H. Bouamor, C. B. Casagran, C. Boland, S. Ghannay (Eds.), Proceedings of
The LREC 2022 workshop on Natural Language Processing for Political Sciences, European
Language Resources Association, Marseille, France, 2022. URL: https://aclanthology.org/
2022.politicalnlp-1.
[6] M. Baran, M. Wójcik, P. Kolebski, M. Bernaczyk, K. Rajda, L. Augustyniak, T. Kajdanowicz,
Electoral agitation dataset: The use case of the polish election, in: Proceedings of The LREC
2022 workshop on Natural Language Processing for Political Sciences, European Language
Resources Association, Marseille, France, 2022, pp. 32–36. URL: https://aclanthology.org/
2022.politicalnlp-1.5.
[7] H. Abdine, Y. Guo, V. Rennard, M. Vazirgiannis, Political communities on twitter: Case
study of the 2022 french presidential election, in: Proceedings of The LREC 2022 work-
shop on Natural Language Processing for Political Sciences, European Language Re-
sources Association, Marseille, France, 2022, pp. 62–71. URL: https://aclanthology.org/
2022.politicalnlp-1.9.
[8] E. Sanders, A. van den Bosch, Correlating political party names in tweets, newspapers
and election results, in: Proceedings of The LREC 2022 workshop on Natural Language
Processing for Political Sciences, European Language Resources Association, Marseille,
France, 2022, pp. 8–15. URL: https://aclanthology.org/2022.politicalnlp-1.2.
[9] M. Litvak, N. Vanetik, Y. Nimer, A. Skout, Offensive language detection in semitic languages,
in: 1st CFP:Multimodal and Multilingual Hate Speech Detection workshop at KONVENS
2021, 2021, pp. 7–13.
[10] M. Litvak, N. Vanetik, C. Liebeskind, O. Hmdia, R. A. Madeghem, Offensive language
detection in hebrew: can other languages help?, in: Proceedings of the Language Resources
91
and Evaluation Conference, European Language Resources Association, Marseille, France,
2022, pp. 3715–3723. URL: https://aclanthology.org/2022.lrec-1.396.
[11] M. Litvak, N. Vanetik, S. Talker, O. Machlouf, Detection of negative campaign in israeli
municipal elections, in: Proceedings of the Third Workshop on Threat, Aggression and
Cyberbullying (TRAC 2022), 2022, pp. 68–74.
[12] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[13] A. Seker, E. Bandel, D. Bareket, I. Brusilovsky, R. S. Greenfeld, R. Tsarfaty, Alephbert: A
hebrew large pre-trained language model to start-off your hebrew nlp application with,
arXiv preprint arXiv:2104.04052 (2021).
[14] A. Chriqui, I. Yahav, Hebert & hebemo: a hebrew bert model and a tool for polarity analysis
and emotion recognition, arXiv preprint arXiv:2102.01909 (2021).
[15] M. Pal, Random forest classifier for remote sensing classification, International journal of
remote sensing 26 (2005) 217–222.
[16] R. E. Wright, Logistic regression, in: L. G. Grimm, P. R. Yarnold (Eds.), Reading and under-
standing multivariate statistics, American Psychological Association, 1995, pp. 217–244.
[17] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, et al., Xgboost:
extreme gradient boosting, R package version 0.4-2 1 (2015) 1–4.
[18] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
org/abs/1810.04805. arXiv:1810.04805.
[19] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep
networks, in: International conference on machine learning, PMLR, 2017, pp. 1126–1135.
[20] M. Litvak, N. Vanetik, Y. Nimer, A. Skout, I. Beer-Sheba, Offensive language detection in
semitic languages, in: Multimodal Hate Speech Workshop 2021, 2021, pp. 7–12.
[21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.
[22] F. Chollet, et al., Keras, https://github.com/fchollet/keras, 2015.
[23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-
fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
URL: https://www.tensorflow.org/, software available from tensorflow.org.
[24] E. Bisong, Building machine learning and deep learning models on Google cloud platform:
A comprehensive guide for beginners, Apress, 2019.
[25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[26] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the
Type and Target of Offensive Posts in Social Media, in: Proceedings of NAACL, 2019, p.
1415–1420.
92