=Paper=
{{Paper
|id=Vol-3878/83_main_long
|storemode=property
|title=MONICA: Monitoring Coverage and Attitudes of Italian Measures in Response to COVID-19
|pdfUrl=https://ceur-ws.org/Vol-3878/83_main_long.pdf
|volume=Vol-3878
|authors=Fabio Pernisi,Giuseppe Attanasio,Debora Nozza
|dblpUrl=https://dblp.org/rec/conf/clic-it/PernisiAN24
}}
==MONICA: Monitoring Coverage and Attitudes of Italian Measures in Response to COVID-19==
MONICA: Monitoring Coverage and Attitudes of Italian
Measures in Response to COVID-19
Fabio Pernisi1 , Giuseppe Attanasio2 and Debora Nozza1
1
Department of Computing Sciences, Bocconi University, Milan, Italy
2
Instituto de Telecomunicações, Lisbon, Portugal
Abstract
Modern social media have long been observed as a mirror for public discourse and opinions. Especially in the face of
exceptional events, computational language tools are valuable for understanding public sentiment and reacting quickly.
During the coronavirus pandemic, the Italian government issued a series of financial measures, each unique in target,
requirements, and benefits. Despite the widespread dissemination of these measures, it is currently unclear how they were
perceived and whether they ultimately achieved their goal. In this paper, we document the collection and release of MoniCA,
a new social media dataset for MONItoring Coverage and Attitudes to such measures. Data include approximately ten
thousand posts discussing a variety of measures in ten months. We collected annotations for sentiment, emotion, irony, and
topics for each post. We conducted an extensive analysis using computational models to learn these aspects from text. We
release a compliant version of the dataset to foster future research on computational approaches for understanding public
opinion about government measures. We release data and code at https://github.com/MilaNLProc/MONICA.
Keywords
Sentiment Analysis, Social Media, Computational Social Science, Italian
1. Introduction and Attitudes of Italian measures to COVID-19. Mon-
iCA comprises approximately 10,000 posts spanning ten
Understanding public opinion on governmental decisions months collected on X.com. These posts pertain to the
has always been crucial for assessing policies’ effective- Italian public’s discussions on diverse financial measures
ness, especially when facing exceptional events requiring introduced during the pandemic. Building on an exten-
prompt decisions. Computational linguistics and social sive body of literature that examines public sentiment
scientists have long observed modern social media plat- during the pandemic [e.g., 4, 5, 6, 7, 8], this work of-
forms as they are a perfect stage for spreading opinions fers new insights into the limited research specifically
swiftly and transparently. Natural Language Processing addressing Italy.1
(NLP) techniques have been widely used for analyzing This paper details the dataset’s collection and release.
public discussion [e.g., 1, 2, 3]. It introduces the annotations we compiled for each post,
The COVID-19 pandemic, arguably the most promi- including sentiment, emotion, irony, and discussion top-
nent of such exceptional events, prompted the Italian ics. Then, we conducted an analysis using traditional
government—and other European governments—to re- models and transformer-based language models to pre-
lease multiple financial measures to cushion the impact dict these aspects from textual data, demonstrating the
on the population. These so-called “bonuses,” issued dataset’s potential usability. Moreover, using state-of-
pro bono, i.e., with no interest payments from recipients, the-art interpretability tools, we explained the models’
aimed at increasing liquidity and reducing tax burdens. decision processes. We found that explanations are faith-
However, despite reaching varied recipients, compre- ful and plausible to human judgments.
hending the measures’ reception and evaluating their MoniCA will allow a retrospective examination of the
effectiveness still needs to be explored. efficacy – and inefficacy – of governmental measures
To address this gap, we collect and release MoniCA, implemented in Italy during the COVID-19 pandemic,
a new social media dataset for MONItoring Coverage as perceived by the population. By doing so, we seek
to provide insights that can inform policymakers about
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, the strengths and weaknesses of such financial measures,
Dec 04 — 06, 2024, Pisa, Italy ensuring better preparedness and response strategies for
$ fabio.pernisi@studbocconi.it (F. Pernisi);
giuseppe.attanasio@lx.it.pt (G. Attanasio);
any future crises.
debora.nozza@unibocconi.it (D. Nozza)
https://gattanasio.cc/ (G. Attanasio); https://deboranozza.com/ Contributions. We release MoniCA, a GDPR-
(D. Nozza) compliant dataset of social media posts to monitor
0000-0001-6945-3698 (G. Attanasio); 0000-0002-7998-2267
(D. Nozza) 1
See De Rosis et al. [9] for one of the early (and few) works on
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). modelling sentiment from Twitter during the COVID-19 outbreak.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
the coverage and people’s attitude towards Italy’s • Reddito di emergenza (Emergency income):
government’s financial aid to combat the COVID-19 a temporary income support measure established
crisis. We collect annotations of several aspects to allow by the "Decreto Rilancio" for households facing
for a finer-grained analysis. We used state-of-the-art financial difficulties.
NLP and interpretability tools and reported key insights • Bonus terme (Spa bonus): it is an incentive
on public sentiment. (of up to 200 euros) aimed at supporting citizens’
purchases of spa services at accredited facilities.
• Bonus babysitter: it is a measure providing par-
2. MoniCA ents of children under 14 in remote learning or
To build a comprehensive resource, reflecting multiple quarantine with a bonus (up to 1,200 or 2,000
facets of the phenomenon and usable for future policy- euros) for purchasing babysitting or child care
makers, we prioritized 1) topic and time coverage in our services. It is available to certain workers includ-
collection process (§2.1), and 2) relevance refinement and ing those in public security and healthcare sectors
data annotation to enrich the initial pool with additional involved in the Covid-19 response.
metadata (§2.2). • Bonus asilo nido (Daycare/nursery bonus): it
is an income support subsidy aimed at families
with children under three years old attending pub-
2.1. Data Collection lic or authorized private nurseries or those suf-
We collected approximately 200,000 posts from X in late fering from severe chronic illnesses. The bonus
2022. We then filtered each post to obtain data that was amount varies based on the family’s ISEE in-
in Italian (per the platform-retrieved metadata), not a come level, with maximum yearly benefits rang-
repost, dated between March 1, 2021, and December 31, ing from 1,500 to 3,000 euros.
2021, and selected via hard keyword matching. • Bonus figli (Child Bonus): it is a universal fi-
We chose search keywords and phrases that match nancial aid for families with dependent children
the informal name of any of the measures – e.g., “bonus up to 21 years old, or indefinitely for disabled chil-
bicicletta” (eng: bike bonus) or “bonus babysitting.” – and dren. The amount varies based on family income
download all matching posts. The keywords we used to (ISEE), the number and age of children, and any
identify relevant discussions in the posts were selected disabilities.
based on insights from an author who is native to Italy • Bonus partite IVA (VAT Bonus) it is a one-time
and was residing there during the pandemic period (2019- 200 euro aid for self-employed and professional
2022). Additional keyword refinement was supported by workers who earned less than 35,000 euros in
details from the National Social Security Institute (INPS) 2021, have an active VAT, and made at least one
about COVID-19 measures.2 contributory payment by May 18, 2022.
Below is the complete list of financial measures on • Bonus sportivi (Sport bonus): it is a one-time
which we focused (see Appendix for corresponding key- 200 euro incentive to sports collaborators.
words): • "Bonus Covid": it provides a 1,600 euro pay-
• Bonus mobilità (Mobility bonus): contribu- ment for certain categories of workers heavily
tion of 750 euros that could be used to purchase impacted by the COVID-19 crisis. This bonus
electric scooters, electric or traditional bicycles, is available to occasional self-employed workers
for public transport subscriptions. who do not have a VAT number and are not en-
rolled in other mandatory pension schemes.
• Bonus 600 euro: a 600 euro income support
allowance provided under Italy’s "Cura Italia" de- To improve the initial pool quality, we removed dupli-
cree to self-employed professionals with an active cates (n=6543). Moreover, after manually inspecting the
VAT number as of February 23, 2020. pool, we discarded posts related to the keywords “decreti”
• Bonus vacanza (Holiday bonus): part of "De- (eng: decree) and “credito d’imposta” (eng: tax credit) as
creto Rilancio", it offers up to 500 euros to be used they mainly pulled unrelated or too generic posts. The
for payment of tourism services and packages pro- resulting collection counts approximately 100,000 posts
vided by national tourist accommodations, travel relative to 12 different queries.
agencies, tour operators, farm stays, and bed &
breakfasts.
2.2. Data Annotation
2
https://www.inps.it/it/it/inps-comunica/ To balance annotation quantity and quality, we decided
notizie/dettaglio-news-page.news.2020.10. to collect extensive annotations for 10% of the initial pool.
misure-covid-19-i-dati-al-10-ottobre-2020.html
Subjective Not Subjective Emotion Irony
Anger Sadness Joy Disgust Fear
96.8% 3.2%
66.7% 16.8% 5.8% 3.2% 2.2% 13.1%
Table 1
Subjectivity in MoniCA. Table 3
Emotion and irony in MoniCA.
Negative Neutral Positive
81% 14% 5% When available, the preceding posts and media are the
conversational context and can help disambiguate the
Table 2 post’s meaning.
Sentiment in MoniCA. Each post was annotated for (1) subjectivity, (2) sen-
timent, (3) topic, and (4) emotion and (5) irony. Subjec-
tivity was assessed as binary (subjective or not subjec-
A critical issue with our initial pool was the presence tive); sentiment classification included negative, neutral,
of news posts, most frequently by media agencies and and positive categories; irony was annotated as ironic
newspaper accounts. However, these posts are irrelevant or not ironic; The topics were carefully pre-determined
to our goal of monitoring public perception of bonuses. together with annotators, taking into account the aspects
Following previous work [7], we conducted a first round we aimed to extract from the data (see Table 4 for the list
of annotation for relevance. We held round-table meet- of topics); emotions included anger, sadness, joy, disgust,
ings to settle on a shared definition of relevance; then, and fear categories; irony was assessed as binary. Anno-
we assigned 200 posts to each annotator and requested tators were given the possibility to select more than one
to choose whether each was relevant. We considered a emotion and topic per post. Moreover, we asked anno-
tweet irrelevant if it mentions a bonus but focuses on tators to highlight the (6) span(s) of text that motivated
another topic.3 Next, we trained a supervised classifier their sentiment annotation. (1), (2), (3), (4) and (5) will
to detect relevance and used it to select 10,400 additional serve to map the public opinion on the studied measures,
posts from 7238 unique users.4 and (6) will allow us to verify whether NLP models detect
The annotation was conducted in three iterations. In sentiment like a human would (§5).
the first two, we tasked annotators to annotate a shared
set of 100 posts to compute agreement and tune annota-
General Statistics. Tables 1,2 and 3 report the dis-
tion guidelines. Then, we assigned each annotator 3,333
tribution of sentiment and emotions over the possible
posts, non-overlapping among them. In the next step
options.
we aggregated the labels. For subjectivity, sentiment,
Similar to related work [6, 7, 8], both sentiment and
and irony we selected the annotations through majority
emotion are heavily skewed toward negative attitudes.
voting, while for emotions and topics we used all the
The vast majority of posts (96.8%) are subjective; among
identified emotions from all the annotators. During this
them, 78% of the posts are negative, whereas 62% show
process, we identified some missing values in annota-
anger. Irony notably appears in 5.4% of the posts. Table 4
tions that we addressed by removing them. The final set
shows the discussion topics and their proportion. Half
comprises 9,763 posts with one annotation each.
of the posts are directed toward politicians, with even a
See Appendix B for full details on the annotation pro-
higher spike in negative sentiment (93.4%).
cess, including pay rates, annotation platform and guide-
These findings, taken together, convey a critical mes-
lines, inter-annotator agreement, intra-annotator consis-
sage: The majority of social media comments about
tency over time, and classifier performance.
financial aid in Italy in 2021 are from unhappy peo-
ple. Such users posted on X with a negative sentiment,
Annotation Fields. To conduct the annotation, we showing anger, sadness, disgust, or fear eight times out
provided annotators with i) the post’s main text, ii) pub- of ten. Some of our fine-grained annotations disclose
lication date, iii) at most two antecedent posts in the con- some potential reasons: 8.5% of posts mention struggling
versation tree, and iv) any multimedia content if present. to obtain a bonus, 1.4% not having the requisites, and
3
E.g., “@user Ma allora sei grillina ?! Il bonus vacanze l’ha dato 1.3% do not benefit from or get the bonus.
lo Stato no De Luca.” En: “@user are you grillina then? De Luca
provided bonus vacanze, not the state.—grillina is an idiomatic ex-
pression indicating someone who votes for the Movimento Cinque 3. Experiments
Stelle political party.
4
We selected posts with a relevance score above 0.95, stratifying We are particularly interested in verifying whether state-
on the publication month, user ID, and matching search query to of-the-art NLP tools can help us automatically model
preserve variety in the data.
Topics Proportion 4. Results
Requesting a bonus 10.7%
Asking for information Table 5 reports classification performance for every
9.7%
Obtained a bonus model-task pair in our setup. Our experiments revealed
2.5 %
Not obtained a bonus disparate performance across tasks.
1.3%
Struggling to obtain a bonus 8.5% We observed higher scores on the subjectivity detec-
Struggling to benefit from a bonus tion task, probably due to the easier binary setup and
1.2%
Is interested in a bonus the high unbalance. Emotion detection proved most chal-
13.5%
Does not have the requisites to access to a lenging due to the subtle distinctions between classes. In-
1.4%
bonus terestingly, UmBERTo classified instances as either anger
Addressing the political class 49.3% or joy, while LR defaulted to anger for all cases. FEEL-IT
Table 4 stood out by successfully identifying sadness and fear,
Topics in MoniCA. highlighting the need for more data to capture the full
spectrum of emotional nuances. None of the classifiers
ever detected disgust.
Macro F1 Weighted F1 Topic detection was also another difficult task. In ad-
LR UB F-I LR UB F-I dition to a higher number of unique topics, text content
Subjectivity 49.2 59.9 - 95.3 96.0 - among topics might overlap (e.g., users who complain
Sentiment 42.8 61.1 32.6 78.0 82.7 72.5 about struggling to get a bonus might use similar lan-
Emotion 16.2 18.0 26.6 57.9 57.0 62.9 guage to those who cannot see benefits from it).
Topic 20.5 30.5 - 46.9 57.9 - UmBERTo demonstrated strong performance, ex-
Irony 49.7 46.4 81.3 80.4 celling in three out of five tasks (avg. Macro F1: 43.18,
Weighted F1: 74.8). Interestingly, simpler methods like lo-
Table 5 gistic regression also performed reliably (avg. Macro F1:
35.68, Weighted F1: 71.88). These results are promising,
Macro and Weighted F1 of Logistic Regression (LR), fine-tuned
UmBERTo (UB) and FEEL-IT (F-I) predictions on Subjectivity,
Sentiment, Emotions, Topic, and Irony. Best models in bold.
showing that both straightforward models and advanced
large-scale models—pretrained in the target language,
Italian—can effectively serve as tools for automatic detec-
tion of subjectivity, sentiment, emotion, irony, and public
and detect the users’ opinions. If models succeed at this attitudes. However, the natural imbalance in the data
task, they will serve as a digital barometer for monitoring plays a significant role in these experiments, suggesting
issues and pitfalls of state-enacted financial aids. that further work is needed to address this issue more
We designed four text classification tasks to train a effectively.
model for automatic (1) Subjectivity, (2) Sentiment, (3)
Emotion, (4) Irony, and (5) Topic detection. (1) and (5)
are binary classification tasks; (2), (3), and (5) are three-, 5. Explainability Experiments
six-, and nine-way multi-class classification tasks.
We used Logistic Regression (LR), fine-tuned a pre- Interpretability research in NLP has developed methods
trained Italian BERT model named UmBERTo [10], and and tools to help explain the rationale behind a model
tested an existing BERT model for emotion and sentiment prediction. These tools are beneficial to assess and debug
detection in Italian named FEEL-IT [11]5 . models, e.g., by checking whether a model “is right for
LR has been trained on preprocessed texts: We con- the right reason” or the cause of the error [12].
verted all posts to lowercase and removed special char- We conducted an additional interpretability analysis
acters and stopwords, replaced URLs and user handles on UmBERTo, the best-performing model across our de-
with special tags, and performed stemming. tection tasks (see §4). This study aims to verify whether
Given the significant class imbalance in our anno- the model’s decision process aligns with those high-
tated data, we report both macro and weighted F1 lighted by humans. Transparency on model internals and
6
scores. Macro F1 averages the performance across all human alignment promotes accountability and trust.
classes, highlighting the model’s effectiveness on minor-
ity classes. Weighted F1 adjusts for class distribution, Setup. Following [13, 14], we use four common post-
reflecting overall performance in line with class preva- hoc token-level attribution methods [15], i.e., LIME [16],
lence. This dual reporting provides a balanced view of SHAP [17], Integrated Gradient [18], and Gradient [19]
the model’s performance. across different configurations. Given a model and a
model prediction (e.g., Sentiment: “Negative”), each
5
FEEL-IT does not predict the neutral class in the sentiment classifi-
6
cation task. EU guidelines: https://bit.ly/eu-ai-guide.
... e bonus vacanze per tutti ! ! !
LIME 0.10 0.08 0.06 -0.26 -0.10 -0.15 0.07 0.10 0.08
Human 0 0 1 1 1 1 0 0 0
Table 6
Explanation of Sentiment: Negative. Gold label: Neutral. Predicted label by UmBERTo: Negative. Token attributions that are
darker red (blue) show higher (lower) contribution to the prediction. Eng: “... and holiday bonus for everyone it is!!!”.
aopc aopc taucorr auprc token token
compr↑ suff↓ loo↑ plau↑ f1↑ iou↑
Partition SHAP 0.43 0.01 0.19 0.65 0.20 0.12
LIME 0.51 0.00 0.28 0.63 0.19 0.11
Gradient 0.22 0.10 0.01 0.61 0.19 0.11
Gradient (x Input) 0.00 0.33 -0.12 0.60 0.17 0.10
Integ. Gradient 0.02 0.34 -0.03 0.60 0.17 0.10
Integ. Grad. (x Input) 0.29 0.06 0.10 0.62 0.18 0.11
Table 7
XAI methods for explaining the sentiment analysis task (best values in bold, ↑: higher is better, ↓: lower is better).
method assigns an importance score to each input to- study to understand how models predict sentiment from
ken for that prediction. Table 6 reports an explanation text. We found that explanation quality varies across
example in the first row and the human rationale anno- methods and recommended LIME as a sensible starting
tated in the second row. choice.
We use faithfulness and plausibility [20] to evaluate Our dataset and study fill a critical research gap by
explanations. Faithfulness evaluates how accurately the examining Italian public sentiment towards COVID-19
explanation reflects the inner workings of the model. measures. Future research will build on this groundwork
Plausibility, on the other hand, assesses how well the to build more effective opinion monitoring and mining
explanations align with human reasoning. We use the hu- tools and ultimately inform prompt and targeted policy
man rationales provided by the three annotators during decisions. Additionally, to better understand the severity
the annotation phase, and the UmBERTo model trained of negative attitude, future research may concentrate
on the sentiment classification task, explaining the most on examining hate speech in relation to public policies
likely class label for each test instance. We use three during the pandemic in Italy [22, 23].
faithfulness (Comprehensiveness, Sufficiency, and Corre-
lation with leave-out-out) and plausibility (Token IOU,
Token F1, AUPRC) metrics as described in DeYoung et al. Acknowledgments
[21, ERASER] and leverage ferret [14] for explanation
This project has in part received funding from Fon-
generation and evaluation.
dazione Cariplo (grant No. 2020-4288, MONICA) and
Table 7 shows that LIME is, on average, the best model
from the European Research Council (ERC) under the
to explain predictions, indicating that LIME provides
European Union’s Horizon 2020 research and innova-
explanations that are both comprehensive and sufficient.
tion programme (grant agreement No. 101116095, PER-
SONAE). Debora Nozza and Fabio Pernisi are member
6. Conclusion of the MilaNLP group and the Data and Marketing In-
sights Unit of the Bocconi Institute for Data Science and
We documented the collection and release of MoniCA, Analysis. Giuseppe Attanasio conducted part of the work
the first large-scale dataset for monitoring the cover- as a member of the MilaNLP group. Additionally, he
age and attitudes of financial aid enacted by the Italian was partially supported by the Portuguese Recovery and
government during the COVID-19 pandemic. It counts Resilience Plan through project C645008882-00000055
around 10,000 annotated posts for subjectivity, sentiment, (Center for Responsible AI) and by Fundação para a Ciên-
emotion, irony, and topic. We conducted a first analysis cia e Tecnologia through contract UIDB/50008/2020.
and discovered that (1) most posts have a negative tone
and (2) NLP and machine learning models can help de-
tect it. Finally, we conducted a preliminary explainability
Limitations china via bert model, Ieee Access 8 (2020) 138162–
138169.
Our collection might not represent the opinions of the [9] S. De Rosis, M. Lopreite, M. Puliga, M. Vainieri,
entire population. All posts included in our dataset were The early weeks of the italian covid-19 outbreak:
taken from X, which might have a specific user demo- sentiment insights from a twitter analysis, Health
graphic that is skewed towards a specific demographic. Policy 125 (2021) 987–994.
Additionally, a potential limitation might arise from [10] L. Breiman, Random forests, Machine learning 45
the dependency of our data on keyword matching. This (2001) 5–32.
form of sampling might prevent some topics from being [11] F. Bianchi, D. Nozza, D. Hovy, FEEL-IT: Emotion
included in the dataset. However, we carried out keyword and sentiment classification for the Italian language,
selection very carefully, including words and phrases that in: Proceedings of the Eleventh Workshop on Com-
captured discussions around pro-bono government aid putational Approaches to Subjectivity, Sentiment
(see Section 2.2). and Social Media Analysis, Association for Compu-
Another limitation is that our data covers a specific but tational Linguistics, Online, 2021, pp. 76–83.
quite broad temporal window from March 1 to December [12] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,
31, 2021. This window corresponds to a phase of the B. Kawas, P. Sen, A survey of the state of explain-
pandemic, and changes in public opinion following this able AI for natural language processing, in: Pro-
period are not captured. ceedings of the 1st Conference of the Asia-Pacific
Chapter of the Association for Computational Lin-
guistics and the 10th International Joint Conference
References on Natural Language Processing, Association for
[1] W. Medhat, A. Hassan, H. Korashy, Sentiment anal- Computational Linguistics, Suzhou, China, 2020,
ysis algorithms and applications: A survey, Ain pp. 447–459.
Shams engineering journal 5 (2014) 1093–1113. [13] G. Attanasio, D. Nozza, E. Pastor, D. Hovy, Bench-
[2] A. Giachanou, F. Crestani, Like it or not: A sur- marking post-hoc interpretability approaches for
vey of twitter sentiment analysis methods, ACM transformer-based misogyny detection, in: Proceed-
Computing Surveys (CSUR) 49 (2016) 1–41. ings of NLP Power! The First Workshop on Efficient
[3] C. Qian, N. Mathur, N. H. Zakaria, R. Arora, Benchmarking in NLP, Association for Computa-
V. Gupta, M. Ali, Understanding public opinions tional Linguistics, Dublin, Ireland, 2022, pp. 100–
on social media for financial sentiment analysis us- 112.
ing ai-based techniques, Information Processing & [14] G. Attanasio, E. Pastor, C. Di Bonaventura, D. Nozza,
Management 59 (2022) 103098. ferret: a framework for benchmarking explain-
[4] M. Müller, M. Salathé, P. E. Kummervold, Covid- ers on transformers, in: Proceedings of the 17th
twitter-bert: A natural language processing model Conference of the European Chapter of the As-
to analyse covid-19 content on twitter, Frontiers in sociation for Computational Linguistics: System
Artificial Intelligence 6 (2023) 1023281. Demonstrations, Association for Computational
[5] E. Chen, K. Lerman, E. Ferrara, Tracking social Linguistics, Dubrovnik, Croatia, 2023, pp. 256–266.
media discourse about the covid-19 pandemic: De- URL: https://aclanthology.org/2023.eacl-demo.29.
velopment of a public coronavirus twitter data doi:10.18653/v1/2023.eacl-demo.29.
set, JMIR Public Health Surveill 6 (2020) e19273. [15] A. Madsen, S. Reddy, S. Chandar, Post-hoc inter-
URL: http://publichealth.jmir.org/2020/2/e19273/. pretability for neural nlp: A survey, ACM Comput-
doi:10.2196/19273. ing Surveys 55 (2022) 1–42.
[6] S. Kaur, P. Kaul, P. M. Zadeh, Monitoring the dy- [16] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i
namics of emotions during covid-19 using twitter trust you?" explaining the predictions of any clas-
data, Procedia Computer Science 177 (2020) 423– sifier, in: Proceedings of the 22nd ACM SIGKDD
430. international conference on knowledge discovery
[7] K. Scott, P. Delobelle, B. Berendt, Measuring and data mining, 2016, pp. 1135–1144.
shifts in attitudes towards covid-19 measures in [17] S. M. Lundberg, S.-I. Lee, A unified approach to
belgium, Computational Linguistics in the Nether- interpreting model predictions, in: Proceedings of
lands Journal 11 (2021) 161–171. URL: https://www. the 31st International Conference on Neural Infor-
clinjournal.org/clinj/article/view/133. mation Processing Systems, NIPS’17, Curran Asso-
[8] T. Wang, K. Lu, K. P. Chow, Q. Zhu, Covid-19 sens- ciates Inc., Red Hook, NY, USA, 2017, p. 4768–4777.
ing: negative sentiment analysis on social media in [18] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribu-
tion for deep networks, in: Proceedings of the 34th
International Conference on Machine Learning -
Volume 70, ICML’17, JMLR.org, 2017, p. 3319–3328. • Bonus vacanza (Holiday bonus): "bonus
[19] K. Simonyan, A. Vedaldi, A. Zisserman, Deep in- vacanza" OR "bonus vacanze" OR
side convolutional networks: Visualising image "bonus vacanze" OR #bonusvacanza OR
classification models and saliency maps, CoRR #bonusvacanze
abs/1312.6034 (2013). • Reddito di emergenza (Emergency income):
[20] A. Jacovi, Y. Goldberg, Towards faithfully inter- "reddito d’emergenza" OR "reddito di
pretable NLP systems: How should we define and emergenza" OR #redditodemergenza OR
evaluate faithfulness?, in: Proceedings of the 58th #redditodiemergenza OR #REM
Annual Meeting of the Association for Computa- • Bonus terme (Spa bonus): "bonus terme"
tional Linguistics, Association for Computational OR #bonusterme
Linguistics, Online, 2020, pp. 4198–4205. • Bonus babysitter: "bonus babysitter"
[21] J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, OR "bonus baby-sitter" OR
C. Xiong, R. Socher, B. C. Wallace, ERASER: "bonus babysitting" OR "bonus
A benchmark to evaluate rationalized NLP mod- baby-sitting" OR #bonusbabysitter OR
els, in: Proceedings of the 58th Annual Meet- #bonusbabysitting
ing of the Association for Computational Linguis- • Bonus asilo nido (Daycare/nursery bonus):
tics, Association for Computational Linguistics, On- "bonus asilo nido" OR #bonusasilonido
line, 2020, pp. 4443–4458. URL: https://aclanthology.
• Bonus figli (Child Bonus): "bonus figli"
org/2020.acl-main.408. doi:10.18653/v1/2020.
OR #bonusfigli
acl-main.408.
[22] D. Nozza, F. Bianchi, G. Attanasio, HATE-ITA: • Bonus partite IVA (VAT Bonus): "bonus
partite iva" OR #bonuspartiteiva
Hate speech detection in Italian social media text,
in: Proceedings of the Sixth Workshop on Online • Bonus sportivi (Sport bonus): "bonus
Abuse and Harms (WOAH), Association for Compu- lavoratori sportivi" OR "bonus
tational Linguistics, Seattle, Washington (Hybrid), sportivi" OR (bonus lavoratori
2022, pp. 252–260. sportivi) OR (bonus collaboratori
[23] F. M. Plaza-del arco, D. Nozza, D. Hovy, Respectful sportivi) OR "bonus collaboratori
or toxic? using zero-shot learning with language sportivi" OR #bonussportivi
models to detect hate speech, in: The 7th Workshop • "Bonus Covid": "bonus covid" OR
on Online Abuse and Harms (WOAH), Association #bonuscovid
for Computational Linguistics, Toronto, Canada,
2023, pp. 60–68.
[24] G. Abercrombie, D. Hovy, V. Prabhakaran, Tem-
B. Data Annotation
poral and second language influence on intra- Profile and pay rate. For annotating the MoniCA
annotator agreement and stability in hate speech dataset, three student research assistants with back-
labelling, in: Proceedings of the 17th Linguistic grounds in Machine Learning and Natural Language Pro-
Annotation Workshop (LAW-XVII), Association for cessing were hired full-time. They were each compen-
Computational Linguistics, Toronto, Canada, 2023. sated for 32 hours of work at a rate of about 18 euros
per hour. We provided each annotator with an initial
set of annotation guidelines, and we organized initial
A. Data Collection meetings to familiarize them with the task and refine the
guidelines.
Data for the MoniCA dataset was gathered using X’s
proprietary historical API, via an academic subscription. Platform. We used Label Studio8 using a custom la-
Below is the complete list of f keywords used for data beling schema. We report the annotation schema and
collection in the form of a tweepy7 query: guidelines in the repository associated with the project.
• Bonus mobilità (Mobility bonus): "bonus mo- A screenshot of an annotated example is shown in Figure
bilita" OR "bonus bici" OR "bonus monopattino" 1 for reference.
OR #bonusmobilita OR #bonusbici OR #bonus-
monopattino. Agreement and consistency. The three annotators
• Bonus 600 euro: "bonus 600 euro" OR shared a pool of 100 posts. On these, we computed Krip-
"bonus 600euro" OR "bonus 600" OR pendorff’s alpha of 0.57 on subjectivity (i.e., is the post
#bonus600euro OR #bonus600 subjective or not), 0.60 on the post sentiment, and 0.51 on
7 8
https://www.tweepy.org/ https://labelstud.io/
Figure 1: Screenshot of an annotated example in Label Studio.
whether the contextual information was used. The agree-
ment on sentiment increases to 0.61 when considering
only posts that were considered subjective by everyone.
Moreover, we provided each annotator with a copy of
100 samples randomly shuffled later in the pool of posts
to validate their consistency over time [24]. Annotators
were highly consistent. On average, they annotated sub-
jectivity consistently 95% of the time and sentiment 87%
of the time.