=Paper=
{{Paper
|id=Vol-3878/83_main_long
|storemode=property
|title=MONICA: Monitoring Coverage and Attitudes of Italian Measures in Response to COVID-19
|pdfUrl=https://ceur-ws.org/Vol-3878/83_main_long.pdf
|volume=Vol-3878
|authors=Fabio Pernisi,Giuseppe Attanasio,Debora Nozza
|dblpUrl=https://dblp.org/rec/conf/clic-it/PernisiAN24
}}
==MONICA: Monitoring Coverage and Attitudes of Italian Measures in Response to COVID-19==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/83_main_long.pdf</pdf>
<pre>
                                MONICA: Monitoring Coverage and Attitudes of Italian
                                Measures in Response to COVID-19
                                Fabio Pernisi1 , Giuseppe Attanasio2 and Debora Nozza1
                                1
                                    Department of Computing Sciences, Bocconi University, Milan, Italy
                                2
                                    Instituto de Telecomunicações, Lisbon, Portugal


                                                  Abstract
                                                  Modern social media have long been observed as a mirror for public discourse and opinions. Especially in the face of
                                                  exceptional events, computational language tools are valuable for understanding public sentiment and reacting quickly.
                                                  During the coronavirus pandemic, the Italian government issued a series of financial measures, each unique in target,
                                                  requirements, and benefits. Despite the widespread dissemination of these measures, it is currently unclear how they were
                                                  perceived and whether they ultimately achieved their goal. In this paper, we document the collection and release of MoniCA,
                                                  a new social media dataset for MONItoring Coverage and Attitudes to such measures. Data include approximately ten
                                                  thousand posts discussing a variety of measures in ten months. We collected annotations for sentiment, emotion, irony, and
                                                  topics for each post. We conducted an extensive analysis using computational models to learn these aspects from text. We
                                                  release a compliant version of the dataset to foster future research on computational approaches for understanding public
                                                  opinion about government measures. We release data and code at https://github.com/MilaNLProc/MONICA.

                                                  Keywords
                                                  Sentiment Analysis, Social Media, Computational Social Science, Italian


                                1. Introduction                                                                                            and Attitudes of Italian measures to COVID-19. Mon-
                                                                                                                                           iCA comprises approximately 10,000 posts spanning ten
                                Understanding public opinion on governmental decisions                                                     months collected on X.com. These posts pertain to the
                                has always been crucial for assessing policies’ effective-                                                 Italian public’s discussions on diverse financial measures
                                ness, especially when facing exceptional events requiring                                                  introduced during the pandemic. Building on an exten-
                                prompt decisions. Computational linguistics and social                                                     sive body of literature that examines public sentiment
                                scientists have long observed modern social media plat-                                                    during the pandemic [e.g., 4, 5, 6, 7, 8], this work of-
                                forms as they are a perfect stage for spreading opinions                                                   fers new insights into the limited research specifically
                                swiftly and transparently. Natural Language Processing                                                     addressing Italy.1
                                (NLP) techniques have been widely used for analyzing                                                           This paper details the dataset’s collection and release.
                                public discussion [e.g., 1, 2, 3].                                                                         It introduces the annotations we compiled for each post,
                                   The COVID-19 pandemic, arguably the most promi-                                                         including sentiment, emotion, irony, and discussion top-
                                nent of such exceptional events, prompted the Italian                                                      ics. Then, we conducted an analysis using traditional
                                government—and other European governments—to re-                                                           models and transformer-based language models to pre-
                                lease multiple financial measures to cushion the impact                                                    dict these aspects from textual data, demonstrating the
                                on the population. These so-called “bonuses,” issued                                                       dataset’s potential usability. Moreover, using state-of-
                                pro bono, i.e., with no interest payments from recipients,                                                 the-art interpretability tools, we explained the models’
                                aimed at increasing liquidity and reducing tax burdens.                                                    decision processes. We found that explanations are faith-
                                However, despite reaching varied recipients, compre-                                                       ful and plausible to human judgments.
                                hending the measures’ reception and evaluating their                                                           MoniCA will allow a retrospective examination of the
                                effectiveness still needs to be explored.                                                                  efficacy – and inefficacy – of governmental measures
                                   To address this gap, we collect and release MoniCA,                                                     implemented in Italy during the COVID-19 pandemic,
                                a new social media dataset for MONItoring Coverage                                                         as perceived by the population. By doing so, we seek
                                                                                                                                           to provide insights that can inform policymakers about
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                       the strengths and weaknesses of such financial measures,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                             ensuring better preparedness and response strategies for
                                $ fabio.pernisi@studbocconi.it (F. Pernisi);
                                giuseppe.attanasio@lx.it.pt (G. Attanasio);
                                                                                                                                           any future crises.
                                debora.nozza@unibocconi.it (D. Nozza)
                                 https://gattanasio.cc/ (G. Attanasio); https://deboranozza.com/                                          Contributions. We release MoniCA, a GDPR-
                                (D. Nozza)                                                                                                 compliant dataset of social media posts to monitor
                                 0000-0001-6945-3698 (G. Attanasio); 0000-0002-7998-2267
                                (D. Nozza)                                                                                                 1
                                                                                                                                               See De Rosis et al. [9] for one of the early (and few) works on
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                         modelling sentiment from Twitter during the COVID-19 outbreak.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the coverage and people’s attitude towards Italy’s                     • Reddito di emergenza (Emergency income):
government’s financial aid to combat the COVID-19                        a temporary income support measure established
crisis. We collect annotations of several aspects to allow               by the "Decreto Rilancio" for households facing
for a finer-grained analysis. We used state-of-the-art                   financial difficulties.
NLP and interpretability tools and reported key insights               • Bonus terme (Spa bonus): it is an incentive
on public sentiment.                                                     (of up to 200 euros) aimed at supporting citizens’
                                                                         purchases of spa services at accredited facilities.
                                                                       • Bonus babysitter: it is a measure providing par-
2. MoniCA                                                                ents of children under 14 in remote learning or
To build a comprehensive resource, reflecting multiple                   quarantine with a bonus (up to 1,200 or 2,000
facets of the phenomenon and usable for future policy-                   euros) for purchasing babysitting or child care
makers, we prioritized 1) topic and time coverage in our                 services. It is available to certain workers includ-
collection process (§2.1), and 2) relevance refinement and               ing those in public security and healthcare sectors
data annotation to enrich the initial pool with additional               involved in the Covid-19 response.
metadata (§2.2).                                                       • Bonus asilo nido (Daycare/nursery bonus): it
                                                                         is an income support subsidy aimed at families
                                                                         with children under three years old attending pub-
2.1. Data Collection                                                     lic or authorized private nurseries or those suf-
We collected approximately 200,000 posts from X in late                  fering from severe chronic illnesses. The bonus
2022. We then filtered each post to obtain data that was                 amount varies based on the family’s ISEE in-
in Italian (per the platform-retrieved metadata), not a                  come level, with maximum yearly benefits rang-
repost, dated between March 1, 2021, and December 31,                    ing from 1,500 to 3,000 euros.
2021, and selected via hard keyword matching.                          • Bonus figli (Child Bonus): it is a universal fi-
   We chose search keywords and phrases that match                       nancial aid for families with dependent children
the informal name of any of the measures – e.g., “bonus                  up to 21 years old, or indefinitely for disabled chil-
bicicletta” (eng: bike bonus) or “bonus babysitting.” – and              dren. The amount varies based on family income
download all matching posts. The keywords we used to                     (ISEE), the number and age of children, and any
identify relevant discussions in the posts were selected                 disabilities.
based on insights from an author who is native to Italy                • Bonus partite IVA (VAT Bonus) it is a one-time
and was residing there during the pandemic period (2019-                 200 euro aid for self-employed and professional
2022). Additional keyword refinement was supported by                    workers who earned less than 35,000 euros in
details from the National Social Security Institute (INPS)               2021, have an active VAT, and made at least one
about COVID-19 measures.2                                                contributory payment by May 18, 2022.
   Below is the complete list of financial measures on                 • Bonus sportivi (Sport bonus): it is a one-time
which we focused (see Appendix for corresponding key-                    200 euro incentive to sports collaborators.
words):                                                                • "Bonus Covid": it provides a 1,600 euro pay-
        • Bonus mobilità (Mobility bonus): contribu-                     ment for certain categories of workers heavily
          tion of 750 euros that could be used to purchase               impacted by the COVID-19 crisis. This bonus
          electric scooters, electric or traditional bicycles,           is available to occasional self-employed workers
          for public transport subscriptions.                            who do not have a VAT number and are not en-
                                                                         rolled in other mandatory pension schemes.
        • Bonus 600 euro: a 600 euro income support
          allowance provided under Italy’s "Cura Italia" de-         To improve the initial pool quality, we removed dupli-
          cree to self-employed professionals with an active      cates (n=6543). Moreover, after manually inspecting the
          VAT number as of February 23, 2020.                     pool, we discarded posts related to the keywords “decreti”
        • Bonus vacanza (Holiday bonus): part of "De-             (eng: decree) and “credito d’imposta” (eng: tax credit) as
          creto Rilancio", it offers up to 500 euros to be used   they mainly pulled unrelated or too generic posts. The
          for payment of tourism services and packages pro-       resulting collection counts approximately 100,000 posts
          vided by national tourist accommodations, travel        relative to 12 different queries.
          agencies, tour operators, farm stays, and bed &
          breakfasts.
                                                                  2.2. Data Annotation
2
    https://www.inps.it/it/it/inps-comunica/                      To balance annotation quantity and quality, we decided
    notizie/dettaglio-news-page.news.2020.10.                     to collect extensive annotations for 10% of the initial pool.
    misure-covid-19-i-dati-al-10-ottobre-2020.html
                Subjective       Not Subjective                                          Emotion            Irony
                                                                             Anger Sadness Joy Disgust Fear
                96.8%            3.2%
                                                                             66.7%   16.8%      5.8% 3.2%       2.2%   13.1%
Table 1
Subjectivity in MoniCA.                                                 Table 3
                                                                        Emotion and irony in MoniCA.

              Negative        Neutral      Positive
              81%             14%          5%                       When available, the preceding posts and media are the
                                                                    conversational context and can help disambiguate the
Table 2                                                             post’s meaning.
Sentiment in MoniCA.                                                   Each post was annotated for (1) subjectivity, (2) sen-
                                                                    timent, (3) topic, and (4) emotion and (5) irony. Subjec-
                                                                    tivity was assessed as binary (subjective or not subjec-
    A critical issue with our initial pool was the presence tive); sentiment classification included negative, neutral,
of news posts, most frequently by media agencies and and positive categories; irony was annotated as ironic
newspaper accounts. However, these posts are irrelevant or not ironic; The topics were carefully pre-determined
to our goal of monitoring public perception of bonuses. together with annotators, taking into account the aspects
Following previous work [7], we conducted a first round we aimed to extract from the data (see Table 4 for the list
of annotation for relevance. We held round-table meet- of topics); emotions included anger, sadness, joy, disgust,
ings to settle on a shared definition of relevance; then, and fear categories; irony was assessed as binary. Anno-
we assigned 200 posts to each annotator and requested tators were given the possibility to select more than one
to choose whether each was relevant. We considered a emotion and topic per post. Moreover, we asked anno-
tweet irrelevant if it mentions a bonus but focuses on tators to highlight the (6) span(s) of text that motivated
another topic.3 Next, we trained a supervised classifier their sentiment annotation. (1), (2), (3), (4) and (5) will
to detect relevance and used it to select 10,400 additional serve to map the public opinion on the studied measures,
posts from 7238 unique users.4                                      and (6) will allow us to verify whether NLP models detect
    The annotation was conducted in three iterations. In sentiment like a human would (§5).
the first two, we tasked annotators to annotate a shared
set of 100 posts to compute agreement and tune annota-
                                                                    General Statistics. Tables 1,2 and 3 report the dis-
tion guidelines. Then, we assigned each annotator 3,333
                                                                    tribution of sentiment and emotions over the possible
posts, non-overlapping among them. In the next step
                                                                    options.
we aggregated the labels. For subjectivity, sentiment,
                                                                       Similar to related work [6, 7, 8], both sentiment and
and irony we selected the annotations through majority
                                                                    emotion are heavily skewed toward negative attitudes.
voting, while for emotions and topics we used all the
                                                                    The vast majority of posts (96.8%) are subjective; among
identified emotions from all the annotators. During this
                                                                    them, 78% of the posts are negative, whereas 62% show
process, we identified some missing values in annota-
                                                                    anger. Irony notably appears in 5.4% of the posts. Table 4
tions that we addressed by removing them. The final set
                                                                    shows the discussion topics and their proportion. Half
comprises 9,763 posts with one annotation each.
                                                                    of the posts are directed toward politicians, with even a
    See Appendix B for full details on the annotation pro-
                                                                    higher spike in negative sentiment (93.4%).
cess, including pay rates, annotation platform and guide-
                                                                       These findings, taken together, convey a critical mes-
lines, inter-annotator agreement, intra-annotator consis-
                                                                    sage: The majority of social media comments about
tency over time, and classifier performance.
                                                                    financial aid in Italy in 2021 are from unhappy peo-
                                                                    ple. Such users posted on X with a negative sentiment,
Annotation Fields. To conduct the annotation, we showing anger, sadness, disgust, or fear eight times out
provided annotators with i) the post’s main text, ii) pub- of ten. Some of our fine-grained annotations disclose
lication date, iii) at most two antecedent posts in the con- some potential reasons: 8.5% of posts mention struggling
versation tree, and iv) any multimedia content if present. to obtain a bonus, 1.4% not having the requisites, and
3
  E.g., “@user Ma allora sei grillina ?! Il bonus vacanze l’ha dato 1.3% do not benefit from or get the bonus.
  lo Stato no De Luca.” En: “@user are you grillina then? De Luca
  provided bonus vacanze, not the state.—grillina is an idiomatic ex-
  pression indicating someone who votes for the Movimento Cinque        3. Experiments
  Stelle political party.
4
  We selected posts with a relevance score above 0.95, stratifying      We are particularly interested in verifying whether state-
  on the publication month, user ID, and matching search query to       of-the-art NLP tools can help us automatically model
  preserve variety in the data.
Topics                                                      Proportion 4.           Results
Requesting a bonus                                          10.7%
Asking for information                                         Table 5 reports classification performance for every
                                                            9.7%
Obtained a bonus                                               model-task pair in our setup. Our experiments revealed
                                                            2.5 %
Not obtained a bonus                                           disparate performance across tasks.
                                                            1.3%
Struggling to obtain a bonus                                8.5%  We observed higher scores on the subjectivity detec-
Struggling to benefit from a bonus                             tion task, probably due to the easier binary setup and
                                                            1.2%
Is interested in a bonus                                       the high unbalance. Emotion detection proved most chal-
                                                            13.5%
Does not have the requisites to access to a                    lenging due to the subtle distinctions between classes. In-
                                                            1.4%
bonus                                                          terestingly, UmBERTo classified instances as either anger
Addressing the political class                     49.3%       or joy, while LR defaulted to anger for all cases. FEEL-IT
Table 4                                                        stood out by successfully identifying sadness and fear,
Topics in MoniCA.                                              highlighting the need for more data to capture the full
                                                               spectrum of emotional nuances. None of the classifiers
                                                               ever detected disgust.
                       Macro F1             Weighted F1           Topic detection was also another difficult task. In ad-
                  LR      UB      F-I     LR      UB      F-I  dition to a higher number of unique topics, text content
 Subjectivity 49.2 59.9             -    95.3    96.0      -   among topics might overlap (e.g., users who complain
 Sentiment        42.8 61.1 32.6         78.0    82.7 72.5 about struggling to get a bonus might use similar lan-
 Emotion          16.2   18.0 26.6       57.9    57.0 62.9 guage to those who cannot see benefits from it).
 Topic            20.5 30.5         -    46.9    57.9      -      UmBERTo demonstrated strong performance, ex-
 Irony            49.7 46.4              81.3 80.4             celling in three out of five tasks (avg. Macro F1: 43.18,
                                                               Weighted F1: 74.8). Interestingly, simpler methods like lo-
Table 5                                                        gistic regression also performed reliably (avg. Macro F1:
                                                               35.68, Weighted F1: 71.88). These results are promising,
Macro and Weighted F1 of Logistic Regression (LR), fine-tuned
UmBERTo (UB) and FEEL-IT (F-I) predictions on Subjectivity,
Sentiment, Emotions, Topic, and Irony. Best models in bold.
                                                               showing that both straightforward models and advanced
                                                               large-scale models—pretrained in the target language,
                                                               Italian—can effectively serve as tools for automatic detec-
                                                               tion of subjectivity, sentiment, emotion, irony, and public
and detect the users’ opinions. If models succeed at this attitudes. However, the natural imbalance in the data
task, they will serve as a digital barometer for monitoring plays a significant role in these experiments, suggesting
issues and pitfalls of state-enacted financial aids.           that further work is needed to address this issue more
   We designed four text classification tasks to train a effectively.
model for automatic (1) Subjectivity, (2) Sentiment, (3)
Emotion, (4) Irony, and (5) Topic detection. (1) and (5)
are binary classification tasks; (2), (3), and (5) are three-, 5. Explainability Experiments
six-, and nine-way multi-class classification tasks.
   We used Logistic Regression (LR), fine-tuned a pre- Interpretability research in NLP has developed methods
trained Italian BERT model named UmBERTo [10], and and tools to help explain the rationale behind a model
tested an existing BERT model for emotion and sentiment prediction. These tools are beneficial to assess and debug
detection in Italian named FEEL-IT [11]5 .                     models, e.g., by checking whether a model “is right for
   LR has been trained on preprocessed texts: We con- the right reason” or the cause of the error [12].
verted all posts to lowercase and removed special char-           We conducted an additional interpretability analysis
acters and stopwords, replaced URLs and user handles on UmBERTo, the best-performing model across our de-
with special tags, and performed stemming.                     tection tasks (see §4). This study aims to verify whether
   Given the significant class imbalance in our anno- the model’s decision process aligns with those high-
tated data, we report both macro and weighted F1 lighted by humans. Transparency on model internals and
                                                                                                                       6
scores. Macro F1 averages the performance across all human alignment promotes accountability and trust.
classes, highlighting the model’s effectiveness on minor-
ity classes. Weighted F1 adjusts for class distribution, Setup. Following [13, 14], we use four common post-
reflecting overall performance in line with class preva- hoc token-level attribution methods [15], i.e., LIME [16],
lence. This dual reporting provides a balanced view of SHAP [17], Integrated Gradient [18], and Gradient [19]
the model’s performance.                                       across different configurations. Given a model and a
                                                               model prediction (e.g., Sentiment: “Negative”), each
5
    FEEL-IT does not predict the neutral class in the sentiment classifi-
                                                                            6
    cation task.                                                                EU guidelines: https://bit.ly/eu-ai-guide.
                                  ...     e     bonus        vacanze          per     tutti    !       !        !
                      LIME       0.10   0.08       0.06          -0.26       -0.10    -0.15   0.07    0.10     0.08
                      Human       0       0         1              1           1        1      0       0         0

Table 6
Explanation of Sentiment: Negative. Gold label: Neutral. Predicted label by UmBERTo: Negative. Token attributions that are
darker red (blue) show higher (lower) contribution to the prediction. Eng: “... and holiday bonus for everyone it is!!!”.


                                                      aopc aopc taucorr auprc token token
                                                    compr↑ suff↓   loo↑ plau↑   f1↑ iou↑
                          Partition SHAP                  0.43    0.01         0.19    0.65    0.20     0.12
                          LIME                            0.51    0.00        0.28     0.63    0.19     0.11
                          Gradient                        0.22    0.10         0.01    0.61    0.19     0.11
                          Gradient (x Input)              0.00    0.33        -0.12    0.60    0.17     0.10
                          Integ. Gradient                 0.02    0.34        -0.03    0.60    0.17     0.10
                          Integ. Grad. (x Input)          0.29    0.06         0.10    0.62    0.18     0.11

Table 7
XAI methods for explaining the sentiment analysis task (best values in bold, ↑: higher is better, ↓: lower is better).


method assigns an importance score to each input to-                     study to understand how models predict sentiment from
ken for that prediction. Table 6 reports an explanation                  text. We found that explanation quality varies across
example in the first row and the human rationale anno-                   methods and recommended LIME as a sensible starting
tated in the second row.                                                 choice.
   We use faithfulness and plausibility [20] to evaluate                    Our dataset and study fill a critical research gap by
explanations. Faithfulness evaluates how accurately the                  examining Italian public sentiment towards COVID-19
explanation reflects the inner workings of the model.                    measures. Future research will build on this groundwork
Plausibility, on the other hand, assesses how well the                   to build more effective opinion monitoring and mining
explanations align with human reasoning. We use the hu-                  tools and ultimately inform prompt and targeted policy
man rationales provided by the three annotators during                   decisions. Additionally, to better understand the severity
the annotation phase, and the UmBERTo model trained                      of negative attitude, future research may concentrate
on the sentiment classification task, explaining the most                on examining hate speech in relation to public policies
likely class label for each test instance. We use three                  during the pandemic in Italy [22, 23].
faithfulness (Comprehensiveness, Sufficiency, and Corre-
lation with leave-out-out) and plausibility (Token IOU,
Token F1, AUPRC) metrics as described in DeYoung et al.                  Acknowledgments
[21, ERASER] and leverage ferret [14] for explanation
                                                            This project has in part received funding from Fon-
generation and evaluation.
                                                            dazione Cariplo (grant No. 2020-4288, MONICA) and
   Table 7 shows that LIME is, on average, the best model
                                                            from the European Research Council (ERC) under the
to explain predictions, indicating that LIME provides
                                                            European Union’s Horizon 2020 research and innova-
explanations that are both comprehensive and sufficient.
                                                            tion programme (grant agreement No. 101116095, PER-
                                                            SONAE). Debora Nozza and Fabio Pernisi are member
6. Conclusion                                               of the MilaNLP group and the Data and Marketing In-
                                                            sights Unit of the Bocconi Institute for Data Science and
We documented the collection and release of MoniCA, Analysis. Giuseppe Attanasio conducted part of the work
the first large-scale dataset for monitoring the cover- as a member of the MilaNLP group. Additionally, he
age and attitudes of financial aid enacted by the Italian was partially supported by the Portuguese Recovery and
government during the COVID-19 pandemic. It counts Resilience Plan through project C645008882-00000055
around 10,000 annotated posts for subjectivity, sentiment, (Center for Responsible AI) and by Fundação para a Ciên-
emotion, irony, and topic. We conducted a first analysis cia e Tecnologia through contract UIDB/50008/2020.
and discovered that (1) most posts have a negative tone
and (2) NLP and machine learning models can help de-
tect it. Finally, we conducted a preliminary explainability
Limitations                                                         china via bert model, Ieee Access 8 (2020) 138162–
                                                                    138169.
Our collection might not represent the opinions of the          [9] S. De Rosis, M. Lopreite, M. Puliga, M. Vainieri,
entire population. All posts included in our dataset were           The early weeks of the italian covid-19 outbreak:
taken from X, which might have a specific user demo-                sentiment insights from a twitter analysis, Health
graphic that is skewed towards a specific demographic.              Policy 125 (2021) 987–994.
   Additionally, a potential limitation might arise from       [10] L. Breiman, Random forests, Machine learning 45
the dependency of our data on keyword matching. This                (2001) 5–32.
form of sampling might prevent some topics from being          [11] F. Bianchi, D. Nozza, D. Hovy, FEEL-IT: Emotion
included in the dataset. However, we carried out keyword            and sentiment classification for the Italian language,
selection very carefully, including words and phrases that          in: Proceedings of the Eleventh Workshop on Com-
captured discussions around pro-bono government aid                 putational Approaches to Subjectivity, Sentiment
(see Section 2.2).                                                  and Social Media Analysis, Association for Compu-
   Another limitation is that our data covers a specific but        tational Linguistics, Online, 2021, pp. 76–83.
quite broad temporal window from March 1 to December           [12] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,
31, 2021. This window corresponds to a phase of the                 B. Kawas, P. Sen, A survey of the state of explain-
pandemic, and changes in public opinion following this              able AI for natural language processing, in: Pro-
period are not captured.                                            ceedings of the 1st Conference of the Asia-Pacific
                                                                    Chapter of the Association for Computational Lin-
                                                                    guistics and the 10th International Joint Conference
References                                                          on Natural Language Processing, Association for
 [1] W. Medhat, A. Hassan, H. Korashy, Sentiment anal-              Computational Linguistics, Suzhou, China, 2020,
     ysis algorithms and applications: A survey, Ain                pp. 447–459.
     Shams engineering journal 5 (2014) 1093–1113.             [13] G. Attanasio, D. Nozza, E. Pastor, D. Hovy, Bench-
 [2] A. Giachanou, F. Crestani, Like it or not: A sur-              marking post-hoc interpretability approaches for
     vey of twitter sentiment analysis methods, ACM                 transformer-based misogyny detection, in: Proceed-
     Computing Surveys (CSUR) 49 (2016) 1–41.                       ings of NLP Power! The First Workshop on Efficient
 [3] C. Qian, N. Mathur, N. H. Zakaria, R. Arora,                   Benchmarking in NLP, Association for Computa-
     V. Gupta, M. Ali, Understanding public opinions                tional Linguistics, Dublin, Ireland, 2022, pp. 100–
     on social media for financial sentiment analysis us-           112.
     ing ai-based techniques, Information Processing &         [14] G. Attanasio, E. Pastor, C. Di Bonaventura, D. Nozza,
     Management 59 (2022) 103098.                                   ferret: a framework for benchmarking explain-
 [4] M. Müller, M. Salathé, P. E. Kummervold, Covid-                ers on transformers, in: Proceedings of the 17th
     twitter-bert: A natural language processing model              Conference of the European Chapter of the As-
     to analyse covid-19 content on twitter, Frontiers in           sociation for Computational Linguistics: System
     Artificial Intelligence 6 (2023) 1023281.                      Demonstrations, Association for Computational
 [5] E. Chen, K. Lerman, E. Ferrara, Tracking social                Linguistics, Dubrovnik, Croatia, 2023, pp. 256–266.
     media discourse about the covid-19 pandemic: De-               URL: https://aclanthology.org/2023.eacl-demo.29.
     velopment of a public coronavirus twitter data                 doi:10.18653/v1/2023.eacl-demo.29.
     set, JMIR Public Health Surveill 6 (2020) e19273.         [15] A. Madsen, S. Reddy, S. Chandar, Post-hoc inter-
     URL: http://publichealth.jmir.org/2020/2/e19273/.              pretability for neural nlp: A survey, ACM Comput-
     doi:10.2196/19273.                                             ing Surveys 55 (2022) 1–42.
 [6] S. Kaur, P. Kaul, P. M. Zadeh, Monitoring the dy-         [16] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i
     namics of emotions during covid-19 using twitter               trust you?" explaining the predictions of any clas-
     data, Procedia Computer Science 177 (2020) 423–                sifier, in: Proceedings of the 22nd ACM SIGKDD
     430.                                                           international conference on knowledge discovery
 [7] K. Scott, P. Delobelle, B. Berendt, Measuring                  and data mining, 2016, pp. 1135–1144.
     shifts in attitudes towards covid-19 measures in          [17] S. M. Lundberg, S.-I. Lee, A unified approach to
     belgium, Computational Linguistics in the Nether-              interpreting model predictions, in: Proceedings of
     lands Journal 11 (2021) 161–171. URL: https://www.             the 31st International Conference on Neural Infor-
     clinjournal.org/clinj/article/view/133.                        mation Processing Systems, NIPS’17, Curran Asso-
 [8] T. Wang, K. Lu, K. P. Chow, Q. Zhu, Covid-19 sens-             ciates Inc., Red Hook, NY, USA, 2017, p. 4768–4777.
     ing: negative sentiment analysis on social media in       [18] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribu-
                                                                    tion for deep networks, in: Proceedings of the 34th
                                                                    International Conference on Machine Learning -
     Volume 70, ICML’17, JMLR.org, 2017, p. 3319–3328.                • Bonus vacanza (Holiday bonus): "bonus
[19] K. Simonyan, A. Vedaldi, A. Zisserman, Deep in-                     vacanza"  OR   "bonus  vacanze"  OR
     side convolutional networks: Visualising image                      "bonus vacanze" OR #bonusvacanza OR
     classification models and saliency maps, CoRR                       #bonusvacanze
     abs/1312.6034 (2013).                                            • Reddito di emergenza (Emergency income):
[20] A. Jacovi, Y. Goldberg, Towards faithfully inter-                   "reddito d’emergenza" OR "reddito di
     pretable NLP systems: How should we define and                      emergenza" OR #redditodemergenza OR
     evaluate faithfulness?, in: Proceedings of the 58th                 #redditodiemergenza OR #REM
     Annual Meeting of the Association for Computa-                   • Bonus terme (Spa bonus): "bonus terme"
     tional Linguistics, Association for Computational                   OR #bonusterme
     Linguistics, Online, 2020, pp. 4198–4205.                        • Bonus babysitter: "bonus   babysitter"
[21] J. DeYoung, S. Jain, N. F. Rajani, E. Lehman,                       OR     "bonus     baby-sitter"     OR
     C. Xiong, R. Socher, B. C. Wallace, ERASER:                         "bonus    babysitting"   OR    "bonus
     A benchmark to evaluate rationalized NLP mod-                       baby-sitting" OR #bonusbabysitter OR
     els, in: Proceedings of the 58th Annual Meet-                       #bonusbabysitting
     ing of the Association for Computational Linguis-                • Bonus asilo nido (Daycare/nursery bonus):
     tics, Association for Computational Linguistics, On-                "bonus asilo nido" OR #bonusasilonido
     line, 2020, pp. 4443–4458. URL: https://aclanthology.
                                                                      • Bonus figli (Child Bonus): "bonus figli"
     org/2020.acl-main.408. doi:10.18653/v1/2020.
                                                                         OR #bonusfigli
     acl-main.408.
[22] D. Nozza, F. Bianchi, G. Attanasio, HATE-ITA:                    • Bonus partite IVA (VAT Bonus): "bonus
                                                                         partite iva" OR #bonuspartiteiva
     Hate speech detection in Italian social media text,
     in: Proceedings of the Sixth Workshop on Online                  • Bonus sportivi (Sport bonus):  "bonus
     Abuse and Harms (WOAH), Association for Compu-                      lavoratori   sportivi"   OR   "bonus
     tational Linguistics, Seattle, Washington (Hybrid),                 sportivi"   OR   (bonus   lavoratori
     2022, pp. 252–260.                                                  sportivi) OR (bonus collaboratori
[23] F. M. Plaza-del arco, D. Nozza, D. Hovy, Respectful                 sportivi) OR "bonus collaboratori
     or toxic? using zero-shot learning with language                    sportivi" OR #bonussportivi
     models to detect hate speech, in: The 7th Workshop               • "Bonus Covid":    "bonus      covid"      OR
     on Online Abuse and Harms (WOAH), Association                       #bonuscovid
     for Computational Linguistics, Toronto, Canada,
     2023, pp. 60–68.
[24] G. Abercrombie, D. Hovy, V. Prabhakaran, Tem-
                                                             B. Data Annotation
     poral and second language influence on intra-           Profile and pay rate. For annotating the MoniCA
     annotator agreement and stability in hate speech        dataset, three student research assistants with back-
     labelling, in: Proceedings of the 17th Linguistic       grounds in Machine Learning and Natural Language Pro-
     Annotation Workshop (LAW-XVII), Association for         cessing were hired full-time. They were each compen-
     Computational Linguistics, Toronto, Canada, 2023.       sated for 32 hours of work at a rate of about 18 euros
                                                             per hour. We provided each annotator with an initial
                                                             set of annotation guidelines, and we organized initial
A. Data Collection                                           meetings to familiarize them with the task and refine the
                                                             guidelines.
Data for the MoniCA dataset was gathered using X’s
proprietary historical API, via an academic subscription. Platform. We used Label Studio8 using a custom la-
  Below is the complete list of f keywords used for data beling schema. We report the annotation schema and
collection in the form of a tweepy7 query:                guidelines in the repository associated with the project.
     • Bonus mobilità (Mobility bonus): "bonus mo- A screenshot of an annotated example is shown in Figure
        bilita" OR "bonus bici" OR "bonus monopattino" 1 for reference.
          OR #bonusmobilita OR #bonusbici OR #bonus-
          monopattino.                               Agreement and consistency. The three annotators
        • Bonus 600 euro: "bonus 600 euro" OR shared a pool of 100 posts. On these, we computed Krip-
          "bonus 600euro" OR "bonus 600" OR pendorff’s alpha of 0.57 on subjectivity (i.e., is the post
          #bonus600euro OR #bonus600                 subjective or not), 0.60 on the post sentiment, and 0.51 on
7                                                            8
    https://www.tweepy.org/                                      https://labelstud.io/
Figure 1: Screenshot of an annotated example in Label Studio.


whether the contextual information was used. The agree-
ment on sentiment increases to 0.61 when considering
only posts that were considered subjective by everyone.
   Moreover, we provided each annotator with a copy of
100 samples randomly shuffled later in the pool of posts
to validate their consistency over time [24]. Annotators
were highly consistent. On average, they annotated sub-
jectivity consistently 95% of the time and sentiment 87%
of the time.

</pre>