ConteCorpus: An Analysis of People Response to Institutional
                Communications During the Pandemic
                       Viviana Ventura, Elisabetta Jezek
                           Department of Humanities
                        University of Pavia, Pavia, Italy
        viviana.ventura01@universitadipavia.it, jezek@unipv.it


                      Abstract                           esting content from the point of view of the re-
                                                         sponse of the population to institutional commu-
    The study of institutional communication             nications regarding the pandemic would have
    related to the pandemic, and to the popu-            been found on his social media profiles. There-
    lation's response to it, is of great relevance       fore, we created ConteCorpus, 1 retrieving more
    today. The Italian spokesperson for com-             than 4 million comments from his Facebook page2
    munication regarding the pandemic has                starting from January 2020 until December 2020,
    been, during the year 2020, the former               and we annotated it in CoNLL-U format3.
    Prime Minister Giuseppe Conte. We re-                      A first aim of the research was to evaluate the
    trieved 4,860,395 comments from his Fa-              performance of the model used to annotate the da-
    cebook official page and built the Con-              taset. Models trained on social media texts usually
    teCorpus, a new Italian resource annotated           are poorly generalizable even on text retrieved
    in CoNLL-U format. A first aim of the re-            from the same social media, therefore we wanted
    search was to evaluate the performance of            to test the performance on Facebook texts of a
    the model used to annotate the corpus.               model trained on Twitter texts. In order to evalu-
    Models trained on social media texts are             ate the model, we created a gold standard by ex-
    usually not very generalizable. Neverthe-            tracting 1,000 sentences from the ConteCorpus
    less, the results of the evaluation were             and manually revising them.
    good, especially in parsing metrics, and                   A second aim of the research was to provide
    showed that a parser trained on Twitter              an overall view of this large corpus. For this pur-
    data can be successfully applied to Face-            pose we performed a Topic Modeling. We trained
    book data. A second aim of the research              a LDA model sampling 10% of the ConteCorpus.
    was to provide an overall view of the con-           The LDA model generated 5 topics related to dif-
    tent of such a large corpus; for this pur-           ferent aspects of the pandemic emergency. The
    pose, topic modeling was conducted,                  model was used to see which topics were the most
    training an LDA model. The model gener-              relevant before and after the announcement of the
    ated 5 topics that cover different aspects           first and the second period of restrictions adopted
    linked to the pandemic emergency, from               to fight the pandemic in Italy.
    economic to political issues. Through the                  The paper is structured as follows: we first
    topic modeling we investigated which                 review the relevant literature for our research
    topics are prevalent on particular days.             (section 2), then we describe the data collection
                                                         and the creation of the corpus (section 3). In sec-
1    Introduction                                        tion 4, we describe the evaluation we performed
                                                         of the model we used to annotate the corpus in
During the year 2020, the Prime Minister                 CoNLL-U format, and in section 5 we report the
Giuseppe Conte has played a major role in insti-         results of the topic modeling experiment. In sec-
tutional communication, particularly in communi-         tion 6 we provide some concluding observations.
cation regarding the policies undertaken to man-
age the health emergency. We assumed that inter-


  Copyright ©️ 2021 for this paper by its authors. Use   1
                                                           https://github.com/Viviana-dev/Conte_Corpus
                                                         2
permitted under Creative Commons License Attribu-          https://www.facebook.com/GiuseppeConte64/
                                                         3 https://universaldependencies.org/format.html
tion 4.0 International (CC BY 4.0).
              January   February   March     April     May       June      July      August    September   October   November   December   Tot
    Post      48        59         48        45        26        44        61        24        43          75        33         28         534
    Comment   115,971   154,266    681,221   775,972   361,179   335,772   449,913   190,777   260,237     666,126   441,822    427,139    4,860,395
                        Table 1. Number of posts and comments retrieved for each month.
                                                                               point on the diamesic axis, e.g., “netspeak” (Crys-
2      State of the Art                                                        tal, 2001). Web and social media language is char-
                                                                               acterized by little planning in text structure and a
Since the beginning of the health emergency,
                                                                               greater propensity for parataxis, absence of revi-
there has been a proliferation of computational
                                                                               sion and punctuation, abrupt interruption of peri-
analyses that exploit data extracted from social
                                                                               ods, and an imitation of the continuous flow of
media. These data are considered relevant as they
                                                                               speech (Fiorentino, 2013). Although some persis-
allow us to generalize about human social and lin-
                                                                               tent traits of web and social media language can
guistic behavior, especially regarding the pan-
                                                                               be described, it does not constitute a single variety
demic event. Among the tasks that have been con-
                                                                               of language from a sociolinguistic perspective
ducted on data drawn from social media in this pe-
                                                                               (Fiorentino, 2013). This poses a double challenge
riod, sentiment analysis, emotion profiling and
                                                                               in the use of NLP tools. First, because the tools
topic modeling are the most common (Gagliardi
                                                                               are calibrated to standard language variety re-
et al., 2020; Tamburini, 2020; Vitale et al., 2020;
                                                                               sources. Secondly, even if we created models that
Stella et al., 2020a; Stella et al., 2020b; Stella et
                                                                               are better suited to web and social media lan-
al., 2021; De Santis et al., 2020; Sciandra, 2020;
                                                                               guages, they would not be generalizable to every
Trevisan et al., 2021; Gozzi et al., 2020; Kruspe et
                                                                               language variety on the web (Sanguinetti et al.,
al., 2020; Hussain et al., 2021; Chakraborty et al.,
                                                                               2018).
2020; Nemes e Kiss, 2020; Jelodar et al., 2021;
Lamsal, 2020; Duong et al., 2021; Gupta et al.,                                3      ConteCorpus Construction
2021; Sullivan et al., 2021; Su et al., 2020; Garcia
et Berton, 2021; Ahmed et al., 2020).                                          3.1       Data Collection
      In particular, Topic Modeling aims at finding
                                                                               We have downloaded 4,860,395 comments and
hidden semantic structures within the texts and to
                                                                               534 posts published during the year 2020 on
model them into concepts. The unsupervised clus-
                                                                               Giuseppe Conte’s Facebook official profile. We
tering technique LDA (Latent Dirichlet Alloca-
                                                                               made call to any 2020 post ID of Giuseppe
tion), developed by Blei (2003), has been used ex-
                                                                               Conte’s official page to retrieve text, object id,
tensively in analyses conducted on social media
                                                                               and created time of comments. The calls to the Fa-
data during the pandemic (Dashtian et Murthy,
                                                                               cebook API Graph4 were made month to month in
2021; Feng et Zhou, 2020; Ordun et al., 2020;
                                                                               the same fashion. Nevertheless, as Table 1 shows,
Wang et al., 2020; Kabir et Mandria, 2020; Amara
                                                                               a larger amount of data has been retrieved in the
et al., 2020; Abd-Alzaraq et al., 2020; Naseem et
                                                                               month of March, April, and October. In the same
al., 2021; Low et al. 2020, Andreadis et al., 2021).
                                                                               period in Italy the more restrictive measures to
LDA is a statistical model that represents each
                                                                               fight pandemic were taken by the government.
document in a corpus as a probabilistic distribu-
tion over latent topics and each topic as a proba-                             3.2       Processing with the Neural Pipeline
bilistic distribution over words. A topic has a                                          Stanza
probability of generating various words, where
the words are all the observed words in the corpus.                            After the data collection, we processed the data
Thus, the terms in the set of documents are used                               with the Neural Pipeline Stanza 5 to enrich the
to discover hidden topics in a large corpus.                                   texts with some annotations. Stanza is an open-
      As is well known, the language of the web is                             source Python NLP toolkit, which “features a lan-
characterized by deviation from the standard lan-                              guage-agnostic fully neural pipeline for text anal-
guage that challenges the use of NLP tools. Sev-                               ysis, including tokenization, multiword token ex-
eral classifications have been proposed to label                               pansion, lemmatization, part-of-speech and mor-
the nature of web and social media language. In                                phological feature tagging, dependency parsing,
general, the labels aim to define a variety of lan-                            and named entity” (Qi et al., 2020). The kit sup-
guage that is diaphasically low and at an indefinite                           ports more than 77 human languages and uses the


4                                                                              5
 https://developers.facebook.com/docs/graph-api?lo-                             https://stanfordnlp.github.io/stanza/
cale=it_IT
Table 2. Performance of Stanza's UD pre-trained model tested on the test set of ConteCorpus.


Table 3. Performance of Stanza's UD pre-trained model tested on official test set of PoSTWITA-UD
and on test set of ConteCorpus. The scores shown are calculated using the F-measure.
formalism Universal Dependencies 6 Knowing the          der not to split the sentences, 8 forcing the To-
difficulties of annotating non standard texts such      kenizeProcessor to consider each comment as a
as those derived from social media, we chose to         sentence. Furthermore, we added two metadata to
use this pipeline because the evaluation of its         each sentence: one refers to the id of the post from
models found that Stanza neural language agnos-         which the comment was retrieved, and the other is
tic architecture “adapts well to text of different      the creation time of the comment. The aim is to
genres […] achieving state-of-the-art or competi-       make it easier to retrieve the comments from the
tive performance at each step of the pipeline” (Qi      corpus by their created time or post id if one needs
et al., 2020). Moreover, models that can be down-       to analyze a particular period of time or a particu-
loaded from Stanza have been trained each on a          lar post.
single language and on a specific text genre da-
taset. We chose to download the model trained on        4     End-to-End Evaluation
PoSTWITA-UD.7 PoSTWITA-UD is an Italian
                                                        4.1    Construction of the Gold Standard
Twitter treebank in Universal Dependencies (San-
guinetti et al., 2018). Although the language of so-    We built a gold standard with a dual purpose: to
cial media is very peculiar and changes from one        evaluate the performance of the model on this new
social media to another and from groups to groups       collection of social media texts, and to create a
(Fiorentino, 2013), we thought that the model           standard that can be used for future training and
downloadable from Stanza - trained on this da-          testing. We randomly selected 83 sentences from
taset - could be generalizable to our data, being in-   each file of the corpus annotated automatically
domain. Moreover, Sanguinetti et al. (2018) have        (one file is composed of one-month comments),
added customized tags to the UD scheme to deal          and manually revised the 1,000 sentences col-
with some social media peculiar phenomena: “dis-        lected. The manual revision has followed the prin-
course:emo” for emojis and emoticons, and “par-         ciple that what is understandable by a human
ataxis:hashtag” for hashtags. They tagged the link      would be correct.
found in some sentences as “dep” (unspecified re-
lation) and used the “upos” (universal part-of-         4.2    Evaluation with CoNLL 2018 UD
speech) tag “SYM” (symbol) for hashtags and                    Shared Task Official Evaluation Script
emojis. Additionally, they manually inserted the        To perform the evaluation, we used CoNLL 2018
lemma of non-standard word forms not recog-             UD shared task official evaluation script.9 Table 2
nized by the lemmatizer (Sanguinetti et al., 2018).     shows the scores of evaluation metrics resulting
      We processed the data divided in 12 pack-         from the performance of Stanza model on the test
ages; each correspond to one month data. We used        set of the ConteCorpus. Table 3 compares the
every processor of the pipeline, besides the            scores of evaluation metrics resulting from the
Named Entity Recognition module (TokenizePro-           performance of Stanza model on the test set of
cessor, POSProcessor, LemmaProcessor, Dep-              PoSTWITA-UD and the ConteCorpus. The first
parseProcessor). We personalized the model in or-       two columns are the scores on metrics that evalu-
                                                        ate segmentation. The row called UPOS shows the

6Universal Dependencies (UD) is a “framework for        7  https://universaldependencies.org/treebanks/it_post-
consistent annotation of grammar (parts of speech,      wita/index.html
                                                        8 Sentence segmentation and tokenization are jointly
morphological features, and syntactic dependencies)
across different human languages” (https://univer-      performed by the TokenizeProcessor (Qi et al., 2020).
                                                        9
saldependencies.org/).                                        https://universaldependencies.org/conll18/evalua-
                                                        tion.html
  Figure 1. Frequency distribution of syntactic relation tags in the training set and the gold standard.
                                                       attached. Despite errors in segmentation seem fre-
                                                       quent in the corpus, this did not cause an excessive
                                                       lowering of the scores on the various metrics re-
                                                       ported in Table 2 and 3. Another error that appears
                                                       frequently regards the lemma assigned to the ab-
                                                       breviations that are not present in PoSTTWITA-
                                                       UD. Canonical abbreviations are tagged correctly,
                                                       for example “cmq” for “comunque” (“however”).
                                                       The abbreviations tagged incorrectly are those
                                                       which appeared few times: such as “ql” that stands
  Figure 2. Frequency distribution of part-of-
                                                       for “quelli” (those). An unexpected good result
  speech tags in the training set and the gold
                                                       has been achieved on parsing metrics. This result
  standard.
                                                       could be due to the “preference of UD scheme in
resulting scores on Universal part-of-speech tag-      assigning headedness to content words” (San-
ging metric, XPOS on language-specific part-of-        guinetti et al., 2018); therefore, the tendency of
speech tagging metric, and UFeats on morpholog-        the social media languages to eliminate function
ical features tagging metric. The last 5 rows show     words does not affect the performance of the par-
scores in five different parsing metrics.              ser. Another explanation can be found in the very
      What we found most challenging during the        similar frequencies distribution of part-of-
manual revision of the 1,000 sentences annotated       speeches and syntactic relations in the training set
automatically was correcting the errors in tokeni-     and the gold standard, as shown in Figure 1 and 2.
zation: many words that the tokenizer should have            Overall, the model trained on PoSTWITA-
splitted were joined together. This type of tokeni-    UD   turned  out to perform well on the test set of
zation error is often found when punctuation is        the ConteCorpus because PoSTWITA-UD tagset
used with non standard function. For example: we       has been adapted with attention to some recurrent
found that the token “oneste…volevo” (“hon-            features of social media languages. Our evalua-
est…I wanted to”) - an adjective, a punctuation        tion showed that a model trained on texts retrieved
mark and a verb - are conflated in a single token.     by social media can adapt well to other social me-
In the manual revision, tokens like this have been     dia texts if one pays attention to the neural archi-
splitted in three different tokens and other missing   tecture of the model and the annotation format be-
tags were added. The presence of such conflated        ing used.
words mayhave caused a worse score in the metric
that evaluates the performance of segmentation,        5 Topic Modeling
and consequentially in the other scores. The eval-     To provide an overall view of the content of this
uation on the parser starts with aligning system       large corpus we performed a Topic Modeling
nodes and gold nodes; their respective parent          training and testing an LDA model on the Con-
nodes are also considered; if the system parent is     teCorpus.
not aligned with the gold parent or if the relation
label differs, the word is not counted as correctly    5.1    Methodology
 Topic         1: Economics                 2: Prime Minister             3: Politics                 4: Pandemic                  5: Home
 Terms         pagare, soldo, italia,       presidente, grazie,           italiano, europa, italia,   uscire, miliardo, firmare,   sperare, casa, aspettare,
               euro, chiudere, mese,        Conte, lavoro, bravo,         paese, banca, popolo,       virus, decreto, Salvini,     perdere, impresa, tedesco,
               debito, azienda, prestito,   italia, italiano, signore,    governo, chiedere,          maria, pandemia,             subito, tempo, fondo,
               lavorare                     giuseppe, caro                germania, storia            chiedere, italy              stipendio
 English       to pay, money, italy,        prime minister, thank         italian, europe, italy,     to go out, billion, to       to hope, home, to wait, to
 Translation   euro, to close, month,       you, Conte, work,             country, bank, people,      sign, virus, decree,         lose, business, german,
               loan, company, to work       bravo, italy, italian, sir,   government, to ask,         Salvini, maria,              immediately, time, capital,
                                            giuseppe, dear                germany, story              pandemic, to ask, italy      salary
               Table 4. Topic generated from the LDA model and the ten most frequent terms.


     Figure 3. Intertopic distance Map and Top-30 most relevant terms for Topic 1. For a better view visit:
                     https://sites.google.com/view/ldavisualizationcontecorpus/home-page.
To perform topic modelling, we sampled 10% of                                      periments varying the number of topics and se-
the sentences in our dataset and trained a LDA                                     lected the model with the highest medium topic
model. We treated each sentence as a document.                                     Coherence Value score. Our final model gener-
We pre-processed lemmas removing stopwords,                                        ated 5 topics and has a topic medium Coherence
downloading Italian stopwords list from the                                        Value score of 0.5. Table 4 illustrates the top ten
NLTK (Natural Language Toolkit) library 10 and                                     most representative terms associated with each
manually inserting missing stopwords. We fil-                                      detected topic.
tered out tokens that appear in less than 15 docu-
                                                                                   5.2        Results
ments and tokens with less than three letters; ad-
ditionally, we kept only the 100,000 most frequent                                 As expected, all the topics extracted from the cor-
words. We transformed the documents into vec-                                      pus are related to the concerns about the emer-
tors creating a bag-of-words representation of                                     gency. The focus is on the economic aspect of the
each document. Then, we performed the term fre-                                    emergency. The first ten most frequent words in
quency-inverse document frequency (TF-IDF) on                                      Economics topic (Table 4 and Figure 3) are eco-
the whole corpus to assign higher weights to the                                   nomic terms: “loan”, “company”, “to pay”
most important words. Gensim LDA model11 was                                       “money” etc. In all the other topics at least one of
applied first to the bags-of-words and secondly on                                 the 10 most frequent words comes from the eco-
the TF-IDF corpus to extract latent topics. Better                                 nomic sphere. Among the ten most frequent words
performances were achieved with the LDA model                                      of each topic there are only two words regarding
applied to bags-of-words. We determined the op-                                    the pandemic, found in Pandemic topic: "virus"
timal number of topics in LDA using the Coher-                                     and "pandemic". It is no coincidence that the most
ence Value metric.12 The underlying idea is that a                                 frequent word in this topic is “to go out”. The need
good model will generate topics with high topic                                    to face the emergency through the intervention of
Coherence Value score. We ran different LDA ex-                                    the institutions is evident. This is shown espe


10                                                                                 12
  https://www.nltk.org/.                                                              Coherence Value metric is developed by Roder
11
  https://radimrehurek.com/gensim/models/ldamodel.                                 (2015). It evaluates a single topic by measuring the de-
html.                                                                              gree of semantic similarity between high scoring words
                                                                                   in the topic.
Figure 4. Prevalence of topics during the days 8-        Figure 6. Prevalence of topics during the days 1
                12 March 2020.                           October-15 November 2020.
                                                         cities: Figure 5 shows a peak in the pandemic
                                                         topic on that day. Figure 4 shows how the preva-
                                                         lence of the five topics changes on 8-12 March
                                                         2020. The Figure shows a peak on 9 March in
                                                         Prime Minister topic: on that day he announced
                                                         the first national restrictions period to combat the
                                                         pandemic. Overall, the prevalent topics on those
                                                         days are economics and pandemic. On 13 Octo-
                                                         ber, after a summer without major restrictions,
                                                         with a new exponential increase in the curve of
                                                         contagions, the Italian Parliament passed a decree
Figure 5. Prevalence of topics during the days 15
                                                         limiting the possibility of aggregation. That day
February-30 March 2020.
                                                         we have a new peak in the Pandemic theme (Fig-
cially by Prime Minister and Politics topics (Ta-        ure 6). In the days that followed, the prevailing
ble 4). Prime Minister topic most frequent words         topic is Economics: on 28 October, the “ristoro”
are related to the Prime Minister. Perhaps words         decree was approved to financially support com-
like “bravo” and “thank you” and “dear” show a           mercial activities. A peak in the topic of Econom-
positive judgement towards him. In Politics topic        ics occurred on 18 March: on those days, discus-
one finds words of the institutional sphere such as:     sions were taking place on whether to ask the Eu-
“country”, “government”, “people”, “bank”.               ropean Union for financial aid to overcome the
Home topic is related to the private sphere with         pandemic. The prevailing topics are therefore usu-
words like “to hope”, “home”, “to wait”, “to lose”,      ally related to current events.
although there is no shortage of words from the
economic sphere. In Figure 3 the distance between        6    Concluding Observations
the centre of the circles indicates the similarity be-
                                                         As mentioned before, models trained with data
tween the topics. Here you can see that only Eco-
                                                         from social media are hardly generalizable. This
nomics topic and Prime Minister topic overlap;
                                                         stems from the fact that from a sociolinguistic per-
this indicates that the two topics are more similar
                                                         spective, the language of social media does not
with respect to the other topics. Moreover, the size
                                                         constitute a single variety. So, we expected that
of the area of each circle represents the im-
                                                         the results in the various evaluation metrics we
portance of the topic relative to the corpus. Eco-
                                                         performed would be worse than the results in the
nomics topic is the most important topic in the
                                                         evaluation conducted on the PoSTWITA-UD test
corpus. Finally, we tested our model on unseen
                                                         set. Surprisingly, in some metrics the results on
documents: the comments published between 15
                                                         evaluating the ConteCorpus test set were better
February and 30 March 2020, before and after the
                                                         than the results on the PoSTWITA-UD test set. To
announcement of the first period of restrictions to
                                                         offer an overall view of the content of the Con-
combat the pandemic, and between 1 October and
                                                         teCorpus we performed topic modeling. The top-
14 November 2020, before and after the an-
                                                         ics generated by the LDA model cover various as-
nouncement of the second period of restrictions.
                                                         pects of the pandemic emergency, with a prepon-
Figures 4, 5 and 6 show trends in topics over time.
                                                         derance of political and economic issues. Unex-
Each line represents a topic and the x-axis shows
                                                         pectedly, topics identified do not show concern re-
the time progression. On 23 February, the first re-
                                                         gard the risk of contagion and the possibility of
strictive policies were announced for some Italian
                                                         catching the disease.
References                                                Feng, Y. and Zhou, W. (2020). Is working from home
                                                            the new norm? an observational study based on a
Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi            large geo-tagged covid-19 twitter dataset. arXiv pre-
  M., & Shah, Z. (2020). Top concerns of tweeters           print arXiv:2006.08581.
  during the COVID-19 pandemic: infoveillance
  study. Journal of medical Internet research, 22(4),     Gagliardi, G., Gregori, L., & Suozzi, A. (2021). L’im-
  e19016.                                                   patto emotivo della comunicazione istituzionale du-
                                                            rante la pandemia di Covid-19: uno studio di Twitter
Ahmed, M. E., Rabin, M. R. I., & Chowdhury, F. N.           Sentiment Analysis. Proceedings of the Seventh
  (2020). COVID-19: Social media sentiment analysis         Italian Conference on Computational Linguistics,
  on reopening. arXiv preprint arXiv:2006.00804.            CLiC-it 2020, Bologna, Italy. Volume 2769 of
Amara, A., Taieb, M. A. H., & Aouicha, M. B. (2021).        CEUR Workshop Proceedings.
  Multilingual topic modeling for tracking COVID-19       Garcia, K. and Berton, L. (2021). Topic detection and
  trends based on Facebook data analysis. Applied In-       sentiment analysis in Twitter content related to
  telligence, 51(5), 3052-3073.                             COVID-19 from Brazil and the USA. Applied Soft
Andreadis, S., Antzoulatos, G., Mavropoulos, T., Gian-      Computing, 101, 107057.
  nakeris, P., Tzionis, G., Pantelidis, N., ... & Kom-    Gozzi, N., Tizzani, M., Starnini, M., Ciulla, F., Pao-
  patsiaris, I. (2021). A social media analytics plat-      lotti, D., Panisson, A., & Perra, N. (2020). Collec-
  form visualising the spread of COVID-19 in Italy          tive Response to Media Coverage of the COVID-19
  via exploitation of automatically geotagged tweets.       Pandemic on Reddit and Wikipedia: Mixed-Meth-
  Online Social Networks and Media, 23, 100134.             ods Analysis. Journal of medical Internet research,
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003) Latent       22(10), e21597.
  dirichlet allocation, the Journal of machine Learning   Gupta, V., Jain, N., Katariya, P., Kumar, A., Mohan,
  research (JMach), 3, 993–1022.                            S., Ahmadian, A., & Ferrara, M. (2021). An emotion
Chakraborty, K., Bhatia, S., Bhattacharyya, S., Platos,     care model using multimodal textual analysis on
  J., Bag, R., & Hassanien, A. E. (2020). Sentiment         COVID-19. Chaos, Solitons & Fractals, 144,
  Analysis of COVID-19 tweets by Deep Learning              110708.
  Classifiers—A study to show how popularity is af-       Hussain, A., Tahir, A., Hussain, Z., Sheikh, Z., Gogate,
  fecting accuracy in social media. Applied Soft Com-       M., Dashtipour, K., et al. (2021). Artificial Intelli-
  puting, 97, 106754.                                       gence–Enabled Analysis of Public Attitudes on Fa-
Crystal, D. (2001). Language and the Internet. Cam-         cebook and Twitter Toward COVID-19 Vaccines in
  bridge University Press.                                  the United Kingdom and the United States: Obser-
                                                            vational Study. Journal of medical Internet research,
Dashtian, H. and Murthy, D. (2021). Cml-covid: A            23(4), e26627.
  large-scale covid-19 twitter dataset with latent top-
  ics, sentiment and location information. arXiv pre-     Jelodar, H., Wang, Y., Orji, R., & Huang, S. (2020).
  print arXiv:2101.12202.                                    Deep sentiment classification and topic discovery
                                                             on novel coronavirus or covid-19 online discus-
De Santis, E., Martino, A., & Rizzi, A. (2020). An In-       sions: Nlp using lstm recurrent neural network ap-
  foveillance System for Detecting and Tracking Rel-         proach. IEEE Journal of Biomedical and Health In-
  evant Topics from Italian Tweets During the                formatics, 24(10), 2733-2742.
  COVID-19 Event. IEEE Access, 8, 132527-132538.
                                                          Kruspe, A., Häberle, M., Kuhn, I., & Zhu, X. X.
Fiorentino, G. (2013). “Wild language” goes Web: new        (2020). Cross-language sentiment analysis of Euro-
   writers and old problems in the elaboration of the       pean Twitter messages during the COVID-19 pan-
   written code. In E. Miola (Ed.), Languages Go Web.       demic. arXiv preprint arXiv:2008.12172.
   Standard and non-standard languages on the Internet
   (pp. 67-90.). Alessandria, Edizioni dell’Orso.         Lamsal, R. (2020). Design and analysis of a large-scale
                                                            COVID-19 tweets dataset. Applied Intelligence, 1-
Dozat, T. and Manning, C. D. (2017). Deep biaffine at-      15.
  tention for neural dependency parsing. In Proceed-
  ings of the 2017 International Conference on Learn-     Lomborg, S., & Bechmann, A. (2014). Using APIs for
  ing Representations (ICLR).                               data collection on social media. The Information So-
                                                            ciety, 30(4), 256-265.
Duong, V., Luo, J., Pham, P., Yang, T., & Wang, Y.
  (2020). The ivory tower lost: How college students      Low, D. M., Rumker, L., Talkar, T., Torous, J., Cecchi,
  respond differently than the general public to the        G., & Ghosh, S. S. (2020) Natural Language Pro-
  covid-19 pandemic. IEEE/ACM International Con-            cessing Reveals Vulnerable Mental Health Support
  ference on Advances in Social Networks Analysis           Groups and Heightened Health Anxiety on Reddit
  and Mining (ASONAM) (pp. 126-130).                        During COVID-19: Observational Study. Journal of
                                                            medical Internet research, 22(10), e22635.
Naseem, U., Razzak, I., Khushi, M., Eklund, P. W., &        Sullivan, K. J., Burden, M., Keniston, A., Banda, J. M.,
  Kim, J. (2021). COVIDSenti: a large-scale bench-            & Hunter, L. E. (2020). Characterization of Anony-
  mark Twitter data set for COVID-19 sentiment anal-          mous Physician Perspectives on COVID-19 Using
  ysis. IEEE transactions on computational social sys-        Social Media Data. Pac Symp Biocomput.
  tems.
                                                            Tamburini, F. (2020). EmoItaly. http://corpora.fic-
Nemes, L. and Kiss, A. (2021). Social media sentiment         lit.unibo.it/EmoItaly/.
  analysis based on COVID-19. Journal of Infor-
                                                            Trevisan, M., Vassio, L., & Giordano, D. (2021). De-
  mation and Telecommunication, 5(1), 1-15.
                                                              bate on online social networks at the time of
Ordun, C., Purushotham, S., & Raff, E., (2020). Ex-           COVID-19: An Italian case study. Online Social
  ploratory analysis of covid-19 tweets using topic           Networks and Media, 23, 100136.
  modeling, umap, and digraphs. arXiv preprint
                                                            Wang, J., Zhou, Y., Zhang, W., Evans, R., & Zhu, C.
  arXiv:2005.03082.
                                                             (2020). Concerns Expressed by Chinese Social Me-
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning,         dia Users During the COVID-19 Pandemic: Content
   C. D. (2020). Stanza: A Python Natural Language           Analysis of Sina Weibo Microblogging Data. Jour-
   Processing Toolkit for Many Human Languages.              nal of medical Internet research, 22(11), e22152.
   Association for Computational Linguistics (ACL)
                                                            Vitale, P., Pelosi, S., Falco, M. (2020). #andràtut-
   System Demonstrations.
                                                               tobene: Images, Texts, Emojis and Geodata in a
Röder, M., Both, A., & Hinneburg, A. (2015). Explor-           Sentiment Analysis Pipeline. Proceedings of the
  ing the space of topic coherence measures. Proceed-          Seventh Italian Conference on Computational Lin-
  ings of the eighth ACM international conference on           guistics, CLiC-it 2020, Bologna, Italy. Volume
  Web search and data mining (pp. 399–408).                    2769 of CEUR Workshop Proceedings. http://ceur-
                                                               ws.org/Vol-2769/paper_62.pdf.
Sanguinetti, M., Bosco, C., Lavelli, A., Mazzei, A.,
  Antonelli, O., & Tamburini, F. (2018, May). PoST-
  WITA-UD: an Italian Twitter Treebank in universal
  dependencies. Proceedings of the Eleventh Interna-
  tional Conference on Language Resources and Eval-
  uation (LREC 2018).
Sciandra, A., (2020). COVID-19 Outbreak through
   Tweeters’ Words: Monitoring Italian Social Media
   Communication about COVID-19 with Text Mining
   and Word Embeddings. 2020 IEEE Symposium on
   Computers and Communications (ISCC) (pp. 1-6),
   IEEE.
Stella, M., Restocchi, V., & De Deyne, S., (2020).
   #lockdown: Network-enhanced emotional profiling
   in the time of Covid-19. Big Data and Cognitive
   Computing, 4(2), 14.
Stella, M., (2020). Cognitive network science recon-
   structs how experts, news outlets and social media
   perceived the COVID-19 pandemic. Systems, 8(4),
   38.
Stella, M., Vitevitch, M. S., & Botta F., (2021) Cogni-
   tive networks identify the content of English and
   Italian popular posts about COVID-19 vaccines:
   Anticipation, logistics, conspiracy and loss of trust.
   arXiv preprint arXiv:2103.15909.
Su, Y., Xue, J., Liu, X., Wu, P., Chen, J., Chen, C., et
  al. (2020). Examining the impact of COVID-19
  lockdown in Wuhan and Lombardy: a psycholin-
  guistic analysis on Weibo and Twitter. International
  journal of environmental research and public health,
  17(12), 4552.