ConteCorpus: An Analysis of People Response to Institutional Communications During the Pandemic Viviana Ventura, Elisabetta Jezek Department of Humanities University of Pavia, Pavia, Italy viviana.ventura01@universitadipavia.it, jezek@unipv.it Abstract esting content from the point of view of the re- sponse of the population to institutional commu- The study of institutional communication nications regarding the pandemic would have related to the pandemic, and to the popu- been found on his social media profiles. There- lation's response to it, is of great relevance fore, we created ConteCorpus, 1 retrieving more today. The Italian spokesperson for com- than 4 million comments from his Facebook page2 munication regarding the pandemic has starting from January 2020 until December 2020, been, during the year 2020, the former and we annotated it in CoNLL-U format3. Prime Minister Giuseppe Conte. We re- A first aim of the research was to evaluate the trieved 4,860,395 comments from his Fa- performance of the model used to annotate the da- cebook official page and built the Con- taset. Models trained on social media texts usually teCorpus, a new Italian resource annotated are poorly generalizable even on text retrieved in CoNLL-U format. A first aim of the re- from the same social media, therefore we wanted search was to evaluate the performance of to test the performance on Facebook texts of a the model used to annotate the corpus. model trained on Twitter texts. In order to evalu- Models trained on social media texts are ate the model, we created a gold standard by ex- usually not very generalizable. Neverthe- tracting 1,000 sentences from the ConteCorpus less, the results of the evaluation were and manually revising them. good, especially in parsing metrics, and A second aim of the research was to provide showed that a parser trained on Twitter an overall view of this large corpus. For this pur- data can be successfully applied to Face- pose we performed a Topic Modeling. We trained book data. A second aim of the research a LDA model sampling 10% of the ConteCorpus. was to provide an overall view of the con- The LDA model generated 5 topics related to dif- tent of such a large corpus; for this pur- ferent aspects of the pandemic emergency. The pose, topic modeling was conducted, model was used to see which topics were the most training an LDA model. The model gener- relevant before and after the announcement of the ated 5 topics that cover different aspects first and the second period of restrictions adopted linked to the pandemic emergency, from to fight the pandemic in Italy. economic to political issues. Through the The paper is structured as follows: we first topic modeling we investigated which review the relevant literature for our research topics are prevalent on particular days. (section 2), then we describe the data collection and the creation of the corpus (section 3). In sec- 1 Introduction tion 4, we describe the evaluation we performed of the model we used to annotate the corpus in During the year 2020, the Prime Minister CoNLL-U format, and in section 5 we report the Giuseppe Conte has played a major role in insti- results of the topic modeling experiment. In sec- tutional communication, particularly in communi- tion 6 we provide some concluding observations. cation regarding the policies undertaken to man- age the health emergency. We assumed that inter- Copyright ©️ 2021 for this paper by its authors. Use 1 https://github.com/Viviana-dev/Conte_Corpus 2 permitted under Creative Commons License Attribu- https://www.facebook.com/GiuseppeConte64/ 3 https://universaldependencies.org/format.html tion 4.0 International (CC BY 4.0). January February March April May June July August September October November December Tot Post 48 59 48 45 26 44 61 24 43 75 33 28 534 Comment 115,971 154,266 681,221 775,972 361,179 335,772 449,913 190,777 260,237 666,126 441,822 427,139 4,860,395 Table 1. Number of posts and comments retrieved for each month. point on the diamesic axis, e.g., “netspeak” (Crys- 2 State of the Art tal, 2001). Web and social media language is char- acterized by little planning in text structure and a Since the beginning of the health emergency, greater propensity for parataxis, absence of revi- there has been a proliferation of computational sion and punctuation, abrupt interruption of peri- analyses that exploit data extracted from social ods, and an imitation of the continuous flow of media. These data are considered relevant as they speech (Fiorentino, 2013). Although some persis- allow us to generalize about human social and lin- tent traits of web and social media language can guistic behavior, especially regarding the pan- be described, it does not constitute a single variety demic event. Among the tasks that have been con- of language from a sociolinguistic perspective ducted on data drawn from social media in this pe- (Fiorentino, 2013). This poses a double challenge riod, sentiment analysis, emotion profiling and in the use of NLP tools. First, because the tools topic modeling are the most common (Gagliardi are calibrated to standard language variety re- et al., 2020; Tamburini, 2020; Vitale et al., 2020; sources. Secondly, even if we created models that Stella et al., 2020a; Stella et al., 2020b; Stella et are better suited to web and social media lan- al., 2021; De Santis et al., 2020; Sciandra, 2020; guages, they would not be generalizable to every Trevisan et al., 2021; Gozzi et al., 2020; Kruspe et language variety on the web (Sanguinetti et al., al., 2020; Hussain et al., 2021; Chakraborty et al., 2018). 2020; Nemes e Kiss, 2020; Jelodar et al., 2021; Lamsal, 2020; Duong et al., 2021; Gupta et al., 3 ConteCorpus Construction 2021; Sullivan et al., 2021; Su et al., 2020; Garcia et Berton, 2021; Ahmed et al., 2020). 3.1 Data Collection In particular, Topic Modeling aims at finding We have downloaded 4,860,395 comments and hidden semantic structures within the texts and to 534 posts published during the year 2020 on model them into concepts. The unsupervised clus- Giuseppe Conte’s Facebook official profile. We tering technique LDA (Latent Dirichlet Alloca- made call to any 2020 post ID of Giuseppe tion), developed by Blei (2003), has been used ex- Conte’s official page to retrieve text, object id, tensively in analyses conducted on social media and created time of comments. The calls to the Fa- data during the pandemic (Dashtian et Murthy, cebook API Graph4 were made month to month in 2021; Feng et Zhou, 2020; Ordun et al., 2020; the same fashion. Nevertheless, as Table 1 shows, Wang et al., 2020; Kabir et Mandria, 2020; Amara a larger amount of data has been retrieved in the et al., 2020; Abd-Alzaraq et al., 2020; Naseem et month of March, April, and October. In the same al., 2021; Low et al. 2020, Andreadis et al., 2021). period in Italy the more restrictive measures to LDA is a statistical model that represents each fight pandemic were taken by the government. document in a corpus as a probabilistic distribu- tion over latent topics and each topic as a proba- 3.2 Processing with the Neural Pipeline bilistic distribution over words. A topic has a Stanza probability of generating various words, where the words are all the observed words in the corpus. After the data collection, we processed the data Thus, the terms in the set of documents are used with the Neural Pipeline Stanza 5 to enrich the to discover hidden topics in a large corpus. texts with some annotations. Stanza is an open- As is well known, the language of the web is source Python NLP toolkit, which “features a lan- characterized by deviation from the standard lan- guage-agnostic fully neural pipeline for text anal- guage that challenges the use of NLP tools. Sev- ysis, including tokenization, multiword token ex- eral classifications have been proposed to label pansion, lemmatization, part-of-speech and mor- the nature of web and social media language. In phological feature tagging, dependency parsing, general, the labels aim to define a variety of lan- and named entity” (Qi et al., 2020). The kit sup- guage that is diaphasically low and at an indefinite ports more than 77 human languages and uses the 4 5 https://developers.facebook.com/docs/graph-api?lo- https://stanfordnlp.github.io/stanza/ cale=it_IT Table 2. Performance of Stanza's UD pre-trained model tested on the test set of ConteCorpus. Table 3. Performance of Stanza's UD pre-trained model tested on official test set of PoSTWITA-UD and on test set of ConteCorpus. The scores shown are calculated using the F-measure. formalism Universal Dependencies 6 Knowing the der not to split the sentences, 8 forcing the To- difficulties of annotating non standard texts such kenizeProcessor to consider each comment as a as those derived from social media, we chose to sentence. Furthermore, we added two metadata to use this pipeline because the evaluation of its each sentence: one refers to the id of the post from models found that Stanza neural language agnos- which the comment was retrieved, and the other is tic architecture “adapts well to text of different the creation time of the comment. The aim is to genres […] achieving state-of-the-art or competi- make it easier to retrieve the comments from the tive performance at each step of the pipeline” (Qi corpus by their created time or post id if one needs et al., 2020). Moreover, models that can be down- to analyze a particular period of time or a particu- loaded from Stanza have been trained each on a lar post. single language and on a specific text genre da- taset. We chose to download the model trained on 4 End-to-End Evaluation PoSTWITA-UD.7 PoSTWITA-UD is an Italian 4.1 Construction of the Gold Standard Twitter treebank in Universal Dependencies (San- guinetti et al., 2018). Although the language of so- We built a gold standard with a dual purpose: to cial media is very peculiar and changes from one evaluate the performance of the model on this new social media to another and from groups to groups collection of social media texts, and to create a (Fiorentino, 2013), we thought that the model standard that can be used for future training and downloadable from Stanza - trained on this da- testing. We randomly selected 83 sentences from taset - could be generalizable to our data, being in- each file of the corpus annotated automatically domain. Moreover, Sanguinetti et al. (2018) have (one file is composed of one-month comments), added customized tags to the UD scheme to deal and manually revised the 1,000 sentences col- with some social media peculiar phenomena: “dis- lected. The manual revision has followed the prin- course:emo” for emojis and emoticons, and “par- ciple that what is understandable by a human ataxis:hashtag” for hashtags. They tagged the link would be correct. found in some sentences as “dep” (unspecified re- lation) and used the “upos” (universal part-of- 4.2 Evaluation with CoNLL 2018 UD speech) tag “SYM” (symbol) for hashtags and Shared Task Official Evaluation Script emojis. Additionally, they manually inserted the To perform the evaluation, we used CoNLL 2018 lemma of non-standard word forms not recog- UD shared task official evaluation script.9 Table 2 nized by the lemmatizer (Sanguinetti et al., 2018). shows the scores of evaluation metrics resulting We processed the data divided in 12 pack- from the performance of Stanza model on the test ages; each correspond to one month data. We used set of the ConteCorpus. Table 3 compares the every processor of the pipeline, besides the scores of evaluation metrics resulting from the Named Entity Recognition module (TokenizePro- performance of Stanza model on the test set of cessor, POSProcessor, LemmaProcessor, Dep- PoSTWITA-UD and the ConteCorpus. The first parseProcessor). We personalized the model in or- two columns are the scores on metrics that evalu- ate segmentation. The row called UPOS shows the 6Universal Dependencies (UD) is a “framework for 7 https://universaldependencies.org/treebanks/it_post- consistent annotation of grammar (parts of speech, wita/index.html 8 Sentence segmentation and tokenization are jointly morphological features, and syntactic dependencies) across different human languages” (https://univer- performed by the TokenizeProcessor (Qi et al., 2020). 9 saldependencies.org/). https://universaldependencies.org/conll18/evalua- tion.html Figure 1. Frequency distribution of syntactic relation tags in the training set and the gold standard. attached. Despite errors in segmentation seem fre- quent in the corpus, this did not cause an excessive lowering of the scores on the various metrics re- ported in Table 2 and 3. Another error that appears frequently regards the lemma assigned to the ab- breviations that are not present in PoSTTWITA- UD. Canonical abbreviations are tagged correctly, for example “cmq” for “comunque” (“however”). The abbreviations tagged incorrectly are those which appeared few times: such as “ql” that stands Figure 2. Frequency distribution of part-of- for “quelli” (those). An unexpected good result speech tags in the training set and the gold has been achieved on parsing metrics. This result standard. could be due to the “preference of UD scheme in resulting scores on Universal part-of-speech tag- assigning headedness to content words” (San- ging metric, XPOS on language-specific part-of- guinetti et al., 2018); therefore, the tendency of speech tagging metric, and UFeats on morpholog- the social media languages to eliminate function ical features tagging metric. The last 5 rows show words does not affect the performance of the par- scores in five different parsing metrics. ser. Another explanation can be found in the very What we found most challenging during the similar frequencies distribution of part-of- manual revision of the 1,000 sentences annotated speeches and syntactic relations in the training set automatically was correcting the errors in tokeni- and the gold standard, as shown in Figure 1 and 2. zation: many words that the tokenizer should have Overall, the model trained on PoSTWITA- splitted were joined together. This type of tokeni- UD turned out to perform well on the test set of zation error is often found when punctuation is the ConteCorpus because PoSTWITA-UD tagset used with non standard function. For example: we has been adapted with attention to some recurrent found that the token “oneste…volevo” (“hon- features of social media languages. Our evalua- est…I wanted to”) - an adjective, a punctuation tion showed that a model trained on texts retrieved mark and a verb - are conflated in a single token. by social media can adapt well to other social me- In the manual revision, tokens like this have been dia texts if one pays attention to the neural archi- splitted in three different tokens and other missing tecture of the model and the annotation format be- tags were added. The presence of such conflated ing used. words mayhave caused a worse score in the metric that evaluates the performance of segmentation, 5 Topic Modeling and consequentially in the other scores. The eval- To provide an overall view of the content of this uation on the parser starts with aligning system large corpus we performed a Topic Modeling nodes and gold nodes; their respective parent training and testing an LDA model on the Con- nodes are also considered; if the system parent is teCorpus. not aligned with the gold parent or if the relation label differs, the word is not counted as correctly 5.1 Methodology Topic 1: Economics 2: Prime Minister 3: Politics 4: Pandemic 5: Home Terms pagare, soldo, italia, presidente, grazie, italiano, europa, italia, uscire, miliardo, firmare, sperare, casa, aspettare, euro, chiudere, mese, Conte, lavoro, bravo, paese, banca, popolo, virus, decreto, Salvini, perdere, impresa, tedesco, debito, azienda, prestito, italia, italiano, signore, governo, chiedere, maria, pandemia, subito, tempo, fondo, lavorare giuseppe, caro germania, storia chiedere, italy stipendio English to pay, money, italy, prime minister, thank italian, europe, italy, to go out, billion, to to hope, home, to wait, to Translation euro, to close, month, you, Conte, work, country, bank, people, sign, virus, decree, lose, business, german, loan, company, to work bravo, italy, italian, sir, government, to ask, Salvini, maria, immediately, time, capital, giuseppe, dear germany, story pandemic, to ask, italy salary Table 4. Topic generated from the LDA model and the ten most frequent terms. Figure 3. Intertopic distance Map and Top-30 most relevant terms for Topic 1. For a better view visit: https://sites.google.com/view/ldavisualizationcontecorpus/home-page. To perform topic modelling, we sampled 10% of periments varying the number of topics and se- the sentences in our dataset and trained a LDA lected the model with the highest medium topic model. We treated each sentence as a document. Coherence Value score. Our final model gener- We pre-processed lemmas removing stopwords, ated 5 topics and has a topic medium Coherence downloading Italian stopwords list from the Value score of 0.5. Table 4 illustrates the top ten NLTK (Natural Language Toolkit) library 10 and most representative terms associated with each manually inserting missing stopwords. We fil- detected topic. tered out tokens that appear in less than 15 docu- 5.2 Results ments and tokens with less than three letters; ad- ditionally, we kept only the 100,000 most frequent As expected, all the topics extracted from the cor- words. We transformed the documents into vec- pus are related to the concerns about the emer- tors creating a bag-of-words representation of gency. The focus is on the economic aspect of the each document. Then, we performed the term fre- emergency. The first ten most frequent words in quency-inverse document frequency (TF-IDF) on Economics topic (Table 4 and Figure 3) are eco- the whole corpus to assign higher weights to the nomic terms: “loan”, “company”, “to pay” most important words. Gensim LDA model11 was “money” etc. In all the other topics at least one of applied first to the bags-of-words and secondly on the 10 most frequent words comes from the eco- the TF-IDF corpus to extract latent topics. Better nomic sphere. Among the ten most frequent words performances were achieved with the LDA model of each topic there are only two words regarding applied to bags-of-words. We determined the op- the pandemic, found in Pandemic topic: "virus" timal number of topics in LDA using the Coher- and "pandemic". It is no coincidence that the most ence Value metric.12 The underlying idea is that a frequent word in this topic is “to go out”. The need good model will generate topics with high topic to face the emergency through the intervention of Coherence Value score. We ran different LDA ex- the institutions is evident. This is shown espe 10 12 https://www.nltk.org/. Coherence Value metric is developed by Roder 11 https://radimrehurek.com/gensim/models/ldamodel. (2015). It evaluates a single topic by measuring the de- html. gree of semantic similarity between high scoring words in the topic. Figure 4. Prevalence of topics during the days 8- Figure 6. Prevalence of topics during the days 1 12 March 2020. October-15 November 2020. cities: Figure 5 shows a peak in the pandemic topic on that day. Figure 4 shows how the preva- lence of the five topics changes on 8-12 March 2020. The Figure shows a peak on 9 March in Prime Minister topic: on that day he announced the first national restrictions period to combat the pandemic. Overall, the prevalent topics on those days are economics and pandemic. On 13 Octo- ber, after a summer without major restrictions, with a new exponential increase in the curve of contagions, the Italian Parliament passed a decree Figure 5. Prevalence of topics during the days 15 limiting the possibility of aggregation. That day February-30 March 2020. we have a new peak in the Pandemic theme (Fig- cially by Prime Minister and Politics topics (Ta- ure 6). In the days that followed, the prevailing ble 4). Prime Minister topic most frequent words topic is Economics: on 28 October, the “ristoro” are related to the Prime Minister. Perhaps words decree was approved to financially support com- like “bravo” and “thank you” and “dear” show a mercial activities. A peak in the topic of Econom- positive judgement towards him. In Politics topic ics occurred on 18 March: on those days, discus- one finds words of the institutional sphere such as: sions were taking place on whether to ask the Eu- “country”, “government”, “people”, “bank”. ropean Union for financial aid to overcome the Home topic is related to the private sphere with pandemic. The prevailing topics are therefore usu- words like “to hope”, “home”, “to wait”, “to lose”, ally related to current events. although there is no shortage of words from the economic sphere. In Figure 3 the distance between 6 Concluding Observations the centre of the circles indicates the similarity be- As mentioned before, models trained with data tween the topics. Here you can see that only Eco- from social media are hardly generalizable. This nomics topic and Prime Minister topic overlap; stems from the fact that from a sociolinguistic per- this indicates that the two topics are more similar spective, the language of social media does not with respect to the other topics. Moreover, the size constitute a single variety. So, we expected that of the area of each circle represents the im- the results in the various evaluation metrics we portance of the topic relative to the corpus. Eco- performed would be worse than the results in the nomics topic is the most important topic in the evaluation conducted on the PoSTWITA-UD test corpus. Finally, we tested our model on unseen set. Surprisingly, in some metrics the results on documents: the comments published between 15 evaluating the ConteCorpus test set were better February and 30 March 2020, before and after the than the results on the PoSTWITA-UD test set. To announcement of the first period of restrictions to offer an overall view of the content of the Con- combat the pandemic, and between 1 October and teCorpus we performed topic modeling. The top- 14 November 2020, before and after the an- ics generated by the LDA model cover various as- nouncement of the second period of restrictions. pects of the pandemic emergency, with a prepon- Figures 4, 5 and 6 show trends in topics over time. derance of political and economic issues. Unex- Each line represents a topic and the x-axis shows pectedly, topics identified do not show concern re- the time progression. On 23 February, the first re- gard the risk of contagion and the possibility of strictive policies were announced for some Italian catching the disease. References Feng, Y. and Zhou, W. (2020). Is working from home the new norm? an observational study based on a Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi large geo-tagged covid-19 twitter dataset. arXiv pre- M., & Shah, Z. (2020). Top concerns of tweeters print arXiv:2006.08581. during the COVID-19 pandemic: infoveillance study. Journal of medical Internet research, 22(4), Gagliardi, G., Gregori, L., & Suozzi, A. (2021). L’im- e19016. patto emotivo della comunicazione istituzionale du- rante la pandemia di Covid-19: uno studio di Twitter Ahmed, M. E., Rabin, M. R. I., & Chowdhury, F. N. Sentiment Analysis. Proceedings of the Seventh (2020). COVID-19: Social media sentiment analysis Italian Conference on Computational Linguistics, on reopening. arXiv preprint arXiv:2006.00804. CLiC-it 2020, Bologna, Italy. Volume 2769 of Amara, A., Taieb, M. A. H., & Aouicha, M. B. (2021). CEUR Workshop Proceedings. Multilingual topic modeling for tracking COVID-19 Garcia, K. and Berton, L. (2021). Topic detection and trends based on Facebook data analysis. Applied In- sentiment analysis in Twitter content related to telligence, 51(5), 3052-3073. COVID-19 from Brazil and the USA. Applied Soft Andreadis, S., Antzoulatos, G., Mavropoulos, T., Gian- Computing, 101, 107057. nakeris, P., Tzionis, G., Pantelidis, N., ... & Kom- Gozzi, N., Tizzani, M., Starnini, M., Ciulla, F., Pao- patsiaris, I. (2021). A social media analytics plat- lotti, D., Panisson, A., & Perra, N. (2020). Collec- form visualising the spread of COVID-19 in Italy tive Response to Media Coverage of the COVID-19 via exploitation of automatically geotagged tweets. Pandemic on Reddit and Wikipedia: Mixed-Meth- Online Social Networks and Media, 23, 100134. ods Analysis. Journal of medical Internet research, Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003) Latent 22(10), e21597. dirichlet allocation, the Journal of machine Learning Gupta, V., Jain, N., Katariya, P., Kumar, A., Mohan, research (JMach), 3, 993–1022. S., Ahmadian, A., & Ferrara, M. (2021). An emotion Chakraborty, K., Bhatia, S., Bhattacharyya, S., Platos, care model using multimodal textual analysis on J., Bag, R., & Hassanien, A. E. (2020). Sentiment COVID-19. Chaos, Solitons & Fractals, 144, Analysis of COVID-19 tweets by Deep Learning 110708. Classifiers—A study to show how popularity is af- Hussain, A., Tahir, A., Hussain, Z., Sheikh, Z., Gogate, fecting accuracy in social media. Applied Soft Com- M., Dashtipour, K., et al. (2021). Artificial Intelli- puting, 97, 106754. gence–Enabled Analysis of Public Attitudes on Fa- Crystal, D. (2001). Language and the Internet. Cam- cebook and Twitter Toward COVID-19 Vaccines in bridge University Press. the United Kingdom and the United States: Obser- vational Study. Journal of medical Internet research, Dashtian, H. and Murthy, D. (2021). Cml-covid: A 23(4), e26627. large-scale covid-19 twitter dataset with latent top- ics, sentiment and location information. arXiv pre- Jelodar, H., Wang, Y., Orji, R., & Huang, S. (2020). print arXiv:2101.12202. Deep sentiment classification and topic discovery on novel coronavirus or covid-19 online discus- De Santis, E., Martino, A., & Rizzi, A. (2020). An In- sions: Nlp using lstm recurrent neural network ap- foveillance System for Detecting and Tracking Rel- proach. IEEE Journal of Biomedical and Health In- evant Topics from Italian Tweets During the formatics, 24(10), 2733-2742. COVID-19 Event. IEEE Access, 8, 132527-132538. Kruspe, A., Häberle, M., Kuhn, I., & Zhu, X. X. Fiorentino, G. (2013). “Wild language” goes Web: new (2020). Cross-language sentiment analysis of Euro- writers and old problems in the elaboration of the pean Twitter messages during the COVID-19 pan- written code. In E. Miola (Ed.), Languages Go Web. demic. arXiv preprint arXiv:2008.12172. Standard and non-standard languages on the Internet (pp. 67-90.). Alessandria, Edizioni dell’Orso. Lamsal, R. (2020). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 1- Dozat, T. and Manning, C. D. (2017). Deep biaffine at- 15. tention for neural dependency parsing. In Proceed- ings of the 2017 International Conference on Learn- Lomborg, S., & Bechmann, A. (2014). Using APIs for ing Representations (ICLR). data collection on social media. The Information So- ciety, 30(4), 256-265. Duong, V., Luo, J., Pham, P., Yang, T., & Wang, Y. (2020). The ivory tower lost: How college students Low, D. M., Rumker, L., Talkar, T., Torous, J., Cecchi, respond differently than the general public to the G., & Ghosh, S. S. (2020) Natural Language Pro- covid-19 pandemic. IEEE/ACM International Con- cessing Reveals Vulnerable Mental Health Support ference on Advances in Social Networks Analysis Groups and Heightened Health Anxiety on Reddit and Mining (ASONAM) (pp. 126-130). During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635. Naseem, U., Razzak, I., Khushi, M., Eklund, P. W., & Sullivan, K. J., Burden, M., Keniston, A., Banda, J. M., Kim, J. (2021). COVIDSenti: a large-scale bench- & Hunter, L. E. (2020). Characterization of Anony- mark Twitter data set for COVID-19 sentiment anal- mous Physician Perspectives on COVID-19 Using ysis. IEEE transactions on computational social sys- Social Media Data. Pac Symp Biocomput. tems. Tamburini, F. (2020). EmoItaly. http://corpora.fic- Nemes, L. and Kiss, A. (2021). Social media sentiment lit.unibo.it/EmoItaly/. analysis based on COVID-19. Journal of Infor- Trevisan, M., Vassio, L., & Giordano, D. (2021). De- mation and Telecommunication, 5(1), 1-15. bate on online social networks at the time of Ordun, C., Purushotham, S., & Raff, E., (2020). Ex- COVID-19: An Italian case study. Online Social ploratory analysis of covid-19 tweets using topic Networks and Media, 23, 100136. modeling, umap, and digraphs. arXiv preprint Wang, J., Zhou, Y., Zhang, W., Evans, R., & Zhu, C. arXiv:2005.03082. (2020). Concerns Expressed by Chinese Social Me- Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, dia Users During the COVID-19 Pandemic: Content C. D. (2020). Stanza: A Python Natural Language Analysis of Sina Weibo Microblogging Data. Jour- Processing Toolkit for Many Human Languages. nal of medical Internet research, 22(11), e22152. Association for Computational Linguistics (ACL) Vitale, P., Pelosi, S., Falco, M. (2020). #andràtut- System Demonstrations. tobene: Images, Texts, Emojis and Geodata in a Röder, M., Both, A., & Hinneburg, A. (2015). Explor- Sentiment Analysis Pipeline. Proceedings of the ing the space of topic coherence measures. Proceed- Seventh Italian Conference on Computational Lin- ings of the eighth ACM international conference on guistics, CLiC-it 2020, Bologna, Italy. Volume Web search and data mining (pp. 399–408). 2769 of CEUR Workshop Proceedings. http://ceur- ws.org/Vol-2769/paper_62.pdf. Sanguinetti, M., Bosco, C., Lavelli, A., Mazzei, A., Antonelli, O., & Tamburini, F. (2018, May). PoST- WITA-UD: an Italian Twitter Treebank in universal dependencies. Proceedings of the Eleventh Interna- tional Conference on Language Resources and Eval- uation (LREC 2018). Sciandra, A., (2020). COVID-19 Outbreak through Tweeters’ Words: Monitoring Italian Social Media Communication about COVID-19 with Text Mining and Word Embeddings. 2020 IEEE Symposium on Computers and Communications (ISCC) (pp. 1-6), IEEE. Stella, M., Restocchi, V., & De Deyne, S., (2020). #lockdown: Network-enhanced emotional profiling in the time of Covid-19. Big Data and Cognitive Computing, 4(2), 14. Stella, M., (2020). Cognitive network science recon- structs how experts, news outlets and social media perceived the COVID-19 pandemic. Systems, 8(4), 38. Stella, M., Vitevitch, M. S., & Botta F., (2021) Cogni- tive networks identify the content of English and Italian popular posts about COVID-19 vaccines: Anticipation, logistics, conspiracy and loss of trust. arXiv preprint arXiv:2103.15909. Su, Y., Xue, J., Liu, X., Wu, P., Chen, J., Chen, C., et al. (2020). Examining the impact of COVID-19 lockdown in Wuhan and Lombardy: a psycholin- guistic analysis on Weibo and Twitter. International journal of environmental research and public health, 17(12), 4552.