Long-term Social Media Data Collection at the University of Turin Valerio Basile Mirko Lai Manuela Sanguinetti University of Turin University of Turin University of Turin basile@di.unito.it mirko.lai@unito.it msanguin@di.unito.it Abstract by its users, and the relative ease of collecting them through the official API. We report on the collection of social media Many researchers implemented systems to col- messages — from Twitter in particular — lect large datasets of tweets, and share them in the Italian language that is continuously with the community. Among them, the Content- going on since 2012 at the University of centered Computing group at the University of Turin. A number of smaller datasets have Turin2 is maintaining a large, diversified collec- been extracted from the main collection tion of datasets of tweets in the Italian language3 . and enriched with different kinds of anno- However, although the Twitter datasets in Italian tations for linguistic purposes. Moreover, make the majority of our collection, over the years, a few extra datasets have been collected and also in the recent past, several resources have independently and are now in the process been created in other languages and including data of being merged with the main collection. retrieved from other sources than Twitter. We aim at making the resource available In this paper, we report on the current status of to the community to the best of our pos- the collection (Section 2) and we give an overview sibility, in accordance with the Terms of of several annotated datasets included in it (Sec- Service provided by the platforms where tion 3). Finally, we describe our current and future data have been gathered from. plans to make the data and annotations available to (Italian) In questo articolo descriviamo il the research community (Section 4). lavoro di raccolta di messaggi — da Twit- 2 TWITA: Long-term Collection of ter in particolar modo — in lingua italiana Italian Tweets che va avanti in maniera continuativa dal 2012 presso l’Università di Torino. Di- The current effort to collect tweets in the Ital- versi dataset sono stati estratti dalla rac- ian language started in 2012 at the University of colta principale ed arricchiti con differ- Groningen (Basile and Nissim, 2013). Taking in- enti tipi di annotazione per scopi linguis- spiration from the large collection of Dutch tweets tici. Inoltre, dataset ulteriori sono stati rac- by Tjong Kim Sang and van den Bosch (2013), colti indipendentemente, e fanno ora parte Basile and Nissim (2013) implemented a pipeline della raccolta principale. Il nostro scopo è to collect and automatically annotate a large set rendere questa risorsa disponibile alla co- of tweets in Italian by leveraging the Twitter API. munità in maniera più completa possibile, The process interrogates the stream API with a set considerati i termini d’uso imposti dalle of keywords designed to capture the Italian lan- piattaforme da cui i dati sono stati estratti. guage and at the same time excluding other lan- guages. At the time of its publishing, the resource 1 Introduction contained about 100 million tweets in Italian in the first year (from February 2012 to February The online micro-blogging platform Twitter1 has 2 been a popular source for natural language data http://beta.di.unito.it/index. php/english/research/groups/ since the second half of the 2010’s, due to the content-centered-computing/people enormous quantity of public messages exchanged 3 Some of the datasets included in this report and their methodology of annotation are described in Sanguinetti et al. 1 https://twitter.com/ (2014) 2013). The automatic collection, however, contin- 3 Annotated Datasets ued, and in 2015 was transferred from the Univer- sity of Groningen to the University of Turin. From In the past years, the TWITA collection has been June 2018, a new filter based on the five Italian made available to many research teams interested vowels has been added to the pipeline, along with in the study of social media in the Italian language the language filter provided by the Twitter API, with computational methods. Several such studies which was not previously available, in order to focused on creating new linguistic resources start- limit the number of accidentally captured tweets ing from the raw tweets and basic metadata pro- in other languages. In the latest version of the vided by TWITA, including a number of datasets data collection pipeline, a Python script employ- created for shared tasks of computational linguis- ing the tweepy library4 gathers JSON tweets us- tics. In this section, we give an overview of such ing the following filter: track=[”a”,”e”,”i”,”o”,”u”] resources. Moreover, some datasets were created and languages=[”it”]. We stored the raw, complete independently from TWITA, and are now man- JSON tweet structures in zipped files for backup. aged under the same infrastructure, therefore we Meanwhile, we store the text and the most useful include them in this report. metadata (username , timestamp, geolocalization, For each dataset, we provide a summary in- retweet and reply status) in a relational database in fobox with basic information, including the type order to perform efficient queries. of annotation performed on the the dataset and how it was achieved, i.e., by means of expert an- At the time of this writing, the collection com- notators or a crowdsourcing platform. prises more than 500 million tweets in the Ital- ian language, spanning 7 years (57 months) from 3.1 Datasets From TWITA February 2012 to July 2018. There are a few The datasets described in this section are subsets holes in the collection, sometimes spanning entire of the main TWITA dataset, obtained by sampling months, due to incidents involving the server in- the collection according to different criteria, and frastructure or changes in the Twitter API which annotated for several purposes. required manual adjustment of the collection soft- ware. Figure 1 shows the percentage of days in TWitterBuonaScuola (Stranisci et al., 2016) each month for which the collection has data, at is a corpus of Italian tweets on the topic the time of this writing. of the national educational and training sys- tems. The tweets were extracted from a spe- cific hashtag (#labuonascuola, the nickname of an education reform, translating to the good school) and a set of related keywords: “la buona scuola” (the good school), “buona scuola” (good school), “riforma scuola” (school re- form), “riforma istruzione” (education reform). Name: TWitterBuonaScuola Size: 35,148 total tweets, 7,049 annotated tweets Time period: February 22, 2014–December 31, 2014 Annotation: polarity, irony and topic Annotation method: crowdsourcing URL: http://twita.dipinfo.di.unito.it/tw-bs TW-SWELLFER (Sulis et al., 2016) is a corpus of Italian tweets on subjective well- being, in particular regarding the topics of fer- tility and parenthood. The tweets were col- Figure 1: Percentage of days in each month for lected by searching for 11 hashtags — #papa (fa- which tweets are available. ther), #mamma (mother), #babbo (dad), #inc- inta (pregnant), #primofiglio (first child), #sec- ondofiglio (second child), #futuremamme (fu- ture moms), #maternita (materhood), #paternità 4 http://www.tweepy.org/ (fatherhood), #allattamento (nursing), #gravi- danza (pregnancy) — and 19 related keywords. sub-tasks: subjectivity and polarity classification, Name: TW-SWELLFER and irony detection. The data for SENTIPOLC Size: 2,760,416 total tweets, 1,508 annotated tweets 2014 were gathered from TWITA and Senti-TUT Time period: 2014 Annotation: polarity, irony and sub-topic (see Section 3.3), while for the 2016 edition the Annotation method: crowdsourcing dataset was further expanded by including other URL: http://twita.dipinfo.di.unito.it/tw-swellfer data sources, such as TWitterBuonaScuola (see Italian Hate Speech Corpus (Sanguinetti et al., Section 3.1) and a subset of TWITA overlapping 2018b; Poletto et al., 2017) is a corpus of hate with the dataset used for the shared task on Named speech on social media towards migrants and eth- Entity Recognition and Linking in Italian Tweets nic minorities, in the context of the Hate Speech (Basile et al., 2016, NEEL-it). Monitoring Program of the University of Turin5 . Name: SENTIPOLC Size: 6,448 (SENTIPOLC 2014), 9,410 (SENTIPOLC The tweets were collected according to a set 2016) tweets of keywords: invadere (invade), invasione (inva- Time period: 2012 (SENTIPOLC 2014), 2014 (SEN- sion), basta (enough), fuori (out), comunist* (com- TIPOLC 2016) Annotation: subjectivity, polarity, irony munist*), african* (African), barcon* (migrants Annotation method: experts (SENTIPOLC 2014), crowd- boat*). sourcing and experts (SENTIPOLC 2016) Name: Italian Hate Speech Corpus URL: http://twita.dipinfo.di.unito.it/sentipolc Size: 236,193 total tweets, 6,965 annotated tweets Time period: October 1st, 2016–April 25th, 2017 PoSTWITA (Bosco et al., 2016b) is the shared Annotation: hate speech, aggressiveness, offensiveness, stereotype, irony, intensity task on Part-of-Speech tagging of Twitter posts Annotation method: crowdsourcing and experts held at EVALITA 2016. Its content was extracted URL: http://twita.dipinfo.di.unito.it/ihsc from the SENTIPOLC corpus described above. The PoSTWITA dataset consists of Italian tweets TWITTIRÒ (Cignarella et al., 2017) is a tokenized and annotated at PoS level with a tagset dataset of tweets overlapping with other datasets inspired by the Universal Dependencies scheme7 . included in the University of Turin collection, Name: PoSTWITA on which a finer-grained annotation of irony Size: 6,738 tweets is superimposed. The TWITTIRÒ tweets are Time period: 2012 Annotation: part of speech taken from TWitterBuonaScuola, SENTIPOLC Annotation method: experts (see Section 3.2), and TWSpino (see Section 3.3). URL: http://twita.dipinfo.di.unito.it/postwita Name: TWITTIRÒ After the task took place, the PoSTWITA cor- Size: 1,600 total tweets: 400 tweets from TWSpino, 600 from SENTIPOLC tweets, 600 tweets from TWitter- pus has been used in a new independent project BuonaScuola on the development of a Twitter-based Italian tree- Time period: 2012–2016 bank fully compliant with the Universal Depen- Annotation: fine-grained irony Annotation method: experts dencies, thus becoming PoSTWITA-UD (San- URL: http://twita.dipinfo.di.unito.it/twittiro guinetti et al., 2018a). In particular, the first core of the resource was automatically annotated by 3.2 Shared Task Datasets out-of-domain parsing experiments using different The large collection of Italian tweets of the Uni- parsers. The output with the best results was then versity of Turin has been exploited in different oc- revised by two annotators for the final version of casions to extract datasets to organize shared tasks the resource. for the Italian community, in particular under the PoSTWITA-UD has been made available in the of- umbrella of the EVALITA evaluation campaign6 . ficial UD repository8 since v2.1 release. In this section, we describe such datasets. Name: PoSTWITA-UD Size: 6,712 tweets SENTIPOLC The SENTIment POLarity Clas- Time period: 2012 Annotation: dependency-based syntactic annotation sification task was proposed in two editions of Annotation method: experts the EVALITA campaign, namely in 2014 (Basile URL: http://twita.dipinfo.di.unito.it/postwita-ud et al., 2014) and 2016 (Barbieri et al., 2016). Both editions were organized into three different 7 http://universaldependencies.org/ 8 https://github.com/ 5 http://hatespeech.di.unito.it/ UniversalDependencies/UD_ 6 http://www.evalita.it/ Italian-PoSTWITA IronITA The irony detection task proposed for from TWITA. However, they are now hosted in EVALITA 20189 consists in automatically classi- the same infrastructure and therefore can be con- fying tweets according to the presence of irony sidered part of the same collection. (sub-task A) and sarcasm (sub-task B). Given the array of situations and topics where ironic or sar- Senti-TUT (Bosco et al., 2013) is a dataset castic devices can be used, the corpus has been of Italian tweets with a focus on politics and created by resorting to multiple annotated sources, irony. Senti-TUT includes two corpora: TWNews such as the already mentioned TWITTIRÒ, SEN- contains tweets retrieved by querying the Twit- TIPOLC, and the Italian Hate Speech Corpus. ter search API with a series of hashtags related Name: IronITA to Mario Monti (the Italian First Minister at the Size: 4,877 tweets Time period: 2012–2016 time); TWSpino contains tweets from Spinoza11 , a Annotation: irony, sarcasm popular satirical Italian blog on politics. Annotation method: crowdsourcing and experts Name: Senti-TUT URL: http://twita.dipinfo.di.unito.it/ironita Size: 3,288 (TWNews), 1,159 tweets (TWSpino) Time period: October 16th, 2011–February 3rd, 2012 HaSpeeDe The Hate Speech Detection task10 at (TWNews), July 2009–February 2012 (TWSpino) Annotation: polarity, irony EVALITA 2018 consists in automatically annotat- Annotation method: experts ing messages from Twitter and Facebook. The URL: http://twita.dipinfo.di.unito.it/senti-tut dataset proposed for the task is the result of a joint effort of two research groups on harmonizing Felicittà (Allisio et al., 2013) was a project on the annotation previously applied to two different the development of a platform that aimed to esti- datasets: the first one is a collection of Facebook mate and interactively display the degree of happi- comments developed by the group from CNR-Pisa ness in Italian cities, based on the analysis of data and created in 2016 (Del Vigna et al., 2017), while from Twitter. For its evaluation, a gold corpus was the other one is a subset of the Italian Hate Speech created by Bosco et al. (2014), using the same an- Corpus (described in Section 3.1). The annota- notation scheme provided for Senti-TUT. tion scheme has thus been simplified, and it only Name: Felicittà includes a binary value indicating whether hate- Size: 1,500 tweets ful contents are present or not in a given tweet or Time period: November 1st, 2013–July 7th, 2014 Annotation: polarity, irony Facebook comment. The task organizers created Annotation method: experts such harmonized scheme also in view of a cross- URL: http://twita.dipinfo.di.unito.it/felicitta domain evaluation, with one dataset used for train- ing and the other one for testing the system. ConRef-STANCE-ita (Lai et al., 2018) is a col- It is worth pointing out, however, that despite lection of tweets on the topic of the Referendum their joint use in the task, the resources are main- held in Italy on December 4, 2016, about a reform tained separately, thus only the Twitter section of of the Italian Constitution. This is supposedly a the dataset is part of TWITA. highly controversial topic, chosen to highlight lan- Name: HaSpeeDe guage features useful for the study of stance de- Size: 4,000 tweets and 4,000 Facebook comments Time period: 2016–2017 for the Twitter dataset, May 2016 tection. The tweets were collected by searching for the Facebook dataset for specific hashtags: #referendumcostituzionale Annotation: hate speech (constitutional referendum), #iovotosi (I vote yes), Annotation method: crowdsourcing and experts for the Twitter dataset, experts for the Facebook dataset #iovotono (I vote no). Subsequently, the collection URL: http://twita.dipinfo.di.unito.it/haspeede was enriched by recovering the conversation chain from each retrieved tweet to its source, annotat- 3.3 Independently-collected Datasets ing triplets consisting in one tweet, one retweet, To complete the overview of the social media and one reply posted by the same user in a specific datasets, in this section we describe collections temporal window. The aim of the collection is to of tweets that have been compiled independently monitor the evolution of the stance of 248 users during the debate in four different temporal win- 9 http://www.di.unito.it/˜tutreeb/ dows and also inspecting their social network. ironita-evalita18 10 http://www.di.unito.it/˜tutreeb/ 11 haspeede-evalita18 http://www.spinoza.it Name: ConRef-STANCE-ita However, there are considerations about the pri- Size: 2,976 tweets (963 triplets) Time period: November 24th, 2016–December 7th, 2016 vacy of the users that must be accounted for in re- Annotation: stance leasing Twitter data. In particular, the EU General Annotation method: crowdsourcing and experts Data Protection Regulation from 2018 (GDPR)13 URL: http://twita.dipinfo.di.unito.it/conref-stance-ita strictly regulates data and user privacy. For in- 3.4 Work in Progress and Other Datasets stance, if a tweet has been deleted by a user, it should not be published in other forms (Article Finally, there are a number of additional datasets 17), although it can still be used for scientific pur- hosted in our infrastructure that are being actively poses. developed at the time of this writing. Some of Technically, we follow these consideration by those datasets include a collection of geo-localized implementing an interface to download the ID of tweets on the 2016 edition of the “giro d’Italia” the tweets in our collection, and tools to retrieve cycling competition, a dataset of tweets concern- the original tweets (if still available). The anno- ing the 2016 local elections in 10 major Italian tated datasets can instead be shared in their en- cities, and an addendum to the ConRef-STANCE- tirety, given their limited size, thus we provide ita dataset described in Section 3.3. links to download them in tabular format. Finally, Furthermore, we limited this report to the we are developing interactive interfaces to select datasets of tweets in the Italian language, which and download samples of the collection based on make for the majority of our collection. How- the time period and sets of keywords and hashtags. ever, we curate several datasets in other languages, often as a result of collaborations with interna- Acknowledgments tional research teams and projects, such as, for in- stance, TwitterMariagePourTous (Bosco et al., Valerio Basile and Manuela Sanguinetti are par- 2016a), a corpus of 2,872 French tweets extracted tially supported by Progetto di Ateneo/CSP 2016 in the period 16th December 2010 - 20th July 2013 (Immigrants, Hate and Prejudice in Social Media, on the topic of same-sex marriage. In addition, S1618 L2 BOSC 01). several new corpora have been developed within Mirko Lai is partially supported by Italian Min- the Hate Speech Monitoring program (see Section istry of Labor (Contro l’odio: tecnologie infor- 3.1), aiming at studying hate speech phenomenon matiche, percorsi formativi e story telling parte- against different targets such as women and the cipativo per combattere l’intolleranza, avviso LGBTQ community, and resorting to other data n.1/2017 per il finanziamento di iniziative e pro- sources than Twitter (Facebook and online news- getti di rilevanza nazionale ai sensi dell’art. 72 del papers in particular). Although such resources are d.l. 3 luglio 2017, n. 117 - anno 2017). still under construction - therefore it is not possible to provide any corpus statistics yet - our goal is to References include them in our resource infrastructure, thus making a step forward and ensuring its improve- Leonardo Allisio, Valeria Mussa, Cristina Bosco, Vi- viana Patti, and Giancarlo Ruffo. 2013. Felicittà: ment also in terms of diversity of data sources. Visualizing and estimating happiness in Italian cities from geotagged tweets. In Proceedings of the First 4 Data Availability International Workshop on Emotion and Sentiment in Social and Expressive Media: approaches and The main goal of collecting and organizing perspectives from AI (ESSEM 2013), pages 95–106, datasets such as the ones described in this paper is, Turin, Italy. generally speaking, to provide the NLP research Francesco Barbieri, Valerio Basile, Danilo Croce, community with powerful tools to enhance the Malvina Nissim, Nicole Novielli, and Viviana Patti. state of the art of language technologies. There- 2016. Overview of the Evalita 2016 SENTIment fore, our default policy is to share as much data POLarity Classification Task. In Proceedings of as possible, as freely as possible. Twitter has Third Italian Conference on Computational Linguis- tics (CLiC-it 2016) & Fifth Evaluation Campaign of proven to behave cooperatively towards the sci- Natural Language Processing and Speech Tools for entific community, relaxing the limits imposed to Italian. Final Workshop (EVALITA 2016), Naples, data sharing for non-commercial use over time12 . Italy. 12 html https://developer.twitter.com/en/ 13 developer-terms/agreement-and-policy. https://gdpr-info.eu/ Valerio Basile and Malvina Nissim. 2013. Sentiment Facebook. In Proceedings of the First Italian Con- analysis on Italian tweets. In Proceedings of the 4th ference on Cybersecurity (ITASEC17),, pages 86– Workshop on Computational Approaches to Subjec- 95, Venice, Italy. tivity, Sentiment and Social Media Analysis, pages 100–107. Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo Rosso. 2018. Stance evolution and twitter interac- Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi- tions in an italian political debate. In NLDB, volume viana Patti, and Paolo Rosso. 2014. Overview of 10859 of Lecture Notes in Computer Science, pages the Evalita 2014 SENTIment POLarity Classifica- 15–27. Springer. tion Task. In Proceedings of the Fourth Evalua- tion Campaign of Natural Language Processing and Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, Speech Tools for Italian. Final Workshop (EVALITA Viviana Patti, and Cristina Bosco. 2017. Hate 2014), Pisa, Italy. speech annotation: Analysis of an Italian Twitter corpus. In Proceedings of the Fourth Italian Confer- Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen- ence on Computational Linguistics (CLiC-it 2017), tile, and Giuseppe Rizzo. 2016. Overview of Rome, Italy. the EVALITA 2016 Named Entity rEcognition and Manuela Sanguinetti, Emilio Sulis, Viviana Patti, Gi- Linking in Italian tweets (NEEL-IT) task. In Pro- ancarlo Ruffo, Leonardo Allisio, Valeria Mussa, and ceedings of the Third Italian Conference on Com- Cristina Bosco. 2014. Developing corpora and tools putational Linguistics (CLiC-it 2016) & the Fifth for sentiment analysis: the experience of the Uni- Evaluation Campaign of Natural Language Pro- versity of Turin group. In First Italian Conference cessing and Speech Tools for Italian. Final Work- on Computational Linguistics (CLiC-it 2014), pages shop (EVALITA 2016), Naples, Italy. 322–327, Pisa, Italy. Cristina Bosco, Viviana Patti, and Andrea Bolioli. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, 2013. Developing corpora for sentiment analysis: Alessandro Mazzei, Oronzo Antonelli, and Fabio The case of irony and Senti-TUT. IEEE Intelligent Tamburini. 2018a. PoSTWITA-UD: an Italian Twit- Systems, 28(2):55–63. ter treebank in Universal Dependencies. In Pro- ceedings of the 11th Language Resources and Eval- Cristina Bosco, Leonardo Allisio, Valeria Mussa, Vi- uation Conference LREC 2018), pages 1768–1775, viana Patti, Giancarlo Ruffo, Manuela Sanguinetti, Miyazaki, Japan. and Emilio Sulis. 2014. Detecting happiness in Ital- ian tweets: Towards an evaluation dataset for sen- Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, timent analysis in Felicittà. In Proceedings of the Viviana Patti, and Marco Stranisci. 2018b. An Ital- 5th International Workshop on EMOTION, SOCIAL ian Twitter Corpus of Hate Speech against Immi- SIGNALS, SENTIMENT & LINKED OPEN DATA, grants. In Proceedings of the Eleventh International pages 56 – 63. Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Cristina Bosco, Mirko Lai, Viviana Patti, and Daniela Resources Association (ELRA). Virone. 2016a. Tweeting and being ironic in the debate about a political reform: the French anno- Marco Stranisci, Cristina Bosco, Delia Iraz Hernndez tated corpus Twitter-MariagePourTous. In Proceed- Faras, and Viviana Patti. 2016. Annotating senti- ings of the Tenth International Conference on Lan- ment and irony in the online italian political debate guage Resources and Evaluation LREC 2016, Por- on #labuonascuola. In Proceedings of the Tenth In- torož, Slovenia. ternational Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may. Euro- Cristina Bosco, Fabio Tamburini, Andrea Bolioli, and pean Language Resources Association (ELRA). Alessandro Mazzei. 2016b. Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAl- Emilio Sulis, Cristina Bosco, Viviana Patti, Mirko Lai, ian task. In Proceedings of the Third Italian Confer- Delia Irazú Hernández Farı́as, Letizia Mencarini, ence on Computational Linguistics (CLiC-it 2016) Michele Mozzachiodi, and Daniele Vignoli. 2016. & the Fifth Evaluation Campaign of Natural Lan- Subjective well-being and social media. A seman- guage Processing and Speech Tools for Italian. Fi- tically annotated Twitter corpus on fertility and par- nal Workshop (EVALITA 2016), Naples, Italy. enthood. In Proceedings of the Third Italian Confer- ence on Computational Linguistics (CLiC-it 2016) Alessandra Teresa Cignarella, Cristina Bosco, and Vi- & the Fifth Evaluation Campaign of Natural Lan- viana Patti. 2017. Twittirò: a social media corpus guage Processing and Speech Tools for Italian. Fi- with a multi-layered annotation for irony. In Pro- nal Workshop (EVALITA 2016), Naples, Italy. ceedings of the Fourth Italian Conference on Com- putational Linguistics (CLiC-it 2017), Rome, Italy. E. Tjong Kim Sang and A. van den Bosch. 2013. Dealing with big data: The case of Twitter. Com- Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, putational Linguistics in the Netherlands Journal, Marinella Petrocchi, and Maurizio Tesconi. 2017. 3(12/2013):121–134. Reporting year: 2013. Hate me, hate me not: Hate speech detection on