Long-term Social Media Data Collection at the University of Turin

        Valerio Basile                             Mirko Lai                     Manuela Sanguinetti
       University of Turin                      University of Turin               University of Turin
    basile@di.unito.it                       mirko.lai@unito.it               msanguin@di.unito.it


                          Abstract                         by its users, and the relative ease of collecting
                                                           them through the official API.
        We report on the collection of social media           Many researchers implemented systems to col-
        messages — from Twitter in particular —            lect large datasets of tweets, and share them
        in the Italian language that is continuously       with the community. Among them, the Content-
        going on since 2012 at the University of           centered Computing group at the University of
        Turin. A number of smaller datasets have           Turin2 is maintaining a large, diversified collec-
        been extracted from the main collection            tion of datasets of tweets in the Italian language3 .
        and enriched with different kinds of anno-         However, although the Twitter datasets in Italian
        tations for linguistic purposes. Moreover,         make the majority of our collection, over the years,
        a few extra datasets have been collected           and also in the recent past, several resources have
        independently and are now in the process           been created in other languages and including data
        of being merged with the main collection.          retrieved from other sources than Twitter.
        We aim at making the resource available               In this paper, we report on the current status of
        to the community to the best of our pos-           the collection (Section 2) and we give an overview
        sibility, in accordance with the Terms of          of several annotated datasets included in it (Sec-
        Service provided by the platforms where            tion 3). Finally, we describe our current and future
        data have been gathered from.                      plans to make the data and annotations available to
        (Italian) In questo articolo descriviamo il        the research community (Section 4).
        lavoro di raccolta di messaggi — da Twit-
                                                           2    TWITA: Long-term Collection of
        ter in particolar modo — in lingua italiana
                                                                Italian Tweets
        che va avanti in maniera continuativa dal
        2012 presso l’Università di Torino. Di-           The current effort to collect tweets in the Ital-
        versi dataset sono stati estratti dalla rac-       ian language started in 2012 at the University of
        colta principale ed arricchiti con differ-         Groningen (Basile and Nissim, 2013). Taking in-
        enti tipi di annotazione per scopi linguis-        spiration from the large collection of Dutch tweets
        tici. Inoltre, dataset ulteriori sono stati rac-   by Tjong Kim Sang and van den Bosch (2013),
        colti indipendentemente, e fanno ora parte         Basile and Nissim (2013) implemented a pipeline
        della raccolta principale. Il nostro scopo è      to collect and automatically annotate a large set
        rendere questa risorsa disponibile alla co-        of tweets in Italian by leveraging the Twitter API.
        munità in maniera più completa possibile,        The process interrogates the stream API with a set
        considerati i termini d’uso imposti dalle          of keywords designed to capture the Italian lan-
        piattaforme da cui i dati sono stati estratti.     guage and at the same time excluding other lan-
                                                           guages. At the time of its publishing, the resource
1       Introduction                                       contained about 100 million tweets in Italian in
                                                           the first year (from February 2012 to February
The online micro-blogging platform Twitter1 has
                                                              2
been a popular source for natural language data                 http://beta.di.unito.it/index.
                                                           php/english/research/groups/
since the second half of the 2010’s, due to the            content-centered-computing/people
enormous quantity of public messages exchanged                3
                                                                Some of the datasets included in this report and their
                                                           methodology of annotation are described in Sanguinetti et al.
    1
        https://twitter.com/                               (2014)
2013). The automatic collection, however, contin-       3     Annotated Datasets
ued, and in 2015 was transferred from the Univer-
sity of Groningen to the University of Turin. From      In the past years, the TWITA collection has been
June 2018, a new filter based on the five Italian       made available to many research teams interested
vowels has been added to the pipeline, along with       in the study of social media in the Italian language
the language filter provided by the Twitter API,        with computational methods. Several such studies
which was not previously available, in order to         focused on creating new linguistic resources start-
limit the number of accidentally captured tweets        ing from the raw tweets and basic metadata pro-
in other languages. In the latest version of the        vided by TWITA, including a number of datasets
data collection pipeline, a Python script employ-       created for shared tasks of computational linguis-
ing the tweepy library4 gathers JSON tweets us-         tics. In this section, we give an overview of such
ing the following filter: track=[”a”,”e”,”i”,”o”,”u”]   resources. Moreover, some datasets were created
and languages=[”it”]. We stored the raw, complete       independently from TWITA, and are now man-
JSON tweet structures in zipped files for backup.       aged under the same infrastructure, therefore we
Meanwhile, we store the text and the most useful        include them in this report.
metadata (username , timestamp, geolocalization,           For each dataset, we provide a summary in-
retweet and reply status) in a relational database in   fobox with basic information, including the type
order to perform efficient queries.                     of annotation performed on the the dataset and
                                                        how it was achieved, i.e., by means of expert an-
   At the time of this writing, the collection com-
                                                        notators or a crowdsourcing platform.
prises more than 500 million tweets in the Ital-
ian language, spanning 7 years (57 months) from         3.1    Datasets From TWITA
February 2012 to July 2018. There are a few
                                                        The datasets described in this section are subsets
holes in the collection, sometimes spanning entire
                                                        of the main TWITA dataset, obtained by sampling
months, due to incidents involving the server in-
                                                        the collection according to different criteria, and
frastructure or changes in the Twitter API which
                                                        annotated for several purposes.
required manual adjustment of the collection soft-
ware. Figure 1 shows the percentage of days in          TWitterBuonaScuola (Stranisci et al., 2016)
each month for which the collection has data, at        is a corpus of Italian tweets on the topic
the time of this writing.                               of the national educational and training sys-
                                                        tems. The tweets were extracted from a spe-
                                                        cific hashtag (#labuonascuola, the nickname of
                                                        an education reform, translating to the good
                                                        school) and a set of related keywords: “la
                                                        buona scuola” (the good school), “buona scuola”
                                                        (good school), “riforma scuola” (school re-
                                                        form), “riforma istruzione” (education reform).
                                                        Name: TWitterBuonaScuola
                                                        Size: 35,148 total tweets, 7,049 annotated tweets
                                                        Time period: February 22, 2014–December 31, 2014
                                                        Annotation: polarity, irony and topic
                                                        Annotation method: crowdsourcing
                                                        URL: http://twita.dipinfo.di.unito.it/tw-bs

                                                        TW-SWELLFER            (Sulis et al., 2016) is a
                                                        corpus of Italian tweets on subjective well-
                                                        being, in particular regarding the topics of fer-
                                                        tility and parenthood. The tweets were col-
Figure 1: Percentage of days in each month for          lected by searching for 11 hashtags — #papa (fa-
which tweets are available.                             ther), #mamma (mother), #babbo (dad), #inc-
                                                        inta (pregnant), #primofiglio (first child), #sec-
                                                        ondofiglio (second child), #futuremamme (fu-
                                                        ture moms), #maternita (materhood), #paternità
   4
       http://www.tweepy.org/                           (fatherhood), #allattamento (nursing), #gravi-
danza (pregnancy) — and 19 related keywords.              sub-tasks: subjectivity and polarity classification,
Name: TW-SWELLFER                                         and irony detection. The data for SENTIPOLC
Size: 2,760,416 total tweets, 1,508 annotated tweets      2014 were gathered from TWITA and Senti-TUT
Time period: 2014
Annotation: polarity, irony and sub-topic                 (see Section 3.3), while for the 2016 edition the
Annotation method: crowdsourcing                          dataset was further expanded by including other
URL: http://twita.dipinfo.di.unito.it/tw-swellfer
                                                          data sources, such as TWitterBuonaScuola (see
Italian Hate Speech Corpus (Sanguinetti et al.,           Section 3.1) and a subset of TWITA overlapping
2018b; Poletto et al., 2017) is a corpus of hate          with the dataset used for the shared task on Named
speech on social media towards migrants and eth-          Entity Recognition and Linking in Italian Tweets
nic minorities, in the context of the Hate Speech         (Basile et al., 2016, NEEL-it).
Monitoring Program of the University of Turin5 .          Name: SENTIPOLC
                                                          Size: 6,448 (SENTIPOLC 2014), 9,410 (SENTIPOLC
The tweets were collected according to a set              2016) tweets
of keywords: invadere (invade), invasione (inva-          Time period: 2012 (SENTIPOLC 2014), 2014 (SEN-
sion), basta (enough), fuori (out), comunist* (com-       TIPOLC 2016)
                                                          Annotation: subjectivity, polarity, irony
munist*), african* (African), barcon* (migrants           Annotation method: experts (SENTIPOLC 2014), crowd-
boat*).                                                   sourcing and experts (SENTIPOLC 2016)
Name: Italian Hate Speech Corpus                          URL: http://twita.dipinfo.di.unito.it/sentipolc
Size: 236,193 total tweets, 6,965 annotated tweets
Time period: October 1st, 2016–April 25th, 2017           PoSTWITA (Bosco et al., 2016b) is the shared
Annotation: hate speech, aggressiveness, offensiveness,
stereotype, irony, intensity                              task on Part-of-Speech tagging of Twitter posts
Annotation method: crowdsourcing and experts              held at EVALITA 2016. Its content was extracted
URL: http://twita.dipinfo.di.unito.it/ihsc                from the SENTIPOLC corpus described above.
                                                          The PoSTWITA dataset consists of Italian tweets
TWITTIRÒ         (Cignarella et al., 2017) is a          tokenized and annotated at PoS level with a tagset
dataset of tweets overlapping with other datasets         inspired by the Universal Dependencies scheme7 .
included in the University of Turin collection,
                                                          Name: PoSTWITA
on which a finer-grained annotation of irony              Size: 6,738 tweets
is superimposed. The TWITTIRÒ tweets are                 Time period: 2012
                                                          Annotation: part of speech
taken from TWitterBuonaScuola, SENTIPOLC                  Annotation method: experts
(see Section 3.2), and TWSpino (see Section 3.3).         URL: http://twita.dipinfo.di.unito.it/postwita
Name: TWITTIRÒ                                              After the task took place, the PoSTWITA cor-
Size: 1,600 total tweets: 400 tweets from TWSpino,
600 from SENTIPOLC tweets, 600 tweets from TWitter-
                                                          pus has been used in a new independent project
BuonaScuola                                               on the development of a Twitter-based Italian tree-
Time period: 2012–2016                                    bank fully compliant with the Universal Depen-
Annotation: fine-grained irony
Annotation method: experts                                dencies, thus becoming PoSTWITA-UD (San-
URL: http://twita.dipinfo.di.unito.it/twittiro            guinetti et al., 2018a). In particular, the first core
                                                          of the resource was automatically annotated by
3.2     Shared Task Datasets                              out-of-domain parsing experiments using different
The large collection of Italian tweets of the Uni-        parsers. The output with the best results was then
versity of Turin has been exploited in different oc-      revised by two annotators for the final version of
casions to extract datasets to organize shared tasks      the resource.
for the Italian community, in particular under the        PoSTWITA-UD has been made available in the of-
umbrella of the EVALITA evaluation campaign6 .            ficial UD repository8 since v2.1 release.
In this section, we describe such datasets.               Name: PoSTWITA-UD
                                                          Size: 6,712 tweets
SENTIPOLC The SENTIment POLarity Clas-                    Time period: 2012
                                                          Annotation: dependency-based syntactic annotation
sification task was proposed in two editions of           Annotation method: experts
the EVALITA campaign, namely in 2014 (Basile              URL: http://twita.dipinfo.di.unito.it/postwita-ud
et al., 2014) and 2016 (Barbieri et al., 2016).
Both editions were organized into three different            7
                                                             http://universaldependencies.org/
                                                             8
                                                             https://github.com/
   5
       http://hatespeech.di.unito.it/                     UniversalDependencies/UD_
   6
       http://www.evalita.it/                             Italian-PoSTWITA
IronITA The irony detection task proposed for              from TWITA. However, they are now hosted in
EVALITA 20189 consists in automatically classi-            the same infrastructure and therefore can be con-
fying tweets according to the presence of irony            sidered part of the same collection.
(sub-task A) and sarcasm (sub-task B). Given the
array of situations and topics where ironic or sar-        Senti-TUT (Bosco et al., 2013) is a dataset
castic devices can be used, the corpus has been            of Italian tweets with a focus on politics and
created by resorting to multiple annotated sources,        irony. Senti-TUT includes two corpora: TWNews
such as the already mentioned TWITTIRÒ, SEN-              contains tweets retrieved by querying the Twit-
TIPOLC, and the Italian Hate Speech Corpus.                ter search API with a series of hashtags related
Name: IronITA                                              to Mario Monti (the Italian First Minister at the
Size: 4,877 tweets
Time period: 2012–2016
                                                           time); TWSpino contains tweets from Spinoza11 , a
Annotation: irony, sarcasm                                 popular satirical Italian blog on politics.
Annotation method: crowdsourcing and experts               Name: Senti-TUT
URL: http://twita.dipinfo.di.unito.it/ironita              Size: 3,288 (TWNews), 1,159 tweets (TWSpino)
                                                           Time period: October 16th, 2011–February 3rd, 2012
HaSpeeDe The Hate Speech Detection task10 at               (TWNews), July 2009–February 2012 (TWSpino)
                                                           Annotation: polarity, irony
EVALITA 2018 consists in automatically annotat-            Annotation method: experts
ing messages from Twitter and Facebook. The                URL: http://twita.dipinfo.di.unito.it/senti-tut
dataset proposed for the task is the result of a
joint effort of two research groups on harmonizing         Felicittà (Allisio et al., 2013) was a project on
the annotation previously applied to two different         the development of a platform that aimed to esti-
datasets: the first one is a collection of Facebook        mate and interactively display the degree of happi-
comments developed by the group from CNR-Pisa              ness in Italian cities, based on the analysis of data
and created in 2016 (Del Vigna et al., 2017), while        from Twitter. For its evaluation, a gold corpus was
the other one is a subset of the Italian Hate Speech       created by Bosco et al. (2014), using the same an-
Corpus (described in Section 3.1). The annota-             notation scheme provided for Senti-TUT.
tion scheme has thus been simplified, and it only
                                                           Name: Felicittà
includes a binary value indicating whether hate-           Size: 1,500 tweets
ful contents are present or not in a given tweet or        Time period: November 1st, 2013–July 7th, 2014
                                                           Annotation: polarity, irony
Facebook comment. The task organizers created              Annotation method: experts
such harmonized scheme also in view of a cross-            URL: http://twita.dipinfo.di.unito.it/felicitta
domain evaluation, with one dataset used for train-
ing and the other one for testing the system.              ConRef-STANCE-ita (Lai et al., 2018) is a col-
   It is worth pointing out, however, that despite         lection of tweets on the topic of the Referendum
their joint use in the task, the resources are main-       held in Italy on December 4, 2016, about a reform
tained separately, thus only the Twitter section of        of the Italian Constitution. This is supposedly a
the dataset is part of TWITA.                              highly controversial topic, chosen to highlight lan-
Name: HaSpeeDe                                             guage features useful for the study of stance de-
Size: 4,000 tweets and 4,000 Facebook comments
Time period: 2016–2017 for the Twitter dataset, May 2016   tection. The tweets were collected by searching
for the Facebook dataset                                   for specific hashtags: #referendumcostituzionale
Annotation: hate speech                                    (constitutional referendum), #iovotosi (I vote yes),
Annotation method: crowdsourcing and experts for the
Twitter dataset, experts for the Facebook dataset          #iovotono (I vote no). Subsequently, the collection
URL: http://twita.dipinfo.di.unito.it/haspeede             was enriched by recovering the conversation chain
                                                           from each retrieved tweet to its source, annotat-
3.3    Independently-collected Datasets                    ing triplets consisting in one tweet, one retweet,
To complete the overview of the social media               and one reply posted by the same user in a specific
datasets, in this section we describe collections          temporal window. The aim of the collection is to
of tweets that have been compiled independently            monitor the evolution of the stance of 248 users
                                                           during the debate in four different temporal win-
   9
     http://www.di.unito.it/˜tutreeb/                      dows and also inspecting their social network.
ironita-evalita18
  10
     http://www.di.unito.it/˜tutreeb/
                                                             11
haspeede-evalita18                                                http://www.spinoza.it
Name: ConRef-STANCE-ita                                   However, there are considerations about the pri-
Size: 2,976 tweets (963 triplets)
Time period: November 24th, 2016–December 7th, 2016
                                                          vacy of the users that must be accounted for in re-
Annotation: stance                                        leasing Twitter data. In particular, the EU General
Annotation method: crowdsourcing and experts              Data Protection Regulation from 2018 (GDPR)13
URL: http://twita.dipinfo.di.unito.it/conref-stance-ita
                                                          strictly regulates data and user privacy. For in-
3.4    Work in Progress and Other Datasets                stance, if a tweet has been deleted by a user, it
                                                          should not be published in other forms (Article
Finally, there are a number of additional datasets
                                                          17), although it can still be used for scientific pur-
hosted in our infrastructure that are being actively
                                                          poses.
developed at the time of this writing. Some of
                                                             Technically, we follow these consideration by
those datasets include a collection of geo-localized
                                                          implementing an interface to download the ID of
tweets on the 2016 edition of the “giro d’Italia”
                                                          the tweets in our collection, and tools to retrieve
cycling competition, a dataset of tweets concern-
                                                          the original tweets (if still available). The anno-
ing the 2016 local elections in 10 major Italian
                                                          tated datasets can instead be shared in their en-
cities, and an addendum to the ConRef-STANCE-
                                                          tirety, given their limited size, thus we provide
ita dataset described in Section 3.3.
                                                          links to download them in tabular format. Finally,
   Furthermore, we limited this report to the
                                                          we are developing interactive interfaces to select
datasets of tweets in the Italian language, which
                                                          and download samples of the collection based on
make for the majority of our collection. How-
                                                          the time period and sets of keywords and hashtags.
ever, we curate several datasets in other languages,
often as a result of collaborations with interna-         Acknowledgments
tional research teams and projects, such as, for in-
stance, TwitterMariagePourTous (Bosco et al.,             Valerio Basile and Manuela Sanguinetti are par-
2016a), a corpus of 2,872 French tweets extracted         tially supported by Progetto di Ateneo/CSP 2016
in the period 16th December 2010 - 20th July 2013         (Immigrants, Hate and Prejudice in Social Media,
on the topic of same-sex marriage. In addition,           S1618 L2 BOSC 01).
several new corpora have been developed within               Mirko Lai is partially supported by Italian Min-
the Hate Speech Monitoring program (see Section           istry of Labor (Contro l’odio: tecnologie infor-
3.1), aiming at studying hate speech phenomenon           matiche, percorsi formativi e story telling parte-
against different targets such as women and the           cipativo per combattere l’intolleranza, avviso
LGBTQ community, and resorting to other data              n.1/2017 per il finanziamento di iniziative e pro-
sources than Twitter (Facebook and online news-           getti di rilevanza nazionale ai sensi dell’art. 72 del
papers in particular). Although such resources are        d.l. 3 luglio 2017, n. 117 - anno 2017).
still under construction - therefore it is not possible
to provide any corpus statistics yet - our goal is to
                                                          References
include them in our resource infrastructure, thus
making a step forward and ensuring its improve-           Leonardo Allisio, Valeria Mussa, Cristina Bosco, Vi-
                                                            viana Patti, and Giancarlo Ruffo. 2013. Felicittà:
ment also in terms of diversity of data sources.            Visualizing and estimating happiness in Italian cities
                                                            from geotagged tweets. In Proceedings of the First
4     Data Availability                                     International Workshop on Emotion and Sentiment
                                                            in Social and Expressive Media: approaches and
The main goal of collecting and organizing                  perspectives from AI (ESSEM 2013), pages 95–106,
datasets such as the ones described in this paper is,       Turin, Italy.
generally speaking, to provide the NLP research
                                                          Francesco Barbieri, Valerio Basile, Danilo Croce,
community with powerful tools to enhance the                Malvina Nissim, Nicole Novielli, and Viviana Patti.
state of the art of language technologies. There-           2016. Overview of the Evalita 2016 SENTIment
fore, our default policy is to share as much data           POLarity Classification Task. In Proceedings of
as possible, as freely as possible. Twitter has             Third Italian Conference on Computational Linguis-
                                                            tics (CLiC-it 2016) & Fifth Evaluation Campaign of
proven to behave cooperatively towards the sci-
                                                            Natural Language Processing and Speech Tools for
entific community, relaxing the limits imposed to           Italian. Final Workshop (EVALITA 2016), Naples,
data sharing for non-commercial use over time12 .           Italy.
  12                                                      html
     https://developer.twitter.com/en/
                                                            13
developer-terms/agreement-and-policy.                          https://gdpr-info.eu/
Valerio Basile and Malvina Nissim. 2013. Sentiment          Facebook. In Proceedings of the First Italian Con-
  analysis on Italian tweets. In Proceedings of the 4th     ference on Cybersecurity (ITASEC17),, pages 86–
  Workshop on Computational Approaches to Subjec-           95, Venice, Italy.
  tivity, Sentiment and Social Media Analysis, pages
  100–107.                                                Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo
                                                            Rosso. 2018. Stance evolution and twitter interac-
Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi-         tions in an italian political debate. In NLDB, volume
  viana Patti, and Paolo Rosso. 2014. Overview of           10859 of Lecture Notes in Computer Science, pages
  the Evalita 2014 SENTIment POLarity Classifica-           15–27. Springer.
  tion Task. In Proceedings of the Fourth Evalua-
  tion Campaign of Natural Language Processing and        Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
  Speech Tools for Italian. Final Workshop (EVALITA         Viviana Patti, and Cristina Bosco. 2017. Hate
  2014), Pisa, Italy.                                       speech annotation: Analysis of an Italian Twitter
                                                            corpus. In Proceedings of the Fourth Italian Confer-
Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen-           ence on Computational Linguistics (CLiC-it 2017),
   tile, and Giuseppe Rizzo. 2016. Overview of              Rome, Italy.
   the EVALITA 2016 Named Entity rEcognition and
                                                          Manuela Sanguinetti, Emilio Sulis, Viviana Patti, Gi-
   Linking in Italian tweets (NEEL-IT) task. In Pro-
                                                           ancarlo Ruffo, Leonardo Allisio, Valeria Mussa, and
   ceedings of the Third Italian Conference on Com-
                                                           Cristina Bosco. 2014. Developing corpora and tools
   putational Linguistics (CLiC-it 2016) & the Fifth
                                                           for sentiment analysis: the experience of the Uni-
   Evaluation Campaign of Natural Language Pro-
                                                           versity of Turin group. In First Italian Conference
   cessing and Speech Tools for Italian. Final Work-
                                                           on Computational Linguistics (CLiC-it 2014), pages
   shop (EVALITA 2016), Naples, Italy.
                                                           322–327, Pisa, Italy.
Cristina Bosco, Viviana Patti, and Andrea Bolioli.        Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
  2013. Developing corpora for sentiment analysis:         Alessandro Mazzei, Oronzo Antonelli, and Fabio
  The case of irony and Senti-TUT. IEEE Intelligent        Tamburini. 2018a. PoSTWITA-UD: an Italian Twit-
  Systems, 28(2):55–63.                                    ter treebank in Universal Dependencies. In Pro-
                                                           ceedings of the 11th Language Resources and Eval-
Cristina Bosco, Leonardo Allisio, Valeria Mussa, Vi-       uation Conference LREC 2018), pages 1768–1775,
  viana Patti, Giancarlo Ruffo, Manuela Sanguinetti,       Miyazaki, Japan.
  and Emilio Sulis. 2014. Detecting happiness in Ital-
  ian tweets: Towards an evaluation dataset for sen-      Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
  timent analysis in Felicittà. In Proceedings of the     Viviana Patti, and Marco Stranisci. 2018b. An Ital-
  5th International Workshop on EMOTION, SOCIAL            ian Twitter Corpus of Hate Speech against Immi-
  SIGNALS, SENTIMENT & LINKED OPEN DATA,                   grants. In Proceedings of the Eleventh International
  pages 56 – 63.                                           Conference on Language Resources and Evaluation
                                                           (LREC 2018), Miyazaki, Japan. European Language
Cristina Bosco, Mirko Lai, Viviana Patti, and Daniela      Resources Association (ELRA).
  Virone. 2016a. Tweeting and being ironic in the
  debate about a political reform: the French anno-       Marco Stranisci, Cristina Bosco, Delia Iraz Hernndez
  tated corpus Twitter-MariagePourTous. In Proceed-        Faras, and Viviana Patti. 2016. Annotating senti-
  ings of the Tenth International Conference on Lan-       ment and irony in the online italian political debate
  guage Resources and Evaluation LREC 2016, Por-           on #labuonascuola. In Proceedings of the Tenth In-
  torož, Slovenia.                                        ternational Conference on Language Resources and
                                                           Evaluation (LREC 2016), Paris, France, may. Euro-
Cristina Bosco, Fabio Tamburini, Andrea Bolioli, and       pean Language Resources Association (ELRA).
  Alessandro Mazzei. 2016b. Overview of the
  EVALITA 2016 Part Of Speech on TWitter for ITAl-        Emilio Sulis, Cristina Bosco, Viviana Patti, Mirko Lai,
  ian task. In Proceedings of the Third Italian Confer-     Delia Irazú Hernández Farı́as, Letizia Mencarini,
  ence on Computational Linguistics (CLiC-it 2016)          Michele Mozzachiodi, and Daniele Vignoli. 2016.
  & the Fifth Evaluation Campaign of Natural Lan-           Subjective well-being and social media. A seman-
  guage Processing and Speech Tools for Italian. Fi-        tically annotated Twitter corpus on fertility and par-
  nal Workshop (EVALITA 2016), Naples, Italy.               enthood. In Proceedings of the Third Italian Confer-
                                                            ence on Computational Linguistics (CLiC-it 2016)
Alessandra Teresa Cignarella, Cristina Bosco, and Vi-       & the Fifth Evaluation Campaign of Natural Lan-
  viana Patti. 2017. Twittirò: a social media corpus       guage Processing and Speech Tools for Italian. Fi-
  with a multi-layered annotation for irony. In Pro-        nal Workshop (EVALITA 2016), Naples, Italy.
  ceedings of the Fourth Italian Conference on Com-
  putational Linguistics (CLiC-it 2017), Rome, Italy.     E. Tjong Kim Sang and A. van den Bosch. 2013.
                                                             Dealing with big data: The case of Twitter. Com-
Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,         putational Linguistics in the Netherlands Journal,
  Marinella Petrocchi, and Maurizio Tesconi. 2017.           3(12/2013):121–134. Reporting year: 2013.
  Hate me, hate me not: Hate speech detection on