=Paper= {{Paper |id=Vol-2765/159 |storemode=property |title=SardiStance @ EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets |pdfUrl=https://ceur-ws.org/Vol-2765/paper159.pdf |volume=Vol-2765 |authors=Alessandra Teresa Cignarella,Mirko Lai,Cristina Bosco,Viviana Patti,Paolo Rosso |dblpUrl=https://dblp.org/rec/conf/evalita/CignarellaLBPR20 }} ==SardiStance @ EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets== https://ceur-ws.org/Vol-2765/paper159.pdf
                            SardiStance @ EVALITA2020:
               Overview of the Task on Stance Detection in Italian Tweets

Alessandra Teresa Cignarella1,2 , Mirko Lai1 , Cristina Bosco1 , Viviana Patti1 and Paolo Rosso2
             1. Dipartimento di Informatica, Università degli Studi di Torino, Italy
             2. PRHLT Research Center, Universitat Politècnica de València, Spain
                 {lai,cigna,bosco,patti}@di.unito.it, prosso@dsic.upv.es



                          Abstract                                Stance Detection (SD), which is defined as the
                                                                  task of automatically determining from the text
      English. SardiStance is the first shared                    whether the author of a given textual content is
      task for Italian on the automatic classifi-                 in favor of, against, or neutral towards a certain
      cation of stance in tweets. It is articu-                   target. Research on this topic, beyond mere aca-
      lated in two different settings: A) Textual                 demic interest, could have an impact on different
      Stance Detection, exploiting only the in-                   aspects of everyday life such as public administra-
      formation provided by the tweet, and B)                     tion, policy-making, marketing or security strate-
      Contextual Stance Detection, with the ad-                   gies.
      dition of information on the tweet itself
                                                                     Although SD is a fairly recent research topic,
      such as the number of retweets, the num-
                                                                  considerable effort has been devoted to the cre-
      ber of favours or the date of posting; con-
                                                                  ation of stance-annotated datasets. In their re-
      textual information about the author, such
                                                                  cent survey on this topic, Küçük and Can (2020)
      as follower count, location, user’s biogra-
                                                                  describe the existence of a variety of stance-
      phy; and additional knowledge extracted
                                                                  annotated datasets (different text types such as
      from the user’s network of friends, follow-
                                                                  tweets, posts in online forums, news articles, or
      ers, retweets, quotes and replies. The task
                                                                  news comments) for at least eleven languages.
      has been one of the most participated at
                                                                     The first shared task on SD was held for En-
      EVALITA 2020 (Basile et al., 2020), with
                                                                  glish at SemEval in 2016, i.e. Task 6 “Detecting
      a total of 22 submitted runs for Task A,
                                                                  Stance in Tweets” (Mohammad et al., 2016b) for
      and 13 for Task B, and 12 different par-
                                                                  detecting stance towards six different targets of in-
      ticipating teams from both academia and
                                                                  terest: “Hillary Clinton”, “Feminist Movement”,
      industry.
                                                                  “Legalization of Abortion”, “Atheism”, “Donald
                                                                  Trump”, and “Climate Change is a Real Concern”.
  1    Introduction/Motivation                                    A more recent evaluation for SD systems was pro-
  The interest towards detecting people’s opinions                posed at IberEval 2017 for both Catalan and Span-
  towards particular targets, and towards monitoring              ish (Taulé et al., 2017) where the target was only
  politically polarized debates on Twitter has grown              one, i.e. “Independence of Catalonia”. A re-run
  more and more in the last years, as it is attested              was proposed the following year at the evalua-
  by the proliferation of questionnaires and polls on-            tion campaign IberEval 2018 regarding the target
  line (Küçük and Can, 2020). In fact, through the                “Catalan first of October Referendum” encourag-
  constant monitoring of people’s opinion, desires,               ing furthermore an exploration of multimodal ex-
  complaints and beliefs on political agenda or pub-              pressions such as audio, videos and images (Taulé
  lic services, policy makers could better meet pop-              et al., 2018).
  ulation’s needs.                                                   SardiStance@EVALITA2020 is the pioneer task
     In the fields of Natural Language Processing                 for SD in Italian tweets. The motivation behind the
  and Sentiment Analysis, this translates into the                proposal of this task is multi-faceted. On the one
  creation of a specifically dedicated task, namely:              hand, we aimed at the creation of a new annotated
                                                                  dataset for SD in Italian which would enrich the
       Copyright © 2020 for this paper by its authors. Use per-
  mitted under Creative Commons License Attribution 4.0 In-       panorama of available resources for this language,
  ternational (CC BY 4.0).                                        such as CONREF - STANCE - ITA (Lai et al., 2018)
and X - STANCE (Vamvas and Sennrich, 2020). On                     • Task B - Contextual Stance Detection:
the other hand, the organization of this task allows                  The second task was the same as the first one:
us a deeper investigation of SD at a contextual                    a three-class classification task where the system
level, by encouraging the participants and the re-                 had to predict whether a tweet is in FAVOUR,
search community to follow this research line that                 AGAINST or NONE towards the given target. Here
has proved promising in previous work, see e.g.                    participants had access to a wider range of contex-
Lai et al. (2019), Lai et al. (2020) and Del Tredici               tual information based on the post such as: the
et al. (2019). In fact, with the data distributed in               number of retweets, the number of favours, the
Task B different types of social network commu-                    number of replies and the number of quotes re-
nities, based on friendships, retweets, quotes, and                ceived to the tweet, the type of posting source (e.g.
replies could be investigated, in order to analyze                 iOS or Android), and date of posting. Furthermore
the communication among users with similar and                     we shared (and encouraged its exploitation) con-
divergent viewpoints.                                              textual information related to the user, such as:
The efficacy of approaches based on contextual                     number of tweets ever posted, user’s bio, user’s
features paired with textual information has been                  number of followers, user’s number of friends.
widely attested in literature on SD (Magdy et                      Additionally we shared users’ contextual informa-
al., 2016; Rajadesingan and Liu, 2014) and addi-                   tion about their social network, such as: friends,
tionally confirmed by the results obtained in this                 replies, retweets, and quotes’ relations. The per-
shared task, especially by those teams who partic-                 sonal ids of the users were anonymized but their
ipated to both settings (see Section 5).                           network structures were maintained intact.
                                                                   Participants could decide to participate to both
2     Definition of the Task
                                                                   tasks or only to one. Although they were encour-
With this task proposal, we wanted to invite partic-               aged to participate to both.
ipants to explore features based on the textual con-
tent of the tweet, such as structural, stylistic, and              3     Data
affective features, but also features based on con-                We chose to gather the data from the social net-
textual information that does not emerge directly                  working Twitter due to the free availability of a
from the text, such as knowledge about the do-                     huge amount of users’ generated data and because
main of the political debate or information about                  it allowed us to explore different types of relations
the user’s community. For these reasons, we pro-                   among the users involved in a debate.
posed two different settings:
• Task A - Textual Stance Detection:                               3.1    Collection and annotation of the data
   The first task was a three-class classification                 We collected around 700K tweets written in Ital-
task where the system had to predict whether a                     ian about the “Movimento delle Sardine” (Sar-
tweet is in FAVOUR, AGAINST or NONE towards                        dines movement1 ), retrieving tweets containing the
the given target, exploiting only textual informa-                 keywords “sardina”, “sardine”, and the homony-
tion, i.e. the text of the tweet.                                  mous hashtags. Furthermore, we collected all the
                                                                   conversation threads in which the said tweet be-
From reading the tweet, which of the options below is              longs, iteratively following the reply’s tree. We
most likely to be true about the tweeter’s stance towards          also collected the quoted tweets and the list of all
the target? (Mohammad et al., 2016a)                               the retweets of each previously recovered tweet,
                                                                   obtaining about 1M tweets. Finally, we collected
    1. FAVOUR: We can infer from the tweet that the                the friend list of all the users included in the anno-
       tweeter supports the target.                                tated dataset.
                                                                      The tweets were gathered between the 46th
    2. AGAINST: We can infer from the tweet that the
                                                                   week of 2019 (November) and the 5th week of
       tweeter is against the target.
                                                                   2020 (January), corresponding to a 12 weeks time-
    3. NONE: We can infer from the tweet that the                  window. Through the experience matured as par-
       tweeter has a neutral stance towards the target or          ticipants in previous shared tasks of SD, and in or-
       there is no clue in the tweet to reveal the stance of         1
                                                                       https://en.wikipedia.org/wiki/
       the tweeter towards the target.                             Sardines_movement.


                                                               2
der to reduce noise in text, we collected data tak-                  Furthermore, as it can also be seen in Figure 1
ing into account the following constraints: only                  (Tonight we are all sardines in Bologna #bolog-
one tweet per author for each week, no retweets,                  nanonsilega), we asked the annotators to mark
no replies, no quotes, no tweets containing URLs,                 whether, in their opinion, the tweet was IRONIC
no tweets containing pictures or videos.                          or NOT IRONIC. Finally, we were not able to ob-
   Then, we included only Italian tweets posted us-               tain satisfactory results on this end, so we did not
ing a limited number of “sources” (utilities used to              include it in the task.
post the tweet, such as iOS, Android, etc...) in or-
der to avoid to include pre-written tweets posted                 3.2   Analysis of the annotation
using a Tweet button.2 Furthermore, we validated
                                                                  At the end of a first phase of annotation, which
that all the collected tweets presented a Jaccard
                                                                  lasted more or less a month, we obtained 2,256
similarity coefficient < 0.8. From about 25K fil-
                                                                  tweets in agreement, with a clear decision on one
tered tweets, we finally randomly selected around
                                                                  of the three main classes. Other 917 tweets pre-
300 tweets for each week (only the first week of
                                                                  sented a light disagreement (i.e. FAVOUR vs. NEU -
2020 does not reach 300 tweets), thus obtaining
                                                                  TRAL or AGAINST vs. NEUTRAL ), and the remain-
3,600 tweets in total.
                                                                  ing 457 tweets were discarded because the major-
                                                                  ity of annotators considered them out of topic or
                                                                  were in strong disagreement (i.e. FAVOUR vs. OUT
                                                                  OF TOPIC ).
                                                                     We then proceeded in the resolution of those
                                                                  917 tweets, whose disagreement was deemed
                                                                  ”light” in order to obtain a bigger dataset. We re-
                                                                  sorted once again to the annotation platform used
                                                                  in the first phase, we revised the annotation guide-
                                                                  lines and asked the annotators to label the tweets
                                                                  again. In this phase, we paid attention that the
                                                                  tweets in disagreement were not assigned to the
 Figure 1: Platform for the annotation of tweets.                 same pair of annotators that had previously la-
                                                                  belled them, and furthermore we chose to show the
We created a web platform for annotation pur-                     two annotations in contrast, along with any com-
poses, see Figure 1, in order to facilitate the la-               ment - if present - to the annotator that had to solve
belling task to the annotators, unifying the visual-              the disagreement.
ization mode and shuffling the tweets in a random                    After the second phase, we computed the
order.3 12 different native Italian speakers with an              inter-annotator agreement (IAA) through Cohen’s
interest for news and politics were involved in the               kappa coefficient (over the three main classes)
annotation, according to detailed guidelines we                   resulting in κ = 0.493 (weak agreement). The
provided with examples for annotation and exam-                   same coefficient was also used to compute the
ples in their native language. We randomly shuf-                  IAA among annotators over the two most signif-
fled the annotators and matched them into 66 pairs                icant classes (AGAINST and FAVOUR, excluding
in which each pair would annotate 55 tweets. As                   the NEUTRAL class), resulting in a higher score:
a result, each annotator labelled 605 tweets inde-                κ = 0.769 (moderate agreement). Notably, we ob-
pendently and each tweet was annotated by two                     served that the IAA significantly changes depend-
annotators, who had to choose among four differ-                  ing on the observed pair of annotators (it ranges
ent labels: AGAINST, FAVOUR, NONE / NEUTRAL                       from 0.873 to 0.473) in the first phase of the an-
and OUT OF TOPIC.                                                 notation. We also noticed that the average IAA,
   2
                                                                  computed through the sum of each IAA between
     https://developer.twitter.com/en/
docs/twitter-for-websites/tweet-button/                           any annotator and the remaining 11 annotators,
overview.                                                         can significantly change (ranging from 0.704 to
   3
     In this way, each annotator was surely seeing emojis –       0.609). In other words, some annotators tend to
which, we believe are essential in order to understand the
correct stance– in the same way of the other annotators in-       strongly agree with all the other ones, while others
dependently of the device used.                                   tend to disagree with the majority. As future work,

                                                              3
we aim to shed more light on this phenomena ex-                        Task A
ploring the background of the annotators and the                       The training data (TRAIN.csv) was released in the
social relationship among them.                                        following format:

3.3   Composition of the dataset                                             tweet_id      user_id    text   label

After the second round of annotation we were                           where tweet_id is the Twitter ID of the mes-
finally able to create the official dataset for the                    sage, user_id is the Twitter ID of the user who
SardiStance shared task. It is composed by a to-                       posted the message, text is the content of the
tal of 3,242 tweets, 1,770 of which belong to the                      message, label is AGAINST, FAVOUR or NONE.
class AGAINST, 785 to the class FAVOUR, and 687
to the class NONE. In Table 1 we show the distri-                      Task B
bution of such instances accordingly to the train-                     In order to participate to Task B, we released ad-
ing set and the test set and in Table 2 we report                      ditional contextual information.
tweet as example for each class.
                                                                       • the file TWEET.csv, containing contextual infor-
       TRAINING SET                          TEST SET                  mation regarding the tweet, with the following for-
 AGAINST     FAVOUR      NONE      AGAINST      FAVOUR      NONE       mat:
   1,028       589        515         742           196     172
            2,132                              1,110                        tweet_id    user_id       retweet_count
                                                                          favorite_count       source     created_at
           Table 1: Distribution of tweets.                            where tweet_id is the Twitter ID of the mes-
                                                                       sage, user_id is the Twitter ID of the user who
  text                                              label              posted the message, retweet_count indicates
  LE SARDINE IN PIAZZA MAGGIORE NON
  SONO ITALIANI SE LO FOSSERO NON SI                                   the number of times the tweet has been retweeted,
  METTEREBBERO CONTRO LA DESTRA                                        favorite_count indicates the number of
  CHE AMA L’ITALIA E VUOLE RIMANERE
  ITALIANA                                                             times the tweet has been liked, source indi-
  THE SARDINES IN PIAZZA MAGGIORE ARE               AGAINST            cates the type of posting source (e.g. iOS or
  NOT ITALIAN IF THEY WERE THEY WOULD
  NOT GO AGAINST THE RIGHT THAT LOVES                                  Android), and created_at displays the time
  ITALY AND WANTS TO REMAIN ITALIAN
                                                                       of creation according to a yyyy-mm-dd hh:mm:ss
  Non ci credo che stasera devo andare in
  teatro e non posso essere fra le #Sardine                            format. Minutes and seconds have been encrypted
  #Bologna #bolognanonsilega                                           and transformed to zeroes for privacy issues.
  I can’t believe that I have to go to the the-     FAVOUR
  ater tonight and I can’t be among the #Sardines
  #Bologna #bolognanonsilega                                           • the file USER.csv, containing contextual infor-
  Mi sono svegliato nudo e triste perché a                             mation regarding the user. It was released in the
  Bologna, tra salviniani e antisalviniani, non
  mi ha cagato nessuno.                                                following format:
  I woke up naked and sad because in Bologna,       NONE
  between Salvinians and anti-Salvinians, nobody                        user_id    statuses_count        friends_count
  paid me attention.
                                                                          followers_count       created_at       emoji
       Table 2: Examples from the dataset.                             where user_id is the Twitter ID of the user
                                                                       who posted the message, statuses_count,
3.4   Data Release                                                     friends_count indicates the number of
                                                                       friends of the user, followers_count in-
We shared data following the methodology rec-                          dicates the number of followers of the user,
ommended in (Rangel and Rosso, 2018) in order                          created_at displays the time of the user reg-
to comply to GDPR privacy rules and Twitter’s                          istration on Twitter, and emoji shows a list of the
policies. The identifiers of tweets and users                          emojis in the user’s bio (if present, otherwise the
have been anonymized and replaced by unique                            field is left empty).
identifiers. We exclusively released the emojis
eventually contained in the location and descrip-                      • The files FRIEND.csv, QUOTE.csv, REPLY.csv
tion user’s biography, in order to make very hard                      and RETWEET.csv containing contextual info
to trace users and to preserve everybody’s privacy.                    about the social network of the user. Each file was
                                                                       released in the following format:
                                                                                  Source     Target     Weight


                                                                   4
where Source and Target indicate two nodes                    strained runs and two unconstrained runs. Sub-
of a social interaction between two Twitter users.            mitting at least a constrained run was anyway
More specifically, the source user performs one of            compulsory. We decided to provide two sepa-
the considered social relation towards the target             rate official rankings for Task A and Task B, and
user. Two users are tied by a friend relationship if          two separate ranking for constrained and uncon-
the source user follows the target user (friend re-           strained runs. Systems have been evaluated us-
lationship does not have a weight, because it is ei-          ing F1-score computed over the two main classes
ther present or absent); while two users are tied by          (FAVOUR and AGAINST). Therefore, the sub-
a quote, retweet, or reply relationship if the source         missions have been ranked by the averaged F1-
user respectively quoted, retweeted, or replied the           score over the two classes, according the following
target user. Table 4 shows some metrics about the             equation: F 1avg = (F 1f avour + F 1against )/2.
shared networks.
                                                              4.1    Baselines
                     nodes       edges
           friend    669,817     3,076,281                    We computed a baseline using a simple machine
           retweet   110,315     575,460                      learning model, for Task A: a Support Vector Clas-
           quote     2,903       7,899                        sifier based on token uni-gram features. A sec-
           reply     14,268      29,939
                                                              ond baseline we computed for Task B is a system
            Table 4: Networks metrics.                        based on our previous work on Stance Detection: a
                                                              Logistic Regression classifier paired with token n-
   Weight indicates the number of interactions                grams features (unigrams, bigrams and trigrams),
existing between two users. Note that this in-                plus features based on a binary one-hot encod-
formation is not available for the friend rela-               ing representation of the communities extracted
tion (hence, this column was not present in the               from the network of retweets and the network of
FRIEND.csv file) due to the fact that it is a rela-           friends (see the best system for Italian, in Lai et al.
tionship of the type present/absent and cannot be             (2020)).
described through a weight. In all the files, users
are defined by their anonimyzed User ID.                      5     Participants and results
   Regrettably, we did not think to anonymize the
                                                              A total of 12 teams, both from academia and in-
screen names contained in the text of the tweets
                                                              dustry sector participated to at least one of the two
(with the same numeric string used to anonymize
                                                              tasks of SardiStance. In Table 3 we provide an
users), for allowing to match it with the users’ ids
                                                              overview of the teams in alphabetical order.
and allowing the exploration of the network based
                                                                 Teams were allowed to submit up to four runs (2
on mentions. We will surely take it into account in
                                                              constrained and 2 unconstrained) in case they im-
our future works.
                                                              plemented different systems. Furthermore, each
4   Evaluation Measures                                       team had to submit at least a constrained run. Par-
                                                              ticipants have been invited to submit multiple runs
Each participating team was allowed to submit a               to experiment with different models and architec-
maximum of 4 runs for each sub-task: two con-                 tures. However, they have been discouraged from

             team name         institution                          report                          task
             deepreading       UNED, Spain                          (Espinosa et al., 2020)         A, B
             GhostWriter       You Are My Guide, Italy              (Bennici, 2020)                 A, B
             IXA               UPV/EHU, Spain                       (Espinosa et al., 2020)         A, B
             MeSoVe            ISASI, Italy                         -                               A
             QMUL-SDS          QMUL-SDS-EECS, UK                    (Alkhalifa and Zubiaga, 2020)   A, B
             SSN_NLP           CSE Department/SSNCE, India          (Kayalvizhi et al., 2020)       A
             SSNCSE-NLP        SSN College of Engineering, India    (Bharathi et al., 2020)         A, B
             TextWiller        UNIPD, Italy                         (Ferraccioli et al., 2020)      A, B
             UNED              UPV/EHU and UNED, Spain              (Espinosa et al., 2020)         B
             UninaStudents     UNINA, Italy                         (Moraca et al., 2020)           A
             UNITOR            UNIROMA2, Italy                      (Giorgioni et al., 2020)        A
             Venses            UNIVE, Italy                         (Delmonte, 2020)                A

                                       Table 3: Participants and reports.

                                                          5
submitting slight variations of the same model.                strong result to beat (F1avg = 0.5784).
Overall we have 22 runs for Task A and 13 runs
for Task B.                                                    5.2    Task B: Contextual Stance Detection
                                                               Table 6 shows the results for the contextual stance
5.1    Task A: Textual Stance Detection
                                                               detection task, which attracted 13 total submis-
Table 5 shows the results for the textual stance de-           sions from 7 different teams.
tection task, which attracted 22 total submissions
                                                                team name      run               F1-score
from 11 different teams. Since the only two sys-
                                                                                      AVG    AGAINST   FAVOUR   NONE
tems in an unconstrained setting were submitted
                                                                IXA            3     .7445     .8562    .6329   .4214
by the same team we decided not to create a sep-                TextWiller     1     .7309     .8505    .6114   .2963
arate ranking for them, but rather to include them              DeepReading    1     .7230     .8368    .6093   .3364
                                                                DeepReading    2     .7222     .8300    .6143   .4251
in the same ranking, and marking them with a dif-
                                                                TextWiller     2     .7147     .8298    .5995   .3680
ferent color (gray in Table 5).                                 QMUL-SDS       1     .7088     .8267    .5908   .1811
                                                                UNED           2     .6888     .8175    .5600   .2455
 team name        run               F1-score                    QMUL-SDS       2     .6765     .8134    .5396   .1553
                         AVG    AGAINST   FAVOUR   NONE
                                                                SSNCSE-NLP     2     .6582     .7915    .5249   .3691
                                                                SSNCSE-NLP     1     .6556     .7914    .5198   .3880
 UNITOR           1     .6853     .7866    .5840   .3910
 UNITOR           1     .6801     .7881    .5721   .3979        baseline             .6284     .7672    .4895   .3009
 UNITOR           2     .6793     .7939    .5647   .3672        GhostWriter    1     .6257     .7502    .5012   .3810
 DeepReading      1     .6621     .7580    .5663   .4213        GhostWriter    2     .6004     .7224    .4784   .3778
 UNITOR           2     .6606     .7689    .5522   .3702        UNED           1     .5313     .7399    .3226   .2000
 IXA              1     .6473     .7616    .5330   .3888
 GhostWriter      1     .6257     .7502    .5012   .3810
                                                                              Table 6: Results Task B.
 IXA              2     .6171     .7543    .4800   .3675
 SSNCSE-NLP       2     .6067     .7723    .4412   .2113
 DeepReading      2     .6004     .6966    .5042   .3916       The best scores are achieved by the IXA team that
 GhostWriter      2     .6004     .7224    .4784   .3778
                                                               with a constrained run obtained the highest score
 UninaStudents    1     .5886     .7850    .3922   .2326
                                                               of F1avg = 0.7445. The best F1-score for the
 baseline               .5784     .7158    .4409   .2764
                                                               main classes AGAINST and FAVOUR is achieved
 TextWiller       1     .5773     .7755    .3791   .1849
 SSNCSE-NLP       1     .5749     .7307    .4192   .3388       by the team ranked 1st, IXA, team with F1against =
 QMUL-SDS         1     .5595     .7091    .4099   .2313       0.8562, and F1f avour = 0.6329, respectively. Once
 QMUL-SDS         2     .5329     .6478    .4181   .3049
 MeSoVe           1     .4989     .7336    .2642   .3118
                                                               again, the Deepreading team, ranking 3rd and
 TextWiller       2     .4715     .6713    .2718   .2884       4th, has obtained the best F1-score for the NONE
 SSN_NLP          1     .4707     .5763    .3651   .3364       class, with F1none = 0.4251.
 SSN_NLP          2     .4473     .6545    .2402   .1913
 Venses           1     .3882     .5325    .2438   .2022          Almost all participating systems show an im-
 Venses           2     .3637     .4564    .2710   .2387       provement over the baseline, which was computed
                                                               using a Logistic Regression classifier paired with
                 Table 5: Results Task A.                      token n-grams features (unigrams, bigrams and tri-
                                                               grams), features based on the network of retweets,
The best results are achieved by the UNITOR team
                                                               and features based on the network of friends (Lai
that, with an unconstrained, ranked as 1st position
                                                               et al., 2020).
with F1avg = 0.6853. The best result for the con-
strained runs is achieved once again by the UNI-
                                                               6     Discussion
TOR team with F1avg = 0.6801.
   The best results for the two main classes                   In this section we compare the participating sys-
AGAINST and FAVOR are obtained by the three                    tems according to the following main dimensions:
best systems of the ranking, which are all submis-             system architecture, features, use of additional an-
sions by the team UNITOR. On the other hand,                   notated data for training, and use of external re-
though, the Deepreading team, ranking as 4th,                  sources (e.g. sentiment lexica, NLP tools, etc.).
has obtained the best F1-score for the NONE class,             We also operate a distinction between runs sub-
with F1none = 0.4213.                                          mitted in Task A and those submitted in Task B.
   Among the 12 participating teams, at least 6                This discussion is based on the participants’ re-
show an improvement over the baseline, which                   ports and the answers the participants provided to
was computed using an SVM paired with token                    a questionnaire proposed by the organizers. Two
unigrams as unique feature, resulting an already               teams, namely TextWiller and Venses wrote a

                                                           6
joint report, overlapping between this task and the        for hate speech detection, and on IronITA 2018
HaSpeeDe 2 task (Sanguinetti et al., 2020), as they        (Cignarella et al., 2018) for irony detection; and
participated in both competitions. The three fol-          they added three tags to each instance of the
lowing teams, Deepreading, IXA, and UNED,                  SardiStance datasets with respect to these three di-
also wrote a unique report as the participants, be-        mensions: sentiment, hate and irony. ItVenses
long to the same research project and wanted to            proposed features collected automatically from a
compare their three different approaches.                  unique dictionary list, frequency of occurrence
                                                           of emojis and emoticons, and semantic features
6.1   Systems participating to Task A                      investigating propositional level, factivity and
System architecture. Among all submitted runs              speech act type.
we counted a great variety of architectures,
ranging from classical machine learning classi-            Additional training data. The only team who
fiers, to recent state-of-the-art approaches, and          participated to the unconstrained setting of SardiS-
statistically-based models. For instance, regard-          tance is UNITOR. They proposed two uncon-
ing the use of classical ML, the team UninaStu-            strained runs in addition to other two constrained
dents used a SVM, and the team MeSoVe used                 ones. For the unconstrained setting, they down-
Logistic Regression in one run. Regarding the              loaded and labeled about 3,200 tweets using dis-
use of neural networks, the QMUL-SDS team                  tant supervision and used the additional data to
used bidirectional-LSTM, a CNN-2D, and a bi-               train their systems. In particular they created the
LSTM with attention. Also SSN_NLP exploited                following subsets:
the LSTM neural network.                                   - 1,500 AGAINST: tweets from 2019 containing
   Four teams exploited different variants of the          the hashtag: #gatticonsalvini;
BERT model: Ghostwriter used AlBERTo trained               - 1,000 FAVOUR: tweets from 2019 containing
on Italian tweets, IXA used GilBERTo and Um-               the hashtags: #nessunotocchilesardine, #iostocon-
BERTo4 , while UNITOR adopted only this latter             lesardine, #unmaredisardine, #vivalesardine and
model. Finally the Deepreading team made use               #forzasardine;
of transformers such as BERT XXL and XML-                  - 700 NONE / NEUTRAL: texts derived from news
RoBERTa, paired together with linear classifiers.          titles. These were retrieved by querying to Google
TextWiller is the only team to have exploited the          news with the keyword “sardine”.
xg-boost algorithm, and ItVenses relied on super-
                                                           Other resources. Five teams declared to have
vised models, based on statistics and semantics.
                                                           used also other resources such as lexica, word em-
The UNED team proposed instead a voting sys-
                                                           beddings, or others. In particular, GhostWriter
tem among the output of different models.
                                                           used grammar model to rephrase the tweets.
Features. Besides having explored a variety of             MeSoVe exploited SenticNet (Cambria et al.,
system architectures, the teams participating in           2014) and the “Nuovo vocabolario di base della
Task A, also used many different textual features,         lingua italiana”.5 QMUL-SDS took advantage
in the most of cases based on n-grams or char-             of temporal embeddigns and FastText, while only
grams. MeSoVe and TextWiller additionally en-              one team, UninaStudents, used a sentiment lex-
gineered features based on emoticons. The team             icon: AFINN (Nielsen, 2011). Lastly, Venses
UNED, in one of their runs, proposed a system re-          used a proprietary lexicon of Italian, enriched with
lying on psychologycal and social features, while          conceptual, semantic and syntactic information;
UninaStudents proposed features of uni-grams               and similarly TextWiller approach relies on a self-
of hashtags. Interestingly, UNITOR added spe-              created vocabulary and trained word-embeddigs
cial tags to the texts, which are the result of a          on the corpus PAISÀ (Lyding et al., 2014).
classification with respect some so-called “auxil-
                                                           6.2     Systems participating to Task B
iary task”. In particular, they trained three clas-
sifiers based respectively on SENTIPOLC 2016               Seven teams participated in Task B submitting
(Barbieri et al., 2016) for sentiment analysis clas-       a total of 13 runs. Most teams extensively ex-
sification, on HaSpeeDe 2018 (Bosco et al., 2018)          plored the additional features available for Task B;
  4
                                                           GhostWriter, on the contrary, proposes the same
    https://huggingface.co/Musixmatch/
                                                              5
umberto-commoncrawl-cased-v1.                                     https://dizionario.internazionale.it.


                                                       7
two approaches presented in Task A. Notably, the              In particular, among the 6 teams that partici-
three runs with a score lower than the baseline do         pated to both tasks, only 4 fully explored the social
not have benefited from any features based on the          network relations of the author of the tweet. The
users’ social network.                                     only two runs that overcome the baseline with-
System architecture. Most teams enriched the               out investigating the structures of the social graphs
models they submitted in Task A by taking advan-           are those submitted by the SSNCSE-NLP team.
tage of contextual information available in Task B.        Only one team participated to both tasks exploit-
UNED, DeepReading, and TextWiller exploited                ing the same architecture. This, allowed us to
the xg-boost algorithm selecting different features        compare the F1-scores obtained in the first set-
from contextual data. The language model BERT              ting with those obtained in the second, highlight-
was used in different variants by SSNCSE-NLP,              ing that adding contextual features could increase
DeepReading, and IXA. In particular, the last              performance of +0.2432, in terms of F1avg .
two teams proposed three voting based ensemble                Additionally, we calculated the increment in
methods that use two or more models that ex-               performance between the score obtained by the
ploit the xg-boost algorithm. Furthermore, the             run ranked as 1st position in Task A (UNITOR,
neural network framework proposed by QMUL-                 Favg = 0.6853) and the score of the run ranked as
SDS exploits and combine four different embed-             1st position in Task B (IXA, Favg = 0.7445), show-
ding methods into a dense layer for generating the         ing that taking advantage of contextual features
final label using a softmax activation function.           could increase performance up to 8,6% in terms
                                                           of F1avg .
Features. Not every team took full advantage of
contextual information. For example, SSNCSE-               7   Conclusions
NLP only exploits the number of friends in run
1, and the number of quotes and friends in run             We presented the first shared task on Stance Detec-
2. In its run 1 UNED also exploited some fea-              tion for Italian, discussing the development of the
tures based on the tweets in addition to the psy-          datasets used and the participation. A great panel
chological and emotional ones, using the xg-boost          for discussions about techniques and state-of-the-
algorithm. The other teams exploited different ap-         art approaches has been opened which can be used
proaches for learning vector representations of the        for investigating future research directions.
nodes of the available networks. DeepReading,
IXA, and UNED proposed a feature that computes             Acknowledgments
the mean distances of each user to the rest of users
                                                           The work of C. Bosco, M. Lai and V. Patti is par-
whose stance is known. TextWiller experimented
                                                           tially funded by the project “Be Positive!” (un-
a multi-dimensional scaling (MDS) for retaining
                                                           der the 2019 “Google.org Impact Challenge on
the first and second dimension for each of the four
                                                           Safety” call). The work of C. Bosco and V.
networks instated. Node2vec and deepwalk for
                                                           Patti is also partially funded by Progetto di Ate-
learning a vector representation of the nodes of the
                                                           neo/CSP 2016 Immigrants, Hate and Prejudice in
networks were used respectively in QMUL-SDS’s
                                                           Social Media (S1618_L2_BOSC_01). The work
runs 1 and 2.
                                                           of P. Rosso is partially funded by the Spanish
   The comparison between the approaches re-               MICINN under the research projects MISMIS-
spectively used for dealing with Task A and                FAKEnHATE on Misinformation and Miscommu-
Task B, clearly highlights the benefits of exploit-        nication in social media: FAKE news and HATE
ing information from different and heterogeneous           speech (PGC2018-096212-B-C31) and PROME-
sources. In particular, it is interesting to ob-           TEO/2019/121 (DeepPattern) of the Generalitat
serve that all the teams that participated to both         Valenciana.
tasks, also produced better results in the second
setting. Experimenting with different classifiers          A special mention also to the people who helped
trained with the textual content of the tweets as          us with the annotation of the dataset. In random
well as with features based on contextual infor-           order: Matteo, Luca, Ylenia, Simona, Elisa, Se-
mation (additional info on the tweets, on users, or        bastiano, Francesca, Simona, Komal and Angela,
their social networks) seems therefore to allow to         thank you very much for your great help.
obtain overall better results.

                                                       8
References                                                     You Shall Know a User by the Company It Keeps:
                                                               Dynamic Representations for Social Media Users in
Rabab Alkhalifa and Arkaitz Zubiaga. 2020. QMUL-               NLP. In Proceedings of the 2019 Conference on
  SDS @ SardiStance: Leveraging Network Inter-                 Empirical Methods in Natural Language Processing
  actions to Boost Performance on Stance Detec-                and the 9th International Joint Conference on Nat-
  tion using Knowledge Graphs. In Proceedings of               ural Language Processing (EMNLP-IJCNLP 2019).
  the 7th Evaluation Campaign of Natural Language              ACL.
  Processing and Speech Tools for Italian (EVALITA
  2020). CEUR-WS.org.                                        Rodolfo Delmonte. 2020. Venses @ HaSpeeDe2 &
                                                               SardiStance: Multilevel Deep Linguistically Based
Francesco Barbieri, Valerio Basile, Danilo Croce,
                                                               Supervised Approach to Classification. In Proceed-
  Malvina Nissim, Nicole Novielli, and Viviana Patti.
                                                               ings of the 7th Evaluation Campaign of Natural
  2016. Overview of the EVALITA 2016 SENTIment
                                                               Language Processing and Speech Tools for Italian
  POLarity Classification task. In Proceedings of
                                                               (EVALITA 2020). CEUR-WS.org.
  the 5th Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian (EVALITA           Maria S. Espinosa, Rodrigo Agerri, Alvaro Rodrigo,
  2016). CEUR-WS.org.                                         and Roberto Centeno. 2020. DeepReading @
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-          SardiStance: Combining Textual, Social and Emo-
  cia C. Passaro. 2020. EVALITA 2020: Overview of             tional Features. In Proceedings of the 7th Evalua-
  the 7th Evaluation Campaign of Natural Language             tion Campaign of Natural Language Processing and
  Processing and Speech Tools for Italian. In Valerio         Speech Tools for Italian (EVALITA 2020). CEUR-
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.           WS.org.
  Passaro, editors, Proceedings of Seventh Evalua-           Federico Ferraccioli, Andrea Sciandra, Mattia Da Pont,
  tion Campaign of Natural Language Processing and             Paolo Girardi, Dario Solari, and Livio Finos. 2020.
  Speech Tools for Italian. Final Workshop (EVALITA            TextWiller @ SardiStance, HaSpeede2: Text or
  2020), Online. CEUR-WS.org.                                  Con-text? A smart use of social network data in pre-
Mauro Bennici. 2020. ghostwriter19 @ SardiS-                   dicting polarization. In Proceedings of the 7th Eval-
 tance: Generating new tweets to classify SardiS-              uation Campaign of Natural Language Process-
 tance EVALITA 2020 political tweets. In Proceed-              ing and Speech Tools for Italian (EVALITA 2020).
 ings of the 7th Evaluation Campaign of Natural                CEUR-WS.org.
 Language Processing and Speech Tools for Italian
                                                             Simone Giorgioni, Marcello Politi, Samir Salman,
 (EVALITA 2020). CEUR-WS.org.
                                                               Danilo Croce, and Roberto Basili. 2020. UN-
B. Bharathi, J. Bhuvana, and Nitin Nikamanth Ap-               ITOR@Sardistance2020: Combining Transformer-
  piah Balaji. 2020. SardiStance@EVALITA2020:                  based architectures and Transfer Learning for robust
  Textual and Contextual stance detection from                 Stance Detection. In Proceedings of the 7th Evalua-
  Tweets using machine learning approach. In Pro-              tion Campaign of Natural Language Processing and
  ceedings of the 7th Evaluation Campaign of Natural           Speech Tools for Italian (EVALITA 2020). CEUR-
  Language Processing and Speech Tools for Italian             WS.org.
  (EVALITA 2020). CEUR-WS.org.
                                                             S. Kayalvizhi, D. Thenmozhi, and Chandrabose Ar-
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,             avindan. 2020. SSN_NLP@SardiStance : Stance
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.              Detection from Italian Tweets using RNN and
  Overview of the Evalita 2018 Hate Speech Detection            Transformers. In Valerio Basile, Danilo Croce,
  Task. In Proceedings of 6th Evaluation Campaign of            Maria Di Maro, and Lucia C. Passaro, editors, Pro-
  Natural Language Processing and Speech Tools for              ceedings of the 7th Evaluation Campaign of Natural
  Italian (EVALITA 2018). CEUR-WS.org.                          Language Processing and Speech Tools for Italian
                                                                (EVALITA 2020). CEUR-WS.org.
Erik Cambria, Daniel Olsher, and Dheeraj Rajagopal.
   2014. SenticNet 3: a Common and Common-                   Dilek Küçük and Fazli Can. 2020. Stance detection: A
   sense Knowledge Base for Cognition-driven Senti-            survey. ACM Computing Surveys, 53(1):1–37.
   ment Analysis. In Proceedings of the 28th AAAI
   Conference on Artificial Intelligence (AAAI 2014).        Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo
                                                               Rosso. 2018. Stance evolution and Twitter inter-
Alessandra Teresa Cignarella, Simona Frenda, Vale-             actions in an Italian political debate. In Proceed-
  rio Basile, Cristina Bosco, Viviana Patti, and Paolo         ings of the 23rd International Conference on Natu-
  Rosso. 2018. Overview of the EVALITA 2018 task               ral Language & Information Systems (NLDB 2018).
  on Irony Detection in Italian Tweets (IronITA). In           Springer.
  Proceedings of 6th Evaluation Campaign of Natural
  Language Processing and Speech Tools for Italian           Mirko Lai, Marcella Tambuscio, Viviana Patti, Gian-
  (EVALITA 2018). CEUR-WS.org.                                 carlo Ruffo, and Paolo Rosso. 2019. Stance polarity
                                                               in political debates: A diachronic perspective of net-
Marco Del Tredici, Diego Marcheggiani, Sabine                  work homophily and conversations on twitter. Data
 Schulte im Walde, and Raquel Fernández. 2019.                 & Knowledge Engineering, 124:101738.


                                                         9
Mirko Lai, Alessandra Teresa Cignarella, Delia                Mariona Taulé, M. Antònia Martí, Francisco M. Rangel
  Irazú Hernández Farías, Cristina Bosco, Viviana              Pardo, Paolo Rosso, Cristina Bosco, and Viviana
  Patti, and Paolo Rosso. 2020. Multilingual stance            Patti. 2017. Overview of the Task on Stance and
  detection in social media political debates. Com-            Gender Detection in Tweets on Catalan Indepen-
  puter Speech & Language, 63(101075).                         dence. In Proceedings of the 2nd Workshop on Eval-
                                                               uation of Human Language Technologies for Iberian
Verena Lyding, Egon Stemle, Claudia Borghetti, Marco           Languages (IberEval 2017) co-located with 33th
  Brunello, Sara Castagnoli, Felice Dell’Orletta, Hen-         Conference of the Spanish Society for Natural Lan-
  rik Dittmann, Alessandro Lenci, and Vito Pirrelli.           guage Processing (SEPLN 2017). CEUR-WS.org.
  2014. The PAISA’ Corpus of Italian Web Texts.
  In Proceedings of the 9th World Archaeological              Mariona Taulé, Francisco M. Rangel Pardo, M. Antò-
  Congress (WAC-9) @ the 16th Conference of the                nia Martí, and Paolo Rosso. 2018. Overview of the
  European Chapter of the Association for Computa-             Task on Multimodal Stance Detection in Tweets on
  tional Linguistics (EACL 2014). ACL.                         Catalan #1Oct Referendum. In Proceedings of the
                                                               3rd Workshop on Evaluation of Human Language
Walid Magdy, Kareem Darwish, Norah Abokhodair,                 Technologies for Iberian Languages (IberEval 2018)
  Afshin Rahimi, and Timothy Baldwin. 2016. #isi-              co-located with 34th Conference of the Spanish
  sisnotislam or #deportallmuslims?: Predicting un-            Society for Natural Language Processing (SEPLN
  spoken views. In Proceedings of the 8th ACM Con-             2018). CEUR-WS.org.
  ference on Web Science (WebSci 2016). ACM.
                                                              Jannis Vamvas and Rico Sennrich. 2020. X-Stance: A
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-                Multilingual Multi-Target Dataset for Stance Detec-
  hani, Xiaodan Zhu, and Colin Cherry. 2016a. A                  tion. In Proceedings of the 5th Swiss Text Analyt-
  Dataset for Detecting Stance in Tweets. In Pro-                ics Conference (SwissText 2020) & 16th Conference
  ceedings of the 10th International Conference on               on Natural Language Processing (KONVENS 2020).
  Language Resources and Evaluation (LREC 2016).                 CEUR-WS.org.
  ELRA.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
  hani, Xiaodan Zhu, and Colin Cherry. 2016b.
  SemEval-2016 Task 6: Detecting Stance in Tweets.
  In Proceedings of the 10th International Workshop
  on Semantic Evaluation (SemEval-2016). ACL.

Maurizio Moraca, Gianluca Sabella, and Simone
 Morra. 2020. UninaStudents @ SardiStance:
 Stance detection in Italian tweets - Task A. In Pro-
 ceedings of the 7th Evaluation Campaign of Natural
 Language Processing and Speech Tools for Italian
 (EVALITA 2020). CEUR-WS.org.

Finn Årup Nielsen. 2011. AFINN. Richard Petersens
   Plads, Building, 321.

Ashwin Rajadesingan and Huan Liu. 2014. Identi-
  fying users with opposing opinions in Twitter de-
  bates. In Proceedings of the 7th Social Computing,
  Behavioral-Cultural Modeling and Prediction Inter-
  national Conference (SBP-BRiMS 2014). Springer.

Francisco Rangel and Paolo Rosso. 2018. On the im-
  plications of the general data protection regulation
  on the organisation of evaluation tasks. Language
  and Law / Linguagem e Direito, 5(2):95–117.

Manuela Sanguinetti, Gloria Comandini, Elisa
 Di Nuovo, Simona Frenda, Marco Stranisci,
 Cristina Bosco, Tommaso Caselli, Viviana Patti, and
 Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
 Overview of the EVALITA 2020 Hate Speech
 Detection Task. In Proceedings of the 7th Evalu-
 ation Campaign of Natural Language Processing
 and Speech Tools for Italian (EVALITA 2020).
 CEUR-WS.org.


                                                         10