=Paper= {{Paper |id=Vol-2765/159 |storemode=property |title=SardiStance @ EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets |pdfUrl=https://ceur-ws.org/Vol-2765/paper159.pdf |volume=Vol-2765 |authors=Alessandra Teresa Cignarella,Mirko Lai,Cristina Bosco,Viviana Patti,Paolo Rosso |dblpUrl=https://dblp.org/rec/conf/evalita/CignarellaLBPR20 }} ==SardiStance @ EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets== https://ceur-ws.org/Vol-2765/paper159.pdf

SardiStance @ EVALITA2020:
Overview of the Task on Stance Detection in Italian Tweets

Alessandra Teresa Cignarella1,2 , Mirko Lai1 , Cristina Bosco1 , Viviana Patti1 and Paolo Rosso2
1. Dipartimento di Informatica, Università degli Studi di Torino, Italy
2. PRHLT Research Center, Universitat Politècnica de València, Spain
{lai,cigna,bosco,patti}@di.unito.it, prosso@dsic.upv.es

Abstract Stance Detection (SD), which is defined as the
task of automatically determining from the text
English. SardiStance is the first shared whether the author of a given textual content is
task for Italian on the automatic classifi- in favor of, against, or neutral towards a certain
cation of stance in tweets. It is articu- target. Research on this topic, beyond mere aca-
lated in two different settings: A) Textual demic interest, could have an impact on different
Stance Detection, exploiting only the in- aspects of everyday life such as public administra-
formation provided by the tweet, and B) tion, policy-making, marketing or security strate-
Contextual Stance Detection, with the ad- gies.
dition of information on the tweet itself
Although SD is a fairly recent research topic,
such as the number of retweets, the num-
considerable effort has been devoted to the cre-
ber of favours or the date of posting; con-
ation of stance-annotated datasets. In their re-
textual information about the author, such
cent survey on this topic, Küçük and Can (2020)
as follower count, location, user’s biogra-
describe the existence of a variety of stance-
phy; and additional knowledge extracted
annotated datasets (different text types such as
from the user’s network of friends, follow-
tweets, posts in online forums, news articles, or
ers, retweets, quotes and replies. The task
news comments) for at least eleven languages.
has been one of the most participated at
The first shared task on SD was held for En-
EVALITA 2020 (Basile et al., 2020), with
glish at SemEval in 2016, i.e. Task 6 “Detecting
a total of 22 submitted runs for Task A,
Stance in Tweets” (Mohammad et al., 2016b) for
and 13 for Task B, and 12 different par-
detecting stance towards six different targets of in-
ticipating teams from both academia and
terest: “Hillary Clinton”, “Feminist Movement”,
industry.
“Legalization of Abortion”, “Atheism”, “Donald
Trump”, and “Climate Change is a Real Concern”.
1 Introduction/Motivation A more recent evaluation for SD systems was pro-
The interest towards detecting people’s opinions posed at IberEval 2017 for both Catalan and Span-
towards particular targets, and towards monitoring ish (Taulé et al., 2017) where the target was only
politically polarized debates on Twitter has grown one, i.e. “Independence of Catalonia”. A re-run
more and more in the last years, as it is attested was proposed the following year at the evalua-
by the proliferation of questionnaires and polls on- tion campaign IberEval 2018 regarding the target
line (Küçük and Can, 2020). In fact, through the “Catalan first of October Referendum” encourag-
constant monitoring of people’s opinion, desires, ing furthermore an exploration of multimodal ex-
complaints and beliefs on political agenda or pub- pressions such as audio, videos and images (Taulé
lic services, policy makers could better meet pop- et al., 2018).
ulation’s needs. SardiStance@EVALITA2020 is the pioneer task
In the fields of Natural Language Processing for SD in Italian tweets. The motivation behind the
and Sentiment Analysis, this translates into the proposal of this task is multi-faceted. On the one
creation of a specifically dedicated task, namely: hand, we aimed at the creation of a new annotated
dataset for SD in Italian which would enrich the
Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- panorama of available resources for this language,
ternational (CC BY 4.0). such as CONREF - STANCE - ITA (Lai et al., 2018)
and X - STANCE (Vamvas and Sennrich, 2020). On • Task B - Contextual Stance Detection:
the other hand, the organization of this task allows The second task was the same as the first one:
us a deeper investigation of SD at a contextual a three-class classification task where the system
level, by encouraging the participants and the re- had to predict whether a tweet is in FAVOUR,
search community to follow this research line that AGAINST or NONE towards the given target. Here
has proved promising in previous work, see e.g. participants had access to a wider range of contex-
Lai et al. (2019), Lai et al. (2020) and Del Tredici tual information based on the post such as: the
et al. (2019). In fact, with the data distributed in number of retweets, the number of favours, the
Task B different types of social network commu- number of replies and the number of quotes re-
nities, based on friendships, retweets, quotes, and ceived to the tweet, the type of posting source (e.g.
replies could be investigated, in order to analyze iOS or Android), and date of posting. Furthermore
the communication among users with similar and we shared (and encouraged its exploitation) con-
divergent viewpoints. textual information related to the user, such as:
The efficacy of approaches based on contextual number of tweets ever posted, user’s bio, user’s
features paired with textual information has been number of followers, user’s number of friends.
widely attested in literature on SD (Magdy et Additionally we shared users’ contextual informa-
al., 2016; Rajadesingan and Liu, 2014) and addi- tion about their social network, such as: friends,
tionally confirmed by the results obtained in this replies, retweets, and quotes’ relations. The per-
shared task, especially by those teams who partic- sonal ids of the users were anonymized but their
ipated to both settings (see Section 5). network structures were maintained intact.
Participants could decide to participate to both
2 Definition of the Task
tasks or only to one. Although they were encour-
With this task proposal, we wanted to invite partic- aged to participate to both.
ipants to explore features based on the textual con-
tent of the tweet, such as structural, stylistic, and 3 Data
affective features, but also features based on con- We chose to gather the data from the social net-
textual information that does not emerge directly working Twitter due to the free availability of a
from the text, such as knowledge about the do- huge amount of users’ generated data and because
main of the political debate or information about it allowed us to explore different types of relations
the user’s community. For these reasons, we pro- among the users involved in a debate.
posed two different settings:
• Task A - Textual Stance Detection: 3.1 Collection and annotation of the data
The first task was a three-class classification We collected around 700K tweets written in Ital-
task where the system had to predict whether a ian about the “Movimento delle Sardine” (Sar-
tweet is in FAVOUR, AGAINST or NONE towards dines movement1 ), retrieving tweets containing the
the given target, exploiting only textual informa- keywords “sardina”, “sardine”, and the homony-
tion, i.e. the text of the tweet. mous hashtags. Furthermore, we collected all the
conversation threads in which the said tweet be-
From reading the tweet, which of the options below is longs, iteratively following the reply’s tree. We
most likely to be true about the tweeter’s stance towards also collected the quoted tweets and the list of all
the target? (Mohammad et al., 2016a) the retweets of each previously recovered tweet,
obtaining about 1M tweets. Finally, we collected
1. FAVOUR: We can infer from the tweet that the the friend list of all the users included in the anno-
tweeter supports the target. tated dataset.
The tweets were gathered between the 46th
2. AGAINST: We can infer from the tweet that the
week of 2019 (November) and the 5th week of
tweeter is against the target.
2020 (January), corresponding to a 12 weeks time-
3. NONE: We can infer from the tweet that the window. Through the experience matured as par-
tweeter has a neutral stance towards the target or ticipants in previous shared tasks of SD, and in or-
there is no clue in the tweet to reveal the stance of 1
https://en.wikipedia.org/wiki/
the tweeter towards the target. Sardines_movement.

2
der to reduce noise in text, we collected data tak- Furthermore, as it can also be seen in Figure 1
ing into account the following constraints: only (Tonight we are all sardines in Bologna #bolog-
one tweet per author for each week, no retweets, nanonsilega), we asked the annotators to mark
no replies, no quotes, no tweets containing URLs, whether, in their opinion, the tweet was IRONIC
no tweets containing pictures or videos. or NOT IRONIC. Finally, we were not able to ob-
Then, we included only Italian tweets posted us- tain satisfactory results on this end, so we did not
ing a limited number of “sources” (utilities used to include it in the task.
post the tweet, such as iOS, Android, etc...) in or-
der to avoid to include pre-written tweets posted 3.2 Analysis of the annotation
using a Tweet button.2 Furthermore, we validated
At the end of a first phase of annotation, which
that all the collected tweets presented a Jaccard
lasted more or less a month, we obtained 2,256
similarity coefficient < 0.8. From about 25K fil-
tweets in agreement, with a clear decision on one
tered tweets, we finally randomly selected around
of the three main classes. Other 917 tweets pre-
300 tweets for each week (only the first week of
sented a light disagreement (i.e. FAVOUR vs. NEU -
2020 does not reach 300 tweets), thus obtaining
TRAL or AGAINST vs. NEUTRAL ), and the remain-
3,600 tweets in total.
ing 457 tweets were discarded because the major-
ity of annotators considered them out of topic or
were in strong disagreement (i.e. FAVOUR vs. OUT
OF TOPIC ).
We then proceeded in the resolution of those
917 tweets, whose disagreement was deemed
”light” in order to obtain a bigger dataset. We re-
sorted once again to the annotation platform used
in the first phase, we revised the annotation guide-
lines and asked the annotators to label the tweets
again. In this phase, we paid attention that the
tweets in disagreement were not assigned to the
Figure 1: Platform for the annotation of tweets. same pair of annotators that had previously la-
belled them, and furthermore we chose to show the
We created a web platform for annotation pur- two annotations in contrast, along with any com-
poses, see Figure 1, in order to facilitate the la- ment - if present - to the annotator that had to solve
belling task to the annotators, unifying the visual- the disagreement.
ization mode and shuffling the tweets in a random After the second phase, we computed the
order.3 12 different native Italian speakers with an inter-annotator agreement (IAA) through Cohen’s
interest for news and politics were involved in the kappa coefficient (over the three main classes)
annotation, according to detailed guidelines we resulting in κ = 0.493 (weak agreement). The
provided with examples for annotation and exam- same coefficient was also used to compute the
ples in their native language. We randomly shuf- IAA among annotators over the two most signif-
fled the annotators and matched them into 66 pairs icant classes (AGAINST and FAVOUR, excluding
in which each pair would annotate 55 tweets. As the NEUTRAL class), resulting in a higher score:
a result, each annotator labelled 605 tweets inde- κ = 0.769 (moderate agreement). Notably, we ob-
pendently and each tweet was annotated by two served that the IAA significantly changes depend-
annotators, who had to choose among four differ- ing on the observed pair of annotators (it ranges
ent labels: AGAINST, FAVOUR, NONE / NEUTRAL from 0.873 to 0.473) in the first phase of the an-
and OUT OF TOPIC. notation. We also noticed that the average IAA,
2
computed through the sum of each IAA between
https://developer.twitter.com/en/
docs/twitter-for-websites/tweet-button/ any annotator and the remaining 11 annotators,
overview. can significantly change (ranging from 0.704 to
3
In this way, each annotator was surely seeing emojis – 0.609). In other words, some annotators tend to
which, we believe are essential in order to understand the
correct stance– in the same way of the other annotators in- strongly agree with all the other ones, while others
dependently of the device used. tend to disagree with the majority. As future work,

3
we aim to shed more light on this phenomena ex- Task A
ploring the background of the annotators and the The training data (TRAIN.csv) was released in the
social relationship among them. following format:

3.3 Composition of the dataset tweet_id user_id text label

After the second round of annotation we were where tweet_id is the Twitter ID of the mes-
finally able to create the official dataset for the sage, user_id is the Twitter ID of the user who
SardiStance shared task. It is composed by a to- posted the message, text is the content of the
tal of 3,242 tweets, 1,770 of which belong to the message, label is AGAINST, FAVOUR or NONE.
class AGAINST, 785 to the class FAVOUR, and 687
to the class NONE. In Table 1 we show the distri- Task B
bution of such instances accordingly to the train- In order to participate to Task B, we released ad-
ing set and the test set and in Table 2 we report ditional contextual information.
tweet as example for each class.
• the file TWEET.csv, containing contextual infor-
TRAINING SET TEST SET mation regarding the tweet, with the following for-
AGAINST FAVOUR NONE AGAINST FAVOUR NONE mat:
1,028 589 515 742 196 172
2,132 1,110 tweet_id user_id retweet_count
favorite_count source created_at
Table 1: Distribution of tweets. where tweet_id is the Twitter ID of the mes-
sage, user_id is the Twitter ID of the user who
text label posted the message, retweet_count indicates
LE SARDINE IN PIAZZA MAGGIORE NON
SONO ITALIANI SE LO FOSSERO NON SI the number of times the tweet has been retweeted,
METTEREBBERO CONTRO LA DESTRA favorite_count indicates the number of
CHE AMA L’ITALIA E VUOLE RIMANERE
ITALIANA times the tweet has been liked, source indi-
THE SARDINES IN PIAZZA MAGGIORE ARE AGAINST cates the type of posting source (e.g. iOS or
NOT ITALIAN IF THEY WERE THEY WOULD
NOT GO AGAINST THE RIGHT THAT LOVES Android), and created_at displays the time
ITALY AND WANTS TO REMAIN ITALIAN
of creation according to a yyyy-mm-dd hh:mm:ss
Non ci credo che stasera devo andare in
teatro e non posso essere fra le #Sardine format. Minutes and seconds have been encrypted
#Bologna #bolognanonsilega and transformed to zeroes for privacy issues.
I can’t believe that I have to go to the the- FAVOUR
ater tonight and I can’t be among the #Sardines
#Bologna #bolognanonsilega • the file USER.csv, containing contextual infor-
Mi sono svegliato nudo e triste perché a mation regarding the user. It was released in the
Bologna, tra salviniani e antisalviniani, non
mi ha cagato nessuno. following format:
I woke up naked and sad because in Bologna, NONE
between Salvinians and anti-Salvinians, nobody user_id statuses_count friends_count
paid me attention.
followers_count created_at emoji
Table 2: Examples from the dataset. where user_id is the Twitter ID of the user
who posted the message, statuses_count,
3.4 Data Release friends_count indicates the number of
friends of the user, followers_count in-
We shared data following the methodology rec- dicates the number of followers of the user,
ommended in (Rangel and Rosso, 2018) in order created_at displays the time of the user reg-
to comply to GDPR privacy rules and Twitter’s istration on Twitter, and emoji shows a list of the
policies. The identifiers of tweets and users emojis in the user’s bio (if present, otherwise the
have been anonymized and replaced by unique field is left empty).
identifiers. We exclusively released the emojis
eventually contained in the location and descrip- • The files FRIEND.csv, QUOTE.csv, REPLY.csv
tion user’s biography, in order to make very hard and RETWEET.csv containing contextual info
to trace users and to preserve everybody’s privacy. about the social network of the user. Each file was
released in the following format:
Source Target Weight

4
where Source and Target indicate two nodes strained runs and two unconstrained runs. Sub-
of a social interaction between two Twitter users. mitting at least a constrained run was anyway
More specifically, the source user performs one of compulsory. We decided to provide two sepa-
the considered social relation towards the target rate official rankings for Task A and Task B, and
user. Two users are tied by a friend relationship if two separate ranking for constrained and uncon-
the source user follows the target user (friend re- strained runs. Systems have been evaluated us-
lationship does not have a weight, because it is ei- ing F1-score computed over the two main classes
ther present or absent); while two users are tied by (FAVOUR and AGAINST). Therefore, the sub-
a quote, retweet, or reply relationship if the source missions have been ranked by the averaged F1-
user respectively quoted, retweeted, or replied the score over the two classes, according the following
target user. Table 4 shows some metrics about the equation: F 1avg = (F 1f avour + F 1against )/2.
shared networks.
4.1 Baselines
nodes edges
friend 669,817 3,076,281 We computed a baseline using a simple machine
retweet 110,315 575,460 learning model, for Task A: a Support Vector Clas-
quote 2,903 7,899 sifier based on token uni-gram features. A sec-
reply 14,268 29,939
ond baseline we computed for Task B is a system
Table 4: Networks metrics. based on our previous work on Stance Detection: a
Logistic Regression classifier paired with token n-
Weight indicates the number of interactions grams features (unigrams, bigrams and trigrams),
existing between two users. Note that this in- plus features based on a binary one-hot encod-
formation is not available for the friend rela- ing representation of the communities extracted
tion (hence, this column was not present in the from the network of retweets and the network of
FRIEND.csv file) due to the fact that it is a rela- friends (see the best system for Italian, in Lai et al.
tionship of the type present/absent and cannot be (2020)).
described through a weight. In all the files, users
are defined by their anonimyzed User ID. 5 Participants and results
Regrettably, we did not think to anonymize the
A total of 12 teams, both from academia and in-
screen names contained in the text of the tweets
dustry sector participated to at least one of the two
(with the same numeric string used to anonymize
tasks of SardiStance. In Table 3 we provide an
users), for allowing to match it with the users’ ids
overview of the teams in alphabetical order.
and allowing the exploration of the network based
Teams were allowed to submit up to four runs (2
on mentions. We will surely take it into account in
constrained and 2 unconstrained) in case they im-
our future works.
plemented different systems. Furthermore, each
4 Evaluation Measures team had to submit at least a constrained run. Par-
ticipants have been invited to submit multiple runs
Each participating team was allowed to submit a to experiment with different models and architec-
maximum of 4 runs for each sub-task: two con- tures. However, they have been discouraged from

team name institution report task
deepreading UNED, Spain (Espinosa et al., 2020) A, B
GhostWriter You Are My Guide, Italy (Bennici, 2020) A, B
IXA UPV/EHU, Spain (Espinosa et al., 2020) A, B
MeSoVe ISASI, Italy - A
QMUL-SDS QMUL-SDS-EECS, UK (Alkhalifa and Zubiaga, 2020) A, B
SSN_NLP CSE Department/SSNCE, India (Kayalvizhi et al., 2020) A
SSNCSE-NLP SSN College of Engineering, India (Bharathi et al., 2020) A, B
TextWiller UNIPD, Italy (Ferraccioli et al., 2020) A, B
UNED UPV/EHU and UNED, Spain (Espinosa et al., 2020) B
UninaStudents UNINA, Italy (Moraca et al., 2020) A
UNITOR UNIROMA2, Italy (Giorgioni et al., 2020) A
Venses UNIVE, Italy (Delmonte, 2020) A

Table 3: Participants and reports.

5
submitting slight variations of the same model. strong result to beat (F1avg = 0.5784).
Overall we have 22 runs for Task A and 13 runs
for Task B. 5.2 Task B: Contextual Stance Detection
Table 6 shows the results for the contextual stance
5.1 Task A: Textual Stance Detection
detection task, which attracted 13 total submis-
Table 5 shows the results for the textual stance de- sions from 7 different teams.
tection task, which attracted 22 total submissions
team name run F1-score
from 11 different teams. Since the only two sys-
AVG AGAINST FAVOUR NONE
tems in an unconstrained setting were submitted
IXA 3 .7445 .8562 .6329 .4214
by the same team we decided not to create a sep- TextWiller 1 .7309 .8505 .6114 .2963
arate ranking for them, but rather to include them DeepReading 1 .7230 .8368 .6093 .3364
DeepReading 2 .7222 .8300 .6143 .4251
in the same ranking, and marking them with a dif-
TextWiller 2 .7147 .8298 .5995 .3680
ferent color (gray in Table 5). QMUL-SDS 1 .7088 .8267 .5908 .1811
UNED 2 .6888 .8175 .5600 .2455
team name run F1-score QMUL-SDS 2 .6765 .8134 .5396 .1553
AVG AGAINST FAVOUR NONE
SSNCSE-NLP 2 .6582 .7915 .5249 .3691
SSNCSE-NLP 1 .6556 .7914 .5198 .3880
UNITOR 1 .6853 .7866 .5840 .3910
UNITOR 1 .6801 .7881 .5721 .3979 baseline .6284 .7672 .4895 .3009
UNITOR 2 .6793 .7939 .5647 .3672 GhostWriter 1 .6257 .7502 .5012 .3810
DeepReading 1 .6621 .7580 .5663 .4213 GhostWriter 2 .6004 .7224 .4784 .3778
UNITOR 2 .6606 .7689 .5522 .3702 UNED 1 .5313 .7399 .3226 .2000
IXA 1 .6473 .7616 .5330 .3888
GhostWriter 1 .6257 .7502 .5012 .3810
Table 6: Results Task B.
IXA 2 .6171 .7543 .4800 .3675
SSNCSE-NLP 2 .6067 .7723 .4412 .2113
DeepReading 2 .6004 .6966 .5042 .3916 The best scores are achieved by the IXA team that
GhostWriter 2 .6004 .7224 .4784 .3778
with a constrained run obtained the highest score
UninaStudents 1 .5886 .7850 .3922 .2326
of F1avg = 0.7445. The best F1-score for the
baseline .5784 .7158 .4409 .2764
main classes AGAINST and FAVOUR is achieved
TextWiller 1 .5773 .7755 .3791 .1849
SSNCSE-NLP 1 .5749 .7307 .4192 .3388 by the team ranked 1st, IXA, team with F1against =
QMUL-SDS 1 .5595 .7091 .4099 .2313 0.8562, and F1f avour = 0.6329, respectively. Once
QMUL-SDS 2 .5329 .6478 .4181 .3049
MeSoVe 1 .4989 .7336 .2642 .3118
again, the Deepreading team, ranking 3rd and
TextWiller 2 .4715 .6713 .2718 .2884 4th, has obtained the best F1-score for the NONE
SSN_NLP 1 .4707 .5763 .3651 .3364 class, with F1none = 0.4251.
SSN_NLP 2 .4473 .6545 .2402 .1913
Venses 1 .3882 .5325 .2438 .2022 Almost all participating systems show an im-
Venses 2 .3637 .4564 .2710 .2387 provement over the baseline, which was computed
using a Logistic Regression classifier paired with
Table 5: Results Task A. token n-grams features (unigrams, bigrams and tri-
grams), features based on the network of retweets,
The best results are achieved by the UNITOR team
and features based on the network of friends (Lai
that, with an unconstrained, ranked as 1st position
et al., 2020).
with F1avg = 0.6853. The best result for the con-
strained runs is achieved once again by the UNI-
6 Discussion
TOR team with F1avg = 0.6801.
The best results for the two main classes In this section we compare the participating sys-
AGAINST and FAVOR are obtained by the three tems according to the following main dimensions:
best systems of the ranking, which are all submis- system architecture, features, use of additional an-
sions by the team UNITOR. On the other hand, notated data for training, and use of external re-
though, the Deepreading team, ranking as 4th, sources (e.g. sentiment lexica, NLP tools, etc.).
has obtained the best F1-score for the NONE class, We also operate a distinction between runs sub-
with F1none = 0.4213. mitted in Task A and those submitted in Task B.
Among the 12 participating teams, at least 6 This discussion is based on the participants’ re-
show an improvement over the baseline, which ports and the answers the participants provided to
was computed using an SVM paired with token a questionnaire proposed by the organizers. Two
unigrams as unique feature, resulting an already teams, namely TextWiller and Venses wrote a

6
joint report, overlapping between this task and the for hate speech detection, and on IronITA 2018
HaSpeeDe 2 task (Sanguinetti et al., 2020), as they (Cignarella et al., 2018) for irony detection; and
participated in both competitions. The three fol- they added three tags to each instance of the
lowing teams, Deepreading, IXA, and UNED, SardiStance datasets with respect to these three di-
also wrote a unique report as the participants, be- mensions: sentiment, hate and irony. ItVenses
long to the same research project and wanted to proposed features collected automatically from a
compare their three different approaches. unique dictionary list, frequency of occurrence
of emojis and emoticons, and semantic features
6.1 Systems participating to Task A investigating propositional level, factivity and
System architecture. Among all submitted runs speech act type.
we counted a great variety of architectures,
ranging from classical machine learning classi- Additional training data. The only team who
fiers, to recent state-of-the-art approaches, and participated to the unconstrained setting of SardiS-
statistically-based models. For instance, regard- tance is UNITOR. They proposed two uncon-
ing the use of classical ML, the team UninaStu- strained runs in addition to other two constrained
dents used a SVM, and the team MeSoVe used ones. For the unconstrained setting, they down-
Logistic Regression in one run. Regarding the loaded and labeled about 3,200 tweets using dis-
use of neural networks, the QMUL-SDS team tant supervision and used the additional data to
used bidirectional-LSTM, a CNN-2D, and a bi- train their systems. In particular they created the
LSTM with attention. Also SSN_NLP exploited following subsets:
the LSTM neural network. - 1,500 AGAINST: tweets from 2019 containing
Four teams exploited different variants of the the hashtag: #gatticonsalvini;
BERT model: Ghostwriter used AlBERTo trained - 1,000 FAVOUR: tweets from 2019 containing
on Italian tweets, IXA used GilBERTo and Um- the hashtags: #nessunotocchilesardine, #iostocon-
BERTo4 , while UNITOR adopted only this latter lesardine, #unmaredisardine, #vivalesardine and
model. Finally the Deepreading team made use #forzasardine;
of transformers such as BERT XXL and XML- - 700 NONE / NEUTRAL: texts derived from news
RoBERTa, paired together with linear classifiers. titles. These were retrieved by querying to Google
TextWiller is the only team to have exploited the news with the keyword “sardine”.
xg-boost algorithm, and ItVenses relied on super-
Other resources. Five teams declared to have
vised models, based on statistics and semantics.
used also other resources such as lexica, word em-
The UNED team proposed instead a voting sys-
beddings, or others. In particular, GhostWriter
tem among the output of different models.
used grammar model to rephrase the tweets.
Features. Besides having explored a variety of MeSoVe exploited SenticNet (Cambria et al.,
system architectures, the teams participating in 2014) and the “Nuovo vocabolario di base della
Task A, also used many different textual features, lingua italiana”.5 QMUL-SDS took advantage
in the most of cases based on n-grams or char- of temporal embeddigns and FastText, while only
grams. MeSoVe and TextWiller additionally en- one team, UninaStudents, used a sentiment lex-
gineered features based on emoticons. The team icon: AFINN (Nielsen, 2011). Lastly, Venses
UNED, in one of their runs, proposed a system re- used a proprietary lexicon of Italian, enriched with
lying on psychologycal and social features, while conceptual, semantic and syntactic information;
UninaStudents proposed features of uni-grams and similarly TextWiller approach relies on a self-
of hashtags. Interestingly, UNITOR added spe- created vocabulary and trained word-embeddigs
cial tags to the texts, which are the result of a on the corpus PAISÀ (Lyding et al., 2014).
classification with respect some so-called “auxil-
6.2 Systems participating to Task B
iary task”. In particular, they trained three clas-
sifiers based respectively on SENTIPOLC 2016 Seven teams participated in Task B submitting
(Barbieri et al., 2016) for sentiment analysis clas- a total of 13 runs. Most teams extensively ex-
sification, on HaSpeeDe 2018 (Bosco et al., 2018) plored the additional features available for Task B;
4
GhostWriter, on the contrary, proposes the same
https://huggingface.co/Musixmatch/
5
umberto-commoncrawl-cased-v1. https://dizionario.internazionale.it.

7
two approaches presented in Task A. Notably, the In particular, among the 6 teams that partici-
three runs with a score lower than the baseline do pated to both tasks, only 4 fully explored the social
not have benefited from any features based on the network relations of the author of the tweet. The
users’ social network. only two runs that overcome the baseline with-
System architecture. Most teams enriched the out investigating the structures of the social graphs
models they submitted in Task A by taking advan- are those submitted by the SSNCSE-NLP team.
tage of contextual information available in Task B. Only one team participated to both tasks exploit-
UNED, DeepReading, and TextWiller exploited ing the same architecture. This, allowed us to
the xg-boost algorithm selecting different features compare the F1-scores obtained in the first set-
from contextual data. The language model BERT ting with those obtained in the second, highlight-
was used in different variants by SSNCSE-NLP, ing that adding contextual features could increase
DeepReading, and IXA. In particular, the last performance of +0.2432, in terms of F1avg .
two teams proposed three voting based ensemble Additionally, we calculated the increment in
methods that use two or more models that ex- performance between the score obtained by the
ploit the xg-boost algorithm. Furthermore, the run ranked as 1st position in Task A (UNITOR,
neural network framework proposed by QMUL- Favg = 0.6853) and the score of the run ranked as
SDS exploits and combine four different embed- 1st position in Task B (IXA, Favg = 0.7445), show-
ding methods into a dense layer for generating the ing that taking advantage of contextual features
final label using a softmax activation function. could increase performance up to 8,6% in terms
of F1avg .
Features. Not every team took full advantage of
contextual information. For example, SSNCSE- 7 Conclusions
NLP only exploits the number of friends in run
1, and the number of quotes and friends in run We presented the first shared task on Stance Detec-
2. In its run 1 UNED also exploited some fea- tion for Italian, discussing the development of the
tures based on the tweets in addition to the psy- datasets used and the participation. A great panel
chological and emotional ones, using the xg-boost for discussions about techniques and state-of-the-
algorithm. The other teams exploited different ap- art approaches has been opened which can be used
proaches for learning vector representations of the for investigating future research directions.
nodes of the available networks. DeepReading,
IXA, and UNED proposed a feature that computes Acknowledgments
the mean distances of each user to the rest of users
The work of C. Bosco, M. Lai and V. Patti is par-
whose stance is known. TextWiller experimented
tially funded by the project “Be Positive!” (un-
a multi-dimensional scaling (MDS) for retaining
der the 2019 “Google.org Impact Challenge on
the first and second dimension for each of the four
Safety” call). The work of C. Bosco and V.
networks instated. Node2vec and deepwalk for
Patti is also partially funded by Progetto di Ate-
learning a vector representation of the nodes of the
neo/CSP 2016 Immigrants, Hate and Prejudice in
networks were used respectively in QMUL-SDS’s
Social Media (S1618_L2_BOSC_01). The work
runs 1 and 2.
of P. Rosso is partially funded by the Spanish
The comparison between the approaches re- MICINN under the research projects MISMIS-
spectively used for dealing with Task A and FAKEnHATE on Misinformation and Miscommu-
Task B, clearly highlights the benefits of exploit- nication in social media: FAKE news and HATE
ing information from different and heterogeneous speech (PGC2018-096212-B-C31) and PROME-
sources. In particular, it is interesting to ob- TEO/2019/121 (DeepPattern) of the Generalitat
serve that all the teams that participated to both Valenciana.
tasks, also produced better results in the second
setting. Experimenting with different classifiers A special mention also to the people who helped
trained with the textual content of the tweets as us with the annotation of the dataset. In random
well as with features based on contextual infor- order: Matteo, Luca, Ylenia, Simona, Elisa, Se-
mation (additional info on the tweets, on users, or bastiano, Francesca, Simona, Komal and Angela,
their social networks) seems therefore to allow to thank you very much for your great help.
obtain overall better results.

8
References You Shall Know a User by the Company It Keeps:
Dynamic Representations for Social Media Users in
Rabab Alkhalifa and Arkaitz Zubiaga. 2020. QMUL- NLP. In Proceedings of the 2019 Conference on
SDS @ SardiStance: Leveraging Network Inter- Empirical Methods in Natural Language Processing
actions to Boost Performance on Stance Detec- and the 9th International Joint Conference on Nat-
tion using Knowledge Graphs. In Proceedings of ural Language Processing (EMNLP-IJCNLP 2019).
the 7th Evaluation Campaign of Natural Language ACL.
Processing and Speech Tools for Italian (EVALITA
2020). CEUR-WS.org. Rodolfo Delmonte. 2020. Venses @ HaSpeeDe2 &
SardiStance: Multilevel Deep Linguistically Based
Francesco Barbieri, Valerio Basile, Danilo Croce,
Supervised Approach to Classification. In Proceed-
Malvina Nissim, Nicole Novielli, and Viviana Patti.
ings of the 7th Evaluation Campaign of Natural
2016. Overview of the EVALITA 2016 SENTIment
Language Processing and Speech Tools for Italian
POLarity Classification task. In Proceedings of
(EVALITA 2020). CEUR-WS.org.
the 5th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian (EVALITA Maria S. Espinosa, Rodrigo Agerri, Alvaro Rodrigo,
2016). CEUR-WS.org. and Roberto Centeno. 2020. DeepReading @
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- SardiStance: Combining Textual, Social and Emo-
cia C. Passaro. 2020. EVALITA 2020: Overview of tional Features. In Proceedings of the 7th Evalua-
the 7th Evaluation Campaign of Natural Language tion Campaign of Natural Language Processing and
Processing and Speech Tools for Italian. In Valerio Speech Tools for Italian (EVALITA 2020). CEUR-
Basile, Danilo Croce, Maria Di Maro, and Lucia C. WS.org.
Passaro, editors, Proceedings of Seventh Evalua- Federico Ferraccioli, Andrea Sciandra, Mattia Da Pont,
tion Campaign of Natural Language Processing and Paolo Girardi, Dario Solari, and Livio Finos. 2020.
Speech Tools for Italian. Final Workshop (EVALITA TextWiller @ SardiStance, HaSpeede2: Text or
2020), Online. CEUR-WS.org. Con-text? A smart use of social network data in pre-
Mauro Bennici. 2020. ghostwriter19 @ SardiS- dicting polarization. In Proceedings of the 7th Eval-
tance: Generating new tweets to classify SardiS- uation Campaign of Natural Language Process-
tance EVALITA 2020 political tweets. In Proceed- ing and Speech Tools for Italian (EVALITA 2020).
ings of the 7th Evaluation Campaign of Natural CEUR-WS.org.
Language Processing and Speech Tools for Italian
Simone Giorgioni, Marcello Politi, Samir Salman,
(EVALITA 2020). CEUR-WS.org.
Danilo Croce, and Roberto Basili. 2020. UN-
B. Bharathi, J. Bhuvana, and Nitin Nikamanth Ap- ITOR@Sardistance2020: Combining Transformer-
piah Balaji. 2020. SardiStance@EVALITA2020: based architectures and Transfer Learning for robust
Textual and Contextual stance detection from Stance Detection. In Proceedings of the 7th Evalua-
Tweets using machine learning approach. In Pro- tion Campaign of Natural Language Processing and
ceedings of the 7th Evaluation Campaign of Natural Speech Tools for Italian (EVALITA 2020). CEUR-
Language Processing and Speech Tools for Italian WS.org.
(EVALITA 2020). CEUR-WS.org.
S. Kayalvizhi, D. Thenmozhi, and Chandrabose Ar-
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, avindan. 2020. SSN_NLP@SardiStance : Stance
Manuela Sanguinetti, and Maurizio Tesconi. 2018. Detection from Italian Tweets using RNN and
Overview of the Evalita 2018 Hate Speech Detection Transformers. In Valerio Basile, Danilo Croce,
Task. In Proceedings of 6th Evaluation Campaign of Maria Di Maro, and Lucia C. Passaro, editors, Pro-
Natural Language Processing and Speech Tools for ceedings of the 7th Evaluation Campaign of Natural
Italian (EVALITA 2018). CEUR-WS.org. Language Processing and Speech Tools for Italian
(EVALITA 2020). CEUR-WS.org.
Erik Cambria, Daniel Olsher, and Dheeraj Rajagopal.
2014. SenticNet 3: a Common and Common- Dilek Küçük and Fazli Can. 2020. Stance detection: A
sense Knowledge Base for Cognition-driven Senti- survey. ACM Computing Surveys, 53(1):1–37.
ment Analysis. In Proceedings of the 28th AAAI
Conference on Artificial Intelligence (AAAI 2014). Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo
Rosso. 2018. Stance evolution and Twitter inter-
Alessandra Teresa Cignarella, Simona Frenda, Vale- actions in an Italian political debate. In Proceed-
rio Basile, Cristina Bosco, Viviana Patti, and Paolo ings of the 23rd International Conference on Natu-
Rosso. 2018. Overview of the EVALITA 2018 task ral Language & Information Systems (NLDB 2018).
on Irony Detection in Italian Tweets (IronITA). In Springer.
Proceedings of 6th Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian Mirko Lai, Marcella Tambuscio, Viviana Patti, Gian-
(EVALITA 2018). CEUR-WS.org. carlo Ruffo, and Paolo Rosso. 2019. Stance polarity
in political debates: A diachronic perspective of net-
Marco Del Tredici, Diego Marcheggiani, Sabine work homophily and conversations on twitter. Data
Schulte im Walde, and Raquel Fernández. 2019. & Knowledge Engineering, 124:101738.

9
Mirko Lai, Alessandra Teresa Cignarella, Delia Mariona Taulé, M. Antònia Martí, Francisco M. Rangel
Irazú Hernández Farías, Cristina Bosco, Viviana Pardo, Paolo Rosso, Cristina Bosco, and Viviana
Patti, and Paolo Rosso. 2020. Multilingual stance Patti. 2017. Overview of the Task on Stance and
detection in social media political debates. Com- Gender Detection in Tweets on Catalan Indepen-
puter Speech & Language, 63(101075). dence. In Proceedings of the 2nd Workshop on Eval-
uation of Human Language Technologies for Iberian
Verena Lyding, Egon Stemle, Claudia Borghetti, Marco Languages (IberEval 2017) co-located with 33th
Brunello, Sara Castagnoli, Felice Dell’Orletta, Hen- Conference of the Spanish Society for Natural Lan-
rik Dittmann, Alessandro Lenci, and Vito Pirrelli. guage Processing (SEPLN 2017). CEUR-WS.org.
2014. The PAISA’ Corpus of Italian Web Texts.
In Proceedings of the 9th World Archaeological Mariona Taulé, Francisco M. Rangel Pardo, M. Antò-
Congress (WAC-9) @ the 16th Conference of the nia Martí, and Paolo Rosso. 2018. Overview of the
European Chapter of the Association for Computa- Task on Multimodal Stance Detection in Tweets on
tional Linguistics (EACL 2014). ACL. Catalan #1Oct Referendum. In Proceedings of the
3rd Workshop on Evaluation of Human Language
Walid Magdy, Kareem Darwish, Norah Abokhodair, Technologies for Iberian Languages (IberEval 2018)
Afshin Rahimi, and Timothy Baldwin. 2016. #isi- co-located with 34th Conference of the Spanish
sisnotislam or #deportallmuslims?: Predicting un- Society for Natural Language Processing (SEPLN
spoken views. In Proceedings of the 8th ACM Con- 2018). CEUR-WS.org.
ference on Web Science (WebSci 2016). ACM.
Jannis Vamvas and Rico Sennrich. 2020. X-Stance: A
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- Multilingual Multi-Target Dataset for Stance Detec-
hani, Xiaodan Zhu, and Colin Cherry. 2016a. A tion. In Proceedings of the 5th Swiss Text Analyt-
Dataset for Detecting Stance in Tweets. In Pro- ics Conference (SwissText 2020) & 16th Conference
ceedings of the 10th International Conference on on Natural Language Processing (KONVENS 2020).
Language Resources and Evaluation (LREC 2016). CEUR-WS.org.
ELRA.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
hani, Xiaodan Zhu, and Colin Cherry. 2016b.
SemEval-2016 Task 6: Detecting Stance in Tweets.
In Proceedings of the 10th International Workshop
on Semantic Evaluation (SemEval-2016). ACL.

Maurizio Moraca, Gianluca Sabella, and Simone
Morra. 2020. UninaStudents @ SardiStance:
Stance detection in Italian tweets - Task A. In Pro-
ceedings of the 7th Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian
(EVALITA 2020). CEUR-WS.org.

Finn Årup Nielsen. 2011. AFINN. Richard Petersens
Plads, Building, 321.

Ashwin Rajadesingan and Huan Liu. 2014. Identi-
fying users with opposing opinions in Twitter de-
bates. In Proceedings of the 7th Social Computing,
Behavioral-Cultural Modeling and Prediction Inter-
national Conference (SBP-BRiMS 2014). Springer.

Francisco Rangel and Paolo Rosso. 2018. On the im-
plications of the general data protection regulation
on the organisation of evaluation tasks. Language
and Law / Linguagem e Direito, 5(2):95–117.

Manuela Sanguinetti, Gloria Comandini, Elisa
Di Nuovo, Simona Frenda, Marco Stranisci,
Cristina Bosco, Tommaso Caselli, Viviana Patti, and
Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
Overview of the EVALITA 2020 Hate Speech
Detection Task. In Proceedings of the 7th Evalu-
ation Campaign of Natural Language Processing
and Speech Tools for Italian (EVALITA 2020).
CEUR-WS.org.