=Paper=
{{Paper
|id=Vol-3731/paper39
|storemode=property
|title=Inferring Political Leaning on X (Twitter): A Zero-Shot Approach in an Italian Scenario
|pdfUrl=https://ceur-ws.org/Vol-3731/paper39.pdf
|volume=Vol-3731
|authors=Caterina Senette,Margherita Gambini,Tiziano Fagni,Victoria Popa,Maurizio Tesconi
|dblpUrl=https://dblp.org/rec/conf/itasec/SenetteGFPT24
}}
==Inferring Political Leaning on X (Twitter): A Zero-Shot Approach in an Italian Scenario==
<pdf width="1500px">https://ceur-ws.org/Vol-3731/paper39.pdf</pdf>
<pre>
                                Inferring Political Leaning on X (Twitter): A Zero-Shot
                                Approach in an Italian Scenario
                                Caterina Senette1,∗ , Margherita Gambini1 , Tiziano Fagni1 , Victoria Popa1,2 and
                                Maurizio Tesconi1
                                1
                                    Institute of Informatics and Telematics (IIT) - CNR, Via Giuseppe Moruzzi, 1 56124 Pisa – Italy
                                2
                                    Università di Pisa, Dipartimento di Computer Science, Largo Bruno Pontecorvo, 3, 56127 Pisa


                                               Abstract
                                               In recent years, there has been growing attention on predicting the political orientation of active
                                               social media users, aiding in political forecasts, modeling opinion dynamics, and understanding user
                                               polarization. Existing methods, primarily for X (Twitter) users, use content-based or a blend of content,
                                               network, and communication analysis. The latest research highlights that a user’s political stance mainly
                                               hinges on their views on key political and social issues, prompting a shift towards detecting user stances
                                               through their content shared on social networks. This work investigates the use of an unsupervised
                                               stance-detection framework Tweets2Stance (T2S) based on zero-shot classification (ZSC) models [1] to
                                               predict users’ stances toward a set of social-political statements using content-based analysis of their X
                                               (Twitter) timelines in an Italian scenario. The ground-truth user stances are drawn from Voting Advice
                                               Applications (VAAs), tools aiding citizens in identifying their political leanings by comparing their
                                               preferences with party stances. Leveraging the agreement levels of six parties on 20 statements from
                                               VAAs, the study aims to predict Party p’s stance on each statement s using X (Twitter) Party account data.
                                               T2S, employing zero-shot learning, proves effective across various contexts beyond politics, showcasing
                                               a minimum MAE of 1.13 despite a general maximum F1 value of 0.4, demonstrating significant progress
                                               given the task complexity.

                                               Keywords
                                               user stance detection, Zero-shot learning, unsupervised ML, political leaning, X (Twitter), VAA


                                1. Introduction
                                During the last few years, there has been a growing attention towards social media for what
                                is explicitly shared among users (content, thoughts, and behavior), as well as for what is
                                hidden and latent. Among this latent information, the user’s stance, i.e. the expression of a
                                user’s point of view and perception toward a given statement [2], is particularly interesting;
                                in fact, stance detection on social media is an emerging opinion mining paradigm that well
                                applies in different social and political contexts, and for which many researchers are working to
                                propose solutions ranging from natural language processing, web science, and social computing
                                [3, 4, 5, 6, 7, 8, 9, 10]. Some work [3, 11] dealt with stance-detection at the user-level; however,

                                ITASEC 2024: The Italian Conference on CyberSecurity, April 08–12, 2024, Salerno, Italy
                                ∗
                                    Corresponding author.
                                Envelope-Open caterina.senette@iit.cnr.it (C. Senette); margherita.gambini@iit.cnr.it (M. Gambini); tiziano.fagni@iit.cnr.it
                                (T. Fagni); victoria.popa@iit.cnr.it (V. Popa); maurizio.tesconi@iit.cnr.it (M. Tesconi)
                                Orcid 0000-0002-4411-7134 (C. Senette); 0000−0003−0640−2724 (M. Gambini); 0000−0003−1921−7456 (T. Fagni);
                                0000−0001−8228−7807 (M. Tesconi)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to the best of our knowledge, a completely unsupervised technique exploiting user’s textual
content only has never been explored. Hence, the work herein described investigates the
use of an unsupervised stance-detection framework based on zero-shot learning models and
previously introduced by us [12, 1] named 𝑇 𝑤𝑒𝑒𝑡𝑠2𝑆𝑡𝑎𝑛𝑐𝑒 (𝑇 2𝑆), to detect the stance of a X
(Twitter) account using its timeline in an Italian scenario. The idea for this framework stems
out from observing how Voting Advice Applications (VAAs) work. Voting Advice Applications,
originally developed in the 1980s as paper-and-pencil civic education questionnaires [13], are
online tools that aid citizens, mainly before elections, to identify their political leaning by
comparing their policy preferences with the political stances of parties or candidates running
for office. VAAs are widespread in many countries and have a crucial role in online election
campaigns worldwide. Basically, the user marks its position on a range of policy statements. The
application compares the individual’s answers to the positions of each Party or candidate and
generates a rank-ordered list or a graph indicating which Party or candidate is located closest
to the user’s policy preferences. One of the crucial elements of the VAAs is the questionnaire:
the selection of the statements, their balance among the political poles, and their phrasing have
an impact both on the way in which users respond, as well as on the overall users‘ engagement
on the poll itself. For these reasons, the VAA’s issued statements should cover the spectrum
of the most important topics of an election campaign and adequately show crucial differences
among all the competitors in the political scenario for which the VAA is designed [14]. This
careful definition of the questionnaire, i.e. taking into high consideration the main topics under
discussion at a certain time, suggested us the possibility of using the official position of Italian
parties about specific political statements (during a certain political election period) as the
ground-truth to determine that stance from the timeline of the X (Twitter) Party accounts in a
completely unsupervised way 1 ; notice that only tweets written during the pre-election period
are considered.

Objectives Starting from the knowledge of the agreement level of six parties on 20 different
statements (VAA’s statements), the objective of the study is to predict the stance of a Party 𝑝
toward each statement 𝑠 exploiting what the X (Twitter) Party account wrote on X (Twitter).
Differently from previous works in the literature [3], our classification model is built for different
topics and we come up with a fine-grained stance-detector solution working along five classes
that could be generalized to various spheres, not just the political one.


2. Related Work
Stance-detection is an emerging opinion-mining paradigm that well applies in several social
and political scenarios. The state of the art resumed in a highly valuable survey [3] highlights
the importance of categorization since stance-detection can be classified according to the target
(single, multi-related, or claim-based), according to the task type (detection or prediction) or
distinguishing between stance at user level or the statement level. At the statement level [17, 18],

1
    the Italian Parties’ official positions about 20 political statements were kindly provided by the Observatory on
    Political Parties and Representation [15] based on the VAA NavigatoreElettorale for the European Elections 2019
    [16]
whose objective is to predict the stance described in a piece of text, previous research works
are mainly based on Natural Language Processing (NLP) methods and classification tasks with
three classes (support/against/none). Instead, at the user level, the objective is to predict the
stance of a user toward a given topic and generally, prediction solutions incorporate different
users’ attributes along with the text of their posts. Our work falls under the category of stance
detection tasks at the user level, specifically focusing on target-specific stances—a common
approach in social media stance detection. This involves predicting stances on specific topics,
often using separate classification models for each topic. Notable approaches [19, 4, 5, 20]
utilize post text along with various user attributes, typically employing binary classification
(support and against). Lynn et al. [11] explored using user-level features alone versus document-
level features in predicting tweet stances without the tweets highlighting the importance of
integrating user features into predictive systems. Other target-specific strategies in literature
were conducted at the statement level [6, 7, 8, 21]. In [6] the approach was conducted at
the statement level through unsupervised methods, and classification was made along three
positions (favour, against, neither). In [7] is introduced a stance-detection shared task, where
teams inferred three-level tweet stances using natural language systems: for, against, or neutral
towards the given target. Divided into supervised (Task A) and unsupervised (Task B) sub-tasks,
they received 19 and 9 team submissions respectively. The highest F-score reached was 67.82
for Task A and 56.28 for Task B. As mentioned above, target-specific approaches could consider
single or multiple targets. Usually, the concept of multi-target classification has been used to
analyse the relation between two political candidates by using domain knowledge about these
targets. In that case, the same model can be applied to different targets on the hypothesis that
the same piece of text that contains the stance in favour to a target, it also implicitly contains
the stance against the other [22, 9, 23]. Our method handles a broad multi-target classification
task, where each statement represents a specific target. Unlike previous methods, it operates
without the need for pre-selected texts or distinct models for each target

2.1. Machine Learning (ML) approaches
Among ML features for stance detection, the literature distinguishes between linguistic features,
revealing stance based on text-linguistic features [24, 7], and users’ vocabulary, which is based on
their choice of words [10, 25]. Since textual cues could refer both to textual features, sentiment,
and semantics, we limit our attention to textual features. In this context, the most used ML
approaches are based on supervised techniques [19, 5, 23, 18, 26]. Some works attempted to
enrich dataset entities applying unconstrained supervised methods such as transfer learning,
weak-supervision, and distant supervision methods for stance detection [6, 4]. Other innovative
approaches are those that propose unsupervised learning strategies [10, 27, 28] exploiting
clustering techniques and embeddings representations of users’ tweet[29]. The limitations
across these studies include: (i) time-intensive data collection and analysis, particularly with
network-based approaches; (ii) challenges in accessing or retrieving necessary data due to
stringent social media data protection policies; (iii) most models are limited to two or three
stance classes at most; (iv) reliance on supervised or semi-supervised models, which require
large datasets and have limited generalizability tied to training sets[30]. For all these reasons,
the recent challenge for user-level and target-specific stance detection is to move towards
unsupervised systems exploiting textual content only. To this aim, a ZSL technique exploiting
advanced pre-trained Natural Language Inference (NLI) models [24, 31] can be a viable solution
as our T2S framework proved.


3. Task Definition
The task is to predict the stance 𝐴𝑢𝑠 of a Social Media User 𝑢 with respect to a social-political
statement (or sentence) 𝑠 making use of the User’s textual content timeline on the considered
social media (e.g., the X (Twitter) timeline). The stance 𝐴𝑢𝑠 represents a five-level categorical
label: completely agree (5), agree (4), neither disagree nor agree (3), disagree (2), completely disagree
(1). The integer mappings used by the Tweets2Stance framework are shown in parentheses.
   The desired ground-truth is the label 𝐺𝑠𝑢 , which is the known agreement/ disagreement level
of User 𝑢 in regard to sentence 𝑠. Remind that the ground-truth is only used to evaluate our
proposed 𝑇 𝑤𝑒𝑒𝑡𝑠2𝑆𝑡𝑎𝑛𝑐𝑒 framework and find its optimal parameters; no training step ever occurs.
In this work, we assume that users are the X (Twitter) accounts of six Italian Parties, as the
following section will detail.


4. Data collection and Pre-processing
The political scenario under analysis refers to the European and Municipal elections in Italy on
26th May 2019, when Italian citizens were called for the election of the Italian representatives
to the European Parliament. The number of Members of the European Parliament (751 deputies
in total) for each country is approximately proportional to the population. In 2019, Italy had to
elect 76 deputies. Contextually, Italian voters had also to participate in the municipal election of
mayors, municipal and district councillors (in about 3800 Italian municipalities), with a planned
run-off on 9th June 2019. In that context, we focused our attention on the six major parties in
Italy: three center-right parties including Forza Italia (FI), Fratelli d’Italia (FDI), and Lega, two
left-wing parties including Partito Democratico (PD) and +Europa (+Eu)2 , and the Movimento 5
Stelle (M5S) representing a sort of third pole at that time. The Italian parliament included other
minor parties, especially on the left- wing, representing less than 5% of the Italian population
each. We did not consider these parties in the current study. As previously said, we started
from the assumption that knowing the parties’ answers on the VAA’s statements, it is possible
to predict the stance of a Party 𝑝 in regard to each statement 𝑠 exploiting what the Party wrote
on X (Twitter). The definition of the 20 statements (Table 3 in Appendix A) that express the
political positions of the six referenced parties towards selected themes under discussion in Italy
and in Europe in 2019, was entrusted to a group of political experts [15, 16] who provided us
                           𝑝
with the ground-truth 𝐺𝑠 for each Party 𝑝 and statement 𝑠 on which the current work is based.
At first, we collected timelines of the official X (Twitter) account of each party using the official
X (Twitter) API3 . Considering the speed with which political discussion nowadays takes place,
especially on social media, the observation period was adequately chosen in order to maximize
the number of tweets avoiding noise and off-topic content. Furthermore, to intercept any
2
+Europa was recently born in 2018 and it is characterized for a pro-European and liberal orientation.
3
https://developer.X(Twitter).com/en/docs
valuable information or discussion trends over time we have extended the analysis considering
four temporal ranges and built the associated datasets4 as described in Table 1.

Table 1
The four studied datasets with the total number of tweets before the pre-processing step. 𝐷𝑗 contains 𝑗
months of tweets.
                𝐷3                       𝐷4                      𝐷5                       𝐷7
    Period      [2019-03-01, 2019-05-    [2019-02-01, 2019-05-   [2019-01-01, 2019-05-    [2018-11-01, 2019-05-
                25]                      25]                     25]                      25]
    #tweets     20’266                   25’979                  34’736                   44’370

   As a preliminary step, since the text collected from tweets contains a lot of noise and irrelevant
information, we pre-processed the tweets in order to remove anything which doesn’t have
predicting significance, such as: item URLs, ”𝑅𝑇 @𝑢𝑠𝑒𝑟 ∶” prefix of retweets, mentions at the
beginning of a reply tweet, tweets with {1, 2, 3} words and empty tweets, hashtags and emojis
(replaced with empty string). Lastly, since we wanted to test our prediction approach on English
tweets as well, we further translated the Italian tweets using the google_trans_new 5 Python
package.


5. Framework Design
This section briefly describes the proposed Tweets2Stance (T2S) framework (Fig. 1) to detect
the stance 𝐴𝑢𝑠 of a X (Twitter) User 𝑢 in regard to a sentence 𝑠, exploiting its X (Twitter) timeline
𝑇 𝐿𝑢 = [𝑡𝑤1 , ..., 𝑡𝑤𝑛 ]. More details of the framework are provided in a previous work where we
have extensively introduced it [1].
    A User might either not talk about a specific political argument (here expressed with sentence
𝑠), or debate on an issue not risen by our pre-defined set of statements. For these reasons, our
framework executes a preliminary 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 step, exploiting a Zero-Shot Classifier (ZSC)
to get only those tweets talking about the topic 𝑡𝑝 of the sentence 𝑠. A ZSC is a language-model-
based method that, given a text and a set of labels (e.g., topics), assigns a classification probability
score to each label [21]. The higher the score assigned to a label, the higher the likelihood that
the input text pertains to that specific label. ZSC does not require further fine-tuning on the
target dataset. After obtaining the in-topic tweets 𝐼𝑡𝑝  𝑢 through Topic Filtering, the Agreement
                                                           𝑠
Detector module employs the same ZSC to detect the user’s agreement/disagreement level. In
Fig. 1 we use colour-codes to identify the four parameters of the 𝑇 2𝑆 framework that we’ll vary
during our experiments, as explained in Section 6.

                                                                                      𝑝
Topic Filtering The 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 module extracts the in-topic tweets 𝐼𝑡𝑝𝑠 from the X (Twitter)
Timeline 𝑇 𝐿𝑝 of Party 𝑝, using the topic 𝑡𝑝𝑠 associated with sentence 𝑠 (e.g., the topic for the
sentence ”overall, membership in the EU has been a bad thing for the UK ” can be ”UK membership
in EU ”). The topic definitions for all considered sentences can be found in the linked repository.
4
    The four raw datasets can be found at https://github.com/marghe943/Tweets2Stance_dataset
5
    https://pypi.org/project/google-trans-new/
                                                                            o XRoberta1
                                                                            o XRoberta2
                                                                            o BART

                                                                            Language
                                                                            Model LM
             •    D3       •   D5
             •    D4       •   D7
                                       Tweets2Stance
                                                                        Zero-shot classifier
                    Dataset           Topic tps                                 C                               Sentence s


                 extract
                                                                           Filtered Tweets
                                                                                                    Agreement
             User Timeline                 Topic Filtering                                                                       Agreement
                                                                                                     Predictor
                  TLu                                                                                                               label


                                           Treshold th                                             Algorithm Alg

                                       •    0,5   •   0,6 •   0,7                                    Alg1   Alg2
                                       •    0,8   •   0,9                                            Alg3   Alg4


Figure 1: Our Tweets2Stance framework to compute the agreement/disagreement level 𝐴𝑢𝑠 of User 𝑢 in
regard to sentence 𝑠. The inputs are the X (Twitter) timeline 𝑇 𝐿𝑢 extracted from a certain time-period
dataset 𝐷𝑖 , the sentence 𝑠, the topic 𝑡𝑝 associated with 𝑠, a language model 𝐿𝑀, a threshold 𝑡ℎ and an
algorithm 𝐴𝑙𝑔. The highlighted components are the parameters that we’ll vary during our experiments,
as explained in Section 6.


                                                                                                            𝑝
The module utilizes the ZSC 𝐶 to retrieve the in-topic tweets 𝐼𝑡𝑝𝑠 and their corresponding topic
         𝑝
scores 𝑇𝑡𝑝𝑠 .

Agreement Detector The 𝐴𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡 𝐷𝑒𝑡𝑒𝑐𝑡𝑜𝑟 module (Fig. 1 - Module 2) computes the final
                   𝑝                            𝑝     𝑝
five-valued label 𝐴𝑠 through an algorithm 𝐴𝑙𝑔(𝑇𝑡𝑝𝑠 , 𝑆𝑠 ), defining
                                                                    𝑝                          𝑝
                                                               𝑆𝑠 = {𝐶(𝑡𝑤𝑖 , 𝑠)|𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 }                                                      (1)
                                       𝑝
as the 𝐶 scores of tweets 𝐼𝑡𝑝𝑠 with respect to sentence 𝑠, each one indicating the relevance and
agreement of tweet 𝑡𝑤𝑖 with sentence 𝑠.
   Each employed algorithm 𝐴𝑙𝑔 exploits one of the following mapping functions:
                   ⎧1               if 𝑠 ∈ [0, 0.2)
                   ⎪                                                                                                ⎧1       if 𝑠 ∈ [0, 0.25)
                   ⎪2               if 𝑠 ∈ [0.2, 0.4)                                                               ⎪
                   ⎪                                                                                                ⎪2       if 𝑠 ∈ [0.25, 0.5)
            𝑀1(𝑠) = 3               if 𝑠 ∈ [0.4, 0.6)                          (2)                  𝑀2(𝑠) =                                       (3)
                   ⎨                                                                                                ⎨3       if 𝑠 ∈ [0.5, 0.75)
                   ⎪4               if 𝑠 ∈ [0.6, 0.8)                                                               ⎪
                   ⎪                                                                                                ⎪
                   ⎪                                                                                                ⎩4       if 𝑠 ∈ [0.75, 1]
                   ⎩5               if 𝑠 ∈ [0.8, 1]
   where 𝑀1(𝑠) ranges from 1 to 5, corresponding to the five agreement/disagreement labels
defined in Section 3. Similarly, 𝑀2(𝑠) ranges from 1 to 4, representing an intermediate agree-
ment/disagreement scale. Specifically, 𝑀2(𝑠) = {1, 2} has the same meaning as in Section 3,
while 𝑀2(𝑠) = 3 indicates agreement and 𝑀2(𝑠) = 4 represents complete agreement. The
rationale behind this intermediate mapping is explained in Algorithm 4 [1].
   We defined four algorithms with different complexity levels, details of each one are provided
in the Appendix B and the already mentioned work [1].
6. Experimental Setup
6.1. Baselines
It is a good practice to compare the proposed methods with a bunch of baselines. To the best of
our knowledge, no baseline method has been devised for the typology of our stance detection
task yet: unlike our approach, the state-of-the-art unsupervised user-stance detection method
proposed by Darwish et al. [10] cannot operate without context information from other users
and it is not suitable for a multi-class ordinal classification like our case. Therefore, the following
                           𝑝
baselines to compute 𝐴𝑠 for Party 𝑝 and sentence 𝑠 were used:
              𝑝
Random 𝐴𝑠 is set to a random integer picked from a discrete uniform distribution of 𝑖𝑛𝑡 ∈ [1, 5].
    The numpy random method6 was used with random seed set to 42. .
               𝑝
Predict 3 𝐴𝑠 is set to 3 (neither disagree, nor agree).
Sentence Bert The newest Transformer-based language models like BERT can be used as fea-
     ture extractors [32], providing contextual word and sentence embeddings. The Sentence-
     Bert architecture of the Sentence Transformers Python library7 was used with the English
     all-mpnet-base-v2 model on translated tweets, and with the multi-lingual model distiluse-
     base-multilingual-cased-v1 on the Italian tweets.

6.2. Experiments in detail
As already explained in section 5, our 𝑇 2𝑆 method has got four parameters to tune: the language
model 𝐿𝑀 to be used for zero-shot classification, the dataset 𝐷𝑖 from which extract the X
(Twitter) timeline 𝑇 𝐿𝑝 , the algorithm 𝐴𝑙𝑔 for the 𝐴𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡 step, and the threshold value 𝑡ℎ for
the 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 step. Considering the values of those parameters in Fig. 1, we carried out
each experiment having in mind the four research questions summarized in Table 2 and ordered
by specificity.

6.3. Evaluation
In evaluating the stance detection model, traditional metrics like MSE, MAE, R2 Score, and
Residual Plots are common. However, a bespoke metric is needed to address varying error
importance across stance classes. For instance, misclassifying agree instead of completely
disagree carries a different weight than neither disagree, nor agree instead of agree. In the
absence of such a metric, MAE is chosen. Lastly, since the predicted value is an integer among
{1, 2, 3, 4, 5}, a classification evaluation metric was considered as well: the weighted F1 score was
picked, since it summarizes both Precision and Recall [33]. The sklearn.metrics Python package
was used to compute both MAE 8 and F1_weighted 9


6
  https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html
7
  https://www.sbert.net/
8
  https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html
9
  https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Table 2
Description of all carried out experiments
    Experiment Name             Research Question
    Best language model 𝐿𝑀      Which is the best language model 𝐿𝑀 for zero-shot classification? Which is the
                                best model to deal with Italian tweets? All in all, is an English model better?
    Best dataset 𝐷              Fixed the language model 𝐿𝑀, which is the best dataset to work on, considering
                                all proposed algorithms? Hence, which is the best time period to listen to before
                                a Political Election?
    Best algorithm 𝐴𝑙𝑔          Fixed the language model 𝐿𝑀 and dataset 𝐷𝑖, which is the best algorithm to
                                work on, considering all evaluated thresholds 𝑡ℎ? Are all our proposed algorithms
                                better than the baselines (subsection 6.1)? Are the more complex algorithms
                                better or not?
    Best threshold 𝑡ℎ           Fixed the language model 𝐿𝑀, the dataset 𝐷 − 𝑁 𝑚𝑜𝑛𝑡ℎ𝑠 and the algorithm 𝐴𝑙𝑔,
                                which is the best filtering threshold 𝑡ℎ, hence the optimal set-up?
    Party Analysis              Fixed the optimal setup for our framework, which are the Parties on which 𝑇 2𝑆
                                behaves well or poorly?


7. Results and Discussion
7.1. Best Language Model LM
First, we explored which is the best language model for ZSC on Italian tweets: a model pre-
trained on a mix of languages including Italian or one fine-tuned on Italian text? Also, would
results improve with an English model on translated tweets? Furthermore, would the results
benefit from using an English language model on translated tweets instead? We answered these
questions by looking at Fig. 2: each cell (𝐿𝑀𝑖 , 𝐷𝑗 ) indicates the minimum MAE (maximum F1)
obtained with our 𝑇 2𝑆 method for a certain language model 𝐿𝑀𝑖 and dataset 𝐷𝑗 by varying the
algorithm 𝐴𝑙𝑔 and the threshold 𝑡ℎ according to Fig. 1.

                              MAE                                                           F1
                                                                                                               0.400
                                                   1.15
    XRoberta1   1.28     1.27   1.27     1.29                     XRoberta1   0.24   0.25        0.25   0.25   0.375

                                                                                                               0.350
                                                   1.20
model


                                                          model


    XRoberta2   1.32     1.31   1.28     1.28                     XRoberta2   0.29   0.29        0.27   0.26   0.325

                                                   1.25                                                        0.300

                                                                                                               0.275
        BART    1.19     1.13   1.18     1.25                         BART    0.37   0.40        0.38   0.36
                                                   1.30                                                        0.250
                 D3      D4         D5   D7                                   D3     D4          D5     D7
                           dataset                                                     dataset

Figure 2: Best MAE and F1 values of our 𝑇 2𝑆 method for each couple (𝐿𝑀𝑖 , 𝐷𝑗 ) of language models and
datasets. Darker colors indicate optimal values for both metrics.


  Among the cross-lingual models 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎1 and 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 , the best one seemed to be
𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎1 : it had an overall better MAE, while F1 results were close to 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 ’s; we consid-
ered MAE as the first metric to judge the performances since it tells how much we are close to
the correct answer. Apparently, fine-tuning on an Italian translation of a subset of the MNLI
dataset (𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 ) doesn’t contribute a lot to text classification in our 𝑇 2𝑆 framework. All
in all, the best choice is translating the pre-processed tweets in English and using an English
model like 𝐵𝐴𝑅𝑇: it reached significantly higher values on both MAE and F1. Supposedly, using
a model pre-trained and fine-tuned on a single language gives better results for our prediction
task: learning on a single language allows us to focus on more details and features of the
language.

7.2. Best Dataset D
The choice of the dataset’s time period (𝐷𝑖 ) as one of the parameters to tune is motivated by the
use of T2S for stance detection during political elections, where the proximity to the elections
may impact the likelihood of users discussing socio-political topics. Fixed the language model
𝐿𝑀 = 𝐵𝐴𝑅𝑇, the dataset 𝐷4 was immediately detected as the best one, since it had the best
MAE and F1 (Fig. 2). Presumably, the X (Twitter) political discussion four months before the
Italian elections was enough to grasp the Parties’ stances. We evaluated the mean MAE and
mean F1 for each cell (𝐿𝑀𝑖 , 𝐷𝑗 ) of Fig. 2 as well, but the results confirmed 𝐵𝐴𝑅𝑇 and 𝐷4 as the
best language model and dataset.

7.3. Best Algorithm Alg
Once the language model 𝐿𝑀 = 𝐵𝐴𝑅𝑇 and dataset 𝐷4 were chosen, we tested our algorithms
𝐴𝑙𝑔 against the baselines 𝑟𝑎𝑛𝑑𝑜𝑚, 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 , and 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 𝑒𝑟𝑡, examining the best 𝐴𝑙𝑔 across all
thresholds 𝑡ℎ. Fig. 3 describes how much each algorithm performed across different thresholds.
These results include the performances of the three baselines as well. Altogether, the optimal
algorithm can be identified in 𝐴𝑙𝑔3: F1 seemed to contradict it and bend over 𝐴𝑙𝑔4 instead, but
the gain over the prediction error is far more important. This result suggests that assigning the
neutral label (neither disagree, nor agree) only when there’s a minimum number of tweets 𝑚
does not boost the performance of our 𝑇 2𝑆 method. Also, we executed 𝐴𝑙𝑔4 with 𝑚 = {2, 3},
finding out that the results didn’t vary a lot from each other; therefore, we showed 𝐴𝑙𝑔4𝑚=3 in
Fig. 3.

7.4. Best Threshold th and Party Analysis
Fixed the language model 𝐿𝑀 = 𝐵𝐴𝑅𝑇, the dataset 𝐷4 and the algorithm 𝐴𝑙𝑔3, threshold 𝑡ℎ = 0.6
was immediately detected as the optimal one, since it had the best MAE and a good F1 (Fig.
3). Therefore, the best setup 𝑠𝑢𝑜𝑝𝑡 of our 𝑇 2𝑆 framework was (𝐿𝑀, 𝐷𝑗 , 𝐴𝑙𝑔, 𝑡ℎ) = (𝐵𝐴𝑅𝑇 , 𝐷4 ,
𝐴𝑙𝑔3, 0.6). To explore the specific performance of our 𝑇 2𝑆 method over the Parties, we used
the optimal setup 𝑠𝑢𝑜𝑝𝑡 but by varying the threshold 𝑡ℎ. Fig. 4 shows the results. Each point
                                                                  𝑝
indicates the MAE (F1) on the 20 sentences’ agreement level 𝐴𝑠 for a certain Party 𝑝. Each Party
behaves differently, thus it is likely that 𝑇 2𝑆 highly depends on the Party’s timeline in terms of
how much it generally writes, how much it writes in-topic, and how much it writes using figures
of speech or hashtags and emojis (which we removed). Looking at both the MAE and F1, we
observed a regular trend for thresholds 𝑡ℎ = {0.8, 0.9} for five parties out of six: the outlier Party
                                              Alg1           Alg4                  random
                                              Alg2           sentence_bert         predict_3
                                              Alg3
                                                                  all
                                                                 (all)

            MAE           1.6


                          1.4


                                                                                                        all
                          1.2

                                0.50   0.55    0.60   0.65       0.70     0.75   0.80     0.85   0.90
                                                             threshold th
                                              Alg1           Alg4                  random
                                              Alg2           sentence_bert         predict_3
                                              Alg3
                                                                  all
                                                                 (all)
                          0.4
            F1_WEIGHTED


                          0.3


                                                                                                        all
                          0.2

                          0.1

                          0.0
                                0.50   0.55    0.60   0.65       0.70     0.75   0.80     0.85   0.90
                                                             threshold th

Figure 3: MAE and F1 of our four proposed algorithms 𝐴𝑙𝑔s and the three baselines by varying the
threshold 𝑡ℎ. It is shown 𝐴𝑙𝑔4 with 𝑚 = 3 (see B).


𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒 was more predictable for those thresholds. That may happen because the user’s
timeline deals with a certain statement in a clearer way; for example, looking at 𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒
and 𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎’s tweets filtered for the sentence 𝑆19 and 𝑡ℎ = 0.9, we saw that 𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒
wrote clearer and explicit tweets supporting the argument (it completely agrees), while from
𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎’s timeline it’s not immediately clear that it disagrees; 𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎 tweeted about tax
reduction, fewer fees on families, and job creation, in that case, our 𝑇 2𝑆 framework marked it
’completely agree’ since the party didn’t explicitly disagree with income support for the poorest
as beneficial for the Italian economy.


8. Conclusions and Future Work
In this work, we investigate the use of an unsupervised stance-detection framework
Tweets2Stance (T2S) based on zero-shot classification [1] to predict users’ stances toward
a set of social-political statements using content-based analysis of their X (Twitter) timelines in
an Italian scenario. In particular, we dealt with the stance of 20 political statements for the six
major parties in Italy. Results showed that, although the general maximum F1 value was 0.4,
𝑇 2𝑆 could correctly predict the stance with a general minimum MAE of 1.13, which is a great
                                                pdnetwork          Mov5Stelle            LegaSalvini
                                                Piu_Europa         forza_italia          FratellidItalia


                     1.4
                     1.2
        MAE


                     1.0
                     0.8

                           0.50   0.55   0.60           0.65       0.70           0.75        0.80         0.85   0.90
                                                               threshold th

                                                pdnetwork          Mov5Stelle            LegaSalvini
                                                Piu_Europa         forza_italia          FratellidItalia


                     0.5
       F1_WEIGHTED


                     0.4
                     0.3
                     0.2
                           0.50   0.55   0.60           0.65       0.70           0.75        0.80         0.85   0.90
                                                               threshold th

Figure 4: MAE and F1 computed for each Party over the stance predictions of the 20 VAA statements.
The optimal 𝑠𝑢𝑜𝑝𝑡 is used, but the threshold 𝑡ℎ varies.

achievement considering that MAE tells how close we are to the correct answer, and that we
worked with a final five-valued label. Also, as we hypothesized, the 𝑇 2𝑆’s performance highly
depends on how the X (Twitter) account of the Party (hence the social media user) writes, e.g.
the employed figures of speech, the words used, and so on. As mentioned when introducing the
work, the approach is potentially generalizable to several topics. If applied to political discourse,
it could represent the first step of a pipeline whose output is the user’s political leaning. In
the near future, we will investigate how T2S’s agreement levels output can be used to derive
the political leaning of a social media user, for example by trying to emulate a VAA algorithm.
Besides, we hope to apply it to detect extremist accounts on social media; however, a domain
expert may be needed to define precise social statements to use. Future research could address
T2S limitations by using advanced models like GPT-4 or conversational AI such as ChatGPT for
robust stance detection.
Acknowledgments
We thank Project SERICS (PE00000014) - NRRP MUR program funded by the EU- NGEU, and
Project SoBigData-PlusPlus Grant Agreement number: 871042 CUP B54I1900639000.


References
 [1] M. Gambini, C. Senette, T. Fagni, M. Tesconi, From tweets to stance: An unsupervised
     framework for user stance detection on twitter, in: International Conference on Discovery
     Science, Springer, 2023, pp. 96–110.
 [2] D. Biber, E. Finegan, Adverbial stance types in english, Discourse processes 11 (1988) 1–34.
 [3] A. ALDayel, W. Magdy, Stance detection on social media: State of the art and trends,
     Information Processing & Management 58 (2021) 102597.
 [4] M. Dias, K. Becker, Inf-ufrgs-opinion-mining at semeval-2016 task 6: Automatic generation
     of a training corpus for unsupervised identification of stance in tweets, in: Proceedings
     of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp.
     378–383.
 [5] Y. Igarashi, H. Komatsu, S. Kobayashi, N. Okazaki, K. Inui, Tohoku at semeval-2016 task
     6: Feature-based model versus convolutional neural network for stance detection, in:
     Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016),
     2016, pp. 401–407.
 [6] I. Augenstein, T. Rocktäschel, A. Vlachos, K. Bontcheva, Stance detection with bidirectional
     conditional encoding, arXiv preprint arXiv:1606.05464 (2016).
 [7] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, C. Cherry, Semeval-2016 task 6:
     Detecting stance in tweets, in: Proceedings of the 10th international workshop on semantic
     evaluation (SemEval-2016), 2016, pp. 31–41.
 [8] S. Hamidian, M. T. Diab, Rumor detection and classification for twitter data, arXiv preprint
     arXiv:1912.08926 (2019).
 [9] K. Darwish, W. Magdy, T. Zanouda, Improved stance prediction in a user similarity feature
     space, in: Proceedings of the 2017 IEEE/ACM international conference on advances in
     social networks analysis and mining 2017, 2017, pp. 145–148.
[10] K. Darwish, P. Stefanov, M. Aupetit, P. Nakov, Unsupervised user stance detection on
     twitter, in: Proceedings of the International AAAI Conference on Web and Social Media,
     volume 14, 2020, pp. 141–152.
[11] V. Lynn, S. Giorgi, N. Balasubramanian, H. A. Schwartz, Tweet classification without the
     tweet: An empirical examination of user versus document attributes, in: Proceedings of
     the Third Workshop on Natural Language Processing and Computational Social Science,
     2019, pp. 18–28.
[12] M. Gambini, T. Fagni, C. Senette, M. Tesconi, Tweets2stance: users stance detection
     exploiting zero-shot learning algorithms on tweets, arXiv preprint arXiv:2204.10710
     (2022).
[13] L. Cedroni, Voting Advice Applications in Europe: The state of the art, Scriptaweb, 2010.
[14] T. Louwerse, M. Rosema, The design effects of voting advice applications: Comparing
     methods of calculating matches, Acta politica 49 (2014) 286–312.
[15] OPPR, Opi - observatory on political parties and representation, ???? URL: http://opi.sp.
     unipi.it/opi-political-parties/.
[16] O. on Political Parties, R. (OPPR), Navigatoreelettorale europee 2019, 2019. URL: http:
     //opi.sp.unipi.it/opi-political-parties/oppr-projects/.
[17] A. Murakami, R. Raymond, Support or oppose? classifying positions in online debates
     from reply activities and opinion expressions, in: Coling 2010: Posters, 2010, pp. 869–875.
[18] M. A. Walker, P. Anand, R. Abbott, J. E. F. Tree, C. Martell, J. King, That is your evidence?:
     Classifying stance in online political debate, Decision Support Systems 53 (2012) 719–729.
[19] S. Gottipati, M. Qiu, L. Yang, F. Zhu, J. Jiang, Predicting user’s political party using
     ideological stances, in: International Conference on Social Informatics, Springer, 2013, pp.
     177–191.
[20] A. Aldayel, W. Magdy, Your stance is exposed! analysing possible factors for stance
     detection on social media, Proceedings of the ACM on Human-Computer Interaction 3
     (2019) 1–20.
[21] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation
     and entailment approach, in: Proceedings of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong
     Kong, China, 2019, pp. 3914–3923. URL: https://aclanthology.org/D19-1404. doi:10.18653/
     v1/D19- 1404 .
[22] P. Sobhani, D. Inkpen, X. Zhu, A dataset for multi-target stance detection, in: Proceedings
     of the 15th Conference of the European Chapter of the Association for Computational
     Linguistics: Volume 2, Short Papers, 2017, pp. 551–557.
[23] M. Lai, V. Patti, G. Ruffo, P. Rosso, Stance evolution and twitter interactions in an italian
     political debate, in: International Conference on Applications of Natural Language to
     Information Systems, Springer, 2018, pp. 15–27.
[24] S. Ghosh, P. Singhania, S. Singh, K. Rudra, S. Ghosh, Stance detection in web and social me-
     dia: a comparative study, in: International Conference of the Cross-Language Evaluation
     Forum for European Languages, Springer, 2019, pp. 75–87.
[25] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, H.-W. Hon, Unified
     language model pre-training for natural language understanding and generation, Advances
     in Neural Information Processing Systems 32 (2019).
[26] B. Zhang, M. Yang, X. Li, Y. Ye, X. Xu, K. Dai, Enhancing cross-target stance detection with
     transferable semantic-emotion knowledge, in: Proceedings of the 58th Annual Meeting of
     the Association for Computational Linguistics, 2020, pp. 3188–3197.
[27] A. Joshi, P. Bhattacharyya, M. Carman, Political issue extraction model: A novel hierarchi-
     cal topic model that uses tweets by political and non-political authors, in: Proceedings of
     the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social
     Media Analysis, 2016, pp. 82–90.
[28] T. Fagni, S. Cresci, Fine-Grained Prediction of Political Leaning on Social Media with
     Unsupervised Deep Learning, Journal of Artificial Intelligence Research 73 (2022) 633–672.
[29] A. Rashed, M. Kutlu, K. Darwish, T. Elsayed, C. Bayrak, Embeddings-based clustering for
     target specific stances: The case of a polarized turkey, arXiv preprint arXiv:2005.09649
     (2020).
[30] R. Cohen, D. Ruths, Classifying political orientation on twitter: It’s not easy!, in: Pro-
     ceedings of the International AAAI Conference on Web and Social Media, volume 7,
     2013.
[31] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation
     and entailment approach, arXiv preprint arXiv:1909.00161 (2019).
[32] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
     networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
     China, 2019, pp. 3982–3992. URL: https://aclanthology.org/D19-1410. doi:10.18653/v1/
     D19- 1410 .
[33] F. Sebastiani, Machine learning in automated text categorization, ACM computing surveys
     (CSUR) 34 (2002) 1–47.


A. Statements

                Table 3: Defined topic for each of the 20 sentence (Italian version).
 nr.    Sentence                                            Topic
 1      nel complesso, essere membri dell’UE è uno svantaggi dell’Unione Europea
        svantaggio
 2      l’Italia dovrebbe uscire dall’Euro                  uscire dall’euro
 3      dovrebbe esistere un esercito comune europeo esercito europeo comune
 4      le multinazionali come Google e Youtube dovreb- tasse per le multinazionali in relazione
        bero pagare i diritti d’autore e le tasse secondo alle regole di ciascun Paese Europeo
        le regole di ciascun paese europeo
 5      l’integrazione economica europea si è spinta autonomia economica dei membri
        troppo oltre: gli Stati membri dovrebbero dell’Unione Europea
        riguadagnare maggiore autonomia
 6      l’Unione Europea dovrebbe riformare la pro- gestione                    dell’immigrazione
        pria politica dell’immigrazione: l’Italia dovrebbe nell’Unione Europea
        ricevere più supporto dagli altri Stati membri
 7      l’Italia dovrebbe intensificare le sue relazioni relazioni economiche dell’Italia con la
        economiche con la Cina                              Cina
 8      l’uso ricreativo della cannabis dovrebbe essere uso ricreativo della cannabis
        legale
 9      l’Islam è una minaccia per i valori dell’Italia     minaccia dell’Islam nei confronti dei
                                                            valori italiani
 10     alle donne deve essere garantita autonomia di autonomia di scelta sull’aborto
        scelta sull’aborto
 11     ogni forma di auto-difesa all’interno della pro- legittima difesa nella propria
        prietà privata dovrebbe essere legittima            abitazione con armi
 12     le attività della magistratura devono essere in- indipendenza della magistratura dalla
        dipendenti dalle pressioni della politica           politica
 13     i bambini, nati in Italia da cittadini stranieri, cittadinanza italiana per bambini nati
        dovrebbero ricevere la cittadinanza italiana au- in Italia da famiglie straniere
        tomaticamente
 14     la ricchezza dovrebbe essere redistribuita dai redistribuzione della ricchezza verso i
        cittadini più abbienti ai cittadini più poveri      piu poveri
 15     le imprese dovrebbe poter licenziare i dipendenti possibilita delle imprese di licenziare
        più facilmente                                      facilmente i propri dipendenti
                                                                            Continued on next page
                                 Table 3 – Continued from previous page
 nr.      Sentence                                             Topic
 16       la Sanità dovrebbe essere più aperta agli opera- apertura della Sanità ad operatori pri-
          tori privati                                         vati
 17       proteggere l’ambiente è più importante della importanza della protezione dell’ambi-
          crescita economica                                   ente
 18       tagliare la spesa pubblica è un buon modo per tagli alla spesa pubblica come
          risolvere la crisi economica                         soluzione per la crisi economica
 19       il sostegno al reddito alle fasce più povere della migliorare l’economia aiutando le
          popolazione è positivo per l’economia italiana fasce a basso reddito
 20       l’introduzione di una aliquota unica sui redditi conseguenze della flat tax per l’econo-
          (”flat tax”) sarebbe di beneficio all’economia ital- mia italiana
          iana


B. Algorithms ordered by complexity
                                         𝑝
Algorithm 1 [Alg1] The label 𝐴𝑠 is computed as
                                                             𝑝
                                                           |𝐼𝑡𝑝 |
                                                               𝑠
                                                ⎧ 𝑀1( ∑𝑖=1𝑝 𝑠𝑖 ⋅𝑡𝑖 )              𝑝
                                          𝑝                                 if ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0
                                         𝐴𝑠 =             |𝐼𝑡𝑝 |
                                                         ∑𝑖=1𝑠 𝑠𝑖                                 (4)
                                                ⎨
                                                ⎩3                          otherwise
                     𝑝             𝑝
       where 𝑠𝑖 ∈ 𝑆𝑡𝑝𝑠 and 𝑡𝑖 ∈ 𝑇𝑡𝑝𝑠 .
                                                                        𝑝
Algorithm 2 [Alg2] First, it maps each tweet 𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 into the label 𝑙𝑖 ∈ {1, 2, 3, 4, 5} using its
                          𝑝
     sentence score 𝑠𝑖 ∈ 𝑆𝑠
                                          𝑙𝑖 = 𝑀1(𝑠𝑖 )                                            (5)
               𝑝
       then, 𝐴𝑠 is
                                                           𝑝
                                                         |𝐼𝑡𝑝 |
                                                    ∑ 𝑠 𝑙                     𝑝
                                              𝑝    ⌊ 𝑖=1 𝑖 ⌉            if ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0
                                             𝐴𝑠 = { |𝐼𝑡𝑝𝑝 𝑠 |                                     (6)
                                                     3                  otherwise
                                                                    𝑝
       The step of assigning 𝑙𝑖 to each tweet 𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 (Eq. 5), hopefully returns a more fair
         𝑝
       𝐴𝑠 . In fact, the tweet normalization may help in aggregating the contribution of each
       tweet (𝑙𝑖 ) using the standard mean, which means applying the macro aggregation. In a
       multi-class classification setup, macro-metric aggregation is preferable if it is suspected
       that there may be class imbalance; in fact, the values 𝑙𝑖 are not balanced with respect
       to the current sentence 𝑠: likely, if a Party 𝑝 agrees with a sentence, there will be lot of
       tweets in agreement with it (many 𝑙𝑖 = 4 or 𝑙𝑖 = 5) and a few (errors) or no tweets in
       disagreement (few labels 𝑙𝑖 = 1, or 𝑙𝑖 = 2, or 𝑙𝑖 = 3), and vice-versa.
                                                                            𝑝
Algorithm 3 [Alg3] Like 𝐴𝑙𝑔2, but slightly modifying how 𝐴𝑠 is computed (Eq. 6). Let’s
     further define 𝑉𝑙 as the number of voters for the integer label 𝑙 ∈ {1, 2, 3, 4, 5}
                                                                 𝑝
                                                               |𝐼𝑡𝑝 |
                                           𝑉𝑙 = |{𝑙𝑖 ∶ 𝑙𝑖 = 𝑙}𝑖=1𝑠 |                           (7)

      where 𝑙𝑖 are the labels computed from Eq. 5. Let’s define 𝑣 = 𝑚𝑎𝑥(𝑉𝑙 ), then

                                       𝑙                if |{𝑙 ∶ 𝑉𝑙 = 𝑣}| = 1                 (8a)
                                     ⎧     𝑝
                                  𝑝
                                     ⎪ ∑|𝐼𝑡𝑝𝑠 | 𝑙
                                 𝐴𝑠 = ⌊ 𝑖=1 𝑖 ⌉         if |{𝑙 ∶ 𝑉𝑙 = 𝑣}| > 1                 (8b)
                                     ⎨ |𝐼 𝑝 |
                                     ⎪   𝑡𝑝𝑠
                                     ⎩3                 otherwise                             (8c)

      where ⌊...⌉ is the round function. The majority voting (case 8a) may have a bigger
      contribution in assigning correct labels than the plain standard mean (case 8b taken from
      Eq. 6 of 𝐴𝑙𝑔2), since it better accounts for class imbalance.
Algorithm 4 [Alg4] The previous algorithms take into consideration the neutral label 𝑛𝑙 = 3
                                                 𝑝
     (neither disagree, nor agree) also when ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0. However, we wondered how the results
                                                           𝑝
     would change if 𝑛𝑙 was only considered when ∣ 𝐼𝑡𝑝𝑠 ∣= 0. The neutral label may also be
                                                              𝑝
     assigned in the presence of a low number of in-topic 𝐼𝑡𝑝𝑠 : in this particular situation, the
                                                                                                 𝑝
     user may have not taken a position about the current sentence 𝑠 yet; also, choosing 𝐴𝑠
     looking at just one tweet may not be significant. Therefore, 𝐴𝑙𝑔4 stems from 𝐴𝑙𝑔3 having

                                                𝑙𝑖 = 𝑀2(𝑠𝑖 )                                   (9)

      where 𝑙𝑖 ∈ {1, 2, 3, 4}; we define
                                                                                 𝑝
                         𝑝
                            ⎧3                                              if ∣ 𝐼𝑡𝑝𝑠 ∣< 𝑚
                        𝑎𝑠 = majority voting (case 8a)                                        (10)
                            ⎨
                            ⎩ rounded standard mean (case 8b)

      where 𝑚 is the minimum number of tweets for which the majority voting algorithm or
      the standard mean is executed. Since the {3, 4} labels in output from 𝑀2(𝑠) represent the
      𝑎𝑔𝑟𝑒𝑒 and 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑎𝑔𝑟𝑒𝑒 final labels, they must be mapped again to the real final integer
      labels 4 and 5 respectively (as coded in Table ??)

                                            𝑝             𝑝             𝑝
                                     𝑝    𝑎𝑠          if 𝑎𝑠 = 1 ∨ 𝑎𝑠 = 2
                                    𝐴𝑠 = { 𝑝              𝑝        𝑝                          (11)
                                          𝑎𝑠 + 1      if 𝑎𝑠 = 3 ∨ 𝑎𝑠 = 4

</pre>