Inferring Political Leaning on X (Twitter): A Zero-Shot Approach in an Italian Scenario Caterina Senette1,∗ , Margherita Gambini1 , Tiziano Fagni1 , Victoria Popa1,2 and Maurizio Tesconi1 1 Institute of Informatics and Telematics (IIT) - CNR, Via Giuseppe Moruzzi, 1 56124 Pisa – Italy 2 Università di Pisa, Dipartimento di Computer Science, Largo Bruno Pontecorvo, 3, 56127 Pisa Abstract In recent years, there has been growing attention on predicting the political orientation of active social media users, aiding in political forecasts, modeling opinion dynamics, and understanding user polarization. Existing methods, primarily for X (Twitter) users, use content-based or a blend of content, network, and communication analysis. The latest research highlights that a user’s political stance mainly hinges on their views on key political and social issues, prompting a shift towards detecting user stances through their content shared on social networks. This work investigates the use of an unsupervised stance-detection framework Tweets2Stance (T2S) based on zero-shot classification (ZSC) models [1] to predict users’ stances toward a set of social-political statements using content-based analysis of their X (Twitter) timelines in an Italian scenario. The ground-truth user stances are drawn from Voting Advice Applications (VAAs), tools aiding citizens in identifying their political leanings by comparing their preferences with party stances. Leveraging the agreement levels of six parties on 20 statements from VAAs, the study aims to predict Party p’s stance on each statement s using X (Twitter) Party account data. T2S, employing zero-shot learning, proves effective across various contexts beyond politics, showcasing a minimum MAE of 1.13 despite a general maximum F1 value of 0.4, demonstrating significant progress given the task complexity. Keywords user stance detection, Zero-shot learning, unsupervised ML, political leaning, X (Twitter), VAA 1. Introduction During the last few years, there has been a growing attention towards social media for what is explicitly shared among users (content, thoughts, and behavior), as well as for what is hidden and latent. Among this latent information, the user’s stance, i.e. the expression of a user’s point of view and perception toward a given statement [2], is particularly interesting; in fact, stance detection on social media is an emerging opinion mining paradigm that well applies in different social and political contexts, and for which many researchers are working to propose solutions ranging from natural language processing, web science, and social computing [3, 4, 5, 6, 7, 8, 9, 10]. Some work [3, 11] dealt with stance-detection at the user-level; however, ITASEC 2024: The Italian Conference on CyberSecurity, April 08–12, 2024, Salerno, Italy ∗ Corresponding author. Envelope-Open caterina.senette@iit.cnr.it (C. Senette); margherita.gambini@iit.cnr.it (M. Gambini); tiziano.fagni@iit.cnr.it (T. Fagni); victoria.popa@iit.cnr.it (V. Popa); maurizio.tesconi@iit.cnr.it (M. Tesconi) Orcid 0000-0002-4411-7134 (C. Senette); 0000−0003−0640−2724 (M. Gambini); 0000−0003−1921−7456 (T. Fagni); 0000−0001−8228−7807 (M. Tesconi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings to the best of our knowledge, a completely unsupervised technique exploiting user’s textual content only has never been explored. Hence, the work herein described investigates the use of an unsupervised stance-detection framework based on zero-shot learning models and previously introduced by us [12, 1] named 𝑇 𝑤𝑒𝑒𝑡𝑠2𝑆𝑡𝑎𝑛𝑐𝑒 (𝑇 2𝑆), to detect the stance of a X (Twitter) account using its timeline in an Italian scenario. The idea for this framework stems out from observing how Voting Advice Applications (VAAs) work. Voting Advice Applications, originally developed in the 1980s as paper-and-pencil civic education questionnaires [13], are online tools that aid citizens, mainly before elections, to identify their political leaning by comparing their policy preferences with the political stances of parties or candidates running for office. VAAs are widespread in many countries and have a crucial role in online election campaigns worldwide. Basically, the user marks its position on a range of policy statements. The application compares the individual’s answers to the positions of each Party or candidate and generates a rank-ordered list or a graph indicating which Party or candidate is located closest to the user’s policy preferences. One of the crucial elements of the VAAs is the questionnaire: the selection of the statements, their balance among the political poles, and their phrasing have an impact both on the way in which users respond, as well as on the overall users‘ engagement on the poll itself. For these reasons, the VAA’s issued statements should cover the spectrum of the most important topics of an election campaign and adequately show crucial differences among all the competitors in the political scenario for which the VAA is designed [14]. This careful definition of the questionnaire, i.e. taking into high consideration the main topics under discussion at a certain time, suggested us the possibility of using the official position of Italian parties about specific political statements (during a certain political election period) as the ground-truth to determine that stance from the timeline of the X (Twitter) Party accounts in a completely unsupervised way 1 ; notice that only tweets written during the pre-election period are considered. Objectives Starting from the knowledge of the agreement level of six parties on 20 different statements (VAA’s statements), the objective of the study is to predict the stance of a Party 𝑝 toward each statement 𝑠 exploiting what the X (Twitter) Party account wrote on X (Twitter). Differently from previous works in the literature [3], our classification model is built for different topics and we come up with a fine-grained stance-detector solution working along five classes that could be generalized to various spheres, not just the political one. 2. Related Work Stance-detection is an emerging opinion-mining paradigm that well applies in several social and political scenarios. The state of the art resumed in a highly valuable survey [3] highlights the importance of categorization since stance-detection can be classified according to the target (single, multi-related, or claim-based), according to the task type (detection or prediction) or distinguishing between stance at user level or the statement level. At the statement level [17, 18], 1 the Italian Parties’ official positions about 20 political statements were kindly provided by the Observatory on Political Parties and Representation [15] based on the VAA NavigatoreElettorale for the European Elections 2019 [16] whose objective is to predict the stance described in a piece of text, previous research works are mainly based on Natural Language Processing (NLP) methods and classification tasks with three classes (support/against/none). Instead, at the user level, the objective is to predict the stance of a user toward a given topic and generally, prediction solutions incorporate different users’ attributes along with the text of their posts. Our work falls under the category of stance detection tasks at the user level, specifically focusing on target-specific stances—a common approach in social media stance detection. This involves predicting stances on specific topics, often using separate classification models for each topic. Notable approaches [19, 4, 5, 20] utilize post text along with various user attributes, typically employing binary classification (support and against). Lynn et al. [11] explored using user-level features alone versus document- level features in predicting tweet stances without the tweets highlighting the importance of integrating user features into predictive systems. Other target-specific strategies in literature were conducted at the statement level [6, 7, 8, 21]. In [6] the approach was conducted at the statement level through unsupervised methods, and classification was made along three positions (favour, against, neither). In [7] is introduced a stance-detection shared task, where teams inferred three-level tweet stances using natural language systems: for, against, or neutral towards the given target. Divided into supervised (Task A) and unsupervised (Task B) sub-tasks, they received 19 and 9 team submissions respectively. The highest F-score reached was 67.82 for Task A and 56.28 for Task B. As mentioned above, target-specific approaches could consider single or multiple targets. Usually, the concept of multi-target classification has been used to analyse the relation between two political candidates by using domain knowledge about these targets. In that case, the same model can be applied to different targets on the hypothesis that the same piece of text that contains the stance in favour to a target, it also implicitly contains the stance against the other [22, 9, 23]. Our method handles a broad multi-target classification task, where each statement represents a specific target. Unlike previous methods, it operates without the need for pre-selected texts or distinct models for each target 2.1. Machine Learning (ML) approaches Among ML features for stance detection, the literature distinguishes between linguistic features, revealing stance based on text-linguistic features [24, 7], and users’ vocabulary, which is based on their choice of words [10, 25]. Since textual cues could refer both to textual features, sentiment, and semantics, we limit our attention to textual features. In this context, the most used ML approaches are based on supervised techniques [19, 5, 23, 18, 26]. Some works attempted to enrich dataset entities applying unconstrained supervised methods such as transfer learning, weak-supervision, and distant supervision methods for stance detection [6, 4]. Other innovative approaches are those that propose unsupervised learning strategies [10, 27, 28] exploiting clustering techniques and embeddings representations of users’ tweet[29]. The limitations across these studies include: (i) time-intensive data collection and analysis, particularly with network-based approaches; (ii) challenges in accessing or retrieving necessary data due to stringent social media data protection policies; (iii) most models are limited to two or three stance classes at most; (iv) reliance on supervised or semi-supervised models, which require large datasets and have limited generalizability tied to training sets[30]. For all these reasons, the recent challenge for user-level and target-specific stance detection is to move towards unsupervised systems exploiting textual content only. To this aim, a ZSL technique exploiting advanced pre-trained Natural Language Inference (NLI) models [24, 31] can be a viable solution as our T2S framework proved. 3. Task Definition The task is to predict the stance 𝐴𝑢𝑠 of a Social Media User 𝑢 with respect to a social-political statement (or sentence) 𝑠 making use of the User’s textual content timeline on the considered social media (e.g., the X (Twitter) timeline). The stance 𝐴𝑢𝑠 represents a five-level categorical label: completely agree (5), agree (4), neither disagree nor agree (3), disagree (2), completely disagree (1). The integer mappings used by the Tweets2Stance framework are shown in parentheses. The desired ground-truth is the label 𝐺𝑠𝑢 , which is the known agreement/ disagreement level of User 𝑢 in regard to sentence 𝑠. Remind that the ground-truth is only used to evaluate our proposed 𝑇 𝑤𝑒𝑒𝑡𝑠2𝑆𝑡𝑎𝑛𝑐𝑒 framework and find its optimal parameters; no training step ever occurs. In this work, we assume that users are the X (Twitter) accounts of six Italian Parties, as the following section will detail. 4. Data collection and Pre-processing The political scenario under analysis refers to the European and Municipal elections in Italy on 26th May 2019, when Italian citizens were called for the election of the Italian representatives to the European Parliament. The number of Members of the European Parliament (751 deputies in total) for each country is approximately proportional to the population. In 2019, Italy had to elect 76 deputies. Contextually, Italian voters had also to participate in the municipal election of mayors, municipal and district councillors (in about 3800 Italian municipalities), with a planned run-off on 9th June 2019. In that context, we focused our attention on the six major parties in Italy: three center-right parties including Forza Italia (FI), Fratelli d’Italia (FDI), and Lega, two left-wing parties including Partito Democratico (PD) and +Europa (+Eu)2 , and the Movimento 5 Stelle (M5S) representing a sort of third pole at that time. The Italian parliament included other minor parties, especially on the left- wing, representing less than 5% of the Italian population each. We did not consider these parties in the current study. As previously said, we started from the assumption that knowing the parties’ answers on the VAA’s statements, it is possible to predict the stance of a Party 𝑝 in regard to each statement 𝑠 exploiting what the Party wrote on X (Twitter). The definition of the 20 statements (Table 3 in Appendix A) that express the political positions of the six referenced parties towards selected themes under discussion in Italy and in Europe in 2019, was entrusted to a group of political experts [15, 16] who provided us 𝑝 with the ground-truth 𝐺𝑠 for each Party 𝑝 and statement 𝑠 on which the current work is based. At first, we collected timelines of the official X (Twitter) account of each party using the official X (Twitter) API3 . Considering the speed with which political discussion nowadays takes place, especially on social media, the observation period was adequately chosen in order to maximize the number of tweets avoiding noise and off-topic content. Furthermore, to intercept any 2 +Europa was recently born in 2018 and it is characterized for a pro-European and liberal orientation. 3 https://developer.X(Twitter).com/en/docs valuable information or discussion trends over time we have extended the analysis considering four temporal ranges and built the associated datasets4 as described in Table 1. Table 1 The four studied datasets with the total number of tweets before the pre-processing step. 𝐷𝑗 contains 𝑗 months of tweets. 𝐷3 𝐷4 𝐷5 𝐷7 Period [2019-03-01, 2019-05- [2019-02-01, 2019-05- [2019-01-01, 2019-05- [2018-11-01, 2019-05- 25] 25] 25] 25] #tweets 20’266 25’979 34’736 44’370 As a preliminary step, since the text collected from tweets contains a lot of noise and irrelevant information, we pre-processed the tweets in order to remove anything which doesn’t have predicting significance, such as: item URLs, ”𝑅𝑇 @𝑢𝑠𝑒𝑟 ∶” prefix of retweets, mentions at the beginning of a reply tweet, tweets with {1, 2, 3} words and empty tweets, hashtags and emojis (replaced with empty string). Lastly, since we wanted to test our prediction approach on English tweets as well, we further translated the Italian tweets using the google_trans_new 5 Python package. 5. Framework Design This section briefly describes the proposed Tweets2Stance (T2S) framework (Fig. 1) to detect the stance 𝐴𝑢𝑠 of a X (Twitter) User 𝑢 in regard to a sentence 𝑠, exploiting its X (Twitter) timeline 𝑇 𝐿𝑢 = [𝑡𝑤1 , ..., 𝑡𝑤𝑛 ]. More details of the framework are provided in a previous work where we have extensively introduced it [1]. A User might either not talk about a specific political argument (here expressed with sentence 𝑠), or debate on an issue not risen by our pre-defined set of statements. For these reasons, our framework executes a preliminary 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 step, exploiting a Zero-Shot Classifier (ZSC) to get only those tweets talking about the topic 𝑡𝑝 of the sentence 𝑠. A ZSC is a language-model- based method that, given a text and a set of labels (e.g., topics), assigns a classification probability score to each label [21]. The higher the score assigned to a label, the higher the likelihood that the input text pertains to that specific label. ZSC does not require further fine-tuning on the target dataset. After obtaining the in-topic tweets 𝐼𝑡𝑝 𝑢 through Topic Filtering, the Agreement 𝑠 Detector module employs the same ZSC to detect the user’s agreement/disagreement level. In Fig. 1 we use colour-codes to identify the four parameters of the 𝑇 2𝑆 framework that we’ll vary during our experiments, as explained in Section 6. 𝑝 Topic Filtering The 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 module extracts the in-topic tweets 𝐼𝑡𝑝𝑠 from the X (Twitter) Timeline 𝑇 𝐿𝑝 of Party 𝑝, using the topic 𝑡𝑝𝑠 associated with sentence 𝑠 (e.g., the topic for the sentence ”overall, membership in the EU has been a bad thing for the UK ” can be ”UK membership in EU ”). The topic definitions for all considered sentences can be found in the linked repository. 4 The four raw datasets can be found at https://github.com/marghe943/Tweets2Stance_dataset 5 https://pypi.org/project/google-trans-new/ o XRoberta1 o XRoberta2 o BART Language Model LM • D3 • D5 • D4 • D7 Tweets2Stance Zero-shot classifier Dataset Topic tps C Sentence s extract Filtered Tweets Agreement User Timeline Topic Filtering Agreement Predictor TLu label Treshold th Algorithm Alg • 0,5 • 0,6 • 0,7 Alg1 Alg2 • 0,8 • 0,9 Alg3 Alg4 Figure 1: Our Tweets2Stance framework to compute the agreement/disagreement level 𝐴𝑢𝑠 of User 𝑢 in regard to sentence 𝑠. The inputs are the X (Twitter) timeline 𝑇 𝐿𝑢 extracted from a certain time-period dataset 𝐷𝑖 , the sentence 𝑠, the topic 𝑡𝑝 associated with 𝑠, a language model 𝐿𝑀, a threshold 𝑡ℎ and an algorithm 𝐴𝑙𝑔. The highlighted components are the parameters that we’ll vary during our experiments, as explained in Section 6. 𝑝 The module utilizes the ZSC 𝐶 to retrieve the in-topic tweets 𝐼𝑡𝑝𝑠 and their corresponding topic 𝑝 scores 𝑇𝑡𝑝𝑠 . Agreement Detector The 𝐴𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡 𝐷𝑒𝑡𝑒𝑐𝑡𝑜𝑟 module (Fig. 1 - Module 2) computes the final 𝑝 𝑝 𝑝 five-valued label 𝐴𝑠 through an algorithm 𝐴𝑙𝑔(𝑇𝑡𝑝𝑠 , 𝑆𝑠 ), defining 𝑝 𝑝 𝑆𝑠 = {𝐶(𝑡𝑤𝑖 , 𝑠)|𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 } (1) 𝑝 as the 𝐶 scores of tweets 𝐼𝑡𝑝𝑠 with respect to sentence 𝑠, each one indicating the relevance and agreement of tweet 𝑡𝑤𝑖 with sentence 𝑠. Each employed algorithm 𝐴𝑙𝑔 exploits one of the following mapping functions: ⎧1 if 𝑠 ∈ [0, 0.2) ⎪ ⎧1 if 𝑠 ∈ [0, 0.25) ⎪2 if 𝑠 ∈ [0.2, 0.4) ⎪ ⎪ ⎪2 if 𝑠 ∈ [0.25, 0.5) 𝑀1(𝑠) = 3 if 𝑠 ∈ [0.4, 0.6) (2) 𝑀2(𝑠) = (3) ⎨ ⎨3 if 𝑠 ∈ [0.5, 0.75) ⎪4 if 𝑠 ∈ [0.6, 0.8) ⎪ ⎪ ⎪ ⎪ ⎩4 if 𝑠 ∈ [0.75, 1] ⎩5 if 𝑠 ∈ [0.8, 1] where 𝑀1(𝑠) ranges from 1 to 5, corresponding to the five agreement/disagreement labels defined in Section 3. Similarly, 𝑀2(𝑠) ranges from 1 to 4, representing an intermediate agree- ment/disagreement scale. Specifically, 𝑀2(𝑠) = {1, 2} has the same meaning as in Section 3, while 𝑀2(𝑠) = 3 indicates agreement and 𝑀2(𝑠) = 4 represents complete agreement. The rationale behind this intermediate mapping is explained in Algorithm 4 [1]. We defined four algorithms with different complexity levels, details of each one are provided in the Appendix B and the already mentioned work [1]. 6. Experimental Setup 6.1. Baselines It is a good practice to compare the proposed methods with a bunch of baselines. To the best of our knowledge, no baseline method has been devised for the typology of our stance detection task yet: unlike our approach, the state-of-the-art unsupervised user-stance detection method proposed by Darwish et al. [10] cannot operate without context information from other users and it is not suitable for a multi-class ordinal classification like our case. Therefore, the following 𝑝 baselines to compute 𝐴𝑠 for Party 𝑝 and sentence 𝑠 were used: 𝑝 Random 𝐴𝑠 is set to a random integer picked from a discrete uniform distribution of 𝑖𝑛𝑡 ∈ [1, 5]. The numpy random method6 was used with random seed set to 42. . 𝑝 Predict 3 𝐴𝑠 is set to 3 (neither disagree, nor agree). Sentence Bert The newest Transformer-based language models like BERT can be used as fea- ture extractors [32], providing contextual word and sentence embeddings. The Sentence- Bert architecture of the Sentence Transformers Python library7 was used with the English all-mpnet-base-v2 model on translated tweets, and with the multi-lingual model distiluse- base-multilingual-cased-v1 on the Italian tweets. 6.2. Experiments in detail As already explained in section 5, our 𝑇 2𝑆 method has got four parameters to tune: the language model 𝐿𝑀 to be used for zero-shot classification, the dataset 𝐷𝑖 from which extract the X (Twitter) timeline 𝑇 𝐿𝑝 , the algorithm 𝐴𝑙𝑔 for the 𝐴𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡 step, and the threshold value 𝑡ℎ for the 𝑇 𝑜𝑝𝑖𝑐 𝐹 𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 step. Considering the values of those parameters in Fig. 1, we carried out each experiment having in mind the four research questions summarized in Table 2 and ordered by specificity. 6.3. Evaluation In evaluating the stance detection model, traditional metrics like MSE, MAE, R2 Score, and Residual Plots are common. However, a bespoke metric is needed to address varying error importance across stance classes. For instance, misclassifying agree instead of completely disagree carries a different weight than neither disagree, nor agree instead of agree. In the absence of such a metric, MAE is chosen. Lastly, since the predicted value is an integer among {1, 2, 3, 4, 5}, a classification evaluation metric was considered as well: the weighted F1 score was picked, since it summarizes both Precision and Recall [33]. The sklearn.metrics Python package was used to compute both MAE 8 and F1_weighted 9 6 https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html 7 https://www.sbert.net/ 8 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html 9 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html Table 2 Description of all carried out experiments Experiment Name Research Question Best language model 𝐿𝑀 Which is the best language model 𝐿𝑀 for zero-shot classification? Which is the best model to deal with Italian tweets? All in all, is an English model better? Best dataset 𝐷 Fixed the language model 𝐿𝑀, which is the best dataset to work on, considering all proposed algorithms? Hence, which is the best time period to listen to before a Political Election? Best algorithm 𝐴𝑙𝑔 Fixed the language model 𝐿𝑀 and dataset 𝐷𝑖, which is the best algorithm to work on, considering all evaluated thresholds 𝑡ℎ? Are all our proposed algorithms better than the baselines (subsection 6.1)? Are the more complex algorithms better or not? Best threshold 𝑡ℎ Fixed the language model 𝐿𝑀, the dataset 𝐷 − 𝑁 𝑚𝑜𝑛𝑡ℎ𝑠 and the algorithm 𝐴𝑙𝑔, which is the best filtering threshold 𝑡ℎ, hence the optimal set-up? Party Analysis Fixed the optimal setup for our framework, which are the Parties on which 𝑇 2𝑆 behaves well or poorly? 7. Results and Discussion 7.1. Best Language Model LM First, we explored which is the best language model for ZSC on Italian tweets: a model pre- trained on a mix of languages including Italian or one fine-tuned on Italian text? Also, would results improve with an English model on translated tweets? Furthermore, would the results benefit from using an English language model on translated tweets instead? We answered these questions by looking at Fig. 2: each cell (𝐿𝑀𝑖 , 𝐷𝑗 ) indicates the minimum MAE (maximum F1) obtained with our 𝑇 2𝑆 method for a certain language model 𝐿𝑀𝑖 and dataset 𝐷𝑗 by varying the algorithm 𝐴𝑙𝑔 and the threshold 𝑡ℎ according to Fig. 1. MAE F1 0.400 1.15 XRoberta1 1.28 1.27 1.27 1.29 XRoberta1 0.24 0.25 0.25 0.25 0.375 0.350 1.20 model model XRoberta2 1.32 1.31 1.28 1.28 XRoberta2 0.29 0.29 0.27 0.26 0.325 1.25 0.300 0.275 BART 1.19 1.13 1.18 1.25 BART 0.37 0.40 0.38 0.36 1.30 0.250 D3 D4 D5 D7 D3 D4 D5 D7 dataset dataset Figure 2: Best MAE and F1 values of our 𝑇 2𝑆 method for each couple (𝐿𝑀𝑖 , 𝐷𝑗 ) of language models and datasets. Darker colors indicate optimal values for both metrics. Among the cross-lingual models 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎1 and 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 , the best one seemed to be 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎1 : it had an overall better MAE, while F1 results were close to 𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 ’s; we consid- ered MAE as the first metric to judge the performances since it tells how much we are close to the correct answer. Apparently, fine-tuning on an Italian translation of a subset of the MNLI dataset (𝑋 𝑅𝑜𝑏𝑒𝑟𝑡𝑎2 ) doesn’t contribute a lot to text classification in our 𝑇 2𝑆 framework. All in all, the best choice is translating the pre-processed tweets in English and using an English model like 𝐵𝐴𝑅𝑇: it reached significantly higher values on both MAE and F1. Supposedly, using a model pre-trained and fine-tuned on a single language gives better results for our prediction task: learning on a single language allows us to focus on more details and features of the language. 7.2. Best Dataset D The choice of the dataset’s time period (𝐷𝑖 ) as one of the parameters to tune is motivated by the use of T2S for stance detection during political elections, where the proximity to the elections may impact the likelihood of users discussing socio-political topics. Fixed the language model 𝐿𝑀 = 𝐵𝐴𝑅𝑇, the dataset 𝐷4 was immediately detected as the best one, since it had the best MAE and F1 (Fig. 2). Presumably, the X (Twitter) political discussion four months before the Italian elections was enough to grasp the Parties’ stances. We evaluated the mean MAE and mean F1 for each cell (𝐿𝑀𝑖 , 𝐷𝑗 ) of Fig. 2 as well, but the results confirmed 𝐵𝐴𝑅𝑇 and 𝐷4 as the best language model and dataset. 7.3. Best Algorithm Alg Once the language model 𝐿𝑀 = 𝐵𝐴𝑅𝑇 and dataset 𝐷4 were chosen, we tested our algorithms 𝐴𝑙𝑔 against the baselines 𝑟𝑎𝑛𝑑𝑜𝑚, 𝑝𝑟𝑒𝑑𝑖𝑐𝑡3 , and 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 𝑒𝑟𝑡, examining the best 𝐴𝑙𝑔 across all thresholds 𝑡ℎ. Fig. 3 describes how much each algorithm performed across different thresholds. These results include the performances of the three baselines as well. Altogether, the optimal algorithm can be identified in 𝐴𝑙𝑔3: F1 seemed to contradict it and bend over 𝐴𝑙𝑔4 instead, but the gain over the prediction error is far more important. This result suggests that assigning the neutral label (neither disagree, nor agree) only when there’s a minimum number of tweets 𝑚 does not boost the performance of our 𝑇 2𝑆 method. Also, we executed 𝐴𝑙𝑔4 with 𝑚 = {2, 3}, finding out that the results didn’t vary a lot from each other; therefore, we showed 𝐴𝑙𝑔4𝑚=3 in Fig. 3. 7.4. Best Threshold th and Party Analysis Fixed the language model 𝐿𝑀 = 𝐵𝐴𝑅𝑇, the dataset 𝐷4 and the algorithm 𝐴𝑙𝑔3, threshold 𝑡ℎ = 0.6 was immediately detected as the optimal one, since it had the best MAE and a good F1 (Fig. 3). Therefore, the best setup 𝑠𝑢𝑜𝑝𝑡 of our 𝑇 2𝑆 framework was (𝐿𝑀, 𝐷𝑗 , 𝐴𝑙𝑔, 𝑡ℎ) = (𝐵𝐴𝑅𝑇 , 𝐷4 , 𝐴𝑙𝑔3, 0.6). To explore the specific performance of our 𝑇 2𝑆 method over the Parties, we used the optimal setup 𝑠𝑢𝑜𝑝𝑡 but by varying the threshold 𝑡ℎ. Fig. 4 shows the results. Each point 𝑝 indicates the MAE (F1) on the 20 sentences’ agreement level 𝐴𝑠 for a certain Party 𝑝. Each Party behaves differently, thus it is likely that 𝑇 2𝑆 highly depends on the Party’s timeline in terms of how much it generally writes, how much it writes in-topic, and how much it writes using figures of speech or hashtags and emojis (which we removed). Looking at both the MAE and F1, we observed a regular trend for thresholds 𝑡ℎ = {0.8, 0.9} for five parties out of six: the outlier Party Alg1 Alg4 random Alg2 sentence_bert predict_3 Alg3 all (all) MAE 1.6 1.4 all 1.2 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 threshold th Alg1 Alg4 random Alg2 sentence_bert predict_3 Alg3 all (all) 0.4 F1_WEIGHTED 0.3 all 0.2 0.1 0.0 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 threshold th Figure 3: MAE and F1 of our four proposed algorithms 𝐴𝑙𝑔s and the three baselines by varying the threshold 𝑡ℎ. It is shown 𝐴𝑙𝑔4 with 𝑚 = 3 (see B). 𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒 was more predictable for those thresholds. That may happen because the user’s timeline deals with a certain statement in a clearer way; for example, looking at 𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒 and 𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎’s tweets filtered for the sentence 𝑆19 and 𝑡ℎ = 0.9, we saw that 𝑀𝑜𝑣5𝑆𝑡𝑒𝑙𝑙𝑒 wrote clearer and explicit tweets supporting the argument (it completely agrees), while from 𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎’s timeline it’s not immediately clear that it disagrees; 𝑓 𝑜𝑟𝑧𝑎_𝑖𝑡𝑎𝑙𝑖𝑎 tweeted about tax reduction, fewer fees on families, and job creation, in that case, our 𝑇 2𝑆 framework marked it ’completely agree’ since the party didn’t explicitly disagree with income support for the poorest as beneficial for the Italian economy. 8. Conclusions and Future Work In this work, we investigate the use of an unsupervised stance-detection framework Tweets2Stance (T2S) based on zero-shot classification [1] to predict users’ stances toward a set of social-political statements using content-based analysis of their X (Twitter) timelines in an Italian scenario. In particular, we dealt with the stance of 20 political statements for the six major parties in Italy. Results showed that, although the general maximum F1 value was 0.4, 𝑇 2𝑆 could correctly predict the stance with a general minimum MAE of 1.13, which is a great pdnetwork Mov5Stelle LegaSalvini Piu_Europa forza_italia FratellidItalia 1.4 1.2 MAE 1.0 0.8 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 threshold th pdnetwork Mov5Stelle LegaSalvini Piu_Europa forza_italia FratellidItalia 0.5 F1_WEIGHTED 0.4 0.3 0.2 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 threshold th Figure 4: MAE and F1 computed for each Party over the stance predictions of the 20 VAA statements. The optimal 𝑠𝑢𝑜𝑝𝑡 is used, but the threshold 𝑡ℎ varies. achievement considering that MAE tells how close we are to the correct answer, and that we worked with a final five-valued label. Also, as we hypothesized, the 𝑇 2𝑆’s performance highly depends on how the X (Twitter) account of the Party (hence the social media user) writes, e.g. the employed figures of speech, the words used, and so on. As mentioned when introducing the work, the approach is potentially generalizable to several topics. If applied to political discourse, it could represent the first step of a pipeline whose output is the user’s political leaning. In the near future, we will investigate how T2S’s agreement levels output can be used to derive the political leaning of a social media user, for example by trying to emulate a VAA algorithm. Besides, we hope to apply it to detect extremist accounts on social media; however, a domain expert may be needed to define precise social statements to use. Future research could address T2S limitations by using advanced models like GPT-4 or conversational AI such as ChatGPT for robust stance detection. Acknowledgments We thank Project SERICS (PE00000014) - NRRP MUR program funded by the EU- NGEU, and Project SoBigData-PlusPlus Grant Agreement number: 871042 CUP B54I1900639000. References [1] M. Gambini, C. Senette, T. Fagni, M. Tesconi, From tweets to stance: An unsupervised framework for user stance detection on twitter, in: International Conference on Discovery Science, Springer, 2023, pp. 96–110. [2] D. Biber, E. Finegan, Adverbial stance types in english, Discourse processes 11 (1988) 1–34. [3] A. ALDayel, W. Magdy, Stance detection on social media: State of the art and trends, Information Processing & Management 58 (2021) 102597. [4] M. Dias, K. Becker, Inf-ufrgs-opinion-mining at semeval-2016 task 6: Automatic generation of a training corpus for unsupervised identification of stance in tweets, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 378–383. [5] Y. Igarashi, H. Komatsu, S. Kobayashi, N. Okazaki, K. Inui, Tohoku at semeval-2016 task 6: Feature-based model versus convolutional neural network for stance detection, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 401–407. [6] I. Augenstein, T. Rocktäschel, A. Vlachos, K. Bontcheva, Stance detection with bidirectional conditional encoding, arXiv preprint arXiv:1606.05464 (2016). [7] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, C. Cherry, Semeval-2016 task 6: Detecting stance in tweets, in: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), 2016, pp. 31–41. [8] S. Hamidian, M. T. Diab, Rumor detection and classification for twitter data, arXiv preprint arXiv:1912.08926 (2019). [9] K. Darwish, W. Magdy, T. Zanouda, Improved stance prediction in a user similarity feature space, in: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, 2017, pp. 145–148. [10] K. Darwish, P. Stefanov, M. Aupetit, P. Nakov, Unsupervised user stance detection on twitter, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 14, 2020, pp. 141–152. [11] V. Lynn, S. Giorgi, N. Balasubramanian, H. A. Schwartz, Tweet classification without the tweet: An empirical examination of user versus document attributes, in: Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science, 2019, pp. 18–28. [12] M. Gambini, T. Fagni, C. Senette, M. Tesconi, Tweets2stance: users stance detection exploiting zero-shot learning algorithms on tweets, arXiv preprint arXiv:2204.10710 (2022). [13] L. Cedroni, Voting Advice Applications in Europe: The state of the art, Scriptaweb, 2010. [14] T. Louwerse, M. Rosema, The design effects of voting advice applications: Comparing methods of calculating matches, Acta politica 49 (2014) 286–312. [15] OPPR, Opi - observatory on political parties and representation, ???? URL: http://opi.sp. unipi.it/opi-political-parties/. [16] O. on Political Parties, R. (OPPR), Navigatoreelettorale europee 2019, 2019. URL: http: //opi.sp.unipi.it/opi-political-parties/oppr-projects/. [17] A. Murakami, R. Raymond, Support or oppose? classifying positions in online debates from reply activities and opinion expressions, in: Coling 2010: Posters, 2010, pp. 869–875. [18] M. A. Walker, P. Anand, R. Abbott, J. E. F. Tree, C. Martell, J. King, That is your evidence?: Classifying stance in online political debate, Decision Support Systems 53 (2012) 719–729. [19] S. Gottipati, M. Qiu, L. Yang, F. Zhu, J. Jiang, Predicting user’s political party using ideological stances, in: International Conference on Social Informatics, Springer, 2013, pp. 177–191. [20] A. Aldayel, W. Magdy, Your stance is exposed! analysing possible factors for stance detection on social media, Proceedings of the ACM on Human-Computer Interaction 3 (2019) 1–20. [21] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3914–3923. URL: https://aclanthology.org/D19-1404. doi:10.18653/ v1/D19- 1404 . [22] P. Sobhani, D. Inkpen, X. Zhu, A dataset for multi-target stance detection, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 551–557. [23] M. Lai, V. Patti, G. Ruffo, P. Rosso, Stance evolution and twitter interactions in an italian political debate, in: International Conference on Applications of Natural Language to Information Systems, Springer, 2018, pp. 15–27. [24] S. Ghosh, P. Singhania, S. Singh, K. Rudra, S. Ghosh, Stance detection in web and social me- dia: a comparative study, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2019, pp. 75–87. [25] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, H.-W. Hon, Unified language model pre-training for natural language understanding and generation, Advances in Neural Information Processing Systems 32 (2019). [26] B. Zhang, M. Yang, X. Li, Y. Ye, X. Xu, K. Dai, Enhancing cross-target stance detection with transferable semantic-emotion knowledge, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3188–3197. [27] A. Joshi, P. Bhattacharyya, M. Carman, Political issue extraction model: A novel hierarchi- cal topic model that uses tweets by political and non-political authors, in: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2016, pp. 82–90. [28] T. Fagni, S. Cresci, Fine-Grained Prediction of Political Leaning on Social Media with Unsupervised Deep Learning, Journal of Artificial Intelligence Research 73 (2022) 633–672. [29] A. Rashed, M. Kutlu, K. Darwish, T. Elsayed, C. Bayrak, Embeddings-based clustering for target specific stances: The case of a polarized turkey, arXiv preprint arXiv:2005.09649 (2020). [30] R. Cohen, D. Ruths, Classifying political orientation on twitter: It’s not easy!, in: Pro- ceedings of the International AAAI Conference on Web and Social Media, volume 7, 2013. [31] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach, arXiv preprint arXiv:1909.00161 (2019). [32] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT- networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3982–3992. URL: https://aclanthology.org/D19-1410. doi:10.18653/v1/ D19- 1410 . [33] F. Sebastiani, Machine learning in automated text categorization, ACM computing surveys (CSUR) 34 (2002) 1–47. A. Statements Table 3: Defined topic for each of the 20 sentence (Italian version). nr. Sentence Topic 1 nel complesso, essere membri dell’UE è uno svantaggi dell’Unione Europea svantaggio 2 l’Italia dovrebbe uscire dall’Euro uscire dall’euro 3 dovrebbe esistere un esercito comune europeo esercito europeo comune 4 le multinazionali come Google e Youtube dovreb- tasse per le multinazionali in relazione bero pagare i diritti d’autore e le tasse secondo alle regole di ciascun Paese Europeo le regole di ciascun paese europeo 5 l’integrazione economica europea si è spinta autonomia economica dei membri troppo oltre: gli Stati membri dovrebbero dell’Unione Europea riguadagnare maggiore autonomia 6 l’Unione Europea dovrebbe riformare la pro- gestione dell’immigrazione pria politica dell’immigrazione: l’Italia dovrebbe nell’Unione Europea ricevere più supporto dagli altri Stati membri 7 l’Italia dovrebbe intensificare le sue relazioni relazioni economiche dell’Italia con la economiche con la Cina Cina 8 l’uso ricreativo della cannabis dovrebbe essere uso ricreativo della cannabis legale 9 l’Islam è una minaccia per i valori dell’Italia minaccia dell’Islam nei confronti dei valori italiani 10 alle donne deve essere garantita autonomia di autonomia di scelta sull’aborto scelta sull’aborto 11 ogni forma di auto-difesa all’interno della pro- legittima difesa nella propria prietà privata dovrebbe essere legittima abitazione con armi 12 le attività della magistratura devono essere in- indipendenza della magistratura dalla dipendenti dalle pressioni della politica politica 13 i bambini, nati in Italia da cittadini stranieri, cittadinanza italiana per bambini nati dovrebbero ricevere la cittadinanza italiana au- in Italia da famiglie straniere tomaticamente 14 la ricchezza dovrebbe essere redistribuita dai redistribuzione della ricchezza verso i cittadini più abbienti ai cittadini più poveri piu poveri 15 le imprese dovrebbe poter licenziare i dipendenti possibilita delle imprese di licenziare più facilmente facilmente i propri dipendenti Continued on next page Table 3 – Continued from previous page nr. Sentence Topic 16 la Sanità dovrebbe essere più aperta agli opera- apertura della Sanità ad operatori pri- tori privati vati 17 proteggere l’ambiente è più importante della importanza della protezione dell’ambi- crescita economica ente 18 tagliare la spesa pubblica è un buon modo per tagli alla spesa pubblica come risolvere la crisi economica soluzione per la crisi economica 19 il sostegno al reddito alle fasce più povere della migliorare l’economia aiutando le popolazione è positivo per l’economia italiana fasce a basso reddito 20 l’introduzione di una aliquota unica sui redditi conseguenze della flat tax per l’econo- (”flat tax”) sarebbe di beneficio all’economia ital- mia italiana iana B. Algorithms ordered by complexity 𝑝 Algorithm 1 [Alg1] The label 𝐴𝑠 is computed as 𝑝 |𝐼𝑡𝑝 | 𝑠 ⎧ 𝑀1( ∑𝑖=1𝑝 𝑠𝑖 ⋅𝑡𝑖 ) 𝑝 𝑝 if ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0 𝐴𝑠 = |𝐼𝑡𝑝 | ∑𝑖=1𝑠 𝑠𝑖 (4) ⎨ ⎩3 otherwise 𝑝 𝑝 where 𝑠𝑖 ∈ 𝑆𝑡𝑝𝑠 and 𝑡𝑖 ∈ 𝑇𝑡𝑝𝑠 . 𝑝 Algorithm 2 [Alg2] First, it maps each tweet 𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 into the label 𝑙𝑖 ∈ {1, 2, 3, 4, 5} using its 𝑝 sentence score 𝑠𝑖 ∈ 𝑆𝑠 𝑙𝑖 = 𝑀1(𝑠𝑖 ) (5) 𝑝 then, 𝐴𝑠 is 𝑝 |𝐼𝑡𝑝 | ∑ 𝑠 𝑙 𝑝 𝑝 ⌊ 𝑖=1 𝑖 ⌉ if ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0 𝐴𝑠 = { |𝐼𝑡𝑝𝑝 𝑠 | (6) 3 otherwise 𝑝 The step of assigning 𝑙𝑖 to each tweet 𝑡𝑤𝑖 ∈ 𝐼𝑡𝑝𝑠 (Eq. 5), hopefully returns a more fair 𝑝 𝐴𝑠 . In fact, the tweet normalization may help in aggregating the contribution of each tweet (𝑙𝑖 ) using the standard mean, which means applying the macro aggregation. In a multi-class classification setup, macro-metric aggregation is preferable if it is suspected that there may be class imbalance; in fact, the values 𝑙𝑖 are not balanced with respect to the current sentence 𝑠: likely, if a Party 𝑝 agrees with a sentence, there will be lot of tweets in agreement with it (many 𝑙𝑖 = 4 or 𝑙𝑖 = 5) and a few (errors) or no tweets in disagreement (few labels 𝑙𝑖 = 1, or 𝑙𝑖 = 2, or 𝑙𝑖 = 3), and vice-versa. 𝑝 Algorithm 3 [Alg3] Like 𝐴𝑙𝑔2, but slightly modifying how 𝐴𝑠 is computed (Eq. 6). Let’s further define 𝑉𝑙 as the number of voters for the integer label 𝑙 ∈ {1, 2, 3, 4, 5} 𝑝 |𝐼𝑡𝑝 | 𝑉𝑙 = |{𝑙𝑖 ∶ 𝑙𝑖 = 𝑙}𝑖=1𝑠 | (7) where 𝑙𝑖 are the labels computed from Eq. 5. Let’s define 𝑣 = 𝑚𝑎𝑥(𝑉𝑙 ), then 𝑙 if |{𝑙 ∶ 𝑉𝑙 = 𝑣}| = 1 (8a) ⎧ 𝑝 𝑝 ⎪ ∑|𝐼𝑡𝑝𝑠 | 𝑙 𝐴𝑠 = ⌊ 𝑖=1 𝑖 ⌉ if |{𝑙 ∶ 𝑉𝑙 = 𝑣}| > 1 (8b) ⎨ |𝐼 𝑝 | ⎪ 𝑡𝑝𝑠 ⎩3 otherwise (8c) where ⌊...⌉ is the round function. The majority voting (case 8a) may have a bigger contribution in assigning correct labels than the plain standard mean (case 8b taken from Eq. 6 of 𝐴𝑙𝑔2), since it better accounts for class imbalance. Algorithm 4 [Alg4] The previous algorithms take into consideration the neutral label 𝑛𝑙 = 3 𝑝 (neither disagree, nor agree) also when ∣ 𝐼𝑡𝑝𝑠 ∣≠ 0. However, we wondered how the results 𝑝 would change if 𝑛𝑙 was only considered when ∣ 𝐼𝑡𝑝𝑠 ∣= 0. The neutral label may also be 𝑝 assigned in the presence of a low number of in-topic 𝐼𝑡𝑝𝑠 : in this particular situation, the 𝑝 user may have not taken a position about the current sentence 𝑠 yet; also, choosing 𝐴𝑠 looking at just one tweet may not be significant. Therefore, 𝐴𝑙𝑔4 stems from 𝐴𝑙𝑔3 having 𝑙𝑖 = 𝑀2(𝑠𝑖 ) (9) where 𝑙𝑖 ∈ {1, 2, 3, 4}; we define 𝑝 𝑝 ⎧3 if ∣ 𝐼𝑡𝑝𝑠 ∣< 𝑚 𝑎𝑠 = majority voting (case 8a) (10) ⎨ ⎩ rounded standard mean (case 8b) where 𝑚 is the minimum number of tweets for which the majority voting algorithm or the standard mean is executed. Since the {3, 4} labels in output from 𝑀2(𝑠) represent the 𝑎𝑔𝑟𝑒𝑒 and 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑎𝑔𝑟𝑒𝑒 final labels, they must be mapped again to the real final integer labels 4 and 5 respectively (as coded in Table ??) 𝑝 𝑝 𝑝 𝑝 𝑎𝑠 if 𝑎𝑠 = 1 ∨ 𝑎𝑠 = 2 𝐴𝑠 = { 𝑝 𝑝 𝑝 (11) 𝑎𝑠 + 1 if 𝑎𝑠 = 3 ∨ 𝑎𝑠 = 4