Trojan Horses at Touché: Logistic Regression for Classification of Political Debates Notebook for the Touché Lab at CLEF 2024 Deepak Chandar S, Diya Seshan, Avaneesh Koushik and P Mirunalini Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India Abstract This study focuses on multilingual parliamentary speech analysis, specifically identification and classification of the ideology of the speaker’s party and their governing status. The approach used here provides valuable insights into the political dynamics of parliamentary debates, enhancing the understanding of legislative discourse. A Logistic Regression model (combined with Count Vectorizer) is employed, trained on a dataset comprising of diverse multilingual parliamentary speech. The model achieves an F1-score of 0.59 for ideology classification and 0.69 for determining governing status. The effectiveness of the model is demonstrated in the context of evaluating parliamentary speeches from multiple countries. Keywords Multilingual Speech Analysis, Binary Classification, Logistic Regression, Count Vectorizer 1. Introduction Understanding the political environment is crucial in parliamentary discussions in order to appreciate the intricacies of legislative language. The philosophy of the speaker’s party and whether the party is in power or not are two crucial factors that greatly impact the substance and tone of speeches. Using the speech content, which may be in many languages, Sub-Task 1 aims to determine the speaker’s party’s ideological stance (left-wing or right-wing political orientation). Sub-Task 2 is determining where the party is in the present political structure: whether they are in power as the ruling party or in opposition [1]. Determining these components with accuracy improves one’s comprehension of the speaker’s viewpoint and the larger political forces at work. In today’s computing environment, the capability to perform data-intensive natural language pro- cessing tasks has expanded significantly. In the context of identifying key aspects of parliamentary speakers, this paper explores the use of the Logistic Regression model (combined with Count Vectorizer) for binary classification of speeches. 2. Background Analysing political ideologies has traditionally been a challenging task due to the lack of a detailed dataset representing individual views. A commonly employed approach which has shown remarkable prowess in capturing nuanced linguistic patterns by utilizing advanced language models (LLMs) like BERT and GPT-4, as outlined in this study which analyzes parliamentary representatives’ ideological positions [2]. Previous studies have also explored the efficacy of integrating natural language processing (NLP) methods into political science research [3]. Some such studies made use of advanced NLP methods to perform sentiment analysis of parliamentary debate transcripts from European parliaments, assessing if the age, gender, and political orientation of speakers could be detected from their speeches [4]. Furthermore, a study on Indian parliamentary debates introduces structured datasets and demonstrates CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ deepakchandar2210436@ssn.edu.in (D. C. S); diya2210208@ssn.edu.in (D. Seshan); avaneesh2210179@ssn.edu.in (A. Koushik); miruna@ssn.edu.in (P. Mirunalini) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings promising results in stance classification and pragmatic analysis [5]. These methodologies have provided valuable insights into the political dynamics of legislative discourse across various linguistic contexts. Inspired by the success of these approaches in various previous studies, the work done here aims to extend and refine existing methodologies for multilingual parliamentary speech analysis. Building upon the foundation laid by these prior investigations, the accuracy and robustness of classification models can be enhanced by incorporating additional linguistic features and optimizing model parameters. By leveraging logistic regression as a reliable framework for binary classification, the aim is to deepen the understanding of the intricate interplay between political ideology, governing status, and parliamentary discourse, thereby contributing to the broader discourse on computational approaches to political analysis. 3. System Overview 3.1. Dataset Overview The dataset comprises a collection of speeches along with metadata that includes the speaker’s gender and a classification label. The following are the dataset’s attributes: • id: A unique identifier for each speech record. This attribute helps in referencing and tracking specific speeches within the dataset. • text: The original speech text in 28 different European languages. This attribute is crucial for analyses that require the original language. • text_en: The translated speech text in English. This attribute is useful for analyses where English is the preferred language for processing or interpretation. This attribute has been used as the primary source for analysis and was fed into the machine learning model for further processing. • sex: The gender of the speaker, indicated by ’M’ for male and ’F’ for female. This attribute allows for gender-based analysis and comparisons. • label: A classification label for the speech, with possible values ’1’ and ’0’. In the context of Sub-Task 1, this label indicates the the speaker’s party’s ideological stance: left-wing (0) or right-wing (1). In the context of Sub-Task 2, this label indicates where the party is in the present political structure: ruling party (0) or opposition (1). Furthermore, on analyzing the dataset for the two sub-tasks, for sub-task 1 (orientation), an average of 10422 speeches per dataset were present, with an average of 5784 instances as left-wing and 4638 as right-wing. For sub-task 2 (power), an average of 8370 speeches were present, with 4445 instances as ruling party as 3925 for opposition party. This illustrates that the dataset for both tasks was well- distributed and balanced, which is crucial for the model to effectively understand the characteristics of the data. 3.2. Data Preprocessing Preprocessing involves extracting the relevant fields (text_en and label) from the dataset, and converting the text data into a suitable format for the model by employing vectorization using CountVectorizer. CountVectorizer is a class in scikit-learn [6] that transforms a collection of text documents into a numerical matrix of word or token counts. This class has a number of parameters that can also assist in text preprocessing tasks, such as stop word removal, word count thresholds (i.e. maximums and minimums), vocab limits, n-gram creation and more. The parameters used for the CountVectorizer tool are as follows: • lowercase: Convert all characters to lowercase before tokenizing. Set to True. • ngram_range: Range of n-values for different n-grams to be extracted. Set to (1, 1): only unigrams. • analyzer: Level at which the input text will be tokenized. Set to ’word’. 3.3. Proposed Model The first model which used for experimentation was Bidirectional Encoder Representations from Transformers (BERT) uncased classifier [7]. The embeddings obtained from the CountVectorizer tool were used to train the BERT model. The model was trained separately for each language, and while some languages yielded training F1-scores of around 0.60, others yielded significantly lower F1-scores. The dataset contains a wide range of text lengths- this could have been a reason why the model exhibited low F1-scores. Another reason for the unpredictable results of the BERT model could be batch size. Batch size is the number of samples processed together in each training step. The batch size used in the model turned out to be sub-optimal for the dataset and the hardware, and due to low computational efficiency, experimentation with different batch sizes was not possible. Hence, a different approach was chosen. The second model that was used for experimentation was logistic regression. Logistic regression is widely applied across various domains, often demonstrating superior accuracy compared to classifiers such as random forest and K-nearest neighbor in numerous empirical studies [8]. A common application of logistic regression is in sentiment analysis tasks, where it effectively categorizes text data into sentiment classes to classify emotions or opinions [9]. The initial approach involved combining the training datasets of all the languages and using this aggregated dataset to train the logistic regression model. However, this method resulted in a notably low average training F1-score. Subsequently, the model was trained separately for each language, and upon analyzing the outcomes, it was found that this method provided a higher average training F1-score. Hence, the latter method was used for subsequent analysis and evaluation. 3.4. Methodology Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome- something that can take two values such as true/false, yes/no, and so on. Logistic regression is a useful analysis method for classification problems, where the goal is to determine which category a new sample is most likely to belong to. The logistic function is represented by the following formula: 1 𝐿𝑜𝑔𝑖𝑡(𝜋) = 1 + 𝑒−𝜋 The embedding obtained from the CountVectorizer tool were used to train the logistic regression model for both the sub-tasks. The parameters used for the logistic regression model are as follows: • class_weight: Weights associated with classes. Set to None. • max_iter: Maximum number of iterations taken for the solvers to converge. Set to 300. • penalty: Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization. Set to ’l2’. • random_state: Controls the randomness of the estimator. Set to 42. • tol: Tolerance for stopping criteria. Set to e-4. • fit_intercept: Allows the model to make predictions more accurately by shifting the decision boundary. Set to true. • regularization strength ’C’: Used to penalize large coefficients. Set to 0.3. 4. Results The model was trained on speeches in around 25 different languages, which were translated into English for the purpose of evaluation. In the training phase, the model achieved an average F1-score of 0.99 for sub-task 1 (orientation), 0.98 for sub-task 2 (power). In the testing phase, the model achieved its highest F1-score of 0.83 for the Power task for the Greek language, and 0.72 for the Orientation task for the Italian language. Additionally, the model’s F1-score surpassed that of the baseline for several languages. On average, the metrics measured were 3 to 5 percent higher than those of the baseline model [10]. The results obtained were analyzed based on the average performance metrics of precision, recall, and F1-scores, as illustrated in 1. Table 1 Average Results for each Sub-Task Task Precision Recall F1-score Orientation 0.62 0.60 0.59 Power 0.67 0.70 0.69 One possible reason for the improvement in results over the baseline model could be the incorporation of the hyperparameter C into the model. The value of C was set to 0.3, determined through random search, where the model was trained and evaluated across various C values. This value of 0.3 yielded the best performance. Therefore, proper regularization helped improve the model’s generalizability to new, unseen data by reducing overfitting. The model was also analysed based on the parliamentary languages, and the top 4 highest performing languages of the two sub-tasks have been listed in tables 2 and 3. Table 2 Top F1-scores for the Power Task Parliament F1-score Baseline F1-score Greece 0.83 0.79 Austria 0.72 0.67 Italy 0.71 0.65 Bosnia and Herzegovina 0.57 0.41 Table 3 Top F1-scores for the Orientation Task Parliament F1-score Baseline F1-score Spain 0.72 0.72 Italy 0.66 0.65 Denmark 0.60 0.56 Netherlands 0.59 0.58 It was found that the model performed better for the power sub-task, benefiting from the presence of well-defined discriminating features in the parliamentary speeches of the dataset. However, in case of classification based on ideology, the model struggled due to the lack of clear discriminating features for the model to capture. 5. Conclusion In conclusion, the research demonstrates the efficacy of logistic regression as a reliable technique for binary classification in the nuanced domain of multilingual parliamentary speech analysis. Through meticulous analysis of the dataset and model training, the proposed model demonstrated the utility of the approach in interpreting key attributes of political discourse, namely party ideology and governing status. The findings revealed compelling F1-scores, averaging at around 0.59 and 0.69 respectively for the two tasks of identifying party ideology and governing status. This highlights the reliability of Logistic Regression in capturing the inherent complexities of parliamentary debates, even in diverse linguistic contexts. Looking ahead, further refinements and extensions of the approach hold promises in enhancing the predictive capabilities and applicability across a broader spectrum of parliamentary contexts. This includes exploring the integration of advanced language models (LLMs) such as BERT or GPT. By lever- aging LLMs, it is possible to delve deeper into the complexities of parliamentary discourse, uncovering subtle semantic nuances and contextual cues that traditional methods may overlook. Additionally, the plan is to investigate novel techniques for fine-tuning LLMs on parliamentary speech data, as well as exploring ensemble methods that combine the strengths of multiple models. Through these endeavors, the aim is to develop a more comprehensive understanding of the intricate dynamics of legislative language and its implications for governance and policy making. Due to the lack of computational resources and time, the model was trained with same features for both the sub-tasks. This work can be improved and extended by using different features for both the sub-tasks. References [1] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N. Handke, M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-Munstermann, M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein, Overview of Touché 2024: Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [2] K. Kato, A. Purnomo, C. Cochrane, R. Saqur, L(u)pin: LLM-based Political Ideology Nowcasting, 2024. URL: https://arxiv.org/abs/2405.07320. arXiv:2405.07320. [3] G. Glavaš, F. Nanni, S. P. Ponzetto, Computational Analysis of Political Texts: Bridging Research Efforts Across Communities, in: P. Nakov, A. Palmer (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, Association for Computational Linguistics, Florence, Italy, 2019, pp. 18–23. URL: https://aclanthology.org/P19-4004. doi:10.18653/v1/P19-4004. [4] K. Miok, E. Hidalgo-Tenorio, P. Osenova, M.-A. Benitez-Castro, M. Robnik-Sikonja, Multi-aspect Multilingual and Cross-lingual Parliamentary Speech Analysis, 2023. URL: https://arxiv.org/abs/ 2207.01054. arXiv:2207.01054. [5] S. V. K. Rohit, N. Singh, Analysis of Speeches in Indian Parliamentary Debates, CoRR abs/1808.06834 (2018). URL: http://arxiv.org/abs/1808.06834. arXiv:1808.06834. [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html. [7] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805. arXiv:1810.04805. [8] K. Shah, H. Patel, D. Sanghvi, M. Shah, A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification, Augmented Human Research 5 (2020) 1–16. URL: https://doi.org/10.1007/s41133-020-00032-0. doi:10.1007/s41133-020-00032-0. [9] A. Kumar, A. Mangotra, A. Ailawadi, R. Jain, M. Arora, Sentiment analysis on multilingual data: Hinglish, in: A. Swaroop, Z. Polkowski, S. D. Correia, B. Virdee (Eds.), Proceedings of Data Analytics and Management, Springer Nature Singapore, 2024, pp. 607–620. [10] Çağrı Çöltekin, M. Kopp, K. Meden, V. Morkevicius, N. Ljubešić, T. Erjavec, Multilingual Power and Ideology Identification in the Parliament: a Reference Dataset and Simple Baselines, 2024. arXiv:2405.07363.