JU_NLP_DID at Touché: An Attempt to Identify
                         Aspects of Power from Parliamentary Debates
                         Notebook for the Touché Lab at CLEF 2024

                         Adnan Khurshid, Dipankar Das, Rajdeep Khaskel and Suchanda Datta
                         1Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India


                     Abstract
                     Parliamentary debates shape critical aspects of citizens’ lives and often influence global policies. Analyzing these debates
                     computationally poses unique challenges due to the indirect and complex nature of political discourse. This paper
                     addresses two key variables in parliamentary speeches: the political ideology of the speaker and their affiliation with
                     either the governing party or the opposition. We approach these subtasks as binary classification problems, employing a
                     combination of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization and Support Vector Machines
                     (SVM) for our analysis. Our methodology is designed to capture the nuanced language of parliamentary debates and
                     effectively classify speakers based on their political stance and party alignment. The results demonstrate the efficacy of
                     TF-IDF with SVM in handling the intricacies of political speech, providing a robust framework for further research in
                     computational political analysis.

                     Keywords
                     TF-IDF, SVM, Binary Classification


                         1. Introduction
                         Parliamentary debates are pivotal in shaping not only the national policies of a country but also
                         influencing global political landscapes. These debates, characterized by their indirect and
                         nuanced discourse, pose significant challenges for computational analysis. Understanding the
                         ideological stance and power alignment of speakers within these debates can provide valuable
                         insights into political dynamics and decision-making processes.

                         1.1. Objective
                         This paper addresses the task[1] of classifying two critical variables associated with speakers in
                         parlia- mentary debates: the political ideology of the speaker’s party and whether the speaker’s
                         party is in the governing coalition or in opposition. These tasks are formulated as binary
                         classification problems, necessitating sophisticated natural language processing (NLP) techniques
                         to handle the complexity and variability of political speech.
                            The data for this study is derived from the ParlaMint corpus, a multilingual and comparable
                         dataset of parliamentary debates across various countries. The corpus has been curated to
                         minimize confounding variables, ensuring that the analysis focuses on the content and context of
                         the speeches rather than extraneous factors such as speaker identity. The provided data includes
                         both the original speeches and their English translations, facilitating the development of
                         multilingual models and cross-linguistic analyses.
                            To tackle the classification tasks, we employ a combination of Term Frequency-Inverse
                         Document Fre- quency (TF-IDF) vectorization and Support Vector Machines (SVM). TF-IDF is
                         utilized to convert textual data into numerical representations that capture the importance of
                         words within the speeches, while SVM is used for its effectiveness in handling high-dimensional
                         feature spaces and binary classification problems.


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                             adnankhurshid251@gmail.com (A. Khurshid); dipankar.dipnil2005@gmail.com (D. Das);
CEUR
Workshop
                          rajdeepkhaskel@gmail.com (R. Khaskel); sumidatta769@gmail.com (S. Datta)
                  ceur-ws.org
              ISSN 1613-0073
Proceedings                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1.2. Contribution
This study contributes to the field of computational political analysis by offering a robust
framework for identifying political ideology and power alignment in parliamentary debates. By
leveraging advanced NLP techniques, we aim to enhance the understanding of political discourse
and provide a foundation for further research in this area. The results of our analysis demonstrate
the potential of TF-IDF and SVM in addressing the challenges posed by the indirect nature of
parliamentary speech, paving the way for more accurate and insightful political analysis tools.


2. Background
In today’s digital era, parliamentary debates have transcended the confines of legislative
chambers to include online platforms, fundamentally reshaping discourse. Within these digital
spaces, social networks wield significant influence over opinions and provide valuable data for
sentiment analysis and power identification. Understanding the intricate power dynamics at play
is essential for decoding how influence is disseminated and policies are formulated. By harnessing
the capabilities of sentiment analysis and computational techniques, we can shed light on the
underlying power structures and sentiment trends, ultimately enhancing decision-making
processes. This multifaceted approach involves analyzing various factors such as speaking
patterns, party contributions, responses to arguments, social network connections, and media
coverage to unveil influential actors and dominant dynamics within parliamentary debates.


3. System Overview
The system developed for parliamentary power identification involves several steps, including
data preprocessing, feature extraction, and classification. The primary goal is to classify
parliamentary text data using machine learning techniques. Below is a detailed overview of the
system components and processes.

3.1. Data Preprocessing
We employed automated English translations for our experiments. The raw textual data
underwent rigorous preprocessing to facilitate feature extraction and classification. This
preprocessing pipeline encompassed the following steps:
  Lowercasing: All text is converted to lowercase to ensure uniformity. HTML Tag Removal:
HTML tags are removed using regular expressions to clean the text. The English translated text
provided in the dataset contained HTML commands, necessitating their removal to ensure
accurate preprocessing and avoid interference with subsequent analyses. Punctuation Removal:
All punctuation marks are removed to reduce noise. Stopword Removal: Common stop words are
removed using NLTK’s stopword list, which helps in focusing on the meaningful words in the text.
Lemmatization: Words are lemmatized to their base or dictionary form using the WordNet
lemmatizer. This involves: Tokenizing the text. Tagging each word with its part of speech.
Mapping the POS tag to WordNet’s POS tag format. Lemmatizing each word based on its POS tag.
This preprocessing ensures that the text data is clean, normalized, and stripped of irrelevant parts,
making it suitable for feature extraction.

3.2. Feature Extraction
After preprocessing, the text data is transformed into numerical features using the TF-IDF (Term
Frequency-Inverse Document Frequency) vectorization technique. This method converts the text
into a matrix of TF-IDF features, which reflects the importance of words in the corpus:
  TF-IDF Vectorization: This technique is used to convert the preprocessed text data into numeri-
cal vectors. It captures the importance of a word in a document relative to the entire corpus. The
Tfidf Vectorizer from scikit-learn is used with default parameters.
 3.3. Model Building and Training

For the classification task, a Support Vector Machine (SVM) model with a linear kernel is initially
employed. The SVM classifier is chosen for its effectiveness in high-dimensional spaces and its
capability to handle large feature sets resulting from TF-IDF vectorization. The dataset is split into
training and testing sets, with 80% of the data used for training and 20% for testing. The SVM model
is trained on the TF-IDF vectors of the training set.

Initially, we had tried using bi-grams and n-grams with SVM but did not observe relevant
improvements in performance, hence we focused solely on uni-gram TF-IDF representations. To
optimize the SVM model, hyper-parameter tuning is performed using RandomizedSearchCV. This
approach is selected over GridSearchCV due to its ability to efficiently explore a wide range of
parameter combinations with fewer iterations, thus reducing computational burden while still
providing robust parameter estimates. Given our system's limited computational power,
RandomizedSearchCV is configured with 5 iterations and 2-fold cross-validation.

Hyper-parameter tuning is performed using RandomizedSearchCV with the following parameter
distribution:

   ●   C: [0.1, 1, 10, 100, 1000]
   ●   kernel: ['linear', 'rbf', 'sigmoid']
   ●   probability: [True]
   ●   gamma: ['scale', 'auto']
   ●   coef0: [0.0, 0.1, 0.5, 1.0]

After tuning, the best parameters for the SVM model are found to be:

   ●   C: 10
   ●   kernel: 'rbf'
   ●   probability: True
   ●   gamma: 'scale'
   ●   coef0: 0.1

These parameters enhance the SVM model's performance significantly for our classification task.

 3.4. Model Evaluation
 The trained SVM model is evaluated using the test set. Several evaluation metrics are computed
 to assess the model’s performance:
   Classification Report: This includes precision, recall, and F1-score for each class, providing a
 detailed performance analysis. Confusion Matrix: The confusion matrix shows the true positive,
 true negative, false positive, and false negative counts, offering insights into the model’s
 prediction errors. ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve and
 the Area Under the Curve (AUC) score are computed to evaluate the model’s ability to discriminate
 between classes. The AUC score provides a single metric to summarize the model’s performance.
 4. Results

                                   Power                              Orientation

 Average F1                        0.629126846                        0.570052236

 Max Precision                     0.832305837                        0.876411242

 Max Recall                        0.824644263                        0.763864404

 Max F1                            0.82763257                         0.765286062


 Table 1
 Average F1, Max Precision, Max Recall, Max F1 Scores of Ideology and Power Identification Shared
 Task.


   The accuracy of the SVM classifier was in the range of 60-80 percent for different languages.


 4. Conclusion
 This study demonstrates the application of advanced natural language processing techniques to
 the analysis of parliamentary debates, focusing on identifying the political ideology of speakers
 and their party’s power status. By leveraging the ParlaMint corpus, which provides a rich and
 multilingual dataset of parliamentary speeches, we have developed a robust framework for
 addressing these binary classification tasks.
    Our approach involved experimenting with various methodologies. We initially utilized Term
 Frequency-Inverse Document Frequency (TF-IDF) vectorization combined with Support Vector
 Machines (SVM), which proved effective in handling the complexity and nuance of political
 discourse. The results highlight the capability of TF-IDF and SVM to capture significant features of
 parliamentary speeches. Each method demonstrated unique strengths, contributing to a
 comprehensive understanding of the political dynamics within parliamentary debates. This work
 not only provides valuable insights into the political dynamics within parliamentary debates but
 also sets the stage for further research in computational political analysis. Future studies can build
 on this foundation by refining these models, exploring ensemble methods, and expanding the
 scope to include additional political variables and more diverse datasets.

In conclusion, our study underscores the importance of computational approaches in
understanding political discourse and offers a promising methodology for analyzing parliamentary
debates. The techniques and findings presented here contribute to the broader field of political text
analysis, enhancing our ability to decipher and interpret the intricate language of politics.
References

  1. Erjavec, Tomaž, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej
     Pančur, Michał Rudolf et al. "The ParlaMint corpora of parliamentary
     proceedings." Language resources and evaluation 57, no. 1 (2023): 415-448.

  2. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
      Thirion, Olivier Grisel, Mathieu Blondel et al. "Scikit-learn: Machine learning in Python."
      the Journal of machine Learning research 12 (2011): 2825-2830.

  3. Çöltekin, Çağrı, Matyáš Kopp, Katja Meden, Vaidas Morkevicius, Nikola Ljubešić, and
      Tomaž Erjavec. "Multilingual Power and Ideology Identification in the Parliament: a
      Reference Dataset and Simple Baselines." arXiv preprint arXiv:2405.07363 (2024).

  4. Russo, Daniel, Salud María Jiménez-Zafra, José Antonio García-Díaz, Tommaso Caselli,
      Marco Guerini, L. Alfonso Ureña-López, and Rafael Valencia-García. "PoliticIT at EVALITA
      2023: Overview of the Political Ideology Detection in Italian Texts Task." (2023).

  5. Tarkka, Otto, Jaakko Koljonen, Markus Korhonen, Juuso Laine, Kristian Martiskainen,
      Kimmo Elo, and Veronika Laippala. "Automated Emotion Annotation of Finnish
      Parliamentary Speeches Using GPT-4." In Proceedings of the IV Workshop on Creating,
      Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN)@ LREC-
      COLING 2024, pp. 70-76. 2024.

  6. Mochtak, Michal, Peter Rupnik, and Nikola Ljubešić. "The ParlaSent multilingual training
     dataset for sentiment identification in parliamentary proceedings." arXiv preprint
     arXiv:2309.09783 (2023).

  7. Eskişar, Gül M. Kurtoğlu, and Çağrı Çöltekin. "Emotions running high? a synopsis of the
      state of turkish politics through the parlamint corpus." In Proceedings of the Workshop
      ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pp. 61-70.
      2022.

  8. J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N.
     Handke, M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-
     Munstermann, M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein,
     Overview of Touché 2024: Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N.
     Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
     Proceedings of the Fifteenth International Confer- ence of the CLEF Association (CLEF
     2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024.