<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GIL_UNAM_Iztacala at Touché: Benchmarking Classical Models for Multilingual Political Stance and Power Classification⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis A. H. Miranda</string-name>
          <email>luisheml16@comunidad.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesús Vázquez-Osorio</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrián Juárez-Pérez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerardo Sierra</string-name>
          <email>GSierraM@iingen.unam.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gemma Bel-Enguix</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Grupo de Ingeniería Lingüística - UNAM, Instituto de Ingeniería</institution>
          ,
          <addr-line>Circuito Escolar -, 04510 Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Nacional Autónoma de México, Facultad de Estudios Superiores Iztacala, Avenida de los Barrios 1</institution>
          ,
          <addr-line>54090 Estado de México</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this article, we present a methodology developed to address the challenges of the Touché shared task on Ideology and Power Identification in Parliamentary Debates , which consists of three sub-tasks: determining the ideological orientation of a speaker's party, identifying whether the party is in government or opposition, and classifying the party's stance on the populist-pluralist spectrum. To tackle these tasks, we implemented a comprehensive pipeline to train, evaluate, and compare several classical machine learning models, including Bernoulli Naive Bayes, Logistic Regression, Support Vector Machines, and Random Forest. Our results show strong and consistent performance in Sub-tasks 1 and 2 across multiple languages, with macro F1-scores indicating reliable generalization. However, Sub-task 3 presented greater challenges, with lower and more variable performance, suggesting the increased complexity involved in modeling populism in multilingual parliamentary discourse.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Political discourse</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Multilingual text classification</kwd>
        <kwd>Political NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Parliamentary debates are a rich source of political expression, ofering insight into the ideological
leanings and governing positions of elected representatives. As both parliaments and citizens engage
in discussions on critical issues, language becomes a key medium through which political positions
are conveyed, often reflecting the broader semantic and cultural context of a region or country. In this
context, the analysis of multilingual parliamentarian speeches represents significant challenges due to
structural and semantic diferences between cultural edges. The way in which power, disagreement,
or support for the government is expressed varies greatly between political cultures. Even the same
linguistic pattern can have diferent meanings depending on the country or historical context.
Understanding these nuances is essential for building robust models that generalize across languages and
political systems. To address these challenges, we adopt a hybrid evaluation framework that explores
combinations of classical text representations (TF-IDF and count-based) with multiple machine learning
models. This approach allows us to systematically assess which configurations generalize best in
multilingual political discourse.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Identifying populism is a tiny line, and it is also related to strong ideologies such as socialism (left wing)
and nationalism (right wing). This may be crucial to understanding the diference between some types
of populism (Trump vs. Evo Morales). Both adversaries are classified as populist, but its main reference
ifeld is completely opposite. Meanwhile, Trump is associated with a strong right ideology, Morales is
related with the left wing [? ]. As any president, any citizen can be represented as a part of an ideology
that completely defines his behaviors [ ? ]. This distinction matters because each ideology frames the
world diferently, shaping who is seen as the problem and who deserves protection.</p>
      <p>Kitchener et. al. [? ] says that political or ideological behaviors are reflected in a physical interaction
with the society or even on the Internet. Every interaction on the internet left a digital footprint that can
be traced, and Taulli [1] identified these massive amounts of data as Big Data. Also, for some companies
or political parties, it is important to understand how people interact and say on social media.</p>
      <p>Recent years have witnessed the development of several models addressing the problem of political
ideology identification. Notably, Iyyer et al. [ 2] applied a Recursive Neural Network (RNN) to this
task using a sentence-based approach. Their work utilized the Ideological Book Corpus (IBC), a
dataset composed of annotated U.S. Congressional debates, labeled for ideological bias (Republican or
Democrat) at both the sentence and phrase levels. Their model outperformed traditional methods, such
as bag-of-words classifiers, particularly when applied to sentence-level annotated data, highlighting the
advantages of leveraging syntactic structures in ideological classification.</p>
      <p>More recently, Andruszak [3] contributed to the 2023 Power Identification shared task by
implementing four distinct approaches. The first two relied on Large Language Models (LLMs), specifically
ifne-tuning BERT and LLaMA 3, as well as employing prompt engineering techniques with LLaMA 3.
The third approach utilized a statistical method based on a Z-score summation, which combined the
mean and standard deviation of token-level representations within a given text. The fourth involved
training a Support Vector Machine (SVM) classifier. While none of the methods significantly
outperformed the others, the results suggested that performance could be enhanced through the integration
of rule-based components tailored to each parliamentary context within a structured pipeline.</p>
      <p>This interest in political ideology is not merely academic; ideological orientation influences voting
behavior, social media engagement, and policy support. As political expression increasingly takes place
in digital environments, understanding how ideological cues manifest in text becomes essential for
both political science and computational modeling.</p>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <sec id="sec-3-1">
        <title>3.1. Data Overview</title>
        <p>This work was carried out as part of the Touché Lab at CLEF 2025 specifically for the task Ideology and
Power Identification in Parliamentary Debates 2025 [4]. This task consists of three subtasks: 1) identify
the ideology of the speaker’s party, 2) identify whether the speaker’s party is currently governing or in
opposition, and 3) identify the position of the speaker’s party in populist - pluralist scale.</p>
        <p>The data set consists of 29 languages that represent an European language. This multilingual data set
is provided by Fröbe et. al. [5]. This set consists of:
• id: A unique identifier for each individual speech instance.
• speaker: A identifier for a unique person. There may be multiple speeches from the same speaker.
• sex: The biological sex of the speaker. It can be classified as Female, Male or Unknown sex.
• text: The original speech text in the speaker’s native European language.
• text_en: The English translation of the political speech.
• orientation: A binary label indicating the speaker’s ideological stance (0: left and 1: right).
• power: A binary label reflecting the speaker’s political role ( 0: opposition, 1: coalition, or the
governing party).
• populism: An ordinal variable capturing the degree of populism in the speaker’s party, on a
four-point scale (1: Strongly Pluralist, 2: Moderately Pluralist 3: Moderately Populist, 4: Strongly
Populist).</p>
        <p>
          To ensure a reliable modeling process, it is necessary to explore the class distribution across all the
subtasks, as class imbalance can significantly afect the performance and generalizability of classification
models. Figure 1 shows the class distribution for Sub-task 1 (Ideological Classification). The bars
represent the percentage of each class within a given language. Class 0 denotes left-wing ideology
(gold), while Class 1 denotes right-wing ideology (blue). Visual inspection reveals a pronounced
imbalance, with right-wing parties being more frequently represented than left-wing ones. A chi-square
goodness-of-fit test confirms that this imbalance is statistically significant,  2(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 5344.14, p &lt; .001.
This suggests that the training data may introduce bias into models trained for this sub-task, particularly
favoring the dominant class.
        </p>
        <p>
          Figure 2 illustrates the distribution of power roles across languages (sub-task 2 - Governing vs
Opposition). Fewer languages are represented in Sub-task 2, likely due to missing information on
whether parties were in government or opposition. The golden bars represent parties in opposition,
while blue bars denote coalition or governing parties. For instance, Serbia and Croatia exhibit a strong
dominance of coalition-class samples, whereas Spain and the Basque Country show the opposite pattern.
A chi-square goodness-of-fit test indicates that this imbalance is statistically significant,  2(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 558.45,
p &lt; .001. This result indicates that the training data may introduce bias into models developed for this
sub-task, potentially favoring the majority class.
        </p>
        <p>
          Figure 3 displays the distribution for party positions along the populist-pluralist spectrum across
languages (Sub-task 3 - Populism). The bars represent four classes: class 0 (Strongly Pluralist), class 1
(Moderately Pluralist), class 2 (Moderately Populist), and class 3 (Strongly Populist). Most languages
show a great variability and also a notably under representation for some labels. A chi-square
goodnessof-fit test confirms that this imbalance is statistically significant,  2(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) = 11, 465.05, p &lt; .001. This
suggests that models trained on this data may be biased toward the more frequent categories, potentially
under performing for less represented ideological positions. These insights from the exploratory analysis
inform the modeling strategy presented in the next section, with a specific focus on mitigating class
imbalance and ensuring fair performance across all categories.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed Model</title>
        <p>
          For a binary classification like Sub-task 1 (ideology classification) and Sub-task 2 (government vs.
opposition classification), a systematic experimentation framework was implemented to compare
diferent machine learning algorithms. Four commonly used machine learning classifiers for
experimentation: Bernoulli Naive Bayes, Logistic Regression, Support Vector Machines (SVM), and Random
Forest. Each model was tested in combination with two vectorization techniques — CountVectorizer
and TfidfVectorizer — and across two n-grams: unigrams (
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          ) and bigrams (
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          ). Furthermore, a
preprocessing option involving lowercasing and punctuation removal was toggled on and of to evaluate
its impact on model performance. In total, the grid of experiments included:
• Classifier (4 options)
• Vectorizer type (2 options)
• Preprocessing (lowercasing and punctuation removal: on/of)
• N-gram (2 options)
To ensure reproductibility and robust comparison, each configuration was trained using GridSearchCV
for each model. The macro-averaged F1-score was used as the main evaluation performance metric
due to the class imbalance and the need to treat all classes equally. The training set was divided using
stratified sampling to preserve the distribution of ideological labels. Moreover, for Sub-task 3 (populism
classification), the same procedure was extended to a multi-labeled classification, where the labels were
treated as ordered categories and the macro F1-score was again selected to reflect the performance
across all levels of the populism scale.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology</title>
        <p>This section describes the purpose of each component used in the modeling pipeline.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Vectorization Techniques</title>
          <p>• CountVectorizer transforms a corpus of text into a matrix of token counts. This class has a number
of parameters that can also assist in text preprocessing tasks, such as stop word removal, word
count thresholds (i.e. maximums and minimums), vocab limits, n-gram creation and more.
• Tfidf Vectorizer gives more weight to words that are more important and distinctive in a document.</p>
          <p>
            In addition, it takes away weight from words that do not help distinguish one class from another.
3.3.2. N-gram Range
• It is a sequence of consecutive n-elements arranged in a text. These elements can be traced as
individual words or unigrams (
            <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
            ) or a pair of words, most often called bigrams (
            <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
            ). It helps de
model identify a landscape of ideological expressions by the way the text is stored.
3.3.3. Text Preprocessing
3.3.4. Classification Algorithms
• This process includes lowercase and the removal of special characters and numerical digits. This
guarantees removing unnecessary noise elements so that the algorithms focus only on relevant
language patterns.
• Bernoulli Naive Bayesis a predictive model classification based on the Bayes theorem. The
presence of one variable is not related to the probability of another variable occurring. But as we
gain more data, we can associate the presence of that variable (independently of the others) with
the classification characteristic.
• Logistic Regression is used to predict the probability of an event occurring. It involves a binary
classification, with the output being a likelihood between 0 and 1.
• Support Vector Machines (SVM) aims to find the optimal line that separates two classes in a
hyperplane with maximum margin.
• Random Forest is a predictive model arranged by binary rules (Yes/No). It is robust to noise and
nonlinear patterns, but may overfit if not properly tuned.
          </p>
          <p>Each algorithm was evaluated in various configurations to identify the most stable and high-performing
approach per language. The decision to train models separately for each language rather than on a
combined dataset was informed by initial tests.
3.3.5. Metrics Performance
• Precision measures the proportion of positive predictions that were correct.
• Recall evaluates the model’s ability to detect all true positive cases.
• Accuracy reflects the total percentage of correct predictions the model made, both positive and
negative.
• Macro F1-score combines precision and recall into a single measure, using their harmonic average,
but does so for each class individually and then averages the results without weighting by class
size. This helps mitigate class imbalance by giving equal importance to all classes regardless of
their frequency.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.6. Implementation Network and Tools</title>
          <p>All experiments were conducted using Python and the scikit-learn library, which supported model
selection, text vectorization, classifier implementation (e.g., MultinomialNB, LogisticRegression,
RandomForest, SVC), and performance evaluation through metrics such as accuracy, precision, and F1-score.
The use of Pipeline ensured reproducibility and modular design across experiments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The performance of the metrics for Sub-task 1 (Ideology Classification) is depicted in Figure 4. Panel
A shows a violin element above the baseline (0.5) and a median value of 0.76. A descriptive analysis
of the Macro F1-scores was conducted to understand the overall performance of the data distribution.
The variance score was 0.0079, the standard deviation was 0.089. A Shapiro-Wilk normality test did
not reflected a normal distribution for Macro F1-scores across languages ( &lt; 0.05). Based on this,
a bootstrap-based one-sample t-test (10,000 iterations) was conducted to assess whether the average
F1-score exceeds the chance performance (0.5). The result was highly significant, (24) = 88.70,
p &lt; .001, indicating that the model performs substantially better than random guessing in the Ideology
Classification task.</p>
      <p>The second panel displays all the evaluations metrics per language. The interquartile range spans
from 0.7 (Q1) and 0.83 (Q3), indicating a strong general performance. Languages such as Catalonia
(es-ct), Galicia (es-ga), and the Basque Country (es-pv) consistently surpassed the third quartile in
all metrics. In contrast, languages like Bosnia and Herzegovina, Belgium, and Croatia consistently
scored below the first quartile across metrics, which may point to language-specific challenges in the
classification process or imbalanced data representation.</p>
      <p>Figure 5 shows performance metrics for Sub-task 2 (Goverment vs Opposition) broken down by
language. Panel A represents a raincloud plot where the Macro F1-score is represented for all languages.
It is plausible to observe that the typical value of Macro F1-score is 0.76 (median value), and a great
part of the language are above this value. The variance scores are 0.007 and the standard deviation is
.08, which represents a slight variation in the data. A normal test was implemented, and results dit not
ifnd a normal distribution across data (p &lt; 0.05). A bootstrap-based one-sample t-test was performed
to assess whether the average Macro F1-score for Sub-task 1 exceeded a baseline of 0.5. The results
were statistically significant, t(24) = 84.50, p &lt; .001, indicating that the model performed significantly
better than random guessing in the Goverment vs Opposition task.</p>
      <p>For panel B, most languages are found to perform highly. The interquartile range has values between
0.72 and 0.83. Languages like Catalonia, Galicia, Basque Country, Greece, Hungary, and Turkey have all
their metrics consistently above the upper quartile, suggesting robust and balanced performance in
precision, recall, accuracy, and F1-score. In contrast, there are important variations between languages;
Belgium and Croatia exhibit consistently lower performance, with all metrics falling below the first
quartile.</p>
      <p>The Sub-task 3 (Populism Scale Classification, Figure 3) involved a multi-class classification problem,
it aimed to classify according to the degree of populism exhibited by the speaker’s party. Panel A
exhibits the performance of the results for the Macro F1-score in all languages. The violin plot shows a
concentration of values in the higher performance range; however, it also displays a non-negligible
spread below the baseline (0.5), indicating that for some cases, the model’s predictions may approximate
chance level. Panel A exhibits the performance of the Macro F1-scores across languages, with a median
value of 0.66. Despite the concentration of values in the upper range, the distribution shows a notable
spread, as reflected in the variance (0.0220) and standard deviation (0.148). A Shapiro-Wilk normality
test indicated no significant deviation from normality (W = 0.9600, p = 0.3285), supporting the validity
of subsequent parametric analysis. A bootstrapping t-test was calculated to identify if the average
Macro F1-score of the model is significantly higher. The results were significant, t(24) = 38.25, p&lt;.001,
confirming that the model significantly outperforms random guessing.</p>
      <p>Panel B shows the second panel where languages are represented by their evaluation metrics.
Outcomes exhibit a greater variability among languages than the previous sub-tasks. Some languages drop
below first quartile in all their metrics such as: Estonia, Finland, France, Great Britain, The Netherlands,
and Norway, indicating that this task is more challenging for the model. Despite this, there are languages
that perform outstandingly, with values greater than the third quartile, such as: Catalonia, Galicia, the
Basque Country and Greece, even with metrics close to or equal to 1.0.</p>
      <p>The results for hyperparameter tuning are shown in Table 1 for sub-task 1. Support Vector Machine
(SVM) combined with TF_IDF and either unigrams or bigrams was the most robust hyperparameter
tuning across languages. Second, Logist Regression performed best with regional languages (e.g.,
Catalan, Galician, Slovenian) when using preprocessing and bigrams. For all vectorizers in all sub-taks,
min_df=2 and max_features=10,000 were set to control vocabulary size and reduce noise.</p>
      <p>The results for hyperparameter tuning for sub-task 2 are shown in Table 2. Support Vector Machine
(SVM) combined with TF-IDF and either unigrams or bigrams consistently outperformed other classifiers
in 15 of the 25 languages such as Austria, Czechia, Denmark, Spain, France et cetera. Second most used
algorithm was Logistic Regression for regional languages such as Turkish, Slovenian and Galician.</p>
      <p>In sub-task 3 (populism detection), Support Vector Machine (SVM) with TI-IDF was again the most
consistent configuration, used in over all of the languages. Logistic Regression remained competitive
and Naive Bayes combined with CountVectorizer was efective for some languages. In contrast with
other subtasks, preprocessing (lowercasing and punctuation removal) was less frequently applied,
suggesting that case sensitivity and punctuation may contain stylistic signals relevant to populist
expression. In contrast with other subtasks, preprocessing (lowercasing and punctuation removal) was
less frequently applied, suggesting that case sensitivity and punctuation may contain stylistic signals
relevant to populist expression.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, a pipeline was designed to train four classical machine learning models to identify the
best F1-score. This pipeline consisted of preprocessing and systematically vary vectorization methods,
n-grams and the machine learning model. The optimal model and the corresponding parameter set
were selected for each language on the basis of performance.</p>
      <p>The pipeline demonstrated strong results across all three subtasks. For Sub-task 1, overall performance
was strong, with most languages achieving F1-scores above the typical benchmark of 0.76. Similarly,
in Sub-task 2, results were generally satisfactory; notably, languages such as Catalan, Galician, and
Basque consistently exceeded the third quartile across all evaluation metrics. Sub-task 3 revealed that,
for certain languages, model predictions were not significantly better than chance.</p>
      <p>Statistical validation using bootstrap-based t-tests confirmed that the observed performance in all
three sub-tasks was significantly above chance (p &lt; .001). These findings suggest that even classical
models, when systematically optimized, can deliver robust results in multilingual political classification
tasks.</p>
      <p>
        The best-performing configuration across the majority of languages involved the use of TF-IDF
vectorization, bigrams (n-gram range = (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        )), and Support Vector Machines (SVM) with default
hyperparameters. This combination consistently yielded higher macro F1-scores, particularly in Sub-task
1 and Sub-task 2. In contrast, for languages with smaller datasets or higher class imbalance, Logistic
Regression with L2 regularization and CountVectorizer showed more stability, suggesting that
performance may vary depending on language-specific characteristics such as vocabulary richness and class
distribution.
      </p>
      <p>Overall, the results are encouraging. We acknowledge that training classical models is considerably
less time-consuming compared to fine-tuning large language models (LLMs), which provides us with
the flexibility to conduct deeper analyses in the future for those languages with lower performance.
Additionally, we remain open to exploring LLM-based approaches for these cases if needed.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by UNAM PAPIIT project IG400725, and by the Mexican Government
through SECIHTI Project FC-2023-G-64. The first author additionally acknowledges support from
SECIHTI (CVU: 123456).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 solely for grammar and spelling checks.
After using this tool, the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
978-3-031-28241-6_20.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Taulli</surname>
          </string-name>
          , Artificial Intelligence Basics:
          <string-name>
            <given-names>A</given-names>
            <surname>Non-Technical</surname>
          </string-name>
          <string-name>
            <surname>Introduction</surname>
          </string-name>
          , primera ed.,
          <source>Apress</source>
          , Berkeley, CA,
          <year>2019</year>
          . URL: https://link.springer.com/book/10.1007/978-1-
          <fpage>4842</fpage>
          -5028-0. doi:
          <volume>10</volume>
          .1007/ 978-1-
          <fpage>4842</fpage>
          -5028-0, publicado el 2 de agosto de
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Enns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnik</surname>
          </string-name>
          ,
          <article-title>Political ideology detection using recursive neural networks</article-title>
          , in: K. Toutanova, H. Wu (Eds.),
          <source>Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Baltimore, Maryland,
          <year>2014</year>
          , pp.
          <fpage>1113</fpage>
          -
          <lpage>1122</lpage>
          . URL: https://aclanthology.org/P14-1105/. doi:
          <volume>10</volume>
          .3115/ v1/
          <fpage>P14</fpage>
          -1105.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Andruszak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alhamzeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Egyed-Zsigmond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leydet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Otiefy</surname>
          </string-name>
          ,
          <article-title>Team insa passau at touché: Multi-lingual parliamentary speech classification</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          . URL: https://api.semanticscholar.org/CorpusID:271843942.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Osenova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ogrodniczuk</surname>
          </string-name>
          , Ç. Çöltekin,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koržinek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meden</surname>
          </string-name>
          , et al.,
          <article-title>Parlamint ii: advancing comparable parliamentary corpora across europe, Language Resources and Evaluation (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous integration for reproducible shared tasks with tira</article-title>
          .io, in: J.
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          , L. Goeuriot,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>