=Paper= {{Paper |id=Vol-3133/paper10 |storemode=property |title=Qualitative Comparison of Native and Machine-Translated Parliamentary Debates |pdfUrl=https://ceur-ws.org/Vol-3133/paper10.pdf |volume=Vol-3133 |authors=Ajda Pretnar Žagar |dblpUrl=https://dblp.org/rec/conf/dhn/Zagar22 }} ==Qualitative Comparison of Native and Machine-Translated Parliamentary Debates== https://ceur-ws.org/Vol-3133/paper10.pdf
Qualitative Comparison of Native and
Machine-Translated Parliamentary Debates
Ajda Pretnar Žagar1
1
    Institute of Contemporary History, Privoz 11, 1000 Ljubljana, Slovenia


                  Abstract
                  Machine translation (MT) models have become increasingly accurate and widely accessible for multiple
                  languages in recent years. They can potentially lift the barriers to applying NLP tools and methods to
                  previously unsupported languages and boost comparative cross-lingual research in digital humanities.
                  This study empirically contrasts results obtained with source and target Slovenian ParlaMint corpus of
                  parliamentary debates on topic modelling. It qualitatively compares three steps in topic interpretation:
                  topic description, topic significance in subcorpora, and marginal topic distribution. The results indicate
                  that the topic modelling on the target corpus only partially replicates the topic modelling on the source
                  corpus, but the overlap is sufficient to provide a starting point for the cross-country comparison.

                  Keywords
                  topic modelling, LDA, parliamentary data, machine translation, qualitative evaluation




1. Introduction
The proliferation of linguistically annotated, well-structured corpora enables in-depth philologi-
cal, cultural, historical, and political analyses. When resources are available in several languages,
as is the case of ParlaMint corpora on parliamentary speeches from 17 European countries
[1], they also enable cross-country comparisons of discourses, topics, political agendas, and
language development. However, comparative research of ParlaMint data requires language
proficiency in more than a single language, which significantly limits transnational research.
   Fortunately, machine translation (MT) models are increasingly accurate and freely available
to the research community. They are a cost-efficient and fast method for converting almost any
corpus to a language the researcher could understand. With state-of-the-art models approaching
or sometimes even surpassing human accuracy [2], machine translation helps alleviate language
barriers for comparative research in multilingual text collections.
   The main research question of this paper is to what extent do bag-of-words results, specifically
topic modelling, on machine-translated corpora correspond to the results on the native corpora.
Given that topic modelling relies on word distributions and not on the order of words, proper
grammar, and correct pronouns, in-context word-to-word translation accuracy is the most
important requirement for MT, which is already relatively high with existing approaches.
Digital Parliamentary Data in Action (DiPaDA 2022) workshop, Uppsala, Sweden, March 15, 2022.
Envelope-Open ajda.pretnar@inz.si (A. P. Žagar)
GLOBE https://github.com/ajdapretnar (A. P. Žagar)
Orcid 0000-0002-5927-4538 (A. P. Žagar)
    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                  146
Similar research has already shown that machine translations can successfully capture topics,
similar to the source corpus [3], and that target corpora can be used in comparative research
[4, 5].
   We chose to work with a recently published ParlaMint corpus, which provides rich linguistic
annotation and metadata. Moreover, we chose the Slovenian ParlaMint corpus as we are native
speakers of the language and can accurately interpret the results. Slovenian generally has
fewer language resources than English and is morphologically rich, resulting in poorer machine
translation.
   The paper qualitatively compares the outcomes of source topic models with their counterparts
obtained from the target corpus. Topic overlap is estimated, not in terms of topic-term similarity,
but on how similar the analytical results would be if using target corpus. Evaluation is done by
comparing: a) topic interpretation in the source and target topic model, b) significant topics for
pre-COVID and COVID period, and c) marginal topic distribution of both models.


2. Related work
Various techniques can be applied for a comparative analysis of topic models of multilingual
corpora. Mimno et al. [6] propose Polylingual Topic Models (PTM), which can extract topics
for corpora in many languages, but they require an initial set of comparable documents. PTM
can be extended to unaligned documents, but not all corpora contain comparable documents.
Boyd-Graber and Blei [7] further this idea by proposing multilingual topic models for unaligned
documents. When the documents in different languages do not cover the same topics, which
is often the case, Yang, Boyd-Graber and Resnik [8] propose a multilingual topic model to
match the learned topics partially. However, all of these approaches require knowledge of the
languages of the corpus.
   The alternative is using machine-translated corpora and computing topic models on those.
However, the results depend heavily on the quality of the translation. There are numerous
approaches to automatically estimating machine translation quality. Most rely on quantitative
assessment against a reference text, such as BLEU [9] and NIST [10]. Establishing community-
accepted automatic scoring methods boosted research in machine translation models as it
enabled fast and cost-effective evaluation of model improvements. That said, certain criticism
has been raised against such evaluations. Turian, Shea and Melamed [11] argue that the
correlation between human evaluation and MT quality estimates is low. Others point to the
inability of such measures to capture translation improvements in syntactic and semantic quality
[12]. Hence a qualitative estimation of topic model similarity between target and source texts
can be a viable alternative to quantitative scores.
   However, such studies are few. Reber [4] compares Google Translate and DeepL MT models
on online discourses on climate change from Germany, the United Kingdom and the United
States. The author uses Structural Topic Models on target corpus to compare topic prevalence in
different national discourses. Maier et al. [13] empirically assess the difference in topic modelling
results between machine-translated texts and multilingual dictionaries. They note the utility
of both approaches but warn of method-specific differences in the results. Nevertheless, both
studies demonstrate that it is possible to apply topic modelling in multilingual settings.




                                                147
  This contribution extends the paper from de Vries, Schoonvelde and Schumacher [3], which
compare a gold standard human translation on euparl data set with a machine-translated corpus.
They use topic modelling with LDA to compare both text sets via the generated term-document
matrices, which explicitly shows that target corpora can be successfully used for extracting
topics.
  We likewise estimate the machine translation quality empirically, but in contrast to the above
paper, we focus on a qualitative perspective. Instead, we leverage our native knowledge of the
Slovenian language to estimate how close the interpretation of topic modelling of the target
corpus would be to the topic modelling of the source corpus.


3. Data
The first data set1 is ParlaMint-SI, a linguistically annotated corpus of parliamentary speeches
from the Slovenian parliament from 2014 onward [14]. We will refer to it as the “source
corpus”. Corpus contains 414 transcribed recordings of parliamentary sessions, equipped with
corresponding metadata on the parliamentary speakers and linguistic annotations of utterances,
including lemmas, POS tags, and named entities.
    We took the data from 2019-01-01 onward, encompassing about a year of pre-COVID and
a year of COVID speeches. We parsed the corpus into 18,476 utterances, each representing a
single speech given in a session. We kept only speeches given by regular MPs, as these would
correspond best to topics discussed in the parliament. We also removed speeches (utterances)
shorter than 50 words, as these would typically be procedural remarks [15]. In the end, the
filtered corpus contained 6861 speeches.
    The second data set2 is a machine-translated version of ParlaMint-SI version 2.0. We will
refer to it as the “target corpus”. Machine translation was performed with opus-mt-zls-en model
3.

    The source corpus already contains lemmas and POS tags attained with the CLASSLA pipeline
[1]. We lemmatised the target corpus with the Lemmagen lemmatiser [16] and tagged it with
the Averaged Perceptron Tagger from the NLTK library. The choice of lemmatiser and tagger
undoubtedly introduces additional noise, resulting from imperfect preprocessing models and
not the machine translation model 4 .
    We kept only lemmatised nouns, thus removing a large portion of tokens. The reasoning is
that nouns sufficiently reflect topics in parliamentary speeches, and they are easier to interpret
than, say, verbs [17]. When we tested the pipeline with nouns and verbs, there was always
at least one topic with only verbs as characteristic topic words. We also removed tokens that
appear in less than ten documents, as they are too niche and do not represent a topic sufficiently.
    In the end, the source corpus retained 3695 types, while the target corpus had 5127. More than
30% difference in types in the target corpus is already a significant discrepancy. The difference
can be attributed to the different lemmatiser (personal names were lemmatised correctly in the
    1
     10.6084/m9.figshare.19248812
    2
     10.6084/m9.figshare.19258814
   3
     https://huggingface.co/Helsinki-NLP/opus-mt-zls-en
   4
     A manual inspection of token differences between the source and target corpora revealed the difficulties of the
lemmatiser to deal with Slovenian proper nouns and acronyms




                                                       148
source corpus, but not in the target one), different POS tagger (which tagged certain words
erroneously), and, indeed, faulty machine translation model (which sometimes creates random
repetitions of words) 5 .
   Compared with de Vries, Schoonvelde and Schumacher, we were stricter with the prepro-
cessing. We kept only nouns from the source corpus, which empirically gave the best results6
and is similar to related work [18]. It is necessary to note that our analysis is performed at the
utterance level instead of the entire session transcription. Utterance-level models result in more
coherent topics and enable later comparison between speakers.


4. Research Design
We compare the practical efficiency of machine translations for comparative research of multi-
lingual corpora on topic modelling results7 . The choice of topic modelling is in line with related
work. However, we extend this with a qualitative comparison of the results. Namely, we wish
to determine whether machine-translated corpus would give similar results to the native corpus.
We use the Latent Dirichlet Allocation (LDA) topic model, a generative model that extracts
topics based on word distributions. We compare the topic interpretation of the source and
target topic model, the ranking of topic significance for pre-COVID and COVID subcorpus, and
the marginal topic distribution of the two topic models.
   The tasks generally correspond to typical analytical workflows in topic modelling research,
namely topic identification, contrasting topic frequencies in different periods or between parties,
and estimation of topic importance [19, 15, 18].
   Furthermore, we aim to explore the quality of the target topic model for a language with
lesser resources, namely Slovene. With the proliferation of freely available yet high-quality
MT models, such as the OPUS collection from Helsinki NLP group [20] and Facebook’s MBart
models [21], it is now possible to translate even smaller languages successfully. We intentionally
use a freely available model from the Hugging Face repository to demonstrate open-source
models’ increasing accuracy and accessibility.


5. Results
We extracted 20 topics with Latent Dirichlet Allocation on TF-IDF weighted bag-of-word matrix.
Twenty topics is a sufficiently large number to cover a wide array of topics that can be discussed
in the parliament while also being sensibly moderate to allow interpretation. Zhao et al. [22]
and Rosa, Gudowsky and Repo [23] corroborate the decision for 20 topics.
   The results of topic modelling with the top 10 words describing each topic are detailed in
Table 3 for the target corpus and in Table 4 for the source corpus (see Appendix). We manually
     5
       Machine translation accuracy cannot be estimated on the ParlaMint corpus due to the lack of a gold standard.
However, authors report a BLUE score of 25.6 and character n-gram F-score of 0.407 for Slovenian to English
translation on Tatoeba corpus.
     6
       We tried topic modelling on all tokens, NOUN+VERB+ADJECTIVE, NOUN+VERB and only NOUN and
compared the results for 5, 10, 20 and 50 topics. The pipeline that yielded the best results was only nouns with 20
topics.
     7
       The Orange data mining workflow for reproducing the analysis is available at 10.6084/m9.figshare.19248806.




                                                       149
                                            Topic 20


                                                                                 Topic 19
                                                                                                                       Topic 9
                                                     Topic 5

                                                Topic 13                                                     Topic 3
                                                              Topic 17            Topic 18




                                                            Topic 8 Topic 15



                                         Topic 9                                       Topic 14
                       Topic 11                               Topic 5
                                                      Topic 20        Topic 2
                 Topic 1                                            Topic 2
                                                Topic 8                                    Topic 4
                                                             Topic 19
                                                                                                                                  Topic 6
                  Topic 10
                                                                           Topic 11                      Topic 13
                Topic 16
                                                                        Topic 18
                                                                 Topic 4
                                       Topic 12
                                                                                               Topic 1
                                                                            Topic 16        Topic 12



                                                       Topic 6
                           Topic 7
                                                                                      Topic 3

                                     Topic 17              Topic 14
                                                                                          Topic 10

                                                                                                                                 Source
                                            Topic 7
                                                                                                                                 Target
                                         Topic 15


Figure 1: Topic similarity between the source and target corpus. The figure shows a t-SNE projection
of topics, which were embedded on their 10 most descriptive words with a FastText word embedding
model and aggregated by mean into document (topic) vectors.


translated the results into English and assigned topic names. We use small caps to denote
topics from the target corpus and bold to denote topics from the source corpus.

5.1. Comparison of topic modelling results
Topics extracted from the source data are semantically more cohesive, as topics 1, 3, 18, and 19
from the target corpus represent a mix of two subtopics. For example, Topic 3 (Table 3) from
the target corpus mixes discussions on family policy (child, family, parent, allowance) with
those on electoral process (election, constituency, voter). Also, Topic 6 is a “junk” topic with
unspecific words (i, t, something, someone).
   However, topic modelling on the target corpus is able to identify certain overarching topics,
namely the discussions on Sunday working hours of shops, issues on education, and taxes.
Other topics have partial overlap (Figure 1), such as epidemic (Topic 14 and Topic 4), judiciary
(Topic 11 and Topic 1), agriculture (Topic 5 and Topic 13), and credit management (Topic
17 and Topic 7). There are two pairs of topics which display high similarity in t-distributed
Stochastic Neighbour Embedding (t-SNE) projection, but we were unable to determine why
they would be deemed similar, namely Topic 2 and Topic 20, and Topic 8 and Topic 19.

5.2. Evaluation of topic significance in subcorpora
Apart from determining topic overlap, we were interested in how the target topic model can
replicate a more complex analytical result. Such a task can be comparing the differences between




                                                                               150
                                    Target                       Source
                              Topic 14: epidemic            Topic 4: epidemic
                           Topic 7: disabilities act      Topic 3: health care
                               Topic 11: judicial         Topic 11: firefighters
                             Topic 15: migration           Topic 17: migration
                        Topic 18: pensions & transport        T1: judicial
Table 1
The top five most significant topics in the target and source corpus based on topic probabilities in the
pre-COVID and COVID period.




          (a) Topic 14 from target corpus                     (b) Topic 4 from source corpus
Figure 2: Comparison of the topics with the highest t-test scores, computed on the reference (pre-
COVID) and COVID subcorpora. Both topics describe the epidemic and share similar test statistic.


two subcorpora. The ParlaMint data set contains rich metadata with pre- and post-COVID
speeches annotated. Thus we chose to compare topic prevalence in these two time periods. We
determined which topics were more significant for the pre-COVID (label Reference) period and
the pandemic period (label COVID) with the source corpus.
   We used Student’s t-test to compare the differences in the topic distribution in the reference
and COVID subcorpora. Topics with the highest test statistic denote more strongly represented
topics in a specific period. For example, Figure 2a shows that in the target corpus Topic 14,
which is about the epidemic, was more frequent in the COVID subcorpus compared to the
reference subcorpus. The same is true for the source corpus, where Topic 4 represents the
epidemic and is also the highest-ranked topic. The two topics even share a similar test statistic,
which shows that, at least for this topic, relative word frequencies were successfully retained in
the machine translation.
   Topic ranks are listed in Table 1. Besides the epidemic, judiciary and migration topics were
among the five highest-ranked topics. The overlap of topic ranks is partial. Three out of five
top-ranked topics were identified in both the target and the source corpus. Health care formed
a separate topic from the epidemic in the source corpus, while in the target corpus, pensions
and transport were collated in a single topic—certainly, even small shifts in word distributions
affect topic models and the results extracted from them.




                                                  151
          (a) Topic 7 from target corpus                   (b) Topic 3 from source corpus
Figure 3: Comparison of the topics with the second highest t-test scores, computed on the reference
(pre-COVID) and COVID subcorpora. Topic 7 from target corpus describes the debate around the
Pension and Disabilities Insurance Act, while Topic 3 covers health care.


5.3. Comparison of marginal topic frequencies
Finally, we observed marginal topic frequencies, showing which topics appear with greater
probability in each corpus (Table 2). The target corpus suffers greatly from an overestimation
of meaningless topics. At the top is Topic 16 containing procedural words, such as “law”,
“article”, “amendment”, and “draft”. Topic 6 includes uninformative words, such as “t”, “thing”,
“something”, and “someone”. Certain topics seem to be affected greatly by shifting word
frequencies, such as epidemic, judicial and budget allocation, the latter completely disappearing
from the target topic model.
   Topic frequencies show a less promising picture of target topic models. As this particular
target topic model seems to mix two topics often, topic frequencies will overlap less. The most
represented topics in the source corpus correspond well to the parliamentary agenda, namely
budget allocation, the COVID epidemic, infrastructure issues and pensions. Target topics do
not reveal the same agenda, giving a less-than-clear picture of parliamentary discussions.


6. Conclusion
While certainly imperfect, machine translation can help researchers explore corpora in their
non-native languages to some extent. The findings imply that machine-translated corpora can
be used by researchers who are not fluent in a specific language but with limited success. In
terms of topic modelling, LDA extracted topics, generally comparable with the source corpus,
thus enabling a cross-country semantic comparison of parliamentary data sets in the English
language.
   On the example of ParlaMint-SI corpus, the topic model of the target corpus identified three
topics, identical with the source topic model, while many topics at least partially overlapped.
Certain topics in the target model were still relevant parliamentary topics, such as railway
infrastructure, migration, and bank audits, even though they were not identified in the source
corpus. Machine-translated word frequencies were different enough that LDA could not capture
the same topics, but it did offer other relevant sub-topics.
   That said, the target topic model reveals a high bias towards topics with generic words,
overestimating the importance of procedural words in topic identification. It also merges




                                               152
                           Target                                         Source
             Topic 16: 0.0990851 (procedural)             Topic 15: 0.081623 (budget allocation)
              Topic 6: 0.0767897 (- no topic -)                Topic 3: 0.070251 (epidemic)
               Topic 17: 0.0706852 (housing)               Topic 5: 0.0692074 (infrastructure)
              Topic 15: 0.0610067 (migration)                 Topic 6: 0.0667152 (pensions)
            Topic 10: 0.0589544 (Sunday work)               Topic 16: 0.064667 (Sunday work)
                Topic 9: 0.0589093 (media)                     Topic 1: 0.0602505 (judicial)
               Topic 4: 0.0516202 (economy)                  Topic 4: 0.0568706 (health care)
        Topic 18: 0.049224 (pension and transport)          Topic 10: 0.0556865 (procedural)
      Topic 8: 0.0482725 (infrastructure and ecology)        Topic 19: 0.0498313 (inspection)
             Topic 12: 0.0481855 (bank audit)               Topic 7: 0.0488244 (bank system)
             Topic 13: 0.0461726 (health care)               Topic 20: 0.046444 (countryside)
                Topic 19: 0.0438473 (army)                      Topic 9: 0.0451169 (police)
             Topic 7: 0.0429779 (disability act)             Topic 17: 0.0449125 (migration)
               Topic 11: 0.0422667 (judicial)                 Topic 12: 0.044142 (education)
              Topic 5: 0.0385346 (agriculture)          Topic 2: 0.0436232 (regional development)
         Topic 1: 0.0362681 (sport and education)               Topic 18: 0.0416208 (army)
              Topic 14: 0.0361004 (epidemic)              Topic 8: 0.0327569 (hazardous waste)
        Topic 2: 0.0314906 (regional development)           Topic 13: 0.0271814 (agriculture)
               Topic 20: 0.0288681 (railways)               Topic 11: 0.0266281 (firefighters)
         Topic 3: 0.0268099 (family and election)             Topic 14: 0.0192786 (railways)
Table 2
Topic modelling results for target and source data, ranked by their marginal topic probabilities.


specific topics into one, which makes the identified topic difficult to interpret (i.e. family and
election, sport and education). Stronger preprocessing could be applied to the target corpus to
remove key MT errors, such as duplicating words and erroneous translation of personal names.
   In the future, we plan to provide annotated machine-translated ParlaMint corpora for all 16
languages (the UK corpus is already in English). The annotated corpus will enable a more accu-
rate comparison of the results with identical preprocessing. Nevertheless, machine translation
can be a viable first option for non-fluent researchers even in its imperfect current form.


Acknowledgments
The work described in this paper was funded by the Slovenian Research Agency research
programme P6-0436: Digital Humanities: resources, tools and methods (2022-2027) and the
Research Infrastructure CLARIN ERIC flagship project ParlaMint (2020-2023).


References
 [1] T. Erjavec, M. Ogrodniczuk, P. Osenova, N. Ljubešić, K. Simov, V. Grigorova, M. Rudolf,
     A. Pančur, M. Kopp, S. Barkarson, S. Steingrímsson, H. van der Pol, G. Depoorter, J. de Does,
     B. Jongejan, D. Haltrup Hansen, C. Navarretta, M. Calzada Pérez, L. D. de Macedo, R. van
     Heusden, M. Marx, Ç. Çöltekin, M. Coole, T. Agnoloni, F. Frontini, S. Montemagni,




                                                  153
     V. Quochi, G. Venturi, M. Ruisi, C. Marchetti, R. Battistoni, M. Sebők, O. Ring, R. Darģis,
     A. Utka, M. Petkevičius, M. Briedienė, T. Krilavičius, V. Morkevičius, S. Diwersy, G. Luxardo,
     P. Rayson, The parlamint corpora of parliamentary proceedings, 2022. (in press).
 [2] M. Popel, M. Tomkova, J. Tomek, Ł. Kaiser, J. Uszkoreit, O. Bojar, Z. Žabokrtskỳ , Trans-
     forming machine translation: a deep learning system reaches news translation quality
     comparable to human professionals, Nature communications 11 (2020) 1–15.
 [3] E. de Vries, M. Schoonvelde, G. Schumacher, No longer lost in translation: Evidence that
     google translate works for comparative bag-of-words text applications, Political Analysis
     26 (2018) 417–430. URL: https://www.jstor.org/stable/26563863.
 [4] U. Reber,         Overcoming language barriers: Assessing the potential of ma-
     chine translation and topic modeling for the comparative analysis of multilingual
     text corpora,      Communication Methods and Measures 13 (2019) 102–125. URL:
     https://doi.org/10.1080/19312458.2018.1555798. doi:10.1080/19312458.2018.1555798 .
     arXiv:https://doi.org/10.1080/19312458.2018.1555798 .
 [5] J. Schwalbach, C. Rauh, Collecting large-scale comparative text data on legislative debates,
     in: H. Bäck, M. Debus, J. M. Fernandes (Eds.), The Politics of Legislative Debates, Oxford
     University Press, Oxford, 2021, pp. 91–109.
 [6] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, A. McCallum, Polylingual topic
     models, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language
     Processing, Association for Computational Linguistics, Singapore, 2009, pp. 880–889. URL:
     https://aclanthology.org/D09-1092.
 [7] J. Boyd-Graber, D. M. Blei, Multilingual topic models for unaligned text, in: Proceedings
     of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, AUAI
     Press, Arlington, Virginia, USA, 2009, p. 75–82.
 [8] W. Yang, J. Boyd-Graber, P. Resnik, A multilingual topic model for learning weighted topic
     links across corpora with low comparability, in: Proceedings of the 2019 Conference on Em-
     pirical Methods in Natural Language Processing and the 9th International Joint Conference
     on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Lin-
     guistics, Hong Kong, China, 2019, pp. 1243–1248. URL: https://aclanthology.org/D19-1120.
     doi:10.18653/v1/D19- 1120 .
 [9] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of
     machine translation, in: Proceedings of the 40th annual meeting of the Association for
     Computational Linguistics, 2002, pp. 311–318.
[10] G. Doddington, Automatic evaluation of machine translation quality using n-gram co-
     occurrence statistics, in: Proceedings of the second international conference on Human
     Language Technology Research, 2002, pp. 138–145.
[11] J. P. Turian, L. Shea, I. D. Melamed, Evaluation of machine translation and its evaluation,
     Technical Report, NEW YORK UNIV NY, 2006.
[12] J. Giménez, L. Márquez, Linguistic measures for automatic machine translation evaluation,
     Machine Translation 24 (2010) 209–240. URL: http://www.jstor.org/stable/41410948.
[13] D. Maier, C. Baden, D. Stoltenberg, M. D. Vries-Kedem, A. Waldherr, Machine trans-
     lation vs. multilingual dictionaries assessing two strategies for the topic modeling
     of multilingual text collections, Communication Methods and Measures 0 (2021)
     1–20. URL: https://doi.org/10.1080/19312458.2021.1955845. doi:10.1080/19312458.2021.




                                               154
     1955845 . arXiv:https://doi.org/10.1080/19312458.2021.1955845 .
[14] T. Erjavec, M. Ogrodniczuk, P. Osenova, N. Ljubešić, K. Simov, V. Grigorova, M. Rudolf,
     A. Pančur, M. Kopp, S. Barkarson, S. Steingrímsson, H. van der Pol, G. Depoorter, J. de Does,
     B. Jongejan, D. Haltrup Hansen, C. Navarretta, M. Calzada Pérez, L. D. de Macedo, R. van
     Heusden, M. Marx, Ç. Çöltekin, M. Coole, T. Agnoloni, F. Frontini, S. Montemagni,
     V. Quochi, G. Venturi, M. Ruisi, C. Marchetti, R. Battistoni, M. Sebők, O. Ring, R. Darģis,
     A. Utka, M. Petkevičius, M. Briedienė, T. Krilavičius, V. Morkevičius, S. Diwersy, G. Luxardo,
     P. Rayson, Multilingual comparable corpora of parliamentary debates ParlaMint 2.1, 2021.
     URL: http://hdl.handle.net/11356/1432, slovenian language resource repository CLARIN.SI.
[15] B. Curran, K. Higham, E. Ortiz, D. Vasques Filho, Look who’s talking: Two-mode networks
     as representations of a topic model of new zealand parliamentary speeches, PloS one 13
     (2018) e0199072.
[16] M. Juršič, I. Mozetič, T. Erjavec, N. Lavrač, Lemmagen: Multilingual lemmatisation with
     induced ripple-down rules, Journal of Universal Computer Science 16 (2010) 1190–1214.
[17] F. Martin, M. Johnson, More efficient topic modelling through a noun only approach, in:
     Proceedings of the Australasian Language Technology Association Workshop 2015, 2015,
     pp. 111–115.
[18] M. Moilanen, S. Østbye, Doublespeak? sustainability in the arctic—a text mining analysis
     of norwegian parliamentary speeches, Sustainability 13 (2021) 9397.
[19] T. Sakamoto, H. Takikawa, Cross-national measurement of polarization in political dis-
     course: Analyzing floor debate in the u.s. the japanese legislatures, 2017 IEEE International
     Conference on Big Data (Big Data) (2017) 3104–3110.
[20] J. Tiedemann, S. Thottingal, Opus-mt–building open translation services for the world,
     in: Proceedings of the 22nd Annual Conference of the European Association for Machine
     Translation, 2020, pp. 479–480.
[21] A. Conneau, G. Lample, Cross-lingual language model pretraining, Advances in Neural
     Information Processing Systems 32 (2019) 7059–7069.
[22] W. Zhao, J. J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to
     determine an appropriate number of topics in topic modeling, in: BMC bioinformatics,
     volume 16, Springer, 2015, pp. 1–10.
[23] A. B. Rosa, N. Gudowsky, P. Repo, Sensemaking and lens-shaping: Identifying citizen
     contributions to foresight through comparative topic modelling, Futures 129 (2021) 102733.




                                               155
                                                                                                                               Table 3




      Topic 1                sport, school, student, child, education, athlete, parent, organisation, science, holiday
      Topic 2           perspective, development, cohesion, policy, kilometer, home, union, minister, debate, variant
      Topic 3               child, election, family, constituency, voter, van, parent, allowance, compensation, cost
      Topic 4               budget, trader, government, coalition, shop, opposition, party, money, economy, sarca
      Topic 5                 animal, culture, food, agriculture, wood, park, book, technology, hunt, conservation
                                                                                                                               LDA with 20 topics on target corpus




      Topic 6                           t, thing, i, virus, something, someone, m, everything, money, nothing
      Topic 7                      tax, income, right, ombudsman, relief, class, rate, veteran, deaf, contribution
      Topic 8    project, water, waste, infrastructure, construction, municipality, packaging, investment, development, fund
      Topic 9                   tv, programme, interpretation, decision, court, janša, minister, member, rtv, janez
      Topic 10                    worker, sunday, work, trade, package, employee, employer, measure, job, day




156
      Topic 11           investigation, candidate, judge, court, justice, prosecutor, prosecution, branch, crime, prison
      Topic 12            bank, auditor, commission, investigation, procurement, dutb, court, report, audit, business
      Topic 13                     health, insurance, doctor, care, patient, heart, system, surgery, hospital, wait
      Topic 14          equipment, mask, fan, purchase, reserve, march, vaccination, government, infection, minister
      Topic 15            referendum, migrant, migration, woman, border, freedom, right, country, citizen, violence
      Topic 16                 law, article, amendment, draft, group, rule, proposal, procedure, provision, service
      Topic 17                housing, fund, investment, apartment, budget, estate, credit, debt, crisis, guarantee
      Topic 18             pension, transport, tachograph, vehicle, pensioner, driver, road, energy, safety, disability
      Topic 19                    army, defence, app, weapon, arm, security, police, application, border, soldier
      Topic 20                      railway, plant, road, traffic, station, transport, passenger, rail, crossing, land
                                                                                                                                                Table 4




      Topic 1          court, constitution, procedure, (judicial) decision, judge, authority, (making a) decision, law, justice, protection
      Topic 2    perspective, candidate (m), candidate (f), drawings (of funds), culture, minister (f), cohesion, program, means, development
      Topic 3                   opposition, government, measure, human, epidemic, crisis, economy, TV show, medium, virus
      Topic 4                doctor, equipment, healthcare, institution, health, mask, purchase, minister (male), hospital, patient
                                                                                                                                                LDA with 20 topics on source data




      Topic 5               project, infrastructure, road, apartment, investment, municipality, source, construction, axis, supply
      Topic 6                 pension, insurance, pensioner, treasury, system, insurance company, euro, abolition, period, year
      Topic 7                         bank, fund, asset, investment, credit, management, obligation, claim, law, company
      Topic 8                      waste, medicinal product, park, agency, transport, product, society, tachograph, use, ton
      Topic 9                 weapon, act, punishment, police, prison, information, authorisation, prevention, victim, authority
      Topic 10                article, amendment, committee, law, proposal, assembly, rules of procedure, session, opinion, job




157
      Topic 11                       vehicle, firefighter, driver, driving, society, exam, category, accident, training, centre
      Topic 12         school, child, sport, education (process), parent, program, education (system), athlete, kindergarten, financing
      Topic 13                     animal, directive, (food) product, being, crop, power plant, energy, food, agriculture, law
      Topic 14                     passage, vaccination, track, train, VAT, harmonisation, service, railway, transport, book
      Topic 15                budget, supplementary budget, tax, million, euro, paycheck, means, billion, municipality, income
      Topic 16                  worker, work, store, supplement, Sunday, employer, compensation, paycheck, student, family
      Topic 17                          water, border, migrant, human, migration, problem, area, country, land, nation
      Topic 18                 army, defence, soldier, member, minister, unit, interpellation, system, chief (female), commander
      Topic 19        committee, member, organisation, report, ombudsman, medium, investigation, control, board, recommendation
      Topic 20          accession, resolution, holiday, digitalisation, agriculture, homeland, technology, countryside, politics, return