=Paper=
{{Paper
|id=Vol-3133/paper05
|storemode=property
|title=Finnish Parliament on the Semantic Web: Using ParliamentSampo Data Service and Semantic Portal for Studying Political Culture and Language
|pdfUrl=https://ceur-ws.org/Vol-3133/paper05.pdf
|volume=Vol-3133
|authors=Eero Hyvönen,Petri Leskinen,Laura Sinikallio,Matti La Mela,Jouni Tuominen,Kimmo Elo,Senka Drobac,Mikko Koho,Esko Ikkala,Minna Tamper,Rafael Leal,Joonas Kesäniemi
|dblpUrl=https://dblp.org/rec/conf/dhn/HyvonenLSMTEDKI22
}}
==Finnish Parliament on the Semantic Web: Using ParliamentSampo Data Service and Semantic Portal for Studying Political Culture and Language==
Finnish Parliament on the Semantic Web:
Using ParliamentSampo Data Service and Semantic
Portal for Studying Political Culture and Language
Eero Hyvönen1,2 , Petri Leskinen1,2 , Laura Sinikallio2,1 , Matti La Mela2 ,
Jouni Tuominen1,2 , Kimmo Elo3 , Senka Drobac2,1 , Mikko Koho1,2 , Esko Ikkala1 ,
Minna Tamper1,2 , Rafael Leal1,2 and Joonas Kesäniemi1
1
Semantic Computing Research Group (SeCo), Department of Computer Science, Aalto University, Finland
2
Helsinki Centre for Digital Humanities (HELDIG), University of Helsinki, Finland
3
Centre for Parliamentary Studies, University of Turku, Finland
Abstract
This paper introduces the system ParliamentSampo – Parliament of Finland on the Semantic Web, a
Linked Open Data (LOD) service, data infrastructure, and semantic portal for studying Finnish political
culture, language, and networks of the Members of Parliament (MP). The article presents the vision
behind the system, the LOD service, and explores the possibilities to utilize it in research and application
development. A knowledge graph of linked data has been created based on ca. 962 000 speeches in all
plenary sessions of the Parliament of Finland in 1907—2021; the data is also available in XML format,
utilizing the new international Parla-CLARIN format. For the first time, the entire time series of the
Finnish parliamentary speeches has been converted into data and a data service in a unified format. In
addition, the speeches have been interlinked with another knowledge graph created from the database
of the MPs and enriched from other data sources into a broader ontology-based data service. The paper
shows how the LOD service SPARQL endpoint can be used to research parliamentary culture, the use
of political language, and networks of politicians through data analysis. The service endpoint can
also be used to develop applications for different user groups without programming skills, such as the
ParliamentSampo semantic portal introduced in the paper, too. This application aims to make political
decision making more transparent to the general public, media, politicians, and other end users.
Keywords
parliamentary studies, semantic portals, linked data, digital humanities
1. Introduction
The main tasks of parliaments are to enact new laws, oversee the work of the government,
and decide on the state budget; how the parliament works in Finland is documented in [1].
Parliamentary data are used in many areas of research [2], as it provides a wealth of information
on the state and functioning of democratic systems, political life and, more generally, language
and culture. The most prominent part of the work of parliaments is the public plenary sessions, in
which the Members of Parliament (MP) discuss and vote on issues on the agenda and other topics
Digital Parliamentary Data in Action (DiPaDa 2022) workshop, Uppsala, Sweden, March 15, 2022.
$ eero.hyvonen@aalto.fi (E. Hyvönen)
https://seco.cs.aalto.fi/u/eahyvone/ (E. Hyvönen)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
69
that arise. Parliaments draw up minutes of plenary sessions and make both the minutes and the
documents on which they are based on available to the public. Openness and transparency in
the work of parliaments is important for the voters, media, researchers, and also parliaments
themselves: based on open data, they can look at the decision-making stages, views, and actions
expressed by parliamentarians in their work as legislators.
This paper argues, inspired by [3, 4], for publishing and using parliamentary data in Digital
Humanities (DH) research based on Semantic Web (SW) technologies1 and Linked Data (LD)
[5]. The LD approach for Cultural Heritage [6] has arguably many advantages: 1) Linked data
and ontologies [7] provide a framework for harmonizing heterogeneous distributed datasets
and combining them into larger and richer entities. 2) The SW is based on the Predicate Logic
[8], which provides an opportunity to enrich data by reasoning new information. 3) When the
machine “understands” the content of the data, intelligent web services and data analyses can
be implemented more easily. 4) Ready-made tools by other actors can be re-used for publishing,
processing and analysing the standardized data; the wheel doesn’t need to be reinvented.
In this paper, we test and demonstrate the above arguments in DH research on parliamentary
culture and language [2] by presenting the ParliamentSampo system, a Linked Open Data
(LOD) corpus and data service of Finnish parliamentary data and a semantic portal on top of it2 .
The paper presents the vision and first results of the ParliamentSampo extending our earlier
papers on creating the knowledge graphs for the speeches [9] and MP networks [10] and a
Finnish presentation on the project [11].
The paper first reviews related research on parliamentary data (Section 2). In Section 3, our
vision of publishing and using Finnish parliamentary linked data on the SW is presented. After
this, first results obtained in developing and using the ParliamentSampo system in different
ways are presented (Section 4). In conclusion, results of our work are summarized and using
parliamentary data in research is considered on a more general level (Section 5).
2. Related Work on Parliamentary Data
Lots of parliamentary materials have been digitized in recent decades, arguably only second
to newspapers [12]. For example, the Royal Library of Sweden has digitized Swedish printed
parliamentary documents from 1521 to 1970. This collection3 is supplemented by the parlia-
ment’s own digital materials and, e.g., by the Westac research project4 at the Umeå University.
Digitization has improved the accessibility and usability of parliamentary materials for both
the public and the research community. Websites have been created that make it easy for users
to browse and download materials. Examples include the website of the Lipad project5 [13] that
digitized Canadian parliamentary materials, and the portal Italian House of Representatives6
that comprehensively presents the history of the Italian parliament in 1848–2018.
1
https://www.w3.org/standards/semanticweb/
2
See the project homepage for more details, videos, and publications: https://seco.cs.aalto.fi/projects/semparl/en/.
3
http://data.riksdagen.se
4
https://www.westac.se
5
https://lipad.ca
6
https://storia.camera.it
70
Several parliamentary corpora have been formed from the minutes of the plenary debates,
which make it possible to study the content of the speeches and their language; see, e.g., [14] and
the CLARIN list of parliamentary corpora7 . The TEI-based Parla-CLARIN scheme8 for session
minutes has been developed within the CLARIN infrastructure, providing a common way to
represent the corpora [15]. The related ParlaMint project9 brings together Parla-CLARIN-based
national corpora. Parliamentary materials have also been transformed into the form of LD when
creating the LinkedEP [3] system on the European Parliament’s data, the Italian Parliament10 ,
and the LinkedSaeima for the Latvian parliament [4].
The materials of the Parliament of Finland (PoF) have been digitized in various contexts but
are difficult to use, as they have been produced separately from different periods and stored in
different formats [9]. The usability of the materials is also hampered by their varying quality
and lack of descriptive data [16]. Language corpora have been published on parliamentary
debates, such as the Parliamentary Corpus of FIN-CLARIN’s Language Bank11 [17] which
covers the years 2008––2016. It contains the speeches in a linguistically annotated form and
also synchronized links to original plenary session videos [18]. The Voices of Democracy
project has produced a research corpus that includes plenary minutes in 1980–2018 annotated
grammatically as well as interviews of veteran MPs conducted by the PoF after 1988 [12]. The
minutes of the parliamentary debates from 1991 to 2015 can also be found in the International
Harvard Parlspeech Corpus [19], but we have identified gaps in the coverage in this corpus.
Digitized parliamentary materials offer a wide range of perspectives on different research
topics and have been used in a variety of fields, such as linguistics, political science, media
studies, economics, and history. The most important research material are the debates in
the parliaments, through which one can study the language and its changes itself as well as
the underlying societal phenomena at large [20]. Metadata makes it possible to structure the
speeches, for example, between parties, gender, or professional groups. Blaxill and Beelen [21]
have examined the content of women’s parliamentary speeches, as well as the role of gender in
the speeches of MPs in the British Parliament. Parliamentary debates have been used in thematic
or conceptual analyses (cf., e.g., [22, 23, 24, 25, 26]) and to study the language and the opinions
of the parties or MPs (e.g., [27, 28]). Parliamentary debates have been used in translation studies
using, for example, the EuroParl Corpus12 of the European Parliament debates.
The digitized material of the Finnish Parliament has been utilized to some extent in digital
humanities and social scientific research. La Mela [16], also Kettunen and La Mela [26], have
studied the history of the concept of Everyman’s right, a Nordic right of public access to nature,
with the digitized minutes of the Parliament, and examined their quality in PDF format. The
digitized minutes have been utilized in the development of language technology methods, in this
case the Finnish Semantic Tagger [26]. Similarly, Andrushschenko et al. [12] have used their
grammatically structured corpus and a search tool to organize and analyze parliamentary debates
in various research cases. Simola [29] has examined the differences in political speech between
7
https://www.clarin.eu/resource-families/parliamentary-corpora
8
https://github.com/clarin-eric/parla-clarin
9
https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
10
http://data.camera.it
11
http://korp.csc.fi
12
https://www.statmt.org/europarl/
71
parties throughout the parliamentary period 1907–2018, for which she compiled a separate
research dataset combining the debates and the speaker data. Makkonen and Loukasmäki [30]
have studied the plenary speeches given in Parliament of Finland in 1999–2014 and their content
by using topic modeling. FIN-CLARIN’s Parliamentary Corpus has been used, for example, by
Lillqvist et al. [31] in their study on debates about public debt. Previous search applications for
Finnish parliamentary speech data are based mostly on traditional text search. Data analysis
tools to examine the results are few, such as the concordance analysis of the Language Bank
Korp, where the words found are visualized in their textual contexts and show some statistics
of words occurrences in the search results. These applications cover only a small part of the
entire time series of the Finnish parliamentary speeches.
3. ParliamentSampo Vision
The vision of the Semantic Parliament project [11] is to develop and implement in the living
laboratory environment model shown in Fig. 1 for publishing and utilizing parliamentary
materials as LOD on the SW. The work focuses on two core datasets:
1. Minutes of Parliamentary Sessions All Finnish parliamentary debates, totalling ca.
962 000 speeches and covering the existence of the PoF 1907–2021, have been transformed
into a 1) Linked Data knowledge graph and into 2) Parla-CLARIN XML form. [9]
2. Members of Parliament Data A proposographical knowledge graph has been created
for representing biographical data about all ca. 2800 Finnish MPs and other politicians
during the same time period (1907–2021). [10]
Figure 1: Vision: Linked Open Data publishing model of ParliamentSampo
The left side of Fig. 1 shows content providers that produce data related to the PoF in their
own local data silos, but in non-interoperable formats. For example, the LawSampo (LawSampo)
72
system publishes Finnish legislation, the results of parliamentary discussions, and case law data
provided by the Ministry of Justice in Finland as the LOD service Semantic Finlex [32] and a
semantic portal13 [33].
In in the middle of Fig. 1, the data is aggregated, harmonized, enriched, interlinked, and
published as a new FinnParla LOD service on the Linked Data Finland platform LDF.fi14 [34].
Its data model is based on 1) a new ontology describing the activities of the PoF and 2) a set of
related vocabularies and ontologies describing, for example, (historical) places15 , professions
[35], people, and organizations. Notice that lots of additional documents of the PoF processes,
such as propositions and bills, and votation data could be interlinked with the ParliamentSampo
system in the future using its open data infrastructure.
The right side of Fig. 1 depicts the ways of utilizing the FinParla data service as 1) a semantic
portal, 2) using it in research by computational tools, and 3) for developing new applications.
The knowledge graph of the parliamentary speeches (S-KG) (cf. Fig. 1), contains speeches
collected from all the minutes of the plenary sessions of the PoF since 1907. The S-KG was
compiled from several initial formats: 1) From 1907 until the middle of 1999, the minutes are
available only as scanned images embedded in PDF documents. This material was OCRed with
minor manual corrections made. 3) From mid-1999 to the end of 2014, the material was available
in HTML format at the Parliament’s website16 . 3) From the 2015 onwards, the minutes are
available through the Finnish Parliament Open Data API17 in custom XML form. The data
quality of S-KG has been deemed satisfactory, although there were issued related to OCR errors
and the fact that there have been differences in how the transliteration and metadata of the
minutes have been produced in the PoF. The data model of S-KG and the data transformation
process are described in detail in [9].
The S-KG was interlinked to the MPs prosopographic knowledge graph P-KG (cf. Fig. 1). For
example, speakers and the parties they represent are resources with URI identifiers described
in the P-KG graph. The data publication about MPs is a knowledge graph (P-KG) covering all
MPs who have worked in Finland [10]. At its core is an RDF conversion of XML-formatted data
about MPs downloaded from the Open Data service18 of PoF. In addition to basic biographical
information, such as times and places of birth and death, the data includes detailed information
about the people’s life events, such as studying, working life, political career, and publications
written by the politicians.
The Finnish parliament’s open data source has been supplemented and enriched with infor-
mation extracted from the Finnish Government’s website19 and Wikidata: in addition to MPs,
some 200 other people with significant political history, such as presidents, ministers, and om-
budsmen, have been added into the knowledge graph. For example, Mauno Koivisto has served
as President and Prime Minister but never as an MP. The knowledge graph was also interlinked
with the BiographySampo system [36], yet another example of the mutually interlinked “Sampo”
13
LawSampo project: http://seco.cs.aalto.fi/projects/lawlod/
14
Linked Data Finland service online: https://ldf.fi/
15
https://seco.cs.aalto.fi/projects/histoplaces/
16
https://www.eduskunta.fi/FI/taysistunto/Sivut/Taysistuntojen-poytakirjat.aspx
17
https://avoindata.eduskunta.fi/#/fi/home
18
https://avoindata.eduskunta.fi/#/fi/dbsearch
19
https://valtioneuvosto.fi/hallitukset-ja-ministerit
73
systems and LOD infrastructure 20 in use in Finland, that publishes biographies of ca. 13 600
significant Finnish persons as a LOD service and a semantic portal21 including biographies of
614 Finnish parlamentarians. The data model of P-KG, based on the CRM Bio extension [37] of
CIDOC CRM22 , is described in more detail in [10] including the transformation process of the
data sources into RDF. The transformation and linking could be done fairly accurately as the
primary data were already available in structured forms.
4. Using the ParliamentSampo LOD Service
The goal of the ParliamentSampo system is to provide the end users with flexible and rich
possibilities for searching, browsing, and analyzing the PoF data. The new possibilities are
offered by a standard SPARQL endpoint for 1) opening the data for external use, 2) for querying
the endpoint and studying the results, 3) for data analysis using various tools and scripting, and
4) for developing new external applications, such as the ParliamentSampo portal. These use
cases are explored next in more detail with examples.
4.1. Exporting the Data for External Use
A simple way for a researcher to use ParliamentSampo data is to download data from the data
service for local use and then apply one’s favourite tools for data analysis, such as spreadsheets,
R23 environment for statistical analysis, or Gephi24 for network analysis. For filtering out
subsets of interest in the big data, SPARQL querying can be used in flexible ways. It is also
possible to install a local SPARQL server environment for linked data on one’s own computer,
for example Fuseki25 , which is also used in the LDF.fi service. The materials in the LDF.fi service
are published using container technology (i.e., Docker26 ), which means that installing the data,
the server, and possible versioned software packages is automatic and effortless.
An example of using ParliamentSampo data externally is reported in [20]. For this case
study in political science, the Parla-CLARIN version was downloaded and a subset of the
speeches 1960–2020 was filtered out and analyzed further using custom XML-based tools. The
authors studied how the language used in discussing environmental politics has evolved in
Finland in the speeches of different parties. Eleven central environmental terms were selected
from a thesaurus27 used by the PoF library, speeches where these terms were used were then
extracted, and various quantitative analyses based on them were presented and compared with
the strategy plans of the parties with qualitative interpretations. The analyses showed, for
example, a constantly increasing intensity of environmental debates and a rhetorical shift of
language from protecting the nature to issues of climate change.
20
LOD Infrastructure for Digital Humanities in Finland (LODI4DH): https://seco.cs.aalto.fi/projects/lodi4dh/
21
BiographySampo portal is available at https://biografiasampo.fi/.
22
https://cidoc-crm.org
23
https://www.r-project.org
24
https://gephi.org
25
https://jena.apache.org/documentation/fuseki2/
26
https://www.docker.com
27
EKS Subject Headings: https://www.eduskunta.fi/kirjasto/EKS/index.html?kieli=en
74
4.2. Querying the Endpoint and Studying Results
Figure 2: Number of speeches in different languages (y-axis) on the timeline (x-axis).
SPARQL is a flexible way to query RDF data. The search result is presented in a tabular format
that can be examined as it is and be visualized and used for application-specific analyzes. For
example, Fig. 2 shows a visualization of the number of speeches (y-axis) in the S-KG graph by
language on a timeline from 1907 to 2021 (x-axis). Speeches in Finnish (’FI’ in the figure) have
clearly been given the most since the beginning (’Kaikki’ in the figure denotes all the speeches).
Originally, there have been more speeches in Swedish (’SV’ in the figure) than today, but the
number remains very small. The graphic was created using the YASGUI editor28 [38], which
can be used to edit SPARQL queries, target them to an online SPARQL endpoint, and to show
the results using pre-implemented visualizations.
SPARQL is an expressive and flexible way to retrieve information from graphical data, and
it is suitable for use by DH researchers. The SPARQL query used to generate Fig. 2 is shown
below:
PREFIX rdf: # For shortening URIs
PREFIX rdfs:
PREFIX semparls:
PREFIX xsd:
PREFIX dct:
SELECT ?year (COUNT(?fin) as ?FI) (COUNT(?swe) as ?SV) # Variables in the result
(count(?document-URI) as ?ALL ) WHERE {
?document-URI a semparls:Speech . # Graph pattern matched
?document-URI ?dateTime .
BIND(STR(year(?dateTime)) as ?year)
{
BIND( as ?swe)
?document-URI dct:language ?swe .
} UNION {
BIND( as ?fin)
?document-URI dct:language ?fin .
}
} GROUP BY ?year ORDER BY ASC(?year) # Grouping and ordering results yearly
This query above first introduces the namespaces used (PREFIX); they are used to make the
URI references in the query syntactically shorter and simpler. In the next SELECT part of the
28
https://yasgui.triply.cc
75
query, all speeches and their languages are retrieved using a graph pattern formed by variables
starting with ?, which are fitted to the end point graph in all possible ways. The answer of
the query is a table of all possible value assignments for the variables than make the query
pattern to match the underlying data. The results are finally classified (GROUP BY) into groups
according to language, sorted by year (ORDER BY), and finally it is summed up (COUNT) how
many speeches there are in Finnish, Swedish, and in total. In the visualization, the variable ?year
forms the x-axis and the y-axis presents the annual number of speeches in different languages.
When the speech graph was created, language recognition of speeches was done automatically.
Typically this could be done accurately. However, sometimes OCR errors, for example, can
make language recognition difficult, and therefore speeches whose language code could note be
identified were excluded automatically from the query result.
4.3. Data-analysis by Scripting
Figure 3: Annual starting age of new MPs and relative proportion of women.
The PoF data can be examined computationally, for example, using Python scripting and
Jupyter notebooks in the Google Colab29 environment. Then one can use the simple HTTP
protocol to perform SPARQL queries and after this analyze and visualize query results using
tools provided by the programming environment used, e.g., by Python libraries.
For example, Fig. 3 shows the ages of persons elected as MPs for the first time each year [10].
In the figure, the blue solid line shows the age of all MPs, and the age of women is shown in red.
It can be seen from the graph that the starting age has remained almost constant throughout
the parliamentary activities, but on the other hand, since 1980, women have been younger than
men for some time when they started as MPs. The relative proportion of women in the PoF
is shown by a black dotted line. Before the 1960s, the proportion remained at an average of
29
https://colab.research.google.com
76
10%, but has after this risen to 30–50%. The graphics in the image were implemented in Google
Colab using standard Python libraries for data analysis.
Figure 4: Table of correlations between parties (y-axis) and MP professions (x-axis) [10]
Figure 4 shows a similarly formed tabular visualization of the correlation between the parties
and the occupations of the MPs. Here only the most popular parties and occupations over
the entire history of PoF are considered. The parties are presented in the horizontal rows of
the table and the number of representatives of each profession is indicated in the vertical row
corresponding to the occupation. The matrix shows, for example, that in the Centre Party,
the National Coalition Party, and the Swedish People’s Party the most common occupation is
Farmer. On the other hand, Entrepreneur has been the most common occupations with the
Finns Party.
The same visualization components can be reused in different contexts. For example, the
matrix visualization of Fig. 4 is re-used in Fig. 5 for analyzing interruptions of speeches of the
current PoF. The y-axis lists the most active speakers and x-axis the MPs that have interrupted
their speeches. For example, of the interrupted speeches of MP Annika Saarikko (Centre Party),
the current Minister of Finance, 46% are due to MP Ben Ben Zyskowics, representing the
National Coalition Party in opposition, and 18% to MP Jukka Gustafsson representing the party
SDP in the government, indicating possibly different opinions inside the government.
77
Figure 5: Table of correlations that indicate how the most active speakers (y-axis) of the current PoF
have been interrupted by other MPs (x-axis).
4.4. Using the ParliamentSampo Portal
The ParliamentSampo portal, based on the Sampo model [39] and the Sampo-UI framework
[40], demonstrates how the FinnParla data service can be used for developing applications for
DH research. In the portal, the data can be filtered using faceted search [41] based on ontologies,
and the results can then be analyzed with the help of seamlessly integrated visualization and
data analysis tools. The data can be accessed along two application views for studying 1)
speeches and 2) MPs. For example, in Fig. 6, the user has selected the Plenary Speeches view,
which shows the search facets Content, Speaker, Party, (Speech) Type, Language, and Date on
the left. The search result, i.e., the speeches found, is shown by default in tabular form on the
right. The user has written a query “suomettum*” in the Content text facet, in which case only
speeches that contain the word “suomettuminen” (Finlandization) in its various inflectional
forms have been filtered into the search result, as the wildcard “*” matches any string. The
user has also limited the result on the Date facet to speeches given since June 4, 1945, when
Parliament began to convene after the World War II. The result in this case is 177 speeches,
shown in a table (with paging). By selecting the tab “Timeline”, the yearly amount of speeches
is visualized as a function of time.
In faceted search, the filtering selections can be made flexibly in any order, and the search
engine calculates a hit count for each subsequent facet selection, which tells how many results
would be obtained in the result set if the selection in question is made next. For example, in
78
Figure 6: Using faceted search to filter out speeches of interest.
the Speaker facet, a click on “Junnila, Tuure (1910-1999) [7]” selects MP Tuure Junnila’s seven
speeches that mention “Finlandization”. The selection facets are created automatically using the
parliamentary ontology and knowledge graphs of the FinParla data. The hit count allows the
user to be directed to selections that do not lead to dead ends where the result set is empty. In
addition, the hit numbers provide an opportunity to investigate the result set statistically along
different facet dimensions. For example, a click on the pie symbol of the Speaker facet opens the
pie chart of Fig 7 which shows how many different speakers mention “Finlandization” in their
speeches. The most active MPS in this case are Mr. Georg C. Ehrnrooth (21 speeches) and Mr.
Ben Zyskowicz (19 speeches), two active right-wing politicians concerned with the concept.
In accordance with the Sampo model, a number of pre-implemented data analysis tools and
visualizations, similar to those shown in the figures above, can be integrated into the application
perspectives of the ParliamentSampo portal. In the future, the tools and visualizations can be
found alongside the table visualization in Fig. 6 on their own tabs in the same way as, for example,
in the AcademySampo’s user interface [42]; the components of the Sampo-UI framework [40]
are reused in the implementation of both portals. Through these tools and visualizations, the
project explores the potential of Artificial Intelligence for knowledge discovery in DH research
[43], i.e., how could ParliamentSampo assist a researcher in finding research problems, in
solving them, and also in explaining solutions?
5. Discussion
In the context of political research, the parliamentary speech is considered an important form of
political communication and political struggle. A parliamentary speech is not just any speech,
79
Figure 7: Speeches containing the word ”suomettuminen” in any form as a pie chart, calculated
according to the distribution of MPs on the Speaker facet. The speakers are listed on the right.
but has its own structure and its own rules, which at the same time reflect the general position
of the parliament. In addition, a parliamentary speech is an instrument of political struggle to
expose competing goals, challenge the views of an adversary, and unlock deadlocked settings.
Thus, a speech in a parliament is always also a political act, in which the words used are the
weapons of political decision-making and which not only tell about the issues under discussion,
but also reveal the different positions, values, and points of view of the speakers. [44]
Traditionally, parliamentary speeches have been studied by close reading and using content
analysis, discourse analysis, and various methods of rhetorical research. However, digitalization
has also entered this traditional area of research more and more, as data on parliamentary
debates in various countries have become increasingly available in the form of open data. In
the case of the PoF, the digitization of parliamentary documents has progressed at a reasonable
speed, and some of the material has also been available through the parliament’s open data
service. The availability of the data and also the quality of the available data has improved in
recent years, but there are still significant differences with similar data in different countries.
The work on ParliamentSampo is an important step in utilizing the plenary debates in PoF
as part of the field of humanities research. Although the materials have always been available
to researchers manually and for some years also electronically digitized in PDF format, the
80
machine-readable data corpus now being prepared and published as a data service, together
with the ParliamentSampo portal, will integrate parliamentary plenary debates and other
open materials into the DH and national information infrastructure. This means in practice,
for example, the opportunity for political scientists, historians, and linguists to extract, model,
analyze, and visualize parliamentary speech through exploratory research, using a vast body of
data covering the entire period of the modern PoF since 1907.
The possibility of exploratory data analysis opens up completely new possibilities and per-
spectives for the study of parliamentary speech. In traditional close reading, the researcher is
forced to delimit the material strongly already at the collection stage, which usually happens
through either temporal or thematic delimitation – that is, either by focusing on a limited time
period or on limited themes. Digital methods make it possible to study the material without
such limitations, and thus to examine it, for example, with fully automatic or semi-automatic
classification methods. In this way, it may be possible to find, for example, new themes and
topics that have been sidelined in research in the past (cf., e.g., [45, 46]).
On the other hand, distant reading and classification of data without strong presuppositions
also allows for a critical examination of previous research results, when the themes/topics
generated by distant reading can be compared with the results obtained by other methods [47].
Another example of the possibilities offered by data is research on the language of politics
and its long-term change (e.g., [48, 49, 50, 51, 52, 53]). Parliamentary big data enables large-scale
and systematic application of language technology methods. Although parliamentary speech is
also linguistically its own special form of speech, parliamentary speech also lives in time and
thus reflects both the wider linguistic development and the social atmosphere of discussion and
word choices that occur in it [30]. At the same time, the extensive data offer an opportunity to
study the change in language use, for example, whether the social debate climate is polarized or
“brutalized”, as politicians and media actors have repeatedly suggested in recent years.
The third opportunity offered by parliamentary data relates to linking the use of language
more broadly to other social contexts of language users, such as education, age, and social
networks. Language can also be approached in policy research on the assumption that language
always reflects the wider world of values and ideas of its user, as well as his or her social status
and context. Discursive coalitions, which can be constructed based on the language use of the
speakers, thus offer an interesting opportunity to detach oneself from the frame of reference
set by, for example, the party background and to focus analytical attention on networks built
through the use of language. In previous studies, this type of approach has been able to connect
experts to different ideological positions by analyzing the content of their texts [54], which we
think can be well applied to the classification of MPs.
A few examples have been highlighted above where the utilization of parliamentary data
would seem to allow for significant new research openings in parliamentary research. However,
in the spirit of exploratory data analysis, it is worth highlighting the as-yet-unknown possibilities
that gradually emerge as researchers begin to outline new hypotheses and research questions
by examining and analyzing data. The potential of large datasets is surprising in their potential,
which on the one hand requires an open-minded attitude towards the data and on the other
hand underscores the growing responsibility of researchers working in data analysis. When it
is no longer possible for the researcher to know the material he or she is using thoroughly, he
or she must know the phenomena that are the subject of the material thoroughly. Only in this
81
way is it possible to assess which findings transmitted through excavation, analysis, modeling,
or visualization are truly relevant.
Acknowledgements Our work is funded by the Academy of Finland and is also related to
the EU project InTaVia30 and the EU COST action Nexus Linguarum31 . The project uses the
computing resources of the CSC – IT Center for Science.
References
[1] M. Hidén, H. Honka-Hallila, Miten eduskunta toimii (How Parliament of Finland works),
Edita Publishing, Helsinki, 2006.
[2] C. Benoît, O. Rozenberg (Eds.), Handbook of Parliamentary Studies: Interdisciplinary Ap-
proaches to Legislatures, Edward Elgar Publishing, 2020. doi:10.4337/9781789906516.
[3] A. Van Aggelen, L. Hollink, M. Kemman, M. Kleppe, H. Beunders, The debates of the
European Parliament as Linked Open Data, Semantic Web – Interoperability, Usability,
Applicability 8 (2017) 271–281. doi:10.1007/s42001-019-00060-w.
[4] U. Bojārs, R. Dargis,‘
U. Lavrinovičs, P. Paikens, Linkedsaeima: A linked open
dataset of Latvia’s parliamentary debates, in: Semantic Systems. The Power of AI
and Knowledge Graphs. SEMANTiCS 2019, Springer, 2019, pp. 50–56. doi:10.1007/
978-3-030-33220-4\_4.
[5] T. Heath, C. Bizer, Linked Data: Evolving the Web into a Global Data Space (1st edition),
Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool,
2011. URL: http://linkeddatabook.com/editions/1.0/.
[6] E. Hyvönen, Publishing and Using Cultural Heritage Linked Data on the Semantic Web,
Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool,
Palo Alto, CA, USA, 2012.
[7] S. Staab, R. Studer (Eds.), Handbook on Ontologies (2nd Edition), Springer, 2009.
[8] P. Hitzler, M. Krötzsch, S. Rudolph, Foundations of Semantic Web technologies, Springer,
2010.
[9] L. Sinikallio, S. Drobac, M. Tamper, R. Leal, M. Koho, J. Tuominen, M. L. Mela, E. Hyvönen,
Plenary debates of the parliament of finland as linked open data and in parla-clarin
markup, in: 3rd Conference on Language, Data and Knowledge, LDK 2021, Schloss
Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2021, pp. 1–17.
URL: https://drops.dagstuhl.de/opus/volltexte/2021/14544/pdf/OASIcs-LDK-2021-8.pdf.
[10] P. Leskinen, E. Hyvönen, J. Tuominen, Members of Parliament in Finland knowledge graph
and its linked open data service, in: of the 17th International Conference on Semantic
Systems, 6-9 September 2021, Amsterdam, The Netherlands, 2021, pp. 255–269. URL:
https://ebooks.iospress.nl/volumearticle/57420. doi:10.3233/SSW210049.
[11] E. Hyvönen, L. Sinikallio, P. Leskinen, S. Drobac, J. Tuominen, K. Elo, M. L. Mela, M. Koho,
E. Ikkala, M. Tamper, R. Leal, J. Kesäniemi, Parlamenttisampo: eduskunnan aineistojen
linkitetyn avoimen datan palvelu ja sen käyttömahdollisuudet, Informaatiotutkimus 40
(2021). URL: https://doi.org/10.23978/inf.107899.
30
https://intavia.eu
31
https://nexuslinguarum.eu
82
[12] M. Andrushchenko, K. Sandberg, R. Turunen, J. Marjanen, M. Hatavara, J. Kurunmäki,
T. Nummenmaa, M. Hyvärinen, K. Teräs, J. Peltonen, J. Nummenmaa, Using parsed and
annotated corpora to analyze parliamentarians’ talk in Finland, Journal of the Association
for Information Science and Technology 185 (2021) 1–15. doi:10.1002/asi.24500.
[13] K. Beelen, T. A. Thijm, C. Cochrane, K. Halvemaan, G. Hirst, M. Kimmins, S. Lijbrink,
M. Marx, N. Naderi, L. Rheault, R. Polyanovsky, T. Whyte, Digitization of the Canadian
parliamentary debates, Canadian Journal of Political Science 50 (2017) 849–864. doi:10.
1017/S0008423916001165.
[14] E. Lapponi, M. G. Søyland, E. Velldal, S. Oepen, The talk of norway: a richly annotated
corpus of the norwegian parliament, 1998–2016, Lang Resources & Evaluation 52 (2018)
873–893. doi:10.1007/s10579-018-9411-5.
[15] A. Pancur, T. Erjavec, The siParl corpus of Slovene parliamentary proceedings, in: Proceed-
ings of the Second ParlaCLARIN Workshop, European Language Resources Association,
2020, pp. 28–34. URL: https://www.aclweb.org/anthology/2020.509parlaclarin-1.6.
[16] M. La Mela, Tracing the emergence of nordic allemansrätten through digitised parlia-
mentary sources, in: M. Fridlund, M., Oiva, P. Paju (Eds.), Digital histories: Emergent
approaches within the new digital history, Helsinki University Press, 2020, pp. 181–197.
doi:10.33134/HUP-5-11.
[17] M. Lennes, FIN-CLARIN and language bank parliamentary data.
workshop “digital parliamentary data and research”, 2019. URL:
https://www2.helsinki.fi/en/helsinki-centre-for-digital-humanities/
workshop-digital-parliamentary-data-and-research.
[18] A. Mansikkaniemi, P. Smit, M. Kurimo, Automatic construction of the Finnish parlia-
ment speech corpus, in: Proc. Interspeech 2017, 2017, pp. 3762–3766. doi:10.21437/
Interspeech.2017-1115.
[19] C. Rauh, P. De Wilde, J. Schwalbach, The ParlSpeech data set: Annotated full-text vectors
of 3.9 million plenary speeches in the key legislative chambers of seven European states
(V1), 2017. doi:10.7910/DVN/E4RSP9.
[20] K. Elo, J. Karimäki, Luonnonsuojelusta ilmastopolitiikkaan: Ympäristöpoliittisen käsit-
teistön muutos parlamenttipuheessa 1960–2020, Politiikka 63 (2021). URL: https://journal.
fi/politiikka/article/view/109690. doi:10.37452/politiikka.109690.
[21] L. Blaxill, K. Beelen, A feminized language of democracy? the representation of women
at Westminster since 1945, Twentieth Century British History 27 (2016) 412–449. doi:10.
1093/tcbh/hww028.
[22] K. Quinn, B. Monroe, M. Colaresi, M. H. Crespin, D. R. Radev, How to analyze political
attention with minimal assumptions and costs, American Journal of Political Science 54
(2010) 209–228. doi:10.1111/j.1540-5907.2009.00427.x.
[23] H. Baker, B. V., M. T., Digitization of the Canadian parliamentary debates, in:
T. Säily, A. Nurmi, M. Palander-Collin, A. Auer (Eds.), Exploring future paths for his-
torical sociolinguistics, John Benjamins, Amsterdam, 2017, pp. 83––107. doi:10.1017/
S0008423916001165.
[24] J. Guldi, Parliament’s debates about infrastructure: An exercise in using dynamic topic
models to synthesize historical change, Technology and Culture 60 (2019) 1–33. doi:10.
1353/tech.2019.0000.
83
[25] P. Ihalainen, A. Sahala, Evolving conceptualisations of internationalism in the UK parlia-
ment: Collocation analyses from the League to Brexit, in: M. Fridlund, M., Oiva, P. Paju
(Eds.), Digital histories: Emergent approaches within the new digital history, Helsinki
University Press, 2020, pp. 199—-219. doi:10.33134/HUP-5-12.
[26] K. Kettunen, M. La Mela, Semantic tagging and the nordic tradition of everyman’s rights,
Digital Scholarship in the Humanities (2021). doi:10.1093/llc/fqab052.
[27] G. Abercrombie, R. Batista-Navarro, Sentiment and position-taking analysis of parliamen-
tary debates: a systematic literature review, Journal of Computational Social Science 3
(2012) 245–270. doi:10.1007/s42001-019-00060-w.
[28] M. Magnusson, R. Öhrvall, K. Barrling, D. Mimno, Voices from the far right: a text analysis
of Swedish parliamentary debates, SocArXiv (2018). doi:10.31235/osf.io/jdsqc.
[29] S. Simola, A century of partisanship in Finnish political speech, 2020. URL: https://sites.
google.com/site/sallasimolaecon/home/research.
[30] K. Makkonen, P. Loukasmäki, Eduskunnan täysistunnon puheenaiheet 1999-–2014: Miten
käsitellä LDA-aihemalleja?, Politiikka 61 (2019) 127––159. URL: https://journal.fi/politiikka/
article/view/77163.
[31] E. Lillqvist, I. K. Kavonius, M. Pantzar, “velkakello tikittää”: Julkisyhteisöjen velka suoma-
laisessa mielikuvastossa ja tilastoissa 2000—2020, Kansantaloudellinen Aikakauskirja 116
(2020) 581––607. URL: https://journal.fi/politiikka/article/view/77163.
[32] A. Oksanen, J. Tuominen, E. Mäkelä, M. Tamper, A. Hietanen, E. Hyvönen, Semantic
Finlex: Transforming, publishing, and using Finnish legislation and case law as linked
open data on the web, in: Knowledge of the Law in the Big Data Age, volume 317 of
Frontiers in Artificial Intelligence and Applications, IOS Press, 2019, pp. 212–228.
[33] E. Hyvönen, M. Tamper, E. Ikkala, S. Sarsa, A. Oksanen, J. Tuominen, A. Hietanen,
Publishing and using legislation and case law as linked open data on the semantic
web, in: The Semantic Web: ESWC 2020 Satellite Events, Springer, 2020, pp. 110–114.
doi:10.1007/978-3-030-62327-2\_19.
[34] E. Hyvönen, J. Tuominen, M. Alonen, E. Mäkelä, Linked Data Finland: A 7-star model
and platform for publishing and re-using linked datasets, in: The Semantic Web: ESWC
2014 Satellite Events, Revised Selected Papers, Springer-Verlag, 2014, pp. 226–230. URL:
https://doi.org/10.1007/978-3-319-11955-7_24.
[35] M. Koho, L. Gasbarra, J. Tuominen, H. Rantala, I. Jokipii, E. Hyvönen, AMMO Ontology of
Finnish Historical Occupations, in: Proceedings of the The First International Workshop
on Open Data and Ontologies for Cultural Heritage (ODOCH’19), volume 2375, CEUR
Workshop Proceedings, 2019, pp. 91–96. URL: http://ceur-ws.org/Vol-2375/.
[36] E. Hyvönen, P. Leskinen, M. Tamper, H. Rantala, E. Ikkala, J. Tuominen, K. Keravuori,
BiographySampo – publishing and enriching biographies on the semantic web for digital
humanities research, in: The Semantic Web. 16th International Conference, ESWC 2019,
Proceedings, Springer, 2019, pp. 574–589. doi:10.1007/978-3-030-21348-0.
[37] J. Tuominen, E. Hyvönen, P. Leskinen, io CRM: A data model for representing biographical
data for prosopographical research, in: Proceedings of the Second Conference on Biograph-
ical Data in a Digital World 2017 (BD2017), volume 2119, CEUR Workshop Proceedings,
2018, pp. 59–66. URL: http://ceur-ws.org/Vol-2119/paper10.pdf.
[38] L. Rietveld, R. Hoekstra, The YASGUI family of SPARQL clients, Semantic Web – Interop-
84
erability, Usability, Applicability 8 (2017) 373–383. doi:10.3233/SW-150197.
[39] E. Hyvönen, Digital humanities on the Semantic Web: Sampo model and portal series,
Semantic Web – Interoperability, Usability, Applicability (2022). Accepted, https://seco.cs.
aalto.fi/publications/2021/hyvonen-sampo-model-2021.pdf.
[40] E. Ikkala, E. Hyvönen, H. Rantala, M. Koho, Sampo-UI: A full stack JavaScript framework
for developing semantic portal user interfaces, Semantic Web – Interoperability, Usability,
Applicability 13 (2022) 69–84. doi:10.3233/SW-210428.
[41] Y. Tzitzikas, N. Manolis, P. Papadakos, Faceted exploration of RDF/S datasets: a survey,
Journal of Intelligent Information Systems 48 (2017) 329–364.
[42] E. Hyvönen, P. Leskinen, H. Rantala, E. Ikkala, J. Tuominen, Akatemiasampo-portaali ja
-datapalvelu henkilöiden ja henkilöryhmien historialliseen tutkimukseen, Informaatio-
tutkimus 40 (2021) 28–56. URL: https://journal.fi/inf/article/view/102656/64169.
[43] E. Hyvönen, Using the Semantic Web in Digital Humanities: Shift from Data Publishing to
Data-analysis and Serendipitous Knowledge Discovery, Semantic Web – Interoperability,
Usability, Applicability 11 (2020) 187–193. doi:10.3233/SW-190386.
[44] K. Palonen, Eduskunnasta puhekunnaksi? Parlamentarismi retorisena politiikkana, Politi-
ikka 47 (2005) 141–148.
[45] D. Mimno, Topic Regression, Ph.D. thesis, University of Massachusetts Amherst, 2012.
URL: https://scholarworks.umass.edu/open_access_dissertations/520.
[46] T. R. Tangherlini, P. Leonard, Trawling in the sea of the great unread: Sub-corpus topic
modeling and humanities research, Poetics 41 (2013) 725–749. doi:10.1016/j.poetic.
2013.08.002.
[47] T. Ylä-Anttila, V. Eranti, Aihemallinnuksesta kehysmallinnukseen, Politiikka 60 (2005)
148–156. URL: http://elektra.helsinki.fi/se/p/politiikka/60/2/aihemall.pdf.
[48] P. DiMaggio, M. Nag, D. Blei, Exploiting affinities between topic modeling and the
sociological perspective on culture: Application to newspaper coverage of U.S. Government
arts funding, Poetics 41 (2013) 570–606. doi:10.1016/j.poetic.2013.08.004.
[49] C. Jacobi, W. van Atteveldt, K. Welbers, Quantitative analysis of large amounts of journal-
istic texts using topic modelling, Poetics 4 (2016) 89–106. doi:10.1080/21670811.2015.
1093271.
[50] S. Purhonen, A. Toikka, “Big Datan” haaste ja uudet laskennaliset tekstiaineistojen ana-
lyysimenetelmät: esimerkkitapauksena aihemallianalyysi tasavallan presidenttien uuden-
vuodenpuheista 1935–2015, Sosiologia 53 (2016) 6–27. URL: http://elektra.helsinki.fi/se/s/
0038-1640/53/1/bigdatan.pdf.
[51] S.-M. Laaksonen, M. Nelimarkka, Omat ja muiden aiheet: Laskennallinen analyysi vaali-
julkisuuden teemoista ja aiheomistajuudesta, Politiikka 60 (2018) 132–147.
[52] A. Törnberg, P. Törnberg, Muslims in social media discourse: Combining topic modeling
and critical discourse analysis, Discourse, Context and Media 13 (2016) 132–142. doi:10.
1016/j.dcm.2016.04.003.
[53] J. B. Mountford, Topic modeling the red pill, Social Sciences 7 (2018). doi:10.3390/
socsci7030042.
[54] Z. Jelveh, B. Kogut, S. Naidu, Detecting latent ideology in expert text: Evidence from
academic papers in economics, in: Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP), ACL, 2018, pp. 1804–1809.
85