<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anna</forename><surname>Lindahl</surname></persName>
							<email>annanlindahl@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Philosophy</orgName>
								<orgName type="department" key="dep2">Linguistics and Theory of Science</orgName>
								<orgName type="institution">University of Gothenburg</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Love</forename><surname>Börjeson</surname></persName>
							<email>love.borjeson@hyresgastforeningen.se</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Visisting Scholar</orgName>
								<orgName type="department" key="dep2">Graduate School of Education</orgName>
								<orgName type="institution">Stanford University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1B9E312626427E54D55B752B32F8A437</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:03+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>topic modeling</term>
					<term>housing policies</term>
					<term>LDA</term>
					<term>public discourse</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Topic modeling is an unsupervised method for finding topics in large collections of data. However, in most studies which employ topic modeling there is a lack of using linguistic information when preprocessing the data. Therefore, this work investigates what effect linguistically informed preprocessing has on topic modeling. Through human evaluation, filtering the data based on part of speech is found to have the largest effect on topic quality. Non-lemmatized topics are found to be rated higher than lemmatized topics. Topics from filters based on dependency relations are found to have low ratings. To exemplify how topic modeling can be used to explore public discourse the area of Swedish housing policies is chosen, as represented by documents from the Swedish parliament and Swedish newstexts. This subject is relevant to study because of the current housing crisis in Sweden.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In the field of humanities and social sciences the use of computational methods has been argued for by many. Commonly referred to as Digital Humanities, the importance of tools for investigation of both digital and printed texts is undeniable. However, as Viklund &amp; Borin <ref type="bibr" target="#b0">[1]</ref> argue, these techniques still need refinement and development to be both accessible and more useful. Often, the linguistic information is disregarded, and there is a need to explore what incorporating this can do for the field. This issue is also raised by Tahmasebi et al. <ref type="bibr" target="#b1">[2]</ref>, where the concept of culturomics is discussed, and the need for good linguistic preprocessing to make this a successful field.</p><p>One popular method for investigating text is topic modeling, which is an unsupervised probabilistic method for finding topics in collections of data. It has been proved a successful method in a wide range of areas for finding structure and topics in large quantities of text. For example, Hall et al. <ref type="bibr" target="#b2">[3]</ref> use it to study ideas within the computational semantics field over time, DiMaggio et al. <ref type="bibr" target="#b3">[4]</ref> investigate the news coverage of U.S. arts funding and Jacobi et al. <ref type="bibr" target="#b4">[5]</ref> use it for following trends in journalistic papers. The most commonly used topic model is the Latent Dirichlet allocation (LDA) model and was developed by Blei et al. <ref type="bibr" target="#b5">[6]</ref>. This is also the model used in most of the studies mentioned here.</p><p>However, many studies including those above differ in their use and reporting of their preprocessing. Preprocessing is an important step in topic modeling, and it includes both formatting of the data, such as removing punctuation, but it can also include removing all words of a certain part of speech. The effect of different preprocessing choices has not been studied systematically and there is also a lack of using linguistic information in the preprocessing.</p><p>Thus, the aim of the present work is twofold. The first is to investigate how one can adapt and enrich topic modeling with linguistic information and knowledge. The second is to exemplify and explore how one can apply this method to investigate the public discourse of Swedish housing policies. This area is chosen because of its relevance, the housing crisis in Sweden has been ongoing since the 1990's and it has been a source of debate for just as long. Lack of housing is still becoming more widespread, with only a small rise in newly built houses in 2015-2016 <ref type="bibr" target="#b6">[7]</ref>, further adding to the relevance of this subject.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Linguistically informed topic modeling</head><p>There are a few studies reporting on the effect of linguistically informed topic modeling. Martin &amp; Johnson <ref type="bibr" target="#b7">[8]</ref> conclude that topic modeling is more informative and effective using only nouns. Following Lau et al. <ref type="bibr" target="#b8">[9]</ref> they also report that lemmatizing improves the results, but that it slows down the topic modeling. They use semantic coherence for evaluation (see the evaluation section) and find that the coherence of the topics improve using only nouns. Jockers <ref type="bibr" target="#b9">[10]</ref> also reports good results for nouns only, but comments that using only nouns can remove some of the information sought after. For example, he argues that if one is looking for sentiment, adjectives are probably necessary to incorporate.</p><p>There are also studies which use linguistic information to develop topic modeling for specific purposes. Fang et al. <ref type="bibr" target="#b10">[11]</ref> present a novel cross-perspective topic model which models topics and opinions. The topics are modelled using only nouns from the corpora. The opinions related to the topics are modeled using adjectives, adverbs and verbs. Guo <ref type="bibr" target="#b11">[12]</ref> uses dependency parsing relations to filter words as a preprocessing step for LDA, and reports improved result for their specific task of detecting spoilers. This, together with the mentioned studies above, further motivates an investigation of how topic modeling can be improved by filtering the input in different ways, based on linguistic information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data</head><p>The data used here comes from two domains of the public discourse, the Swedish parliament, the Riksdag, and Swedish newstexts. Both domains were automati-cally annotated with help of the corpus infrastructure tool of Språkbanken, Korp<ref type="foot" target="#foot_0">3</ref>  <ref type="bibr" target="#b12">[13]</ref>. The Riksdag data is already available through Korp, and the newstext data were annotated using the Sparv pipeline <ref type="foot" target="#foot_1">4</ref> , which is a part of Korp <ref type="bibr" target="#b13">[14]</ref>.</p><p>It should be noted that the language in the two domains differ, the Riksdag data is formal and contains many domain specific words, while the language in the newstexts is more similar to spoken language and the vocabulary is closer to everyday Swedish.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">The Riksdag documents</head><p>All documents and records from the Riksdag's proceedings and correspondence are freely available online, known as Riksdagens öppna data (The Parliament open data). <ref type="foot" target="#foot_2">5</ref> However, here the documents were downloaded from through Korp.</p><p>The documents span between 1971 to present day, with the exception of a few document categories missing from the earlier years. There are 20 different document categories and from these seven were chosen. Documents deemed to cover debates, discussions and proposals are chosen. An overview of the selected documents can be seen in table <ref type="table" target="#tab_0">1</ref>. Only the first 3000-4000 words were used from the longer document types except for the protocols. This was done with the hope that this part covers the document's topics well enough. The protocols will have topics distributed throughout the documents and therefore these were kept long. The documents were split up according to parliamentary periods. This is to able to compare the terms, but also to avoid doing topic modeling over a long time span. Topics will have varied over time and this might affect the topic modeling. The parliamentary periods with respective document and word count can be seen in table appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Newstexts</head><p>To analyze the media, newspaper and magazine articles have been downloaded from the Media Archive provided by Retriever. <ref type="foot" target="#foot_3">6</ref> The access was provided by the Swedish Union of Tenants(SUT).</p><p>In order to find all the newstexts concerning housing policies a search term list was made together with people from SUT who are knowledgeable of housing policies. See Appendix B for the search terms. All newstexts containing the Swedish word for housing, bostad, in all its forms, and at least one of the words in the search term list were used. Using the selected search terms captured both relevant and irrelevant newstexts. The topic modeling helps us sort out the relevant ones for further analysis.</p><p>All the available newstexts were originally published on the web, no printed media is included. The time span of these newstexts is 2000-2015. Before 2000 there are no newstexts available. For the topic modeling, the data is split up in two 5-year period and one 6-year period, to be able to compare the years and avoid a too long time span. These periods can be seen in table 2 together with the number of tokens and documents. In total the newstexts come from 1786 different sources. Most of these sources only contribute with a few newstexts, and there are a few dominant sources.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Period Nr of tokens</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Method</head><p>In order to compare the effects of different linguistic preprocessing, a number of filters based on linguistic information were designed and applied to a test set of the data. An example of a filter can be selecting all words in the documents which are tagged as nouns or words participating in a specified dependency relation. The filters are described in more detail below.</p><p>A topic model was trained on each of the filtered versions of the test set, and the models were evaluated using semantic coherence and human judgement, see below.</p><p>The parliamentary period 2010-2014 from the Riksdag was chosen as the test set. The combination of filters resulting in the highest rated model from this test set was used for the rest of the parliamentary periods of the Riksdag data, which are then used for exploration of the data.</p><p>As previously stated, the language in the two data sets differ, and because of this the highest rated combination of filters for the Riksdag data is not used for the newstexts. Instead, the top five highest rated combinations of filters from the Riksdag are tested on the newstexts, with the hope that the positive effects of these filters are general enough to be useful in this new domain. The five resulting models from the newstexts are then evaluated in the same way as the Riksdag data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Preprocessing and linguistic filters</head><p>Punctuation and numbers are removed from all documents, and all words are changed to lower-case. Frequent words are removed, words which occur in 50% or more of the documents and words which occur in less than 5 documents are removed. Here, this frequency filtering is referred to as filter 1. Unless stated otherwise, this is applied to all documents.</p><p>A stop list was used, also defined as a filter. This list was made from a general stop list for Swedish, but it was necessary to manually add domain-specific words to this list.</p><p>Through the Korp annotation there is information about lemma, part of speech and dependency relation for every token. From this, a filter of lemmas of words was used, this filter simply replaces words with their lemmas.</p><p>Three filters based on part of speech were tested. The first filter uses all parts of speech, called all POS. The second filter removes all words which are not nouns, verbs, adjectives and participles, from here on called POS2. The third, following <ref type="bibr" target="#b7">[8]</ref> uses only nouns.</p><p>A filter based on dependency relations was also made. This filter only uses words participating in seven specified dependency relations, chosen with the aim to find the meaningful parts of the sentence. These relations are: agent, object adverbial, direct object, predicative attribute, place adverbial, subject predicative complement and other subjects.</p><p>In table <ref type="table" target="#tab_3">3</ref> an overview of the combinations of filters tested is shown. If nothing else is stated, all filters had the frequency filter 1 applied. All groups are tested without frequency filter, with lemmatization, and with lemmatization and stop list. The all POS and the POS2 groups are also tested with filters based on dependency relations. The POS2 group was chosen for further investigation has thus 5 more filters applied to it. The linguistic filters applied to the newstext data can be seen in table <ref type="table">4</ref>. These filters were chosen based on the results from the topic modeling of the Riksdag data and manual inspection. Through the initial manual inspection using only a frequency filter was found to work better for the newstext data than the Riksdag. The stop list for the Riksdag data was also made up of domain specific and couldn't be reused. Because of this, instead of making a new stop list, a new frequency filter was made. The alternative filter, named filter 2, removes the 300 most frequent tokens in the data and tokens that occur in 75% of the documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>All</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>POS2</head><p>NN POS 2, Filter 1 NN, Lemma POS 2, Filter 2 NN, Lemma, Filter 2 POS 2, Filter 2, Deprel -Table <ref type="table">4</ref>. Filters for the newstexts test set, filter 2 replaces the stop list.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Topic modeling</head><p>The topic modeling was implemented using the python library Gensim. <ref type="foot" target="#foot_4">7</ref> The LDA implementation in Gensim uses a modified version of variational Bayes, made to handle documents in a stream, which makes handling large corpora more effective <ref type="bibr" target="#b14">[15]</ref> <ref type="bibr" target="#b15">[16]</ref>. Part of the evaluation was also carried out with methods in the library, see next section.</p><p>When training an LDA model the number of topics needs to be provided. Guided by previous papers, experiments were run between 50 -200 topics. After manual inspection 75 number of topics where selected for the filter tests. Other than this the default configurations of Gensim were used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Evaluation</head><p>There are several ways to evaluate a topic model. It has previously been shown that held out likelihood of a model doesn't always correspond to human judgement <ref type="bibr" target="#b16">[17]</ref>. Here the focus lies instead on the interpretability of the generated topics. This is evaluated both computationally and with humans. Using the coherence model available in Gensim, the two semantic coherence measures cv and npmi were calculated. These measures calculate the semantic coherence between the words in a topic by using probabilities derived from word co-occurrence statistics. If a topic has high coherence between its words it is presumably also a good topic. The two measures differ in how the probabilities are calculated, see <ref type="bibr" target="#b17">[18]</ref> for more details. <ref type="bibr" target="#b17">[18]</ref> also finds cv to be the best measure, but is contradicted by <ref type="bibr" target="#b18">[19]</ref> who finds npmi to be the best measure, and therefore these are compared.</p><p>To assess the performance of the coherence measures and evaluate topic quality, human judgements were collected. Before this, a short manual inspection of the models were done by the authors. This resulted in two models being disregarded due to them containing mostly useless topics. The rest of the models were kept, in total 16. These models can be seen in table 6 in the next section.</p><p>Six evaluators each rated 8 models, with three people rating the same 8 and the other three rated 8 other. In total, there are human judgements for 16 models. The evaluators were between the age 20-30, all native Swedish speakers and with an education level of undergraduate or above. There was an equal gender division.</p><p>Following <ref type="bibr" target="#b19">[20]</ref> and <ref type="bibr" target="#b8">[9]</ref>, the evaluators were asked to assess the understandability of the top 10 words from each topic. The instructions given for the rating can be seen in table <ref type="table">5</ref>. The instructions are translated from Swedish.</p><p>Rating Instruction 1 I don't find the words to be belong together, I don't understand the topic. 2 I find about half of the words to belong together, the topic is semi-understandable. 3 I find the topic to be understandable, there is at most one word which doesn't belong.</p><p>Table <ref type="table">5</ref>. Instructions for the human evaluators.</p><p>For each topic, the mean of the human ratings were calculated and the correlation between these ratings and the coherence measures were then calculated using Pearsons r. As stated in the previous section, five models corresponding to the five top rated combinations of filters from the Riksdag test set was chosen for this.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>Below the results for the models trained on the filtered Riksdag data are presented. In table 6 all the models with their ratings are shown. In the table the mean human rating can also be seen together with the number of 3's (from the mean rating) for each of the topics. The maximum number of 3's is 75, which would mean all human evaluators gave all topics a score of 3. The percentage of the original number of words is also shown. However, this number doesn't seem to have an effect on the ratings.</p><p>The highest rated model is the one with only nouns, a stop list and the frequency filter, filter 1 (words occurring in more than 50% of the documents and words with an occurrence of 5 or less are removed). The words are also lemmatized. In second place comes the same model, but without a stop list. The following top ranked models are from the POS2 group, but without lemmatization. The third highest rated model is also filtered based on dependency relations.</p><p>For the models using all parts of speech, using a stop list significantly improves the results, as expected. Applying frequency filter 1 also improves the result. In fact, in the POS2 group, the frequency filter has a better effect than the stop list, when used alone.</p><p>The dependency relations filter have different effects. This can be seen comparing all parts of speech with and without dependency relations, where the dependency relations filters have a lower ranking. This is also seen in the POS2 group comparing the same groups. However, the POS2 model without lemmatization, stop list and dependency relations has a high score. The POS2 model without any filter except the dependency filter also has a high score.</p><p>In the POS2 group models using lemmatized words have lower ratings than their respective models without lemmatization. However, the NN models using lemmatized words have a higher score than all the POS2 models. The results from the human judgements for the newstexts can be seen in table <ref type="table" target="#tab_5">7</ref>. The highest rated models differ from the Riksdag data. Here, the highest rated model is with the POS2 group, frequency filter 2, and no lemmatization, as opposed to lemmatized nouns with a stop list, which had the highest scores in the Riksdag. The second place is the same as the Riksdag, but the rest of the models have different rankings. Note that the frequency filter 2 replaces a stop list here. The mean ratings and number of 3's are lower overall for the newstext data than for the Riksdag.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>Mean human rating Nr of 3's POS2, <ref type="bibr">Filter</ref>  When inspecting the topics from the different filters a few patterns were found. In all topics, nouns were the most frequent part of speech, regardless of POS-filter. Non-lemmatized topics had more repetition of the same words but different word forms. The dependency relations captured mostly nouns due to the nature of the chosen relations, but still these topics where not rated as high as the others.</p><p>The rankings from the two coherence measures, cv and npmi, did not correspond to the human rankings for the Riksdag test set. cv however has the top ranked model as the second best model. The calculated correlation for the cv measure is almost always higher than for the npmi, with a mean correlation of 0.68 and 0.60, respectively. Both have the highest correlation for the top ranked model by humans, and both have lower correlation for the models with dependency relations filters, compared to the other models. See appendix C for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Exploring the public discourse</head><p>The highest rated combination of filters from the Riksdag, which was lemmatized nouns, with a stoplist, was used on the rest of the data. The resulting models and classifications of documents is here used to exemplify how one can use topic modeling for examining public discourse. The same was done for the newstexts, but with the highest rated model for this data, the POS2 group with filter 2.</p><p>For the Riksdag, the topics for each period was manually inspected, and in every period a topic corresponding to housing policies was found. In some of the periods, two topics were found. In the newstexts, more topics was found relating to housing policies as compared to the Riksdag, due to the selection process.</p><p>With this information, one can track changes in the topic over time. For example, figure <ref type="figure" target="#fig_0">1</ref> shows the proportion of documents which contains over 0.35 of this topic in all the motions. To filter out the document with a low proportion of the housing policies topics, documents with less than 35% of the topic was removed. Inspecting the figure one can see that the topic has a peak in <ref type="bibr">1998-2002 and 1976-1979.</ref> To further inspect the data, interactive plots were made with the help of the Python library Bokeh. <ref type="foot" target="#foot_5">8</ref> A static version of this is seen in figure <ref type="figure" target="#fig_1">2</ref>. It shows all documents, not just the ones containing the 'housing policies' topics. The documents on the y-axis are in chronological order. As can be seen in the screenshot, when hovering the mouse over a square, the name of the document it represents is shown, in this case Livet efter skyddat boende (Life after protected housing). The topic is unnamed, but the top ten words of the topic are displayed. They include våld, (violence), kvinna (woman), and barn (children). The proportion of the topic is also shown. Together with the title, one can assume that the document is classified in a correct way. This interactive plot or visualization is thus both a way to explore the data, but also a way to examine how the model classifies documents.</p><p>With these kinds of plots, co-occurring topics can also be examined. Figure <ref type="figure" target="#fig_2">3</ref> is based on newstexts, and shows the mean of each topic for every month during 2014. Only newstexts containing a topic labeled the lack of housing are used. The lack of housing topic is removed (nr 25), to be able to see the other topics more clearly.</p><p>In the figure, topic nr 33, which is about student housing is slightly more cooccurring during July, August and September, possibly due to the start of the academic year in September. Topic number 67, which concerns political parties and politics, have a strong peak in August. In September 2014, general elections were held in Sweden, and this could explain this peak. Other frequent topics are number 39 and 57. 39 is about investments and growth, and 57 are a topic of general words such as said.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>In this work we have shown how one can examine the discourse of Swedish housing policies with the help of topic modeling. The method is deemed suitable for the intended analysis, although there is more work to do for a full analysis of the public discourse.</p><p>By using human evaluators, the effects of different kinds of linguistic preprocessing were investigated. Of the three categories investigated here, part of speech had the largest impact on the results. Using nouns improved the topics. Models based on verb, adjectives, participles and nouns also improved the topics, however the most frequent part of speech in these models is nouns. Lemmatized data is not rated as high as non-lemmatized data, however without lemmatization the same words are repeated in the topics. This might have an effect on the topics usefulness and interpretability and it is thus unclear if non-lemmatized   data is preferred. Using data selected based on dependency relations does not result in topics with high ratings, however this might change if one uses different dependency relations. The evaluation of the topic models showed that the cv measure has a better correlation with human judgements than the npmi measure. Both of the measures has the highest correlation for models using only nouns.</p><p>Appendix C -Top 5 models from the Riksdag compared to cv and npmi measures </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Proportion of documents with a proportion over 0.35 of the topics labeled 'housing policies' in the motions.</figDesc><graphic coords="11,134.77,422.77,345.83,254.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. A screen shot of an interactive plot. Columns represent documents and rows represent topics.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. A plot over the mean of each topic for all the newstexts containing the lack of housing topic for each month during 2014. The housing topic is removed.</figDesc><graphic coords="12,134.77,116.83,250.00,400.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Overview of the chosen document types.</figDesc><table><row><cell cols="2">Document type Description</cell><cell>Nr of documents</cell><cell>Average length document</cell><cell>Period</cell></row><row><cell>Betänkande*</cell><cell>Committee reports with proposals for decisions in the Riksdag.</cell><cell>20 993</cell><cell cols="2">2332 1971-2016</cell></row><row><cell>Interpellation</cell><cell>A formal question from a member of the parliament to the government</cell><cell>7384</cell><cell cols="2">357 1998-2016</cell></row><row><cell>Motion</cell><cell>A formal proposal by a parliament member, submitted once a year.</cell><cell>123 129</cell><cell cols="2">680 1971-2016</cell></row><row><cell>Protokoll</cell><cell>Protocols over the daily meetings in the parliament, including all debates.</cell><cell>6392</cell><cell cols="2">27866 1971-2016</cell></row><row><cell>Proposition*</cell><cell>Proposals for legislation from the Government.</cell><cell>6030</cell><cell cols="2">4906 1971-2016**</cell></row><row><cell>Statens offentliga utredningar*</cell><cell>Reports from committees of in preparation for submitting a proposal. inquiry appointed by the Government,</cell><cell>3169</cell><cell cols="2">3304 1994-2016</cell></row><row><cell></cell><cell>Shorter, written questions</cell><cell></cell><cell></cell></row><row><cell>Skriftliga frågor</cell><cell>from a member of the parliament</cell><cell>26 402</cell><cell cols="2">228 1998-2016</cell></row><row><cell></cell><cell>to the government.</cell><cell></cell><cell></cell></row></table><note>*Shortened documents are used. **Between the years 2006-2009 most of the documents are corrupted.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>The different periods for the newstexts data.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Filters for the Riksdag test set.</figDesc><table><row><cell>POS</cell><cell>POS2</cell><cell>NN</cell></row><row><cell cols="2">No frequency filter No frequency filter</cell><cell>No frequency filter</cell></row><row><cell>Lemma</cell><cell>Lemma</cell><cell>Lemma</cell></row><row><cell>Lemma, Stop</cell><cell>Lemma, Stop</cell><cell>Lemma, Stop</cell></row><row><cell cols="2">Lemma, Stop, Deprel Lemma, Stop, Deprel</cell><cell>-</cell></row><row><cell>Lemma,Deprel</cell><cell>Lemma, Deprel</cell><cell>-</cell></row><row><cell>-</cell><cell>Stop, Deprel</cell><cell>-</cell></row><row><cell>-</cell><cell>Deprel</cell><cell>-</cell></row><row><cell>-</cell><cell>Stop</cell><cell>-</cell></row><row><cell>-</cell><cell cols="2">Deprel, no frequency filter -</cell></row><row><cell>-</cell><cell>Only frequency filter 1</cell><cell>-</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6 .</head><label>6</label><figDesc>Human ratings for all models.</figDesc><table><row><cell>All POS</cell><cell>Mean human rating</cell><cell>Nr of 3's</cell><cell>% of all word used</cell><cell>POS2</cell><cell>Mean human rating</cell><cell>Nr of 3's</cell><cell>% of all words used</cell></row><row><cell>No frequency filter</cell><cell>-</cell><cell>-</cell><cell></cell><cell>No frequency filter</cell><cell>1.978</cell><cell>0</cell><cell>48</cell></row><row><cell>Lemma</cell><cell>-</cell><cell>-</cell><cell></cell><cell>Lemma</cell><cell>2.009</cell><cell>6</cell><cell>42</cell></row><row><cell>Lemma, Stop</cell><cell>2.191</cell><cell>15</cell><cell cols="2">33 Lemma, Stop</cell><cell>2.200</cell><cell>9</cell><cell>29</cell></row><row><cell cols="2">Lemma, Stop, Deprel 2.147</cell><cell>13</cell><cell cols="2">10 Lemma, Stop, Deprel</cell><cell>1.938</cell><cell>5</cell><cell>9</cell></row><row><cell>Lemma,Deprel</cell><cell>1.987</cell><cell>0</cell><cell cols="2">19 Lemma, Deprel</cell><cell>1.858</cell><cell>6</cell><cell>12</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>Stop, Deprel</cell><cell>2.351</cell><cell>16</cell><cell>9</cell></row><row><cell>NN</cell><cell></cell><cell></cell><cell></cell><cell>Deprel</cell><cell>2.058</cell><cell>5</cell><cell>12</cell></row><row><cell>No frequency filter</cell><cell>2.102</cell><cell>6</cell><cell cols="2">24 Stop</cell><cell>2.236</cell><cell>13</cell><cell>28</cell></row><row><cell>Lemma</cell><cell>2.409</cell><cell>24</cell><cell cols="3">23 Deprel, no frequency filter 2.231</cell><cell>10</cell><cell>14</cell></row><row><cell>Lemma, stop</cell><cell>2.489</cell><cell>27</cell><cell cols="2">18 Only frequency filter</cell><cell>2.249</cell><cell>14</cell><cell>43</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7 .</head><label>7</label><figDesc>Results for the chosen models for the newstext data.</figDesc><table><row><cell>2</cell><cell>2.08</cell><cell>10</cell></row><row><cell>NN, Lemma</cell><cell>2.036</cell><cell>5</cell></row><row><cell>POS2, Filter 1</cell><cell>1.933</cell><cell>3</cell></row><row><cell cols="2">NN, Lemma, Filter 2 1.871</cell><cell>4</cell></row><row><cell cols="2">POS2, Filter 2, Deprel 1.636</cell><cell>0</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://spraakbanken.gu.se/swe/node/1535</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">https://spraakbanken.gu.se/swe/node/19799</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">https://data.riksdagen.se/data/dokument/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://www.retriever.se/product/nordens-storsta-mediaarkiv/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://radimrehurek.com/gensim/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">https://bokeh.pydata.org/en/latest/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">https://spraakbanken.gu.se/eng/culturomics</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">https://www.hyresgastforeningen.se/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. This work has been supported in part by a framework grant for the project Towards a knowledge-based culturomics <ref type="bibr" target="#b8">9</ref> , awarded by the Swedish Research Council (contract 2012-5738).</p><p>This work has also been carried out with the support from the Swedish Union of Tenants 10 , which has provided part of the data used.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">How Can Big Data Help Us Study Rhetorical History?</title>
		<author>
			<persName><forename type="first">J</forename><surname>Viklund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Borin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Selected Papers from the CLARIN Annual Conference 2015</title>
				<meeting><address><addrLine>Wroclaw, Poland</addrLine></address></meeting>
		<imprint>
			<publisher>Linköping University Electronic Press</publisher>
			<date type="published" when="2015">2016. October 14-16, 2015</date>
			<biblScope unit="volume">123</biblScope>
			<biblScope unit="page" from="79" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Visions and open challenges for a knowledge-based culturomics</title>
		<author>
			<persName><forename type="first">N</forename><surname>Tahmasebi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Borin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Capannini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dubhashi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Exner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forsberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gossen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">D</forename><surname>Johansson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Johansson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kågebäck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Digital Libraries</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">2-4</biblScope>
			<biblScope unit="page" from="169" to="187" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Studying the history of ideas using topic models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the conference on empirical methods in natural language processing</title>
				<meeting>the conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="363" to="371" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of US government arts funding</title>
		<author>
			<persName><forename type="first">P</forename><surname>Dimaggio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Blei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Poetics</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="570" to="606" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Quantitative analysis of large amounts of journalistic texts using topic modelling</title>
		<author>
			<persName><forename type="first">C</forename><surname>Jacobi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Van Atteveldt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Welbers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Digital Journalism</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="89" to="106" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003-01">2003. Jan</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Därför kan byggboomen inte lösa bostadskrisen</title>
		<author>
			<persName><forename type="first">H</forename><surname>Höjer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Forskning och Framsteg</title>
		<imprint>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="61" to="74" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">More efficient topic modelling through a noun only approach</title>
		<author>
			<persName><forename type="first">F</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Australasian Language Technology Association Workshop</title>
				<imprint>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="page">111</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EACL</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="530" to="539" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Macroanalysis: Digital methods and literary history</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Jockers</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
			<publisher>University of Illinois Press</publisher>
			<biblScope unit="page" from="128" to="133" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Mining contrastive opinions on political texts using cross-perspective topic model</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Somasundaram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifth ACM international conference on Web search and data mining</title>
				<meeting>the fifth ACM international conference on Web search and data mining</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="63" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Using Dependency Parses to Augment Feature Construction for Text Mining</title>
		<author>
			<persName><forename type="first">S</forename><surname>Guo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
		<respStmt>
			<orgName>Virginia Polytechnic Institute and State University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Korp-the corpus infrastructure of Språkbanken</title>
		<author>
			<persName><forename type="first">L</forename><surname>Borin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forsberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Roxendal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="474" to="478" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Sparv: Språkbanken&apos;s corpus annotation pipeline infrastructure</title>
		<author>
			<persName><forename type="first">L</forename><surname>Borin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forsberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hammarstedt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rosén</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schumacher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Schäfer</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Software framework for topic modelling with large corpora</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rehurek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sojka</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Online learning for latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">R</forename><surname>Bach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="856" to="864" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Reading tea leaves: How humans interpret topic models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Boyd-Graber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gerrish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nips</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="1" to="9" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Exploring the space of topic coherence measures</title>
		<author>
			<persName><forename type="first">M</forename><surname>Röder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Both</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hinneburg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the eighth ACM international conference on Web search and data mining</title>
				<meeting>the eighth ACM international conference on Web search and data mining</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="399" to="408" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Topic Coherence for Dutch</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Van Der Zwaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Marx</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kamps</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Automatic evaluation of topic coherence</title>
		<author>
			<persName><forename type="first">D</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Grieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="100" to="108" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
