<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Lindahl</string-name>
          <email>annanlindahl@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Love Borjeson</string-name>
          <email>love.borjeson@hyresgastforeningen.se</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Visisting Scholar, Graduate School of Education, Stanford University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Topic modeling is an unsupervised method for nding topics in large collections of data. However, in most studies which employ topic modeling there is a lack of using linguistic information when preprocessing the data. Therefore, this work investigates what e ect linguistically informed preprocessing has on topic modeling. Through human evaluation, ltering the data based on part of speech is found to have the largest e ect on topic quality. Non-lemmatized topics are found to be rated higher than lemmatized topics. Topics from lters based on dependency relations are found to have low ratings. To exemplify how topic modeling can be used to explore public discourse the area of Swedish housing policies is chosen, as represented by documents from the Swedish parliament and Swedish newstexts. This subject is relevant to study because of the current housing crisis in Sweden.</p>
      </abstract>
      <kwd-group>
        <kwd>topic modeling</kwd>
        <kwd>housing policies</kwd>
        <kwd>LDA</kwd>
        <kwd>public discourse</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the eld of humanities and social sciences the use of computational
methods has been argued for by many. Commonly referred to as Digital Humanities,
the importance of tools for investigation of both digital and printed texts is
undeniable. However, as Viklund &amp; Borin [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] argue, these techniques still need
re nement and development to be both accessible and more useful. Often, the
linguistic information is disregarded, and there is a need to explore what
incorporating this can do for the eld. This issue is also raised by Tahmasebi et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
where the concept of culturomics is discussed, and the need for good linguistic
preprocessing to make this a successful eld.
      </p>
      <p>
        One popular method for investigating text is topic modeling, which is an
unsupervised probabilistic method for nding topics in collections of data. It has
been proved a successful method in a wide range of areas for nding structure
and topics in large quantities of text. For example, Hall et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] use it to study
ideas within the computational semantics eld over time, DiMaggio et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
investigate the news coverage of U.S. arts funding and Jacobi et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] use it for
following trends in journalistic papers. The most commonly used topic model is
the Latent Dirichlet allocation (LDA) model and was developed by Blei et al.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This is also the model used in most of the studies mentioned here.
      </p>
      <p>However, many studies including those above di er in their use and reporting
of their preprocessing. Preprocessing is an important step in topic modeling, and
it includes both formatting of the data, such as removing punctuation, but it
can also include removing all words of a certain part of speech. The e ect of
di erent preprocessing choices has not been studied systematically and there is
also a lack of using linguistic information in the preprocessing.</p>
      <p>
        Thus, the aim of the present work is twofold. The rst is to investigate how
one can adapt and enrich topic modeling with linguistic information and
knowledge. The second is to exemplify and explore how one can apply this method to
investigate the public discourse of Swedish housing policies. This area is chosen
because of its relevance, the housing crisis in Sweden has been ongoing since the
1990's and it has been a source of debate for just as long. Lack of housing is
still becoming more widespread, with only a small rise in newly built houses in
2015{2016 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], further adding to the relevance of this subject.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <sec id="sec-2-1">
        <title>Linguistically informed topic modeling</title>
        <p>
          There are a few studies reporting on the e ect of linguistically informed topic
modeling. Martin &amp; Johnson [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] conclude that topic modeling is more informative
and e ective using only nouns. Following Lau et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] they also report that
lemmatizing improves the results, but that it slows down the topic modeling.
They use semantic coherence for evaluation (see the evaluation section) and nd
that the coherence of the topics improve using only nouns. Jockers [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] also
reports good results for nouns only, but comments that using only nouns can
remove some of the information sought after. For example, he argues that if one
is looking for sentiment, adjectives are probably necessary to incorporate.
        </p>
        <p>
          There are also studies which use linguistic information to develop topic
modeling for speci c purposes. Fang et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] present a novel cross-perspective
topic model which models topics and opinions. The topics are modelled using
only nouns from the corpora. The opinions related to the topics are modeled
using adjectives, adverbs and verbs. Guo [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] uses dependency parsing relations
to lter words as a preprocessing step for LDA, and reports improved result
for their speci c task of detecting spoilers. This, together with the mentioned
studies above, further motivates an investigation of how topic modeling can be
improved by ltering the input in di erent ways, based on linguistic information.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>
        The data used here comes from two domains of the public discourse, the Swedish
parliament, the Riksdag, and Swedish newstexts. Both domains were
automatically annotated with help of the corpus infrastructure tool of Sprakbanken, Korp
3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The Riksdag data is already available through Korp, and the newstext
data were annotated using the Sparv pipeline4, which is a part of Korp [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>It should be noted that the language in the two domains di er, the Riksdag
data is formal and contains many domain speci c words, while the language in
the newstexts is more similar to spoken language and the vocabulary is closer
to everyday Swedish.
3.1</p>
      <sec id="sec-3-1">
        <title>The Riksdag documents</title>
        <p>All documents and records from the Riksdag's proceedings and correspondence
are freely available online, known as Riksdagens oppna data (The Parliament
open data).5 However, here the documents were downloaded from through Korp.</p>
        <p>The documents span between 1971 to present day, with the exception of a
few document categories missing from the earlier years. There are 20 di erent
document categories and from these seven were chosen. Documents deemed to
cover debates, discussions and proposals are chosen. An overview of the selected
documents can be seen in table 1. Only the rst 3000-4000 words were used from
the longer document types except for the protocols. This was done with the hope
that this part covers the document's topics well enough. The protocols will have
topics distributed throughout the documents and therefore these were kept long.</p>
        <p>Average
Nr of
documents ldeoncgutmhent Period
Document type Description</p>
        <p>The documents were split up according to parliamentary periods. This is to
able to compare the terms, but also to avoid doing topic modeling over a long
time span. Topics will have varied over time and this might a ect the topic
modeling. The parliamentary periods with respective document and word count
can be seen in table appendix A.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Newstexts</title>
        <p>To analyze the media, newspaper and magazine articles have been downloaded
from the Media Archive provided by Retriever. 6 The access was provided by
the Swedish Union of Tenants(SUT).</p>
        <p>In order to nd all the newstexts concerning housing policies a search term
list was made together with people from SUT who are knowledgeable of housing
policies. See Appendix B for the search terms. All newstexts containing the
Swedish word for housing, bostad, in all its forms, and at least one of the words
in the search term list were used. Using the selected search terms captured
both relevant and irrelevant newstexts. The topic modeling helps us sort out the
relevant ones for further analysis.</p>
        <p>All the available newstexts were originally published on the web, no printed
media is included. The time span of these newstexts is 2000{2015. Before 2000
there are no newstexts available. For the topic modeling, the data is split up in
two 5-year period and one 6-year period, to be able to compare the years and
avoid a too long time span. These periods can be seen in table 2 together with
the number of tokens and documents. In total the newstexts come from 1786
di erent sources. Most of these sources only contribute with a few newstexts,
and there are a few dominant sources.
In order to compare the e ects of di erent linguistic preprocessing, a number of
lters based on linguistic information were designed and applied to a test set of
the data. An example of a lter can be selecting all words in the documents which</p>
        <sec id="sec-3-2-1">
          <title>6 https://www.retriever.se/product/nordens-storsta-mediaarkiv/</title>
          <p>are tagged as nouns or words participating in a speci ed dependency relation.
The lters are described in more detail below.</p>
          <p>A topic model was trained on each of the ltered versions of the test set, and
the models were evaluated using semantic coherence and human judgement, see
below.</p>
          <p>The parliamentary period 2010{2014 from the Riksdag was chosen as the
test set. The combination of lters resulting in the highest rated model from
this test set was used for the rest of the parliamentary periods of the Riksdag
data, which are then used for exploration of the data.</p>
          <p>As previously stated, the language in the two data sets di er, and because of
this the highest rated combination of lters for the Riksdag data is not used for
the newstexts. Instead, the top ve highest rated combinations of lters from
the Riksdag are tested on the newstexts, with the hope that the positive e ects
of these lters are general enough to be useful in this new domain. The ve
resulting models from the newstexts are then evaluated in the same way as the
Riksdag data.
4.1</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Preprocessing and linguistic lters</title>
        <p>Punctuation and numbers are removed from all documents, and all words are
changed to lower-case. Frequent words are removed, words which occur in 50%
or more of the documents and words which occur in less than 5 documents are
removed. Here, this frequency ltering is referred to as lter 1. Unless stated
otherwise, this is applied to all documents.</p>
        <p>A stop list was used, also de ned as a lter. This list was made from a general
stop list for Swedish, but it was necessary to manually add domain-speci c words
to this list.</p>
        <p>Through the Korp annotation there is information about lemma, part of
speech and dependency relation for every token. From this, a lter of lemmas of
words was used, this lter simply replaces words with their lemmas.</p>
        <p>
          Three lters based on part of speech were tested. The rst lter uses all
parts of speech, called all POS. The second lter removes all words which are
not nouns, verbs, adjectives and participles, from here on called POS2. The
third, following [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] uses only nouns.
        </p>
        <p>A lter based on dependency relations was also made. This lter only uses
words participating in seven speci ed dependency relations, chosen with the
aim to nd the meaningful parts of the sentence. These relations are: agent,
object adverbial, direct object, predicative attribute, place adverbial, subject
predicative complement and other subjects.</p>
        <p>In table 3 an overview of the combinations of lters tested is shown. If nothing
else is stated, all lters had the frequency lter 1 applied. All groups are tested
without frequency lter, with lemmatization, and with lemmatization and stop
list. The all POS and the POS2 groups are also tested with lters based on
dependency relations. The POS2 group was chosen for further investigation has
thus 5 more lters applied to it.</p>
        <p>
          The linguistic lters applied to the newstext data can be seen in table 4. These
lters were chosen based on the results from the topic modeling of the Riksdag
data and manual inspection. Through the initial manual inspection using only
a frequency lter was found to work better for the newstext data than the
Riksdag. The stop list for the Riksdag data was also made up of domain speci c
and couldn't be reused. Because of this, instead of making a new stop list, a
new frequency lter was made. The alternative lter, named lter 2, removes
the 300 most frequent tokens in the data and tokens that occur in 75% of the
documents.
The topic modeling was implemented using the python library Gensim.7 The
LDA implementation in Gensim uses a modi ed version of variational Bayes,
made to handle documents in a stream, which makes handling large corpora
more e ective [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ][
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Part of the evaluation was also carried out with methods
in the library, see next section.
        </p>
        <p>When training an LDA model the number of topics needs to be provided.
Guided by previous papers, experiments were run between 50 - 200 topics. After</p>
        <sec id="sec-3-3-1">
          <title>7 https://radimrehurek.com/gensim/</title>
          <p>manual inspection 75 number of topics where selected for the lter tests. Other
than this the default con gurations of Gensim were used.
4.3</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation</title>
        <p>
          There are several ways to evaluate a topic model. It has previously been shown
that held out likelihood of a model doesn't always correspond to human
judgement [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Here the focus lies instead on the interpretability of the generated
topics. This is evaluated both computationally and with humans. Using the
coherence model available in Gensim, the two semantic coherence measures cv
and npmi were calculated. These measures calculate the semantic coherence
between the words in a topic by using probabilities derived from word co-occurrence
statistics. If a topic has high coherence between its words it is presumably also
a good topic. The two measures di er in how the probabilities are calculated,
see [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for more details. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] also nds cv to be the best measure, but is
contradicted by [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] who nds npmi to be the best measure, and therefore these are
compared.
        </p>
        <p>To assess the performance of the coherence measures and evaluate topic
quality, human judgements were collected. Before this, a short manual inspection of
the models were done by the authors. This resulted in two models being
disregarded due to them containing mostly useless topics. The rest of the models
were kept, in total 16. These models can be seen in table 6 in the next section.</p>
        <p>Six evaluators each rated 8 models, with three people rating the same 8
and the other three rated 8 other. In total, there are human judgements for 16
models. The evaluators were between the age 20-30, all native Swedish speakers
and with an education level of undergraduate or above. There was an equal
gender division.</p>
        <p>
          Following [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the evaluators were asked to assess the
understandability of the top 10 words from each topic. The instructions given for the rating
can be seen in table 5. The instructions are translated from Swedish.
        </p>
        <p>For each topic, the mean of the human ratings were calculated and the
correlation between these ratings and the coherence measures were then calculated
using Pearsons r. As stated in the previous section, ve models corresponding to
the ve top rated combinations of lters from the Riksdag test set was chosen
for this.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Below the results for the models trained on the ltered Riksdag data are
presented. In table 6 all the models with their ratings are shown. In the table the
mean human rating can also be seen together with the number of 3's (from the
mean rating) for each of the topics. The maximum number of 3's is 75, which
would mean all human evaluators gave all topics a score of 3. The percentage of
the original number of words is also shown. However, this number doesn't seem
to have an e ect on the ratings.</p>
      <p>The highest rated model is the one with only nouns, a stop list and the
frequency lter, lter 1 (words occurring in more than 50% of the documents
and words with an occurrence of 5 or less are removed). The words are also
lemmatized. In second place comes the same model, but without a stop list.
The following top ranked models are from the POS2 group, but without
lemmatization. The third highest rated model is also ltered based on dependency
relations.</p>
      <p>For the models using all parts of speech, using a stop list signi cantly
improves the results, as expected. Applying frequency lter 1 also improves the
result. In fact, in the POS2 group, the frequency lter has a better e ect than
the stop list, when used alone.</p>
      <p>The dependency relations lter have di erent e ects. This can be seen
comparing all parts of speech with and without dependency relations, where the
dependency relations lters have a lower ranking. This is also seen in the POS2
group comparing the same groups. However, the POS2 model without
lemmatization, stop list and dependency relations has a high score. The POS2 model
without any lter except the dependency lter also has a high score.</p>
      <p>In the POS2 group models using lemmatized words have lower ratings than
their respective models without lemmatization. However, the NN models using
lemmatized words have a higher score than all the POS2 models.
All POS
No frequency lter
Lemma
Lemma, Stop
Lemma, Stop, Deprel
Lemma,Deprel
NN
No frequency lter
Lemma
Lemma, stop</p>
      <p>POS2
Mean % of all
human Nr word
rating of 3's used
-
-
2.191 15
2.147 13
1.987 0</p>
      <p>The results from the human judgements for the newstexts can be seen in
table 7. The highest rated models di er from the Riksdag data. Here, the highest
rated model is with the POS2 group, frequency lter 2, and no lemmatization,
as opposed to lemmatized nouns with a stop list, which had the highest scores
in the Riksdag. The second place is the same as the Riksdag, but the rest of the
models have di erent rankings. Note that the frequency lter 2 replaces a stop
list here. The mean ratings and number of 3's are lower overall for the newstext
data than for the Riksdag.</p>
      <p>When inspecting the topics from the di erent lters a few patterns were
found. In all topics, nouns were the most frequent part of speech, regardless of
POS- lter. Non-lemmatized topics had more repetition of the same words but
di erent word forms. The dependency relations captured mostly nouns due to
the nature of the chosen relations, but still these topics where not rated as high
as the others.</p>
      <p>The rankings from the two coherence measures, cv and npmi, did not
correspond to the human rankings for the Riksdag test set. cv however has the top
ranked model as the second best model. The calculated correlation for the cv
measure is almost always higher than for the npmi, with a mean correlation of
0.68 and 0.60, respectively. Both have the highest correlation for the top ranked
model by humans, and both have lower correlation for the models with
dependency relations lters, compared to the other models. See appendix C for more
details.
5.1</p>
      <sec id="sec-4-1">
        <title>Exploring the public discourse</title>
        <p>The highest rated combination of lters from the Riksdag, which was lemmatized
nouns, with a stoplist, was used on the rest of the data. The resulting models
and classi cations of documents is here used to exemplify how one can use topic
modeling for examining public discourse. The same was done for the newstexts,
but with the highest rated model for this data, the POS2 group with lter 2.</p>
        <p>For the Riksdag, the topics for each period was manually inspected, and in
every period a topic corresponding to housing policies was found. In some of the
periods, two topics were found. In the newstexts, more topics was found relating
to housing policies as compared to the Riksdag, due to the selection process.</p>
        <p>With this information, one can track changes in the topic over time. For
example, gure 1 shows the proportion of documents which contains over 0.35
of this topic in all the motions. To lter out the document with a low proportion
of the housing policies topics, documents with less than 35% of the topic was
removed. Inspecting the gure one can see that the topic has a peak in 1998{2002
and 1976{1979.</p>
        <p>To further inspect the data, interactive plots were made with the help of the
Python library Bokeh.8 A static version of this is seen in gure 2. It shows all
documents, not just the ones containing the 'housing policies' topics. The
documents on the y-axis are in chronological order. As can be seen in the screenshot,
when hovering the mouse over a square, the name of the document it represents
is shown, in this case Livet efter skyddat boende (Life after protected housing).
The topic is unnamed, but the top ten words of the topic are displayed. They
include vald, (violence), kvinna (woman), and barn (children). The proportion
of the topic is also shown. Together with the title, one can assume that the
document is classi ed in a correct way. This interactive plot or visualization is
thus both a way to explore the data, but also a way to examine how the model
classi es documents.</p>
        <p>With these kinds of plots, co-occurring topics can also be examined. Figure 3
is based on newstexts, and shows the mean of each topic for every month during
2014. Only newstexts containing a topic labeled the lack of housing are used.
The lack of housing topic is removed (nr 25), to be able to see the other topics
more clearly.</p>
        <p>In the gure, topic nr 33, which is about student housing is slightly more
cooccurring during July, August and September, possibly due to the start of the
academic year in September. Topic number 67, which concerns political parties
and politics, have a strong peak in August. In September 2014, general elections
were held in Sweden, and this could explain this peak. Other frequent topics are
number 39 and 57. 39 is about investments and growth, and 57 are a topic of
general words such as said.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this work we have shown how one can examine the discourse of Swedish
housing policies with the help of topic modeling. The method is deemed suitable
for the intended analysis, although there is more work to do for a full analysis
of the public discourse.</p>
      <p>By using human evaluators, the e ects of di erent kinds of linguistic
preprocessing were investigated. Of the three categories investigated here, part of
speech had the largest impact on the results. Using nouns improved the topics.
Models based on verb, adjectives, participles and nouns also improved the topics,
however the most frequent part of speech in these models is nouns. Lemmatized
data is not rated as high as non-lemmatized data, however without
lemmatization the same words are repeated in the topics. This might have an e ect on the
topics usefulness and interpretability and it is thus unclear if non-lemmatized</p>
      <sec id="sec-5-1">
        <title>8 https://bokeh.pydata.org/en/latest/</title>
        <p>data is preferred. Using data selected based on dependency relations does not
result in topics with high ratings, however this might change if one uses di erent
dependency relations. The evaluation of the topic models showed that the cv
measure has a better correlation with human judgements than the npmi
measure. Both of the measures has the highest correlation for models using only
nouns.
Acknowledgments. This work has been supported in part by a framework
grant for the project Towards a knowledge-based culturomics 9, awarded by the
Swedish Research Council (contract 2012-5738).</p>
        <p>This work has also been carried out with the support from the Swedish Union
of Tenants10, which has provided part of the data used.</p>
      </sec>
      <sec id="sec-5-2">
        <title>9 https://spraakbanken.gu.se/eng/culturomics 10 https://www.hyresgastforeningen.se/</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Appendix A - Parliamentary periods for the Riksdag data</title>
      <p>Appendix B - Search terms for newspapers and magazines</p>
    </sec>
    <sec id="sec-7">
      <title>Appendix C - Top 5 models from the Riksdag compared to cv and npmi measures</title>
      <p>hTuompa5nmjuoddgeelsm,ent rhMautemiannagn Nr of 3's
NN, Lemma, Stop 2.489 27
NN, Lemma 2.409 24
POS 2, Stop, Deprel 2.351 16
POS 2, only freq lter 2.249 14
POS 2, Stop 2.236 13</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Viklund</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Borin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <source>How Can Big Data Help Us Study Rhetorical History? In Selected Papers from the CLARIN Annual Conference 2015, October</source>
          <volume>14</volume>
          {
          <fpage>16</fpage>
          ,
          <year>2015</year>
          , Wroclaw, Poland, number
          <volume>123</volume>
          (pp.
          <volume>79</volume>
          {
          <fpage>93</fpage>
          ).: Linkoping University Electronic Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tahmasebi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Capannini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubhashi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Exner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gossen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johansson</surname>
            ,
            <given-names>F. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johansson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Kageback,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , et al. (
          <year>2015</year>
          ).
          <article-title>Visions and open challenges for a knowledge-based culturomics</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>15</volume>
          (
          <issue>2-4</issue>
          ),
          <volume>169</volume>
          {
          <fpage>187</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Studying the history of ideas using topic models</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing</source>
          (pp.
          <volume>363</volume>
          {
          <fpage>371</fpage>
          ).:
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. DiMaggio,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Nag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Exploiting a nities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of US government arts funding</article-title>
          .
          <source>Poetics</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <volume>570</volume>
          {
          <fpage>606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jacobi</surname>
          </string-name>
          , C.,
          <string-name>
            <surname>van Atteveldt</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Welbers</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Quantitative analysis of large amounts of journalistic texts using topic modelling</article-title>
          .
          <source>Digital Journalism</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>89</volume>
          {
          <fpage>106</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research, 3(Jan)</source>
          ,
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Hojer, H. (
          <year>2017</year>
          ).
          <article-title>Darfor kan byggboomen inte losa bostadskrisen</article-title>
          .
          <source>Forskning och Framsteg, (2)</source>
          ,
          <volume>61</volume>
          {
          <fpage>74</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>More e cient topic modelling through a noun only approach</article-title>
          .
          <source>In Australasian Language Technology Association Workshop 2015</source>
          (pp.
          <fpage>111</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2014</year>
          ). Machine Reading Tea Leaves:
          <article-title>Automatically Evaluating Topic Coherence and Topic Model Quality</article-title>
          . In EACL (pp.
          <volume>530</volume>
          {
          <fpage>539</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jockers</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Macroanalysis: Digital methods and literary history</article-title>
          . University of Illinois Press. pp:
          <fpage>128</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Si</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Somasundaram</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Mining contrastive opinions on political texts using cross-perspective topic model</article-title>
          .
          <source>In Proceedings of the fth ACM international conference on Web search and data mining</source>
          (pp.
          <volume>63</volume>
          {
          <fpage>72</fpage>
          ).: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Using Dependency Parses to Augment Feature Construction for Text Mining</article-title>
          . Virginia Polytechnic Institute and State University.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Borin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Roxendal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Korp-the corpus infrastructure of Sprakbanken</article-title>
          . In LREC (pp.
          <volume>474</volume>
          {
          <fpage>478</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Borin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammarstedt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schumacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp; Schafer, R. (
          <year>2016</year>
          ).
          <article-title>Sparv: Sprakbanken's corpus annotation pipeline infrastructure</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rehurek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Software framework for topic modelling with large corpora</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ho</surname>
            <given-names>man</given-names>
          </string-name>
          , M.,
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>F. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Online learning for latent dirichlet allocation</article-title>
          .
          <source>In advances in neural information processing systems</source>
          (pp.
          <volume>856</volume>
          {
          <fpage>864</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd-Graber</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerrish</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Reading tea leaves: How humans interpret topic models</article-title>
          .
          <source>In Nips</source>
          , volume
          <volume>31</volume>
          (pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Roder,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Hinneburg</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Exploring the space of topic coherence measures</article-title>
          .
          <source>In Proceedings of the eighth ACM international conference on Web search and data mining</source>
          (pp.
          <volume>399</volume>
          {
          <fpage>408</fpage>
          ).: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>van der Zwaan</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marx</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Topic Coherence for Dutch.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grieser</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Automatic evaluation of topic coherence</article-title>
          .
          <source>In Human Language Technologies</source>
          :
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          (pp.
          <volume>100</volume>
          {
          <fpage>108</fpage>
          ).:
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>