<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding Niche Topics using Semi-Supervised Topic Modeling via Word Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerald Conheady</string-name>
          <email>gerry.conheady@ucdconnect.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Derek Greene</string-name>
          <email>derek.greene@ucd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Topic modeling techniques generally focus on the discovery of the predominant thematic structures in text corpora. In contrast, a niche topic is made up of a small number of documents related to a common theme. Such a topic may have so few documents relative to the overall corpus size that it fails to be identified when using standard techniques. This paper proposes a new process, called Niche+, for finding these kinds of niche topics. It assumes interactions with a user who can provide a strictly limited level of supervision, which is subsequently employed in semi-supervised matrix factorization. Furthermore, word embeddings are used to provide additional weakly-labeled data. Experimental results show that documents in niche topics can be successfully identified using Niche+. These results are further supported via a use case that explores a real-world company email database.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In certain text corpus exploration tasks, users will be primarily interested in the
predominant topics that naturally appear in the data. At other times, users will
aim to discover documents related to a selection of topics of particular interest.
We define a niche topic as a small set of documents from a corpus that the
user considers to be linked together by a highly-coherent theme. It is expected
that a user can provide example documents and typical words for a niche topic,
i.e. a limited level of supervision. An example of this might be a ‘data breach’
topic, where a user is interested in the discovery of unacceptable leaks of patient
data through a health organization’s email database. There might be millions
of emails in the database and the number of data breaches would be expected
to be low, therefore we could naturally regard this as a niche topic within the
overall dataset. Ideally a user investigating this data could potentially provide
a small sample of emails and terms related to data breaches, so as to help to
discover other related content. A second related example might be a ‘Product
Functionality Query’ topic, where a user is interested in the discovery of content
from customers who have been querying the functionality of a product.
Unsupervised algorithms, such as Non-negative Matrix Factorization (NMF) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
have been used to uncover the underlying topical structure in unlabeled text
corpora [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. NMF might potentially identify a topic associated with a small
number of documents, when a very large number of topics is specified. However,
this has computational implications and also leads to challenges in interpreting
the resulting topic model. Specifically, it is often impractical to ask a user to
scan through hundreds or thousands of topics in order to find one or two niche
topics. Semi-supervised NMF (SS-NMF) algorithms have been proposed which
use background information, in the form of word and document constraints, to
guide the factorization process in order to produce more useful topic models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
The Niche+ process described later in this paper uses SS-NMF techniques.
Word embeddings have been applied in a range of natural language processing
tasks, where words are represented by vectors in a vector space [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Words with
related meanings will tend to be close together in this space. In Section 3 we
apply the Weak+ approach supervision for topic modeling, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which uses word
embeddings to generate additional “weakly-labeled” data. This supervision takes
the form of a list of candidate words that are semantically related to a small
number of “strong” words supplied by an expert to describe a topic.
Our experiments on annotated corpora in Section 4 show that, when this weak
supervision is fed to Niche+, other example documents from the niche topic
can be found. In Section 5 we describe a use case involving a real-world email
corpus provided by an enterprise software manufacturer. We show that, by using
highly-limited supervision, the Niche+ process can successfully identify specific
topics of interest from among a larger set of more general topics.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Topic Modeling</title>
        <p>
          Topic modeling allows for the discovery of themes in a collection of documents
in an unsupervised manner. It differs from keyword searches that try to match
documents directly to a particular subset of words or phrases, whereas in topic
modeling themes are based on grouping documents that have a similar usage of
words. While probabilistic approaches have often been used for topic modeling,
approaches based on NMF [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] have also been successful.
        </p>
        <p>
          Semi-supervised learning often involves using limited labeled data to improve
the performance of algorithms which are normally unsupervised. For instance,
methods have been proposed for incorporating constraints into matrix
factorization [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. For text data, this typically involves providing supervision in the form
of constraints imposed on documents and terms, suggested by a human expert
who is often referred to as the “oracle”. The Utopian system, which implements
this approach, has demonstrated improved topic modeling results [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The best way to provide labeled data for semi-supervised learning is by
continuous human interaction with the algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In topic modeling, a key practical
challenge is to provide a user with an easy way to explore a large collection of
text. The user needs to be free to select from the output of the topic model to
highlight areas for improvement or further analysis. The ability to manipulate
both documents and terms in a topic is needed. This approach is used in our
experiments where the oracle is asked to provide topic documents and topic words
for supervision, in order to guide the topic modeling process. The oracle is also
asked to provide feedback on the documents found in the first run to determine
whether they belong to the topic or not. This information is used to provide
negative supervision during the second run. In other words, the feedback is used
to exclude the documents and words from the niche topic.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Word Embeddings</title>
        <p>
          NMF does not directly take into account semantic associations between words.
Related meanings of words, such as between ‘car’ and ‘bus’, do not explicitly
influence the factorization process. Techniques based on word embeddings attempt
to take into account the semantic relatedness between pairs of words, as derived
from a large corpus of text. Many applications of word embeddings are based
on the use of a neural network as in the original word2vec model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The input
and output layers have one entry for each word in the vocabulary n. The hidden
layer is considered the dimension layer and has d entries. It allows the output
from the hidden layer to be represented by a n x d matrix. This representation
measures the semantic associations between words in a corpus.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <sec id="sec-3-1">
        <title>Characterizing Niche Topics</title>
        <p>The characteristics of a niche topic might typically include both its
distinctiveness compared to the overall corpus and the heterogeneity of the niche. The
distinctiveness influences how easy it is to find documents in the niche topic and
can be measured using the cosine ratio. Given a corpus of documents assigned to
k topics, where each document is assigned to one topic, we quantify the cosine
ratio as follows. Firstly, we compute a topic-topic similarity matrix S, where an
off-diagonal entry Sij indicates the mean cosine similarity between all pairs of
documents in topic i and topic j, and a diagonal entry Sii indicates the mean
cosine similarity between all pairs of documents in the same topic i. We refer
to Sii as the within-topic similarity for topic i. The between-topic similarity for
topic i is the average of the values Sij where j 6= i. The cosine ratio for topic i is
its within-topic similarity divided by its between-topic similarity. A higher value
for this ratio indicates a niche topic which is more coherent and well-separated
relative to the rest of the topics present in the corpus.</p>
        <p>The heterogeneity of a niche topic can be established by a manual review of the
sub-themes of documents within the niche. For instance, sub-themes of a topic
such as “sport” could relate to soccer, rugby and tennis. Although clearly part
of the “sports” topic, the vocabulary of the documents would be specific to each
sub-theme. In a small group of niche documents, it is expected that the higher
the number of sub-themes, the more difficult the documents are to find, as it is
more heterogeneous.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Weak+</title>
        <p>
          The Weak+ approach has been proposed to provide a form of limited
supervision for topic modeling, where word embeddings are used to generate additional
“weakly-labeled” data . The Wikipedia word2vec [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] model provides an
excellent source of generic semantic relationships of words. However, it cannot fully
reflect the idiosyncratic semantic relationships between words within individual
subject domains. In order to overcome this limitation, supervision words are
first chosen from a word2vec model generated from the corpus. These words are
only selected if they also appear in the top 500 similar words coming from the
Wikipedia word2vec model.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Niche+</title>
        <p>
          We now discuss the Niche+ approach for identifying niche topics in a corpus.
It uses a semi-supervised strategy based on a simplified version of the
Utopian algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The oracle-provided documents and words and the Weak+
“weakly-labeled” words are input to Niche+ to provide supervision. The
relevant notation used for this discussion is summarized in the table below.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Notation</title>
        <p>m
n
k
A 2 Rmxn
W 2 Rmxk
H 2 Rkxn
Wr 2 Rmxk
Hr 2 Rkxn
MW 2 Rkxk
MH 2 Rnxn
DH 2 Rnxn</p>
      </sec>
      <sec id="sec-3-5">
        <title>Description</title>
        <sec id="sec-3-5-1">
          <title>Number of documents in the corpus Number of words in the corpus User-specified number of topics Document-term matrix</title>
          <p>Document by topic matrix
Topic by word matrix
Supervision matrix with topic weights for documents in W
Supervision matrix with topic weights for words in H
Masking matrix for W with cells set to 1 for topics supervised
Masking matrix for H with cells set to 1 for topics supervised</p>
          <p>Diagonal matrix used for automatic scaling
The Utopian matrix factorization algorithm minimizes the objective in (1):
jjA
This requires the H matrix to be recalculated column by column for every
iteration until the stopping criteria is reached. It demands large resources for a
corpus with a large vocabulary. As a result a simpler form of the objective was
adopted as in (2) which only requires the H matrix to be updated once per
iteration. The diagonal matrix DH is no longer required and is eliminated.
jjA</p>
          <p>
            2
Hr)MH jjF
The non-negativity constrained least squares with active-set method and column
grouping, [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], nnlsm_activeset, is used. The W matrix continues to be updated
as in (3) per the Utopian algorithm.
(2)
(3)
(4)
W
argmin
          </p>
          <p>W 0</p>
          <p>HT
MW</p>
          <p>W T</p>
          <p>AT
MW WrT
The update process for the H matrix is changed as in (4):</p>
          <p>H
argmin</p>
          <p>H 0</p>
          <p>W
MW</p>
          <p>H</p>
          <p>A
Hr:MH
2
F
2
F
2
The nnlsm_activeset algorithm solves X for min AX
B</p>
          <p>X &gt;=0
elementF
wise and is used to solve W and H. Positive supervision is implemented by
setting the supervision weights for documents and words to positive values and
negative supervision by setting the weights to 0. Niche+, building on the
original SS-NMF process, is carried out using the following steps:</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>1. Normalize the document-term matrix using TF-IDF.</title>
          <p>2. Reconstitute the documents from the document-term matrix.
3. Generate an extended list of supervision words using the Weak+ process.
4. Apply the SS-NMF algorithm on the document-term matrix A
(a) Initialise the W and H matrices with random values. Initialize the Wr,</p>
          <p>Hr, MW and MH matrices with zeros.
(b) Set the Wr and W matrices weights for each document for the topic to
be supervised.
(c) Set the Hr and H matrices weights for each word to be supervised.
(d) Set the MW and MH weights for the topic to be supervised.
(e) Select another topic and set its weights to be the reverse of the supervised
topic.
(f) Repeat until the objective converges or the maximum iteration is
reached:
i. Using nnlsm_activeset solve for H and then for W.</p>
          <p>ii. Recalculate the objective.</p>
          <p>The Niche+ process guides the discovery of niche topic documents using the
words and documents provided by the user, along with the extended list of
semantically linked words provided by Weak+.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>
          The aim of our experiments is to investigate whether the Niche+ process can be
used successfully to find niche topics. The corpora listed in Table 1 are used for
the evaluation. They were chosen as they come with a ground truth and provide
niche topics that meet our definition of being a ‘small set of documents from the
corpus that the users considers to be linked together’.
The EU-PR dataset consists of press releases describing activities relating to
the European Parliament across 12 different policy areas [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], where some policy
areas are naturally covered more frequently than others. Our second corpus is
the widely-used 20 Newsgroups (20-NG ) collection of approximately 18K posts
from 20 Usenet newsgroups. The Complaints corpus is a collection of over 66K
records from the US Consumer Financial Complaints dataset provided by
Kaggle, categorized into 90 different types, such as ‘Data Privacy’, ‘Bankruptcy’ and
‘Foreign Currency Exchange’.
        </p>
        <p>For each corpus, we use the topic(s) with the lowest cosine ratio as these are
the most difficult to find. Niche topics are created for the EU-PR and 20-NG
topics by deleting all the documents for the topics, except the first 30. The 90
Complaints dataset is a much larger dataset and its topics have sizes ranging
from a single document to over 6000 documents. The ‘Adding Money’, ‘Data
Privacy’ and ‘Bankruptcy’ topics are selected as naturally occurring niche topics.
It is important to minimize the user burden of providing labeled documents.
To simulate this, only the first five unique documents in each topic are used as
oracle documents. We calculate centroid vectors for each annotated ground truth
topic, and then rank the corresponding words based on their centroid weights. In
this way we can select the top five words as the oracle-given words for the niche.
These words are used to construct word2vec embeddings on each corpus using
a skip-gram model, with vectors of 100 dimensions and the document frequency
threshold set to a minimum of 5.</p>
        <p>We next run the Niche+ process with step 4 repeated 50 times. We fix the
number of topics k to be the number of ground truth topics in each corpus.
Weights are set to 10 for the oracle given documents and words. Weights are set
to 1 for Weak+ generated words, as these are not to be as influential as the
oracle given words. The mask weights are set to 1 for the topics supervised. All
other weights are set to 0.</p>
        <p>In order to simulate further feedback from the oracle, the initial runs are followed
by an ‘exclusion’ run. Documents found in the original run that are not part of
the niche topic are subject to negative supervision by setting their weights to 0.
The weights of 5 prominent words from these documents are set to 0 to provide
further negative supervision. The percentage of documents found using Niche+
is compared to a word frequency search based on the oracle words.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>
          Firstly, we use Normalized Mutual Information (NMI) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to measure the
accuracy of document assignments arising from the topic models, relative to the
ground truth document assignments. This is done by counting the number of
correct documents found for each topic using the ground truth labels. The
results show little difference in NMI scores between the runs with and without
niche topic supervision. This is explained by the fact that the Niche+ process
concentrates on improving accuracy for a single topic only. The NMI scores for
the EU-PR and 20-NG corpora range from 0.65 to 0.82 and for the Complaints
corpus from 0.35 to 0.36.
        </p>
        <p>Next we use the percentage of documents found for each niche topic as a measure
of success. Weightings by topic for each document are output from Niche+.
They are ranked into the top 100 documents for each topic. Figure 1 shows the
percentage of documents that are labeled as being part of the niche topic. An
oracle-based word frequency search is run using the words generated by centroid
calculation for the oracle documents. This is all that can be done when there is
no ground truth, as in our later use case. The results show 70% of documents
are found in the EU PR ‘Antitrust’ topic, from 3% to 23% documents for the 20
Newsgroups topics and from 6% to 10% for the Complaints topics. A ‘best case’
niche topic word frequency search is run using words generated by a separate
centroid calculation for all the documents in the niche topic. This can be done
as we have a ground truth. Improved results show that 77% of documents are
found in the EU PR ‘Antitrust’ topic, from 27% to 40% of documents for the
20 Newsgroups topics and from 6% to 17% for the Complaints topics. Niche+
oracle supervision results are as high as 87% for the ‘Antitrust’ topic, 40% for
the 20 Newsgroups topics and 22% for the Complaints topics. Niche+ oracle
supervision with exclusions showed further improvement reaching 87% for the
‘Antitrust’ topic, 48% for the 20 Newsgroups topics and 30% for the Complaints
topics.
The precision interval analysis in Figure 2 presents the number of niche
documents for different levels of precision, considering the top 10 to 100 documents
found. We see that typically over 80% of documents are found within the first
30 documents.
It is expected that where the oracle documents are similar to the niche topic
and the niche topic is distinct from the corpus, that the results will be most
successful. i.e. a low oracle to niche cosine ratio and a high niche to corpus
cosine ratio will find more niche documents. These ratios are shown in Figure 3.
18
iso16
ta14
eR12
in10
s
o 8
C
e 6
ich 4
/N 2
lce 0
a
r
O
The “antitrust” topic in the EU-PR dataset has the lowest oracle to niche
documents cosine ratio of 3.3. This is reflected in its high precision and recall scores.
The “electronics” topic has the lowest precision and recall scores for the 20ng
dataset. Its oracle to niche documents cosine ratio is high at 15.8 showing that
the oracle documents do not represent the niche well. The highest 20-NG topic
results are for the “med” topic with an oracle to niche documents cosine ratio
of 7.1. This is slightly higher than the scores of 6.9 and 6.8 for the “religion”
and “politics” topics. However, the cosine ratio for the niche documents to the
corpus documents is 6.8 compared to 5.9 and 5.8 for the “religion” and “politics”
topics, implying that the niche topic for “med” is more distinct than the others.
A similar pattern is seen in the Complaints dataset where the “adding money”
topic has the lowest oracle to niche documents ratio and the “bankruptcy” topic
the highest. The combination of how well the oracle documents reflect the niche
and how distinct the niche is in the corpus determines the success level.
A further manual analysis of the “med” topic reveals a high level of heterogeneity
as defined in Section 3.1. It can be divided into distinct sub-themes such as
‘back pain’, ‘lactose intolerance’, ‘smoking’ and others. All the documents are
clearly linked to the “med” topic. The oracle documents include one related to
the ‘lactose intolerance’ sub-theme and the results include similar documents.
However, the oracle documents do not include any relating to the ‘smoking’
sub-theme and none of the ‘smoking’ documents in the niche topic are found.
Although the relationship between the sub-themes is easily detected manually,
the Niche+ process does not make the connection. The “electronics” topic shows
a few clear sub-themes such as ‘searches for circuits’, ‘data transmission’ and ‘car
radar’. The oracle documents do not contain any documents relating to these
sub-themes and this may explain the poorer performance.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Experimental Setup</title>
        <p>Enterprise email archives can contain hundreds of millions of emails. The ability
to discover niche topics in archives can assist enterprises to audit and manage
business processes. A software manufacturer has extracted 279K emails from
their email archive for our real-world use case. The emails are unlabeled and
cover twelve years of activity from their customer support department. The
niche topics provided for analysis relate to a ‘visa application’, an ‘accounting
package upgrade’ and the ‘moving of email archive volumes’.</p>
        <p>An initial clean-up of the emails is required. Only the subject and the main
body of the email text are used. Details removed include forwarded messages,
original messages, confidentiality notices, signatures, URLs ,and email addresses.
Only emails with at least 50 characters are selected for the creation of a
TFIDF normalized document-term matrix. Words are filtered based on a minimum
document frequency of 30.</p>
        <p>The Weak+ process generates 95 extra words for supervision from the original
user provided words. Table 2 shows the first 10 words generated by user word
for the ‘accounting package’ topic. The words generated are semantically close
to the user words in the context of an accounting package upgrade. The words
selected for ‘quote’ relate to seeking and providing of information relating to the
cost of the accounting package upgrade. This reflects the user’s domain whereas
a more generic approach may have interpreted ‘quote’ as relating to citations
from literature.</p>
        <p>Niche+ is then run to find 20 topics. This choice is based on an inspection of
the data. The supervision weights of the five documents and words supplied by
the user are set to 10. The supervision weights of the Weak+ generated words
are set to 1.
Based on the topic model produced by our approach, the first 100 documents for
each topic are ranked based on their weightings as in Section 4.2. The judgment
of a user expert, who is familiar with the data, is that 94% of the documents
for the ‘Visa’ topic, 49% for the ‘Accounting Package’ and 29% for the ‘Archive’
relate to the topic, as seen in Figure 4. A word frequency search of the email
corpus, as used in Section 4.2, with the user given words, results in finding 12%
of the documents for the ‘Visa’ topic and 14% for the ‘Accounting Package’ and
40% for the ‘Archive’ topic.</p>
        <p>In the case of the ‘Accounting Package’ topic, many of the off-topic documents
relate to other package upgrades, such as Microsoft Windows upgrades. The
documents the user excludes from the ‘Archive’ topic include many relating to
a similar product, from a competitor, that is not of user interest. All documents
and 5 words identified as not belonging to the topic are used for exclusion runs
for each topic, as described in Section 4.1. The number of documents found
increases in all supervision/exclusion runs to 95% for the ‘Visa’ topic, 59% for
the ‘Accounting Package’ and 55% for the ‘Archive’ topic.</p>
        <p>Overall, this use case shows that the Niche+ process can successfully find niche
topics in real-world datasets, such as a large email corpus.
This paper has shown that input from an oracle (e.g. a “human-in-the-loop”)
during topic modeling can improve results. In particular, when trying to identify
small niche topics in a large unstructured text corpus, a user’s domain expertise
can be essential. An initial set of inputs from a user helps the discovery of such
niche topic documents. A second round of input, either in the form of inclusions
or exclusions, can further improve the results.</p>
        <p>It has also been shown that cosine ratio is a good predictor of the number of
niche documents that are found. This opens up the opportunity to guide users
in their selection of suitable documents for niche topic supervision by looking at
the cosine ratios for the oracle documents.</p>
        <p>However, the Niche+ process is not always successful in finding documents
relating to sub-themes in the niche that do not have oracle examples, such as in
the case of the ‘smoking’ sub-theme, as seen in Section 4.2. The process cannot
currently reach out to semantically linked sub-themes. This will be an area of
further investigation.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Arora</surname>
          </string-name>
          , Rong Ge, and
          <string-name>
            <given-names>Ankur</given-names>
            <surname>Moitra</surname>
          </string-name>
          .
          <article-title>Learning topic models - Going beyond SVD</article-title>
          .
          <source>Proceedings - Annual IEEE Symposium on Foundations of Computer Science</source>
          , FOCS, pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Gerald</given-names>
            <surname>Conheady</surname>
          </string-name>
          and
          <string-name>
            <given-names>Derek</given-names>
            <surname>Greene</surname>
          </string-name>
          .
          <article-title>Weak Supervision for Semi-supervised Topic Modeling via Word Embeddings. Language, Data, and</article-title>
          <string-name>
            <surname>Knowledge. LDK</surname>
          </string-name>
          <year>2017</year>
          ., pages
          <fpage>150</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Cross</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          .
          <article-title>Capturing and explaining the policy agenda of the european commission between 1986-2016: A quantitative text analysis approach</article-title>
          .
          <source>Under review</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Jingu</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>nonnegfac-python @ github</article-title>
          .com.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Patrik</given-names>
            <surname>Ehrencrona Kjellin</surname>
          </string-name>
          .
          <source>A Survey On Interactivity in Topic Models</source>
          .
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>456</fpage>
          -
          <lpage>461</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Choo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H</given-names>
            <surname>Park</surname>
          </string-name>
          .
          <article-title>Nonnegative Matrix Factorization for Interactive Topic Modeling</article-title>
          and
          <string-name>
            <given-names>Document</given-names>
            <surname>Clustering</surname>
          </string-name>
          .
          <source>Partitional Clustering Algorithms</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D D</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>H S</given-names>
            <surname>Seung</surname>
          </string-name>
          .
          <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>
          .
          <source>Nature</source>
          ,
          <volume>401</volume>
          (
          <issue>6755</issue>
          ):
          <fpage>788</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Tao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chris</given-names>
            <surname>Ding</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Michael I</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization</article-title>
          .
          <source>In Seventh IEEE International Conference on Data Mining (ICDM</source>
          <year>2007</year>
          ), volume
          <volume>2</volume>
          , pages
          <fpage>577</fpage>
          -
          <lpage>582</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Greg Corrado, Kai Chen, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>Proceedings of the International Conference on Learning Representations (ICLR</source>
          <year>2013</year>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <source>Radim Rehurek. gensim 1</source>
          .0.0rc1 : Python Package Index.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Strehl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joydeep</given-names>
            <surname>Ghosh. Cluster Ensembles - A Knowledge Reuse</surname>
          </string-name>
          <article-title>Framework for Combining Multiple Partitions</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>583</fpage>
          -
          <lpage>617</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>