=Paper= {{Paper |id=Vol-2068/esida5 |storemode=property |title=Interactive Visualization for Topic Model Curation |pdfUrl=https://ceur-ws.org/Vol-2068/esida5.pdf |volume=Vol-2068 |authors=Guoray Cai,Feng Sun,Yongzhong Sha |dblpUrl=https://dblp.org/rec/conf/iui/CaiSS18 }} ==Interactive Visualization for Topic Model Curation== https://ceur-ws.org/Vol-2068/esida5.pdf
             Interactive Visualization for Topic Model Curation
              Guoray Cai                                      Feng Sun                             Yongzhong Sha
         Penn State University                          Penn State University                    Lanzhou University
        University Park, PA, USA                       University Park, PA, USA                 Lanzhou, Gansu, China
            cai@ist.psu.edu                                fzs122@psu.edu                        shayzh@lzu.edu.cn


ABSTRACT                                                               also help to identify the most relevant documents for a specific
Understanding the content of a large text corpus can be assisted       topic. Ideally, an analyst may be able to draw conclusions
by topic modeling methods, but the discovered topics often             from word distributions for topics and use such insight to con-
do not make clear sense to human analysts. Interactive topic           duct a more in-depth study on documents with high affinities
modeling addresses such problems by allowing a human to                for specific topics.
steer the topic model curation process (generate, interpret,
                                                                       Despite such advances, topic models have not been widely
diagnose, and refine). However, human have limited ability to
                                                                       adopted by data analysts for practical use of understanding
work with the artifacts of computational topic models since
                                                                       large corpora [23]. Topics discovered by LDA and other al-
they are difficult to interpret and harvest. This paper explores
                                                                       gorithms often have both “good” and “bad” topics judged by
the nature of such challenges and provides a visual analytic
                                                                       users. Topics could be bad because (1) they often confuse
solution in the context of supporting political scientists to
                                                                       two or more themes into one topic; (2) they often pick up two
understand the thematic content of online petition data. We
                                                                       different topics that are (nearly) duplicates for human; and
use interactive topic modeling of the White House online
                                                                       (3) nonsense topics [18], (4) topics with too many generic
petition data as a lens to bring up key points of discussions
                                                                       words (e.g., “people, like, mr”) [5], (5) topics with disparate
and to highlight the unsolved problems as well as potentials
                                                                       or poorly connected words [22], (6) topics misaligned with
utilities of visual analytics methods.
                                                                       human interpretation [9], (7) irrelevant topics [27], (8) miss-
ACM Classification Keywords                                            ing associations between topics and documents [11], and (9)
 H.5.2 Information Interfaces and Presentation: User Interfaces:       multiple similar topics [5]. The presence of poor-quality topics
-visual analytics; H.4.2 Information Systems:: -visual analytic        has been cited as the primary obstacle to the acceptance of
 systems                                                               statistical topic models outside of the machine learning com-
                                                                       munity [22]. The root of these problems lies in the fact that the
Author Keywords                                                        objective function that topic models optimize does not always
Topic models; Information visualization; visual analytics              correlate well with human judgments of topic quality [7]. Due
                                                                       to these problems, the use of topic models to analyze domain-
INTRODUCTION                                                           specific texts often requires manual validation of the latent
Topic modeling has been advanced as a solution to the chal-            topics to ensure that they are meaningful [16].
lenge of making sense of large corpora of textual data. With           Addressing the above issues to make topic models usable by
the help of machines, valuable themes buried in a large docu-          analysts who are not machine learning experts, a variety of
ment collection can emerge and provide a better representation         human-in-the-loop methods have been proposed to allow an-
of the documents. The most popular topic modeling tech-                alysts to manipulate and incrementally refine a topic model
niques, LDA (Latent Dirichlet Allocation) [4] and its variants,        of a target text corpus [17, 18, 19, 2]. These methods typ-
such as supervised LDA [26] and supervised anchor LDA [3],             ically involve the use of interactive visualization and direct
have been proven useful in many applications [29, 25], in-             manipulation of topic models to diagnose poor topics and fix
cluding online petition analysis [14]. Topic modeling assists          them through operations such as adding or removing words in
qualitative and quantitative research over user-generated texts        topics, adjusting the weights of words within topics, splitting
coming from the blogs or social media. By studying the set             generic topics, and merging similar topics [17]. For example,
of topics learned from social media conversations over some            ITM [18] allows users to add, emphasize, and ignore words
period of time, it may become possible to find out what users          within topics, while UTOPIAN [8] allows users to adjust the
are talking about, identify underlying topical trends, and fol-        weights of words within topics, merge and split topics, and
low them through time. Topic similarities among documents              create new topics. Additionally, iVisClustering [19] lets users
                                                                       manually create or remove topics, merge or split topics, and
                                                                       reassign documents to another topic, with the help of visually
                                                                       exploring topic-document associations in a scatter plot.
                                                                       While these operations can be supported by direct manipu-
                                                                       lation and algorithmic extensions, it is more challenging to
                                                                       diagnose the quality concerns of machine-discovered topics,
©2018. Copyright for the individual papers remains with the authors.   and in assessing if a refinement strategy results in topic im-
Copying permitted for private and academic purposes.
ESIDA ’18, March 11, 2018, Tokyo, Japan
provement. This is where interactive visualization methods are       As topic models treat documents as “bag-of-words”, the first
most helpful. Topic Browser [6] uses a tabular visualization         step of preparation before model training is tokenization,
technique to assist assessing term orders within each topic,         which splits each petition into a set of words. As words may
and Termite [10] focuses on supporting effective evaluation of       have various forms, lemmatization is then applied to transform
term distributions associated with LDA topics through visual-        them into a common base form. Compared with the stem-
izations. TopicNets [13] used a web-based interactive visual         ming technique that shares a similar goal, lemmatization takes
interface to enables users to discover topics of increasing gran-    advantage of vocabulary analysis and thus can produce the
ularity through an informed selection of relevant subsets of         dictionary form of words that users can interpret. Bigrams
documents.                                                           are also used here for performance purpose [32]. Finally,
                                                                     stopwords are removed from the texts, as well as the overly
While these visualization tools help users to assess and refine
                                                                     common terms that appear frequently (top 50), to avoid pos-
static topic models, they run short in supporting the whole
                                                                     sible discrimination. The resulting corpus contains 11,189
topic curation process. Topic model curation goes beyond
                                                                     unique terms.
human validation of machine-generated topics to include the
whole human-directed process of discovering topics that are
                                                                     System Design
useful specific to a domain of applications. For example, pub-
                                                                     Figure 1 shows the user interface of interacting with petition
lic opinion researchers may be interested in discovering what
                                                                     documents and topic words. This system has two functional ar-
is the range of policy preferences expressed in blog-spheres.
                                                                     eas. The lower part is a topic-word visualization that supports
Crisis managers may be interested in conversations in social
                                                                     direct manipulation of words-to-topics correlation.
media that are especially informative to their decisions on how
to allocate resources and dispatch rescue teams. For such ap-
plications, the use of topic models is not a one-shot process but
is a broader process of seeking, assessing, relating, and struc-
turing topics with the help of supervised and unsupervised
topic models. A typical topic curation process starts with a
vanilla topic model (purely unsupervised probabilistic model
such as LDA), and let users conduct a full diagnostics to rec-
ognize good and bad topics. Good topics will be collected and
kept in a “bag”, while bad topics improved or removed. For
the set of bad topics, users may explore multiple ways to ad-
just topic models (merging/splitting topics, adding/removing
words from a topic, modifying orders or weights of words in
a topic). Depending on the consequence of imposed correla-
tions and constraints, a new round of modeling and refinement
can be initiated to explore the topic space of the document
collection either in breadth or depth.
Towards supporting topic curation, this paper focuses on un-
derstanding the specific challenges of topic curation in the
context of analyzing online petition data. We gained insight
by actually practicing interactive topic modeling on the peti-
tion data we collected from the White House online petition
website “We the People”. This data set is considered a unique
source for understanding citizens’ policy concerns and pref-
erences [15]. The insight gained from this practice is used          Figure 1: User interface for interactive topic modeling during
to inform the design of a visual analytic system that supports       exploration of petitions
topic model diagnostics, refinement, and evaluation. We re-
flect the use of visual analytic methods to enable users to
                                                                     The upper part is designed for exploring the topic quality from
interactively curate topic models.
                                                                     the perspective of how the petitions (documents) are clus-
                                                                     tered according to the space defined by the topics. The points
                                                                     cloud map provides a visual overview of the petition space
INTERACTIVE TOPIC MODELING OF PETITION DATA                          where topically similar petitions are positioned adjacently. It
Electronic petitioning (e-petitioning) is becoming a prevalent       is generated using t-SNE (t-distributed stochastic neighbor
form of political action for enabling direct democratic engage-      embedding) [8] to reduce the high-dimensional petition data
ment [20]. The data used for this study comes from the online        to a 2-D vector space that human can perceive easily. Due to
petitioning platform “We the People”, hosted by the White            its nature of being nondeterministic, t-SNE usually transforms
House. It contains 5,177 petitions accumulated over the course       a high-dimensional data point to a different 2-D vector. How-
of six years (2011-2016). We further selected 4,095 petitions        ever, the relationships between the data points will remain
that are in English. Each petition has four fields: (1) a petition   almost the same. An example of visualized petitions is shown
ID, (2) a title, (3) a description, and (4) category tags.           in the Figure 1.
Each petition is assigned to one cluster based on its most            operations can be achieved through a combination of above
salient topic and is color-coded correspondingly. Users can           basic operations. For example, investigating more fine-grained
apply filters and highlighters on topics to manipulate the pe-        topics can be accomplished by splitting topics iteratively.
tition overview map. Highlighting enables users to review
petitions in context while filtering allows users to focus on the     Split a topic
petitions of interest. When hovering over a document point,           If a topic is considered be “bad” based on the observation that
a pop-up window displays the title, body, and topics of the           it confuses two or more meaningful topics into one topic, a
document. In the meantime, the topic distribution (in terms           solution could be to split the topic into two or more topics. To
of weights) of the selected document is visualized as a bar           do so, the user can check the topic he/she intends to split and
chart. By clicking a topic label, its topic-words distribution is     then click the “split” button. Before applying the operation,
visualized as color-coded bars.                                       the user is provided with the option to configure the number
                                                                      of resulting topics. Once confirming, the underlying model
At the back-end of the system, we choose Correlation Explana-         training will re-run under the new constraint that only the
tion (CorEx) [30] as the topic modeling algorithm to perform          selected topic is decomposed while the others remain the same
interactive topic curation. Built on the theory of Correlation        in terms of word allocation. Updated results will be generated
Explanation [31] in information science, CorEx strives to rep-        and visualized.
resent the substrate information in a document collection that
maximizes the informativeness of the data. Due to its fast train-     In the backend, splitting a topic into n topics involves training
ing time and capability of supporting anchoring, CorEx can            a word2vec model to produce word embeddings [21]. The
be easily tailored to incorporate human imposed correlations          resulting model is used to calculate the semantical similarity
or constraints for semi-supervised topic modeling, making it          between words. After that, a similarity matrix of the words
an ideal choice for supporting interactive topic modeling [12].       within this topic is produced, and spectral clustering is applied
Using CorEx, users can anchor multiple words to one topic,            to the matrix to categorize the words into n clusters. The
anchor one word to multiple topics, or any other creative com-        n clusters of words are encoded into the previous model as
bination of anchors in order to discover topics that do not           anchor words and will produce n new topics to replace the
naturally emerge. By leveraging CorEx’s capability of topic           original one.
seeding through anchor words in our system, human analysts
can incorporate their knowledge and insights into the process         Merge topics by joining
of refining topic models.                                             If several topics are judged to have something common in their
                                                                      semantic meaning, they can be merged into one topic. This is
TOPIC CURATION                                                        accomplished by selecting these topics and then clicking “ap-
Using our system for topic curation involves three phased of          ply” button. The system automatically apply the constraint that
activities, with a number of iterations.                              words assigned to the topics to be merged have to appear in
                                                                      the resulting topic. underlying model will be updated. Accord-
                                                                      ingly, the visualization will be re-rendered. In the backend,
Topic Discovery
                                                                      the words that appeared in the two topics are now anchored
The first step is to use topic modeling algorithm with random
                                                                      under the same one.
seeds to run an unsupervised discovery of topics. The user
must specify how many topics is to be produced, with the              Merge topics by absorption
understanding that different numbers of topics can be chosen          If one or more words in a topic are considered intruders and
to analyze the petitions data on different levels of granularity      fit better to a different topic, the user can re-allocate topic
and it is likely to generate a different set of topics [14, 24].      words through drag-and-drop operations. Specifically, a user
After initial unsupervised topic modeling with CorEx, users           can select a word that is considered allocated incorrectly and
assess the topic model and conduct diagnostic analysis on             move it to a more related topic. After reallocation of words is
topics. In particular, users will inspect topics, both individually   done, the petition view will update to reflect the modification.
and as a group, to evaluate their qualities by examining topic        In the back end, Merging topics by absorbing is basically
words. Those topics that users recognize as good ones should          a reallocation process where selected words in one topic is
be kept. For those bad ones, users can file complaints and            anchored to the other one and a new model is trained. The
come up with one or more strategies to address them.                  rest of the topic-word assignments remain the same through
                                                                      anchoring as well.
Topic Refinement
Topic refinement is achieved through manipulating topics-             Evaluating Topics Interactively
word representations at the bottom part of Figure 1. We in-           Evaluating the quality of the topics in the current model is
cluded an anchoring mechanism to be coupled with CorEx                necessary for both the diagnoses of good/bad topics as well
models. It allows users to anchor one or more words to one            as assessing the impact of topic revisions. Evaluating topic
topic, anchor one word to multiple topics, and anchor one or          quality is done by assessing two aspects: (1) are the words in
more words to some topics while not others. With this an-             a topic coherent and contributing to some collective meaning?
choring mechanism, topic revision interactions are supported          (2) are the topics aligned with the information needs of the
by operations such as splitting a topic, merging by joining,          intended application? As such, we designed the interface
and merging by absorbing (following [17]). More complicated           in Figure 1 that visualizes topically represented petitions to
support the following functions for evaluating the quality of                    Moving Intruder Words
topics:                                                                          By examining the above table, we find that topic 4 contains a
                                                                                 word “health” that is clearly different from other words (see
Inspecting quality of every single topic. Users can evaluate
                                                                                 Figure 2). We also find that some petitions related to health but
topics by looking at the coherence of the component words
                                                                                 has nothing to do with “economy” are assigned to this topic
and their relative weights (see the bars next to words) on a
                                                                                 during the petition exploration phase. One example petition
topic. Topics are also color-coded in the visualization window.
                                                                                 is “place mental health as a required course in junior high and
Clicking on the legend of a topic results in all the petitions with
                                                                                 middle schools”. In order to correct this topic assignment,
sufficient weights on that topic being highlighted (while other
                                                                                 we performed topic refinement by moving the intruder word
petitions are dimmed). These functions allow users to explore
                                                                                 “health” from topic 4 to topic 0. The re-generated topic words
the patterns of how petitions of the same topic clustered. A
                                                                                 are shown in Table 1 as topic 0’ and topic 4’.
good topic tends to create a cluster of petitions that are less
mixed with petitions.
Comparing topics. Users can evaluate one or more topics
together by observing semantic relations to spatially close or
remote topics, and by looking at the spatial relationships (over-
lapping clusters, adjacent clusters, non-intersecting clusters)
between petitions of the two topics. Applying filters to leave
fewer topics on the figure helps reduce visual clutters.

TOPIC MODEL CURATION SCENARIO
We practiced topic curation process on the online petition
dataset to experience how well our system supports topic di-
agnostics and refinement. Firstly, we run the CorEx topic
modeling and generated 20 topics. A fixed random seed was
used to make sure the same results can be reproduced. Table 1
shows 5 samples out of 20 produced from a topic model. The                             (a) Original topic words
initial result from the CorEx topic modeling reveals interesting                                                        (b) New topic words
topic clusters from the data set. In the provided samples, topic
0 mainly talks about “disease”, topic 4 generally discusses                       Figure 2: Move topic word “health” from topic 4 to topic 0
“economy”, topic 5 describes “election”, and topic 16 repre-
sents “law enforcement”. The bottom part of the table shows                      In order to assess if such a strategy of refining topics has
the results after applying certain topic revision operations.                    led to a better outcome, we rendered the petition clusters in
                                                                                 relation to the new topic definition and the result is shown
            Table 1: Selected topics (#topics = 20)                              in Figure 3. From this figure, we can clearly see how topic
                                                                                 groups are isolated and cut. Compared with Figure 1, outliers
   id                           topic words (top 15)                             are nicely scattered apart and small clusters of outliers dis-
    0    disease, patient, cancer, treatment, doctor, cure, disorder, medi-      appear. Such result suggests that the change of topic model
         cation, pain, awareness, symptom, illness, medicine, diagnosis,         by moving “health” from topic 4 to topic 0 is a good move.
         disability
                                                                                 This claim is further confirmed by a calculated metric of topic
    4    health, economy, tax, cost, benefit, increase, company, money,
         market, pay, healthcare, fund, research, dollar, debt                   coherence based on word context vectors [1]. This metric has
    5    election, investigation, vote, voter, candidate, hillary_clinton,       been demonstrated to have the highest correlation with the
         voting, campaign, department_justice, fbi, ballot, office, corrup-      interpretability of topics [28]. The topic coherence of topic
         tion, violation, democrat                                               4 is increased from 0.453 to 0.555 after removing the word
    6    internet, consumer, energy, information, technology, provider,          intruder, and the overall topic coherence is increased from
         service, device, car, access, fuel, safety, standard, road, vehicle
   16    officer, police, law_enforcement, evidence, police_officer,             0.431 to 0.443.
         county, aircraft, judge, governor_chris, killing, conviction, de-
         partment, scene, cat, chief                                             Split a Multi-theme Topic
   0’    health, treatment, disease, condition, patient, doctor, cancer,         Observations show that the distribution of petitions of topic 6
         awareness, pain, illness, medicine, disability, disorder, cure,         is scattered in the reduced-dimensional space: there are sev-
         medication
   4’    money, benefit, company, pay, economy, business, cost, fund,
                                                                                 eral small clusters of petitions. By sampling some of them for
         tax, industry, dollar, budget, study, market, increase                  detailed inspection of petition contents, we found that some
   6.1   service, information, com, access, standard, technology, inter-         semantically irrelevant petitions are placed adjacently in the
         net, consumer, provider, content, http, privacy, https_facebook,        visualization, e.g., “Prevent the FCC from ruining the Internet”
         internet_service, customer                                              and “Put a fee on carbon-based fuels and return revenue to
   6.2   safety, vehicle, energy, car, device, accident, fuel, road, aviation,
         forest, traffic, emission, faa, air, carbon                             households”, the former is about Internet and information tech-
  5+16   investigation, vote, election, officer, police, law_enforcement,        nology, while the latter is related to energy. This finding can
         campaign, candidate, corruption, voter                                  also be validated by examining topic words of topic 6: “inter-
                                                                                 net”, “information”, and “technology” are clearly incoherent
                    (a) Petitions of topic 0 and topic 4                                      (a) Petitions of topic 6




                                                                                  (b) Petitions of topic 6 (6.1) and topic 7 (6.2)

                                                                     Figure 5: A comparison of visualized petitions before and
                                                                     after splitting topic 6
                   (b) Petitions of topic 0’ and topic 4’

Figure 3: A comparison of visualized petitions before and            The modified version of the topic model is shown in Table 1
after moving words between topic 0 and topic 4                       as 6-1 and 6-2 and Figure 4 as 6 and 7. The figure shows
                                                                     that the weights of the first several topic words are increased,
                                                                     indicating that these words can better represent the topics. It is
with “energy”, “fuel”, and “safety”. Therefore, we believe           also apparent from Figure 5 that the distributions of petitions
topic 6 is of low quality since it contains several sub-topics       for topic 6 and topic 7 become more focused, indicating that
and needs to be diluted.                                             the petitions documents within same clusters are more topi-
                                                                     cally homogeneous. After the new topic model applied, the
                                                                     above example petitions are allocated to the correct topics re-
                                                                     spectively, resulting in an increase of overall coherence value
                                                                     from 0.431 to 0.441. Specifically, the original topic 6 has an
                                                                     individual coherence score of 0.341, while the scores of newly
                                                                     produced topic 6 and topic 7 are 0.594 and 0.419 respectively.

                                                                     Merge Semantically Similar Topics
                                                                     If the number of topics is set to a large number, CorEX algo-
                                                                     rithm will generate topics in finer granularity of topics. This
                                                                     could create situations where words that contribute to a single
                                                                     theme end up in separate topics. Under such circumstance,
                                                                     a merging operation is necessary to make sure that petitions
      (a) Original topic words                                       of similar topics are grouped together. In order to demon-
                                                                     strate this situation, we trained another topic model by setting
                                                                     the number of topics as 50 (relatively large) and the topic
                                                                     words are shown in Figure 6a. By looking at the topic words,
                                               (b) New topic words   topic 1 and topic 7 both describe “healthcare” but appear to be
                                                                     different topics.
Figure 4: Split topic 6 into two topics topic 6 (6.1) and topic 7
                                                                     The topic words after merging these two topics are shown
(6.2)
                                                                     in Figure 6b. Petitions of these two topics are now grouped
                                                                     into one cluster as well. Subsequently, these petitions can
To address the quality concerns of topic 6, we split topic 6 into    be processed and analyzed as a whole, e.g., summarized and
two topics (by clicking on Topic 6 and choose "Split" button).       forwarded to the Department of Health and Human Services.
                                                                    Topics that are difficult to interpret may still exist even after
                                                                    several iterations of topic refinements. On the other hand,
                                                                    some petitions are complicated in that they have multiple
                                                                    equally important aspects and even people have difficulty in
                                                                    identifying the most representative one. For those documents
                                                                    that are related to "bad" topics and can not be fixed at this
                                                                    round of analysis, the system can collect them into a subset of
                                                                    data to be fed into the next round of analysis.

                                                                    DISCUSSION
                                                                    Our work on analyzing the topic structures of online petitions
                                                                    is still a work-in-progress, but we have gained several lessons
                                                                    about interacting with topic modeling tools. First, users have
                                              (b) New topic words   to deal with tremendous uncertainties when deciding what
                                                                    is the proper strategy in tuning the topic model. Visualizing
                                                                    the impact of multiple strategies and providing interaction
                                                                    capabilities to assess the quality of topics and compare the
     (a) Original topic words
                                                                    document clusters before and after the model tuning will be
                                                                    critically important.
            Figure 6: Merging topic 0 and topic 9                   Another finding from this exercise is that there is a need to
                                                                    construct topic hierarchy from unsupervised topic models in
                                                                    order to be aligned with the way political scientists perceive
                                                                    the world of petition data. However, the topics discovered by
                                                                    CorEx algorithm have a flat structure, and they tend to be bi-
                                                                    ased towards those topic branches that have more detailed data.
                                                                    We will continue to explore our visual analytic approach for
                                                                    incremental refinement of topic structures and demonstrated
                                                                    how such an approach can be used to uncover topic hierar-
                                                                    chy of petitions that best reflects the human conception of
                                                                    the domain. Further work is required to evaluate the usability
                   (a) Petitions of topic 0 and topic 9             and effectiveness of this method. While we used dimension
                                                                    reduction based visualization, other petition explorations and
                                                                    analysis approaches should be investigated as well.

                                                                    ACKNOWLEDGEMENT
                                                                    The authors would like to acknowledge funding support from
                                                                    National Science Foundation under award # IIS-1211059, and
                                                                    from a grant funded by the Chinese Natural Science Founda-
                                                                    tion under award 71373108.
                     (b) Petitions of topic 9 (0 + 9)
                                                                    REFERENCES
Figure 7: A comparison of visualized petitions before and            1. Nikolaos Aletras and Mark Stevenson. 2013. Evaluating
after merging topic 9                                                   Topic Coherence Using Distributional Semantics. In
                                                                        Proceedings of the 10th International Conference on
                                                                        Computational Semantics (IWCS 2013). 13–22.
Merging topics is also useful when a small number of topics
is used. Referring to the before-mentioned topic model of 20         2. David Andrzejewski, Xiaojin Zhu, and Mark Craven.
topics, we found that topic 5 contains words “investigation”            2009. Incorporating Domain Knowledge into Topic
and “justice” that may be related to topic 16. Therefore, we            Modeling via Dirichlet Forest Priors. In Proceedings of
performed a merging by joining on these two topics and it               the 26th Annual International Conference on Machine
leads to a more general topic denoted as 5+16. Although the             Learning (ICML ’09). ACM, New York, NY, USA,
coherence value of merging the two topics remains almost                25–32.
the same, it is noteworthy that a new word “corruption” is
prioritized as it could serve as a bridge to connect two topics      3. Sanjeev Arora, Rong Ge, Yonatan Halpern, David
represented as “election” and “law enforcement” (e.g., a peti-          Mimno, Ankur Moitra, David Sontag, Yichen Wu, and
tion titled “Arrest and prosecute officials who tried to suppress       Michael Zhu. 2013. A practical algorithm for topic
the vote in the 2012 election”), showing that merging topics            modeling with provable guarantees. In International
has the potential of revealing latent relationship among them.          Conference on Machine Learning. 280–288.
 4. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003.     16. David Hall, Daniel Jurafsky, and Christopher D. Manning.
    Latent dirichlet allocation. Journal of machine Learning       2008. Studying the history of ideas using topic models. In
    research 3, Jan (2003), 993–1022.                              EMNLP ’08 Proceedings of the Conference on Empirical
 5. Jordan Boyd-Graber, David Mimno, and David Newman.             Methods in Natural Language Processing. 363–371.
    2014. Care and feeding of topic models: Problems,          17. Enamul Hoque and Giuseppe Carenini. 2016. Interactive
    diagnostics, and improvements. In Handbook of Mixed            Topic Modeling for Exploring Asynchronous Online
    Membership Models and Its Applications. Chapman &              Conversations. ACM Transactions on Interactive
    Hall, Chapter 12, 225 – 254.                                   Intelligent Systems 6, 1 (feb 2016), 1–24.
 6. Ajb Chaney and Dm Blei. 2012. Visualizing Topic            18. Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and
    Models.. In Proceedings of the Sixth International AAAI        Alison Smith. 2014. Interactive topic modeling. Machine
    Conference on Weblogs and Social Media. 419–422.               Learning 95, 3 (2014), 423–469.
 7. J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, and D. M.   19. Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko,
    Blei. 2009. Reading tea leaves: How humans interpret           and Haesun Park. 2012. iVisClustering: An Interactive
    topic models. In Proceedings of Advances in Neural             Visual Document Clustering via Topic Modeling.
    Information Processing Systems. 288–296.                       Computer Graphics Forum 31, 3pt3 (2012), 1155–1164.
 8. Jaegul Choo, Changhyun Lee, Chandan K Reddy, and
                                                               20. Ralf Lindner and Ulrich Riehm. 2009. Electronic
    Haesun Park. 2013. UTOPIAN: User-Driven Topic
                                                                   petitions and institutional modernization. International
    Modeling Based on Interactive Nonnegative Matrix
                                                                   parliamentary e-petition systems in comparative
    Factorization. IEE Transactions of Visualization and
                                                                   perspective. JeDEM-eJournal of eDemocracy and Open
    Computer Graphics 19, 12 (2013), 1992–2001.
                                                                   Government 1, 1 (2009), 1–11.
 9. Jason Chuang, Sonal Gupta, Christopher D Manning, and
    Jeffrey Heer. 2013. Topic Model Diagnostics: Assessing     21. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013.
    Domain Relevance via Topical Alignment. In                     Exploiting similarities among languages for machine
    Proceedings of the 30th International Conference on            translation. arXiv preprint arXiv:1309.4168 (2013).
    Machine Learning. 612–620.                                 22. David Mimno, Hanna M. Wallach, Edmund Talley,
10. Jason Chuang, Christopher D. Manning, and Jeffrey Heer.        Miriam Leenders, and Andrew McCallum. 2011.
    2012. Termite : Visualization Techniques for Assessing         Optimizing semantic coherence in topic models.
    Textual Topic Models. In Proceedings of the                    Proceedings of the 2011 Conference on Empirical
    International Working Conference on Advanced Visual            Methods in Natural Language Processing 2 (2011),
    Interfaces - AVI ’12. 74.                                      262–272.
11. Hal Daumé. 2009. Markov random topic fields.               23. Sergey I. Nikolenko, Sergei Koltcov, and Olessia
    Proceedings of the ACL-IJCNLP 2009 Conference Short            Koltsova. 2017. Topic modelling for qualitative studies.
    Papers August (2009), 293–296.                                 Journal of Information Science 43, 1 (2017), 88–102.
12. Ryan J Gallagher, Kyle Reing, David Kale, and Greg Ver     24. Paul Hitlin. 2016. ‘We the People’: Five Years of Online
    Steeg. 2016. Anchored Correlation Explanation: Topic           Petitions. Technical Report. Pew Research Center.
    Modeling with Minimal Domain Knowledge. arXiv
                                                               25. Daniel Ramage, Susan Dumais, and Dan Liebling. 2010.
    preprint arXiv:1611.10277 (2016).
                                                                   Characterizing Microblogs with Topic Models.
13. Brynjar Gretarsson, John O’Donovan, Svetlin                    Proceedings of the Fourth International AAAI Conference
    Bostandjiev, Tobias Höllerer, Arthur Asuncion, David           on Weblogs and Social Media (2010), 1–8.
    Newman, and Padhraic Smyth. 2012. TopicNets: Visual
    Analysis of Large Text Corpora with Topic Modeling.        26. Daniel Ramage, David Hall, Ramesh Nallapati, and
    ACM Trans. Intell. Syst. Technol. 3, 2, Article 23 (Feb.       Christopher D Manning. 2009a. Labeled LDA: A
    2012), 26 pages.                                               supervised topic model for credit attribution in
                                                                   multi-labeled corpora. In Proceedings of the 2009
14. Loni Hagen, Ozlem Uzuner, Christopher Kotfila,                 Conference on Empirical Methods in Natural Language
    Teresa M. Harrison, and Dan Lamanna. 2015.                     Processing: Volume 1-Volume 1. Association for
    Understanding Citizens’ Direct Policy Suggestions to the       Computational Linguistics, 248–256.
    Federal Government: A Natural Language Processing
    and Topic Modeling Approach. In 2015 48th Hawaii           27. Daniel Ramage, David Hall, Ramesh Nallapati, and
    International Conference on System Sciences, Vol.              Christopher D Manning. 2009b. Labeled LDA: A
    2015-March. IEEE, 2134–2143.                                   supervised topic model for credit attribution in
                                                                   multi-labeled corpora. Proceedings of the 2009
15. Scott A Hale, Helen Margetts, and Taha Yasseri. 2013.          Conference on Empirical Methods in Natural Language
    Petition growth and success rates on the UK No. 10             Processing August (2009), 248–256.
    Downing Street website. In Proceedings of the 5th annual
    ACM web science conference. ACM, 132–138.
28. Michael Röder, Andreas Both, and Alexander Hinneburg.        Explanation. In Advances in Neural Information
    2015. Exploring the Space of Topic Coherence Measures.       Processing Systems, NIPS’14.
    In Proceedings of the Eighth ACM International
                                                             31. Greg Ver Steeg and Aram Galstyan. 2014. Discovering
    Conference on Web Search and Data Mining (WSDM
                                                                 structure in high-dimensional data through correlation
   ’15). ACM, New York, NY, USA, 399–408.
                                                                 explanation. In Advances in Neural Information
29. Amin Sorkhei, Kalle Ilves, and Dorota Glowacka. 2017.        Processing Systems. 577–585.
    Exploring Scientific Literature Search Through Topic
                                                             32. Sida Wang and Christopher D Manning. 2012. Baselines
    Models. In Proceedings of the 2017 ACM Workshop on
                                                                 and bigrams: Simple, good sentiment and topic
    Exploratory Search and Interactive Data Analytics
                                                                 classification. In Proceedings of the 50th Annual Meeting
    (ESIDA ’17). ACM, 65–68.
                                                                 of the Association for Computational Linguistics: Short
30. Greg Ver Steeg and Aram Galstyan. 2014. Discovering          Papers-Volume 2. Association for Computational
    Structure in High-Dimensional Data Through Correlation       Linguistics, 90–94.