=Paper= {{Paper |id=Vol-1917/paper09 |storemode=property |title=Finding Hierarchy of Topics from Twitter Data |pdfUrl=https://ceur-ws.org/Vol-1917/paper09.pdf |volume=Vol-1917 |authors=Nghia Duong-Trung,Nicolas Schilling,Lars Schmidt-Thieme |dblpUrl=https://dblp.org/rec/conf/lwa/Duong-TrungSS17 }} ==Finding Hierarchy of Topics from Twitter Data== https://ceur-ws.org/Vol-1917/paper09.pdf
                  Finding Hierarchy of Topics from Twitter Data

                         Nghia Duong-Trung, Nicolas Schilling, and Lars Schmidt-Thieme

                               Information Systems and Machine Learning Lab (ISMLL)
                                    Universitätsplatz 1, 31141 Hildesheim, Germany
                           {duongn,schilling,schmidt-thieme}@ismll.uni-hildesheim.de
                                        http://www.ismll.uni-hildesheim.de



                        Abstract. Topic modeling of text collections is rapidly gaining impor-
                        tance for a wide variety of applications including information retrieval
                        and automatic multimedia indexing. Our motivation is to exploit a hier-
                        archical topic selection via nonnegative matrix factorization to capture
                        the nature content of text posted on Twitter. This paper explores the
                        use of an effective framework to automatically discover hidden topics
                        and their sub-topics. As input, the framework uses textual data. The
                        output is then the discovered structure of topics. We introduce a con-
                        ceptual topic modeling based on the idea of stability analysis to detect a
                        hierarchy of topics given a text source. In this process, we apply stability
                        measurement in conjunction with nonnegative matrix factorization and
                        WordNet to excavate hidden topics by the scores of conceptual similarity.
                        To demonstrate the effectiveness and generalization, we apply the ap-
                        proach to a large-scale Twitter dataset to investigate the content topics.
                        We also address the problems of several state-of-the-art topic modeling
                        approaches that are unable to handle a large dataset.

                        Keywords: Unsupvervised Learning, Semantics in Text Mining, Con-
                        ceptual Stability, Hierarchy of Topics


                 1    Introduction

                 Nonnegative matrix factorization (NMF) with nonnegativity constraints has
                 been considered as an efficient representation and an emerged technique for
                 text mining and document clustering [22,2,17,23,9,11]. For any desired low-rank
                 K, the NMF algorithm groups the data into clusters. The key issue is whether a
                 given low-rank K helps to decompose the data into appropriate separated clus-
                 ters. Therefore, the problem we study in this paper is how to effectively and
                 efficiently discover the most appropriate structure of topics giving a text corpus
                 by exploiting the semantic meaning and the conceptual stability. In general, the
                 stability of a clustering model refers to its ability to consistently replicate simi-
                 lar solutions on data randomly generated from the same source. In practice, this
                 involves a repeated re-sampling of data, applying a topic selection model, and
                 evaluating the results by a stability metric which measures the level of discrim-
                 ination between the resulting clusterings.




Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org
    We start with the previous work on the idea of random sub-sampling and
stability analysis via consensus clustering to discover the number of clusters that
best describes the data [16,4,13]. The basic assumption of stability in the con-
text of consensus clustering, in general, is very intuitive: for particular observed
data, if we perturb it into different random variabilities, and if they produce
the same cluster composition, or consensus, without radical difference, we would
confidently consider that these clusters represent real structure. Consensus clus-
tering purely captures this procedure. Further work investigated by [4] improved
the consensus clustering technique by adding a quantitative evaluation for ro-
bustness of the decomposition. They adopted a measure based on the cophenetic
correlation coefficient which indicates the dispersion of the consensus matrix. The
coefficient is calculated as the Pearson correlation of two distance matrices: the
consensus matrix captured the distance between data samples and the average
connectivity matrix over many clustering runs. Subsequently, [12,13] formulate
the idea of consensus matrix in the latent space learned by NMF.
    However, the computation of consensus matrix, Rn×n matrix where n is the
number of tweets/documents, seems very costly, e.g. large amount of RAM is
required. For instance, if we apply the previous method on our experimented
Twitter dataset that we describe later in the paper, then 1400GB of RAM is
required to store the consensus matrix during model’s computation. Hence, the
method provided by [12,13] is insufficient or even impossible for large datasets.
To overcome the drawbacks of the construction of consensus matrix, we propose
a topic selection approach, called the conceptual stability analysis, to smoothly
integrate with NMF that can be applied on large datasets effectively.
    Moreover, we also evaluate several state-of-the-art topic modeling approaches
via Latent Dirichlet Allocation (LDA) [3]. The first baseline is the topic selection
method implemented by [1]. The second baseline is proposed by [6]. We imple-
ment the baselines by using [10,18]. However, these methods threw exception
due to large dataset during computation. An upper bound of RAM required for
each approach is 65GB until an exception occurs.
    With these limitation in mind, we introduce an unsupervised topic selection
method that enhances the accuracy and effectiveness of NMF-based models in
the context of document clustering and topic modeling. We show that our pro-
posed method can work effectively on large dataset within acceptable computing
resources such as RAM required and time of computation.


2     Theoretical Aspects and Proposed Framework

2.1   Nonnegative Matrix Factorization

Consider a dataset X ∈ Rn×m containing a set of n documents where each
document is described by m many features. The document features are mapped
from a dictionary that comprises all words/terms/tokens in the dataset. Each
positive entry Xij is either a raw term frequency or a term frequency - inverse
document frequency (TFIDF) score. By r and τ, we denote the sampling rate
and the number of subsets generated from X respectively. Then each subset
        0
Xτ ∈ Rn ×m is a sample without replacement of X.
   Giving a desired number of topics k, the NMF algorithm iteratively computes
an approximation:
                                  X ≈ W H,                                 (1)
where W ∈ Rn×k and H ∈ Rk×m are nonnegative matrices. The conventional
technique to approximate W and H is by minimizing the difference between X
and W H such that:
                                 n   m
                               1 XX                   2
          min     f (W, H) =             Xij − (W H)ij + φ(W ) + θ(H),       (2)
       W ≥0,H≥0                2 i=1 j=1

where φ(.) and θ(.) are regularization terms that are set as follows:
                                                     m
                                                     X
                φ(W ) = αkW k2F      and θ(H) = β          kH(:, i)k21 ,     (3)
                                                     i=1

where H(:, i) indicates the i-th column of H. The L1 norm term of θ(H) promotes
sparsity on the rows of H while the Frobenius norm term of φ(W ) prevents W
from growing too large. Scalar parameters α and β are used to control the
strength of regularization. The matrices W and H are found by minimizing
Equation (2) via estimating W and H in an alternating fashion using projected
gradients or coordinate descent [5].


                     Table 1. Summary of topics discovered.

                                             Number of Number of Share
           N Topic
                                             sub-topics documents (%)
           1 Student Life and Relationship       5           76,415 17.78
           2 Information and Networking          8           17,649 4.11
           3 Business and Current Affairs        2           72,534 16.88
           4 Routine Activities                  4           21,960 5.11
           5 Leisure and Entertainment           2           81,469 18.96
           6 Sport and Games                     3           31,812 7.40
           7 Pessimism and Negativity            4           78,329 18.23
           8 Wishes and Gratitude                5           36,618 8.52
           9 Transport and Travel                2           12,887 3.01
             Total                              35          429,673 100.00




2.2   Conceptual Stability Computation
Now we start discussing our approach of computing stability based on the us-
age of the WordNet hypernym hierarchy [8,15]. Given tokens cp and cq , then
                       Table 2. Labels of all 9 topics and 35 sub-topics.

    Student Life and     Information and                                                 Pessimism and
1                     2                   4 Routine Activities 7
      Relationship         Networking                                                      Negativity
       Friends and
1.1                  2.1      News       4.1     Feelings     7.1                        Hate and Anger
       Relationship
                                                                                       Daily Problems and
1.2      Study Life       2.2         Life             4.2            Sleep        7.2
                                                                                           Complains
         Worry and                 Mood and
1.3                       2.3                          4.3       Work and People   7.3   School Routines
         Confusion                Reflections
1.4    Conversations      2.4      Greetings           4.4    Social Media    7.4        Life and Changes
      Social Media and          Current Regional             Leisure and                 Transport and
1.5                       2.5                           5                      9
        Connections                  Events                 Entertainment                    Traveling
        Wishes and                                           Television and
8                         2.6         Mates            5.1                    9.1          Landmarks
        Gratitude                                                Cinema
8.1        Friends        2.7    Informal Chat         5.2     Reactions      9.2           Journeys
8.2        People         2.8       Religion            6 Sport and Games
                                 Business and              Sport Opinions and
8.3     Anticipation       3                           6.1
                                Current Affairs                Discussion
        Thanks and                Working-day
8.4                       3.1                          6.2 American Football
         Affection                  Activities
                                   Events and
8.5     Celebrations      3.2                          6.3 TV Sport Programs
                                   Socializing



wup(cp , cq ) is the Wu-Palmer similarity [21], which is a scoring method based
on how similar the token senses are and where they occur relative to each other
in the WordNet hierarchy. Then, the Wu-Palmer similarity is calculated by:
                                                             2d
                                  wup(cp , cq ) =                                                       (4)
                                                        d1 + d2 + 2d
where d1 and d2 are the distances that separates the concept cp and cq from
their closest common ancestor and d is the distance which separates the closest
common ancestor of cp and cq from the root node.
    Each row of the low-rank matrix H represents one of the k topics and consists
of scores for each term. However, we only consider the top t  m terms as they
contribute most to the semantic meaning of a topic. In practice, the contribution
of each token to topic i is represented by the scores in the i-th row in matrix H
generated by NMF. By sorting each row of H, we can assess the top t terms for
each topic. The set of top t tokens for all topics of a given H will be denoted
by S = {R1 , . . . , Rk } such that Ri ∈ Rt is the topic i-th represented by top t
tokens. Within a topic, we calculate the conceptual stability score as follows:
                                               t−1 X t
                                          2    X
                         sim(Rv ) =                      wup(Rv i , Rv j )                              (5)
                                      t(t − 1) i=0 j=i+1

    Similarly, the conceptual stability score between two topics Ru and Rv is
calculated in the same fashion.
                                                   t         t
                                              1 XX
                          sim(Ru , Rv ) =                wup(Rui , Rv j )                               (6)
                                              t2 i=1 j=1
   Finally, we consider the problem of measuring the conceptual stability be-
tween two different K-way topic clusterings Sw and Sl . Each ranked list contains
top t tokens that contribute most semantic meaning to the i-th topic. Then, the
conceptual stability between Sw and Sl is calculated by:
                                           K
                                        1 X                  
                      con(Sw , Sl ) =       sim Rwk , π(Rlk ) ,                        (7)
                                        K
                                          k=1

where π(Rwi ) denotes the ranked list Rlj matched to the ranked list Rwi by
the permutation π. The optimal permutation π is found by solving the minimal
weight bipartite matching problem using the Hungarian method [14].
    Moreover, the problem of measuring the conceptual stability within the K-
way topic clustering Sw itself is also considered. The conceptual stability is then
calculated as follows:
                                            K
                                         1 X
                          con(Sw ) =          sim(Rwk )                         (8)
                                        K
                                               k=1
    We now consider the conceptual stability at a particular number of topics k.
At first we apply the NMF on the complete dataset X to get the factor matrices
H that we consider as the reference ranked lists. Let us define SX as the reference
K-way topic clustering, containing K ranked lists SX = {RX 1 , . . . , RX k }.
    Subsequently, we randomly resample τ times the documents of the original
X with the sampling rate r to obtain a random subset of X which we denote
by Xτ . We then apply NMF on each Xτ to get the factor matrix Hτ . This
results in τ many sets {S1 , . . . , Sτ } where each set contains k ranked lists Sj =
{Rj 1 , . . . , Rj k }. Finally, we calculate the overall semantically conceptual stability
at k as following:
                                        Pτ                     Pτ
                                   1       i=1 con(SX , Si ) −   i=1 con(Si )
                  stability(k) =            Pτ                   Pτ                    (9)
                                   τ max      i=1 con(SX , Si ),  i=1 con(Si )

    The maximum stability score is achieved if and only if the top t tokens appear
in only one topic k. Otherwise, the minimum stability score is obtained if top t
tokens overpoweringly appear in every topic k.
    This process is repeated for a range of topics k. The most appropriate value
of k is identified by the highest value of stability(k) score. However, the scores
also reveal the possible range of k for further investigation. With the k topic
classification finalized at the first level, the dataset is split into sub-datasets
where documents are assigned to topic with the highest score, e.g. through the
W matrix.
                                k̂Xi = argmax(Wik )                            (10)
                                                 k
     Then, the process is repeated to discover sub-topics in each sub-dataset. Gen-
erally, we can expand the procedure deeper in the hierarchy. First, we calculate
the most appropriate number of topics L in the whole dataset X. Then, a subset
of X is drawn based on each value in range k ∈ l, . . . , L. The stability(k) score
is, in turn, calculated for each subset to find the best number of sub-topics.
Algorithm 1 The conceptual stability analysis approach with 2-level of hierar-
chy
Input: Dataset X ∈ Rn×m , range of number of topics [K 0 , . . . , K 00 ], number of top
tokens t, sampling rate r, number of subsets τ

 1: /* find k at the first level in hierarchy */
 2: for k ∈ K 0 , . . . , K 00 do
 3:   find W ∈ Rn×k , H ∈ Rk×m with X ≈ W H
 4:   get SX ∈ Rk×t from H
 5:   for τ ∈ 1, . . . , τ do
                               0
 6:      draw Xτ ∈ Rn ×m from X
                             0
 7:      find Wτ ∈ Rn ×k , Hτ ∈ Rk×m with Xτ ≈ Wτ Hτ
 8:      get SX τ ∈ Rk×t from Hτ
 9:   end for
10:   calculate stability(k), Equation (9)
11: end for
12: L = argmax stability(k)
            k
13: /* find k at the second level in hierarchy */
14: for h ∈ l, . . . , L do
15:   Xh = ∅
16:   for Xi s.t. h == argmax(Wih ) do
                                 h
17:      X h = X h ∪ {Xi }
18:    end for
19:    for k ∈ K 0 , . . . , K 00 do
20:      find W h ∈ Rp×k , H h ∈ Rk×m with X h ≈ W h H h
               h           k×t
21:      get SX  h ∈ R           from H h
22:      for τ ∈ 1, . . . , τ do
                                 0
23:         draw Xτh ∈ Rp ×m from X h
                     h        p0 ×k
24:         find Wτ ∈ R             , Hτh ∈ Rk×m with Xτh ≈ Wτh Hτh
                  h             k×t
25:         get SX h τ ∈ R            from Hτh
26:      end for
27:      calculate stability(k) as Equation (9)
28:    end for
29:    Lh = argmax stability(k)
                 k
30: end for



    For the ease of interpretation, we conduct the experiments within 2-level of
hierarchy. An overview of the whole procedure can be seen in Algorithm (1).


3     Empirical Results

3.1   Datasets, Experiment Setup, and Baselines

The North America dataset is a large dataset of tweets that was originally used
for the geolocation prediction problem [19,20,7]. A document in this dataset is
(a) Experiment result on the tweets              (b) Student Life and Relationship
dataset at the first level of hierarchy




    (c) Information and Networking               (d) Business and Current Affairs




          (e) Routine Activities                   (f) Leisure and Entertainment

Fig. 1. Experiment results on the Tweets dataset. Figure (1(a)) shows discovered topics
at the first level. Similarly, the other figures present discovered topics at the second
level. The appropriate number of topics k are identified by peaks in the plots. The
vertical lines represent the highest peaks k.



the concatenation of all tweets by a single user. There were total 38 million
tweets tweeted by 430k users. The tweets were inside a bounding box covering
the continuous United States, a part of Canada and a part of Mexico. The final
          (a) Sport and Games                     (b) Pessimism and Negativity




       (c) Wishes and Gratitude                     (d) Transport and Travel

Fig. 2. Experiment results on discovered topics at the second level. The appropriate
number of topics k are identified by peaks in the plots. The vertical lines represent
the highest peaks k.




dataset after preprocessing is a very sparse text file that requires 2.2GB to store
and contains 430, 000 rows, e.g. the number of documents, and 59, 368 columns,
e.g. the vocabulary size.
    In our experiment, we set required model parameters as follows. The range
of exploring topics at the first level is {5, . . . , 25}. We expect the range of sub-
topics is smaller in the second level so that the range of exploring sub-topics is
{2, . . . , 12}. The number of top tokens that characterize a specific topic is set to
t = 20. The sampling rate is set to r = 0.8 and the number of subsets is set to
τ = 25 to cancel out random effects. Our experiments were conducted on a Xeon
E5-2670v2 with 2.5GHz clock speed and 128GB of RAM. However, an upper
bound of RAM required for our model is 5GB and it takes 4 days to complete.
   As we already mentioned in the introduction section, during our experiment,
we also compare our method with several state-of-the-art NMF-based and LDA-
based topic modeling approaches [13,1,6]. However, all these models either can-
not handle a large dataset or throw resource exception during computation.
3.2   Topics Discovery

The framework identifies distinctive topics and their sub-topics of documents
based on the output of stability scores. In theory, the deepest hierarchy of topics
is where documents are recursively classified until one topic only contains one
document. We do not specify the exact number of topics beforehand but rather
the range of desired topics and the model will figure out the most appropriate
values itself. In other words, the model takes (1) a very large textual dataset,
(2) a desired range of expectant number of topics, and (3) a desired level of
hierarchy. Then, the hierarchy of topics is discovered by considering conceptual
stability scores.
    Figure (1,2) present the potential number of topics and sub-topics at the
first and second levels respectively. Table (1) summarizes topics and their sub-
topics explored. As we can see in Table (1), topics at the first level can be
divided into two groups based on the % share. People are concerned the most
about Pessimism and Negativity, Leisure and Entertainment, Student Life and
Relationship and Business and Current Affairs. We now describe all the topics
and their sub-topics discovery in more detail.
    At the first level, the highest peak is found at k = 9 which means that
the most distinctive number of topics given North America tweet dataset is
9. However, we also see potential high peaks at k = 11 and k = 7 if we need
manually to expand or condense the clustering results respectively. Consequently,
the whole dataset at the first level is then divided into 9 sub-datasets that the
model continues discovering sub-topics within them.
    Next we consider sub-topics. Figure (1(b)) presents that the highest peak is
at k = 5 where we clearly see a Λ shape. Similarly, we see the same Λ shape in
the 4th , 6th , 7th and 8th topics which are presented in Figure (1(e), 2(a), 2(b) and
2(c)) respectively. The peaks made by the Λ shape is the number of sub-topics
discovered by the model. Interestingly, the 2nd topic, Figure (1(c)), contains only
4.11% of the documents but can be divided into k = 8 distinctive sub-topics.
The 3rd and 9th topics, Figure (1(d), 2(d)) respectively, show an obvious peak
that the most suitable number of sub-topics is k = 2, the left most bound of the
experimented range. The 5th topic, Figure (1(f)), presents two candidates with
high magnitude peaks at k = 2 and k = 7. Although the highest peak is selected,
e.g. k = 2, as the output for sub-topics consideration, user can manually choose
the other peak as the desired output.


3.3   Topics Labeling

Having exploited the hierarchical topics structure, we next present their associ-
ated labels. Table (2) summarize our labeling schemes. All topics and sub-topics
were subjectively labeled to ease the understanding and interpretation in succes-
sive spatial distribution analysis. The labels were validated and assigned based
on the meaning of top tokens that characterize a specific topic or sub-topic.
    More generally, questions of accuracy can be raised about the representa-
tiveness of labels as a source for topics demonstration. In each discovered topic
Fig. 3. Word cloud of 9 discovered topics. The size of the word depends on the score
calculated by the topic model. Offensive language has been removed from the word
cloud. The colors for both circles and words are red, yellow, green, blue, purple, brown,
gray, white and turquoise for Student Life and Relationship, Information and Net-
working, Business and Current Affairs, Routine Activities, Leisure and Entertainment,
Sport and Games, Pessimism and Negativity, Wishes and Gratitude, and Transport and
Travel topic respectively.


and sub-topic, after collecting top tokens based on their meaning contribution,
a wide number of heuristic labeling schemes is considered to render each topic
representative and distinctive. After the labels are generated, a random selected
documents are reviewed and the labels are re-validated if needed. The loop is
required to ensure the assigned labels are acceptably appropriate. It is impor-
tant to consider that the labeling results from this paper reflect Twitter users’
opinions at the time the data was collected, not the population at large. The re-
vealed Twitter topics also were visualized using a comparison word cloud of the
top tokens in all topics and sub-topics, e.g. Figure (3). We report the principal
component analysis to inspect the subjective distinctiveness of topics in Figure
(4).


4    Conclusion
In this paper, we propose a topic selection approach to smoothly integrate with
NMF that can be applied on large datasets effectively. The model automatically
discovers the most distinctive topics and sub-topics in many levels of desired
hierarchy by considering conceptual stability scores. The conceptual analysis
helps guide the selection of the appropriate number of topics and their sub-
topics. The main strength of our approach is that it is entirely unsupervised
and does not require any training step. We also demonstrate the practicability
of our framework to get a better understanding of textual source. Starting from
Fig. 4. The bubble plot of 35 discovered sub-topics. The size of the bubbles corresponds
with the % share assigned to that sub-topic. For the ease of interpretation, we report
the bubble plot as 1-component principle component analysis.


addressing the drawbacks of consensus matrix models that exist more than a
decade, we have provided an effective and powerful framework for large-scale
text mining and document clustering via NMF. We also present several state-of-
the-art LDA-based topic modeling approaches that are unable to handle large
dataset.


References

 1. Arun, R., Suresh, V., Madhavan, C.V., Murthy, M.N.: On finding the natural
    number of topics with latent dirichlet allocation: Some observations. In: Pacific-
    Asia Conference on Knowledge Discovery and Data Mining. pp. 391–402. Springer
    (2010)
 2. Berry, M.W., Browne, M.: Email surveillance using non-negative matrix factoriza-
    tion. Computational & Mathematical Organization Theory 11(3), 249–264 (2005)
 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
    Learning research 3(Jan), 993–1022 (2003)
 4. Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular
    pattern discovery using matrix factorization. Proceedings of the national academy
    of sciences 101(12), 4164–4169 (2004)
 5. Cichocki, A., Anh-Huy, P.: Fast local algorithms for large scale nonnegative ma-
    trix and tensor factorizations. IEICE transactions on fundamentals of electronics,
    communications and computer sciences 92(3), 708–721 (2009)
 6. Deveaud, R., SanJuan, E., Bellot, P.: Accurate and effective latent concept mod-
    eling for ad hoc information retrieval. Document numérique 17(1), 61–84 (2014)
 7. Duong-Trung, N., Schilling, N., Schmidt-Thieme, L.: Near real-time geolocation
    prediction in twitter streams via matrix factorization based regression. In: Proceed-
    ings of the 25th ACM International on Conference on Information and Knowledge
    Management. pp. 1973–1976. ACM (2016)
 8. Fellbaum, C.: WordNet. Wiley Online Library (1998)
 9. Gillis, N.: The why and how of nonnegative matrix factorization. Regularization,
    Optimization, Kernels, and Support Vector Machines 12(257) (2014)
10. Hornik, K., Grün, B.: topicmodels: An r package for fitting topic models. Journal
    of Statistical Software 40(13), 1–30 (2011)
11. Kim, H., Choo, J., Kim, J., Reddy, C.K., Park, H.: Simultaneous discovery of
    common and discriminative topics via joint nonnegative matrix factorization. In:
    Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining. pp. 567–576. ACM (2015)
12. Kim, J., Park, H.: Sparse nonnegative matrix factorization for clustering (2008)
13. Kuang, D., Choo, J., Park, H.: Nonnegative matrix factorization for interactive
    topic modeling and document clustering. In: Partitional Clustering Algorithms,
    pp. 215–243. Springer (2015)
14. Kuhn, H.W.: The hungarian method for the assignment problem. Naval research
    logistics quarterly 2(1-2), 83–97 (1955)
15. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
16. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling-
    based method for class discovery and visualization of gene expression microarray
    data. Machine learning 52(1-2), 91–118 (2003)
17. Pauca, V.P., Shahnaz, F., Berry, M.W., Plemmons, R.J.: Text mining using non-
    negative matrix factorizations. In: SDM. vol. 4, pp. 452–456. SIAM (2004)
18. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast
    collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th
    ACM SIGKDD international conference on Knowledge discovery and data mining.
    pp. 569–577. ACM (2008)
19. Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-
    based geolocation using language models on an adaptive grid. In: Proceedings of
    the 2012 Joint Conference on Empirical Methods in Natural Language Processing
    and Computational Natural Language Learning. pp. 1500–1510. Association for
    Computational Linguistics (2012)
20. Wing, B., Baldridge, J.: Hierarchical discriminative classification for text-based
    geolocation. In: EMNLP. pp. 336–348 (2014)
21. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the
    32nd annual meeting on Association for Computational Linguistics. pp. 133–138.
    Association for Computational Linguistics (1994)
22. Xie, B., Song, L., Park, H.: Topic modeling via nonnegative matrix factorization on
    probability simplex. In: NIPS workshop on topic models: computation, application,
    and evaluation (2013)
23. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix
    factorization. In: Proceedings of the 26th annual international ACM SIGIR con-
    ference on Research and development in informaion retrieval. pp. 267–273. ACM
    (2003)