Sense-Level Semantic Clustering of Hashtags in Social Media

                       Ali Javed                                     Byung Suk Lee
           Department of Computer Science                    Department of Computer Science
                University of Vermont                             University of Vermont
          Burlington, Vermont 05405, U.S.A.                 Burlington, Vermont 05405, U.S.A.
                 ajaved@uvm.edu                                     bslee@uvm.edu


                     Abstract                              hashtags are used to index those tweets. There-
                                                           fore, admittedly, classification of tweets benefits
    We enhance the accuracy of the cur-                    from accurate clustering of hashtags.
    rently available semantic hashtag cluster-                Social media is arguably the best source of
    ing method, which leverages hashtag se-                timely information. On Twitter alone, for ex-
    mantics extracted from dictionaries such               ample, an average of 6000 micro-messages are
    as Wordnet and Wikipedia. While immune                 posted per second (Internet Live Stats, last viewed
    to the uncontrolled and often sparse us-               in May 2016). Thus, social media analysts use
    age of hashtags, the current method dis-               clusters of hashtags as the basis for more com-
    tinguishes hashtag semantics only at the               plex tasks (Muntean et al., 2012) such as retriev-
    word level. Unfortunately, a word can                  ing relevant tweets (Muntean et al., 2012; Park
    have multiple senses representing the ex-              and Shin, 2014), tweet ranking, sentiment analy-
    act semantics of a word, and, therefore,               sis (Wang et al., 2011), data visualization (Bhulai
    word-level semantic clustering fails to dis-           et al., 2012), semantic information retrieval (Teufl
    ambiguate the true sense-level semantics               and Kraxberger, 2011), and user characterization.
    of hashtags and, as a result, may generate             Therefore, the accuracy of hashtag clustering is
    incorrect clusters. This paper shows how               important to the quality of the resulting informa-
    this problem can be overcome through                   tion in those tasks.
    sense-level clustering and demonstrates its               The popular approach to hashtag clustering has
    impacts on clustering behavior and accu-               been to leverage the tweet texts accompanying
    racy.                                                  hashtags (Costa et al., 2013; Teufl and Kraxberger,
                                                           2011; Tsur et al., 2012; Tsur et al., 2013; Bhu-
1   Introduction                                           lai et al., 2012; Muntean et al., 2012; Rosa et
Hashtags are used in major social media (e.g.,             al., 2011) by identifying their “contextual” seman-
Twitter, Facebook, Tumblr, Instagram, Google+,             tics (Saif et al., 2012). There are two prominent
Pinterest) for various purposes – to tell jokes, fol-      problems with this approach, however. First, a ma-
low topics, put advertisements, collect consumer           jority of hashtags are not used frequently enough
feedback, etc. For instance, McDonald’s created            to find sizable tweet texts accompanying them,
a hashtag #Mcdstories to collect consumer feed-            thus causing a sparsity problem. Second, tweet
back; #OccupyWallStreet, #ShareaCoke and #Na-              texts are open-ended, with no control over their
tionalFriedChickenDay are only a few examples              contents at all, and therefore often exhibit poor
of many successful hashtag campaigns.                      linguistic quality. (According to Pear Analytics,
   Twitter is the first social media platform that in-     40.1% of tweets are “pointless babble” (Kelly, last
troduced hashtags, and is used as the representa-          viewed in May 2016).) These problems make text-
tive social media in this paper. Most tweets con-          based techniques ineffective for hashtag cluster-
tain one or more hashtags in their texts of up to          ing. Hence, methods that utilize other means to
140 characters.                                            identifying semantics of hashtags are needed.
   Clustering is commonly used as a text classifi-            In this regard, the focus of this paper is on
cation technique, and clustering of hashtags is the        leveraging dictionary metadata to identify the
first step in the classification of tweets given that      semantics of hashtags. We adopt the pioneer-


                                                     140
ing work done by Vicient and Moreno (2014).                Concept       Meaning
Their approach identifies the “lexical” semantics          desert.n.01   arid land with little or no vegeta-
of hashtags from external resources (e.g., Word-                         tion
net, Wikipedia) independent of the tweet messages          abandon.v.05 leave someone who needs or
themselves. To the best of our knowledge, their                          counts on you; leave in the lurch
work is the only one that uses this metadata-based         defect.v.01   desert (a cause, a country or
approach. This approach has the advantage of                             an army), often in order to join
being immune to the sparsity and poor linguistic                         the opposing cause, country, or
quality of tweet messages, and the results of their                      army
work demonstrate it.                                       desert.v.03   leave behind
   On the other hand, their work has a major draw-         Table 1: Example synset for the word “desert”.
back, in that it makes clustering decisions at the
word level while the correct decision can be made
at the sense (or “concept”) level. It goes with-
out saying that the correct use of metadata is crit-      2     Background
ical to the performance of any metadata-based ap-
proach, and indeed clustering hashtags based on           2.1    Wordnet – synset hierarchy and
their word-level semantics has been shown to er-                 similarity measure
roneously putting hashtags of different senses in         Wordnet is a free and publicly available lexical
the same cluster (more on this in Section 4).             database of English language. It groups English
   In this paper, we devise a more accurate sense-        words into sets of synonyms called synsets. Each
level metadata-based semantic clustering algo-            word in Wordnet must point to at least one synset,
rithm. The critical area of improvement is in the         and each synset must point to at least one word.
construction of similarity matrix between pairs of        Hence, there is a many-to-many relationship be-
hashtags, which then is input to a clustering algo-       tween synsets and words (Vicient, 2014). Synsets
rithm. The immediate benefits are shown in the            in Wordnet are interlinked by their semantics and
accuracy of resulting clusters, and we demonstrate        lexical relationships, which results in a network of
it using a toy example. Experimental results using        meaningful related words and concepts.
gold standard testing show a 26% gain of cluster-            Table 1 shows an example synset. The synset
ing accuracy in terms of the weighted average pair-       contains 4 different concepts, where a concept is a
wise maximum f-score (Equation 5), where the              specific sense of a word – e.g., “desert” meaning
weight is the size of a ground truth cluster. Despite     “arid land with little or no vegetation”, “desert”
the gain in the clustering accuracy, we were able         meaning “to leave someone who needs or counts
to keep the run-time and space overheads for sim-         on you”.
ilarity matrix construction within a constant factor         All of these concepts are linked to each
(e.g., 5 to 10) through a careful implementation          other using the semantic and lexical relationships
scheme.                                                   mentioned. For example “oasis.n.01”(meaning
   The remainder of this paper is organized as fol-       “a fertile tract in a desert”) is a meronym
lows. Section 2 provides some background knowl-           of “desert.n.01” i.e, “oasis.n.01” is a part of
edge. Section 3 describes the semantic hash-              “desert.n.01”.
tag clustering algorithm designed by Vicient and             Given this network of relationships, Wordnet is
Moreno (2014). Section 4 discusses the proposed           frequently used in automatic text analysis through
sense-level semantic enhancement to the cluster-          the application program interface (API). There
ing algorithm, and Section 5 presents its evalu-          are different API functions that allow for the cal-
ation against the word-level semantic clustering.         culation of semantic similarity between synsets,
Section 6 discusses other work related to the se-         and the Wu-Palmer similarity measure (Wu and
mantic hashtag clustering. Section 7 summarizes           Palmer, 1994) is used in this paper in order to stay
the paper and suggests future work.                       consistent with the baseline algorithm by Vicient
                                                          and Moreno (2014). In a lexical database like
                                                          Wordnet synset database, where concepts are or-
                                                          ganized in a hierarchical structure, the Wu-Palmer
                                                          similarity between two concepts C1 and C2 , de-


                                                    141
noted as simW P (C1 , C2 ), is defined as                  down (or divisive). In bottom-up strategy, each el-
                       2 · depth(LCS(C1 , C2 ))            ement starts in its own cluster and two clusters are
 simW P (C1 , C2 ) =                            (1)        merged to form one larger cluster as the cluster-
                       depth(C1 ) + depth(C2 )
                                                           ing process moves up the hierarchy. In top-down
where LCS(C1 , C2 ) is the least common subsumer           strategy, all elements start in one cluster and one
of C1 and C2 in the hierarchy of synsets.                  cluster is split into two smaller clusters as the clus-
   This Wordnet functionality is used to calculate         tering process moves down the hierarchy. Bottom-
the semantic similarity between hashtags in this           up strategy is used in this paper because it is con-
paper, that is, by grounding hashtags to specific          ceptually simpler than top-down (Manning et al.,
concepts (called “semantic grounding”) and cal-            2008).
culating the similarity between the concepts.                 For bottom-up strategy, several distance mea-
2.2   Wikipedia – auxiliary categories                     surement methods are available to provide link-
                                                           age criteria for building up a hierarchy of clus-
Wikipedia is by far the most popular crowd-                ters. Among them, single-linkage method and un-
sourced encyclopedia. Not all hashtags can be              weighted pair group method with arithmetic mean
grounded semantically using Wordnet because                (UPGMA) are used most commonly, and are used
many of them are simply not legitimate terms               in this paper. Single-linkage method calculates the
found in Wordnet (e.g. #Honda). This situation             distance between two clusters Cu and Cv as
is where Wikipedia can be used to look up those
hashtags. Wikipedia provides auxiliary categories                d(Cu , Cv ) =        min          dist(ui , vj )   (2)
                                                                                 ui 2Cu ^ vj 2Cv
for each article. For example, when Wikipedia
is queried for categories related to the page titled       and UPGMA calculates the distance as
“Honda”, it returns the following auxiliary cate-                              X          d(ui , vj )
                                                                d(Cu , Cv ) =                                       (3)
gories.                                                                                 |Cu | ⇥ |Cv |
                                                                                 ui 2Cu ,vj 2Cv
[Automotive companies of Japan’,
 Companies based in Tokyo’,                                where |Cu and |Cv | denote the number of elements
 Boat builders’,                                           in clusters Cu and Cv , respectively.
 Truck manufacturers’,
 ...                                                          To generate output clusters, “flat clusters” are
 ]                                                         extracted from the hierarchy. There are multiple
                                                           possible criteria to do that (SciPy.org, last viewed
   Auxiliary categories can be thought of as cate-
                                                           in May 2016), and in this paper we use the “dis-
gories the page belongs to. In this example, if we
                                                           tance criterion” – that is, given either of the dis-
are unable to look up the word “Honda” on Word-
                                                           tance measures discussed above, flat clusters are
net, then, through the help of these auxiliary cate-
                                                           formed from the hierarchy when items in each
gories, we can relate the term to Japan, Automo-
                                                           cluster are no farther than a distance threshold.
tive, Company, etc. There are several open source
Wikipedia APIs available to achieve this purpose           3   Semantic Hashtag Clustering
– for example, the Python library “wikipedia”.             We adopted the semantic clustering approach pro-
2.3 Hierarchical clustering                                posed by Vicient and Moreno (2014) specifically
                                                           for hashtags. This approach uses Wordnet and
Hierarchical clustering is a viable approach to
                                                           Wikipedia as the metadata for identifying the lex-
cluster analysis, and is particularly suitable for the
                                                           ical semantics of a hashtag. Source codes of their
purpose of hashtag clustering in this paper for a
                                                           algorithms were not available, and therefore we
few reasons. First, the approach does not require
                                                           implemented the approach described in Vicient’s
apriori information about the number of clusters.
                                                           PhD dissertation (Vicient, 2014) to the best of our
(The number of outputs clusters is not known in
                                                           abilities using the algorithms and descriptions pro-
most real applications.) Second, it is suited to the
                                                           vided.
taxonomic nature of language semantics. Third, it
                                                              There are three major steps in their semantic
facilitates a fair comparison with the algorithm by
                                                           clustering algorithm: (a) semantic grounding, (b)
Vicient and Moreno (2014), which also uses hier-
                                                           similarity matrix construction, and (c) semantic
archical clustering.
                                                           clustering. Algorithm 1 summarizes the steps.
   There are two popular strategies for hierarchical
                                                              In the first stage (i.e., semantic grounding), each
clustering – bottom-up (or agglomerative) and top-


                                                     142
  Input: list H of hashtags                             cepts by appending them to the list LCh ; this step
  Output: clusters                                      is repeated for each main noun.
  Stage 1 (Semantic grounding):                            In the second stage (i.e, similarity matrix con-
  Step 1: For each hashtag h 2 H perform Step           struction), first, hashtags associated with an empty
  1a.                                                   list of concepts are discarded; in other words,
       Step 1a: Look up h from Wordnet. If h is         hashtags that did not match any Wordnet entry,
       found then append the synset of h to a list      either by themselves or by using word segmen-
       (LCh ). Otherwise segment h into multiple        tation technique, and also had no entry found in
       words and drop the leftmost word and             Wikipedia are discarded. Then, using the remain-
       then try Step 1a again using the reduced h       ing hashtags (each of whose LCh contains at least
       until either a match is found from               one concept in it), semantic similarity is calcu-
       Wordnet or no more word is left in h.            lated between each pair of them. Any ontology-
  Step 2: For each h 2 H that has an empty list         based measure can be used, and Wu-Palmer mea-
  LCh , look up h in Wikipedia. If an article           sure (see Section 2.1) has been used in our work to
  matching h is found in Wikipedia, acquire the         stay consistent with the original work by Vicient
  list of auxiliary categories for the article,         and Moreno (2014).
  extract main nouns from the auxiliary                    Specifically, the similarity between two hash-
  categories, and then, for each main noun              tags, hi and hj , is calculated as the maximum pair-
  extracted, go to Step 1a using the main noun          wise similarity (based on the Wu-Palmer measure)
  as h.                                                 between one set of concepts in LChi and another
  Stage 2 (Similarity matrix construction):             set of concepts in LChj . Calculating the similar-
  Discard any hashtag h that has an empty LCh .         ity this way is expected to find the correct sense of
  Calculate the maximum pairwise similarity             hashtag (among all the sense/concepts in LCh ).
  between each pair of lists LChi and LChj                 Finally, in the third stage (i.e., clustering), any
  (i 6= j) using any ontology-based similarity          clustering algorithm can be used to cluster hash-
  measure.                                              tags based on the similarity matrix obtained in the
  Stage3 (Clustering): Perform clustering on            second stage. As mentioned earlier, in this paper
  the distance matrix (1’s complement of the            we use hierarchical clustering which was used in
  similarity matrix) resulting from Stage 2.            the original work by Vicient and Moreno (2014).
 Algorithm 1: Semantic hashtag clustering (Vi-
                                                        4     Sense-Level Semantic Hashtag
 cient and Moreno, 2014).
                                                              Clustering
hashtag is looked up in Wordnet. If there is a di-      In this section, we describe the enhancement made
rect match, that is, the hashtag is found in Word-      to the word-level semantic hashtag clustering and
net, then it is added as a single candidate synset,     showcase its positive impact using a toy example.
and, accordingly, all the concepts (or senses) (see     Both Stage 1 (i.e, semantic grounding) and Stage 3
Section 2.1) belonging to the synset are saved in       (i.e, clustering) of the sense level semantic cluster-
the form of a list of candidate concepts related        ing algorithm are essentially the same as those in
to the hashtag. We call this list LCh . If, on the      the word-level semantic clustering algorithm (see
other hand, the hashtag is not found in Wordnet,        Algorithm 1 in Section 3). So, here, we discuss
then the hashtag is split into multiple terms (us-      only Stage 2 (i.e, similarity matrix construction)
ing a word segmentation technique) and, then, the       of the algorithm, with a focus on the difference in
leftmost term is dropped sequentially until either      the calculation of maximum pairwise similarity.
a match is found in Wordnet or there is no more
                                                        4.1     Similarity matrix construction
term left.
   For each hashtag that was not found from Word-       4.1.1    Word-level versus sense-level similarity
net in Step 1 (i.e., of which the LCh is empty), it              matrix
is looked up in Wikipedia. If a match is found          As mentioned in Section 3, the similarity between
in Wikipedia, the auxiliary categories (see Sec-        two hashtags hi and hj is defined as the maxi-
tion 2.2) of the article are acquired. Main nouns       mum pairwise similarity between one set of senses
from the auxiliary categories are then looked up in     in LChi and another set of senses in LChj . (Re-
Wordnet, and if a match is found, we save the con-      call that LCh denotes a list of senses retrieved


                                                  143
from Wordnet to semantically ground a hashtag                           around a hashtag that has multiple senses.
h.) This maximum pairwise similarity is an ef-
                                                                        4.1.2   Word-level similarity matrix
fective choice for disambiguating the sense of a
                                                                                construction
hashtag and was used to achieve a positive ef-
fect in the word-based approach by Vicient and                          Algorithm 2 outlines the steps of calculating max-
Moreno (2014).                                                          imum pairwise similarity between hashtags in the
   However, we have observed many instances                             word-level algorithm. One maximum pairwise
where a hashtag word has multiple senses and it                         similarity value is calculated for each pair of hash-
introduces an error in the clustering result. That                      tags semantically grounded in the previous stage
is, the word-level algorithm does not distinguish                       (i.e., Stage 1) and is entered into the similarity ma-
among different senses of the same word when                            trix. The similarity matrix size is |H|2 , where H
constructing a similarity matrix and, as a result,                      is the number of hashtags that have at least one
two hashtags are misjudged to be semantically                           sense (i.e., nonempty LCh ). Note that the pairwise
similar (because they are similar to a third hash-                      similarity comparison is still done at the sense
tag in two different senses) and are included in                        level, considering all senses of the hashtags that
the same cluster. Moreover, a false triangle that                       are compared.
violates the triangular inequality property may be                          Input: set H of hashtags h with nonempty
formed at the word level. (Note this property is                            LCh .
required of any distance metric like Wu-Palmer.)                            Output: pairwise hashtag similarity matrix.
See Figure 1 for an illustration. As its side effect,                   1 Initialize an empty similarity matrix
we have observed that a cluster tends to be formed                        M[|H|, |H|].
centered around a hashtag that takes on multiple                        2 Initialize maxSim to 0.
senses.                                                                 3 for each pair (hi , hj ) of hashtags in H do
                                                                        4      // Calculate the maximum pairwise
                                                                                similarity between hi and hj .
                                                                        5    for each sp 2 LChi do
                                                                        6        for each sq 2 LChj do
                                                                     7               Calculate the similarity sim
                                                                                     between sp and sq .
                                                                     8               if sim > maxSim then
                                                                     9                    Update maxSim to sim .
      (a) Sense level.               (b) Word level.                10               end
 (Edge weights denote similarity values (= 1        distance).      11           end
 Assume the minimum similarity threshold is 0.5. Then,              12       end
 at the sense level (a), two clusters ({H1 , H2 }, {H1 , H3 })      13       Enter maxSim into M[i, j].
 should formed because H2 and H3 are not similar (note
                                                                    14 end
 0.1 < 0.5), but, at the word level (b), one cluster {H1 , H2 ,        Algorithm 2: Word-level construction of seman-
 H3 } is formed because it appears as if H2 and H3 were                tic similarity matrix.
 similar via H1 . Moreover, the false triangle that appears
 to be formed at the word level violates the triangular in-
                                                                        4.1.3   Sense-level similarity matrix
 equality property because dist(H1 , H2 ) + dist(H1 , H3 ) <
                                                                                construction
 dist(H2 , H3 ).)
Figure 1: An illustration of clustering at the word                     Algorithm 3 outlines the steps of constructing a
level versus sense level.                                               similarity matrix at the sense-level algorithm. Un-
                                                                        like the case of the word-level algorithm, entries
   Thus, we chose to explicitly record the sense in                     in the similarity matrix are between senses that
which a hashtag is close to another hashtag when                        make maximum similarity pairs between a pair of
constructing a similarity matrix. This sense-level                      hashtags. Since these senses are not known un-
handling of hashtag semantic distance helps us en-                      til the maximum pairwise similarity calculations
sure that the incorrect clustering problem of word-                     are completed, the construction of the similarity
level clustering does not happen. Accordingly, it                       matrix is deferred until then. In the first phase
avoids the formation of clusters that are centered                      (Lines 2⇠16), for each pair of hashtags, the al-


                                                                  144
    Input: set H of hashtags h with nonempty                trix, therefore, is |Ŝ|2 , where Ŝ is the set of dis-
    LCh .                                                   tinct senses in LHs (see Lines 18⇠19). Second,
    Output: pairwise hashtag similarity matrix.             it enables the algorithm to add exactly the needed
1 Create an empty list LHs of (hashtag sense                number of entries, that is, |H|2 entries (i.e., one
  pair, pairwise maximum similarity).                       for each pair of hashtags (see Lines 20⇠22)) into
2 for each pair (hi , hj ) of hashtags in H do              a matrix of size |Ŝ|2 , where |Ŝ|2 > |H|2 . (The
3     // Calculate the maximum pairwise                     remaining entries are initialized to 0 and remain 0,
       similarity between hi and hj .                       as they are for pairs of senses that do not represent
4        Initialize maxSim to 0.                            maximum similarity pair between any hashtags.)
5        Initialize maxSimPair to (null, null).             Our observation is that the ratio |Ŝ|/|H| is limited
 6       for each sp 2 LChi do                              from 5 to 10 for most individual hashtags, which is
 7            for each sq 2 LChj do                         consistent with Vicient’s statement (Vicient, 2014)
 8                  Calculate the similarity sim            that, out of semantically-grounded 903 hashtags,
                    between sp and sq .                     almost 100 of them have only 2 senses and very
 9                  if sim > maxSim then                    few have more than 5 senses.
10                       Update maxSim to sim .                Since what is clustered are hashtags, although
11                       Update maxSimPair to (hi .sp ,     their similarities are measured at the sense level, a
                         hj .sq ).                          number of interesting points hold. First, we do not
12                  end                                     need to add similarities between all pairs of senses
13            end                                           in the similarity matrix. Second, a hashtag may
14       end                                                appear in multiple clusters, where each cluster is
15       Add (maxSimPair, maxSim) to LHs .                  formed based on distinct senses of the hashtag, and
                                                            therefore the resulting clusters are overlapping.
16 end
17 // Construct the similarity matrix.                      4.2   A toy example
18 Count the number |Ŝ| of distinct hashtag                To demonstrate the merit of clustering at the sense
     senses in LHs .                                        level as opposed to the word level, we made a toy
19 Initialize a similarity matrix M[|Ŝ|, |Ŝ|] as a        set of hashtags and ran the metadata-based seman-
     0 matrix.                                              tic clustering algorithm at both the word level and
20 for each triplet (hi .sp , hj .sq , maxSim) in LHs       the sense level. The hashtags used are #date, #au-
     do                                                     gust, #tree, and #fruit. From Wordnet, we found
21       Update the M[m, n] to maxSim, where                that there were 3 senses associated with the word
         (m, n) is the matrix index for (hi .sp ,           august, 13 senses with date, 5 senses with fruit,
         hj .sq ) .                                         and 7 senses with tree.
22 end                                                         Using the Wu-Palmer similarity measure (ex-
   Algorithm 3: Sense-level construction of seman-          plained in Section 2.1) at the word level, we ob-
   tic similarity matrix.                                   tained the distance matrix shown below.
                                                                  Hashtag august date fruit  tree
gorithm saves the pair of senses (hi .sp , hj .sq ) in            august 0.000 0.200 0.500 0.667
the maximum similarity pair and the maximum                        date   0.200 0.000 0.100 0.400
similarity value in the list LHs . Then, in the sec-               fruit  0.500 0.100 0.00 0.556
ond phase (Lines 18⇠22), for each triplet element                  tree   0.667 0.400 0.556 0.000
(hi .sp , hj .sq , maxSim) in LHs , the algorithm en-
ters the maximum similarity value maxSim at the                Then, to perform clustering using the word-
matrix index corresponding to the pair of senses            level distance matrix as the input, we used both the
(hi .sp , hj .sq ).                                         single-linkage and UPGMA (see Section 2.3) as
   This two-phase construction of similarity ma-            the measure to calculate distance between newly
trix brings two advantages. First, it enables the           formed clusters and the distance threshold for ex-
algorithm to use exactly the needed number of ma-           tracting flat clusters from hierarchical clusters was
trix entries for those senses that are distinct among       set to 0.5.
all senses that constitute pairwise maximum sim-               Table 2 shows the clusters obtained using the
ilarities between hashtags. The size of the ma-             word-level clustering. We see that #august, #date,


                                                      145
and #fruit are included in the same cluster in both         Sense           Semantics
cases of the distance measures. This example              august.n.01       the month following July and pre-
demonstrates a case in which #date takes on mul-                            ceding September
tiple sense identities and glues together #august         august.a.01       of or befitting a lord
and #fruit in the same cluster at the word level al-      corner.v.02       force a person or animal into a po-
though these two are not similar at the sense level,                        sition from which he can not es-
as shown next.                                                              cape
                 Cluster                                   date.n.02        a participant in a date
                                 Cluster using             date.n.06        the particular day, month, or year
      Hashtag using single-
                                 UPGMA                                      (usually according to Gregorian
                 linkage
      august           1               1                                    calendar) that an even occurred
       date            1               1                   date.n.08        sweet edible fruit of the date palm
       fruit           1               1                                    with single long woody seed
        tree           1               2                   fruit.n.01       the ripened reproductive body of
   Table 2: Cluster assignment at the word level.                           a seed plant
                                                           fruit.v.01       cause to bear fruit
   Now, using the sense-level clustering, out of a         tree.n.01        a tall perennial woody plant hav-
total of 28 senses associated with the four hash-                           ing a main trunk and branches
tags, the algorithm picked 10 senses shown in Ta-                           forming a distinct elevated crown;
ble 3. These 10 senses were picked as a result                              includes both gymnosperms and
of maximum pairwise similarity calculations be-                             angiosperms
tween two sets of senses belonging to each pair            yield.n.03       an amount of product
of hashtags. (With 4 hashtags, there are a max-            (‘n’ stands for noun, ‘v’ for verb and ‘a’ for adjective.)
imum of 12 senses that can be obtained for 6 (=          Table 3: Senses and their semantics (source:
C(4, 2)) maximum similarity pairs, and in this ex-       Wordnet).
ample case, there were duplicate senses, conse-                                         Cluster
                                                          Hashtag     Hashtag sense     using single-     Cluster using
quently giving 10 distinct senses.) As mentioned                                                          UPGMA
                                                                                        linkage
earlier, each of these senses represents the seman-         date         date.n.02            1                 1
tics of the hashtag word it belongs to, and thus            tree          tree.n.01           1                 1
makes an entry into the similarity (or distance) ma-        fruit        yield.n.03           2                 2
                                                            fruit        fruit.v.01           3                 3
trix input to the hierarchical clustering algorithm.       august       august.a.01           3                 3
                                                            tree        corner.v.02           4                 4
   The distance matrix obtained from the 10 senses          fruit        fruit.n.01           5                 5
                                                            date         date.n.08            5                 5
is shown in Figure 2. The numbers in bold face are         august       august.n.01           6                 6
the maximum similarity values entered. Note that            date         date.n.06            6                 6
distance 1.000 means similarity 0.000.                     Table 4: Cluster assignment at the sense level.
   Table 4 shows the resulting cluster assignments.
(The outcome is the same for both distance mea-          1600 MHz DDR3 memory.
sures, which we believe is coincidental.) We see         5.1     Experiment setup
that #august and #date are together in the same
                                                         5.1.1      Performance metric
cluster and so are #date and #fruit but, unlike the
word-level clustering result, the three of #august,      Evaluating clustering output is known to be a
#date, and #fruit are not altogether in the same         “black art” (Jain and Dubes, 1988) with no objec-
cluster. This separation is because, at the sense        tive accuracy criterion. It is particularly challeng-
level, #date can no longer take on multiple identi-      ing for the semantic hashtag clustering addressed
ties as it did at the word level.                        in this paper. Sense-level clustering generates
                                                         more items to be clustered than word-level and
5   Evaluation                                           the output clusters are overlapping. Therefore, in-
In this evaluation, all algorithms were imple-           ternal measures (e.g., Silhouette coefficient, SSE)
mented in Python and the experiments were per-           are not desirable because they simply consider the
formed on a computer with OS X operating sys-            cohesion and separation among the output clus-
tem, 2.6 GHz Intel Core i5 processor, and 8 GB           ters without regard to the semantic accuracy of the


                                                   146
Hashtag sense        august.n.01 august.a.01 corner.v.02 date.n.02 date.n.06 date.n.08 fruit.n.01 fruit.v.01 tree.n.01 yield.n.03
                Hashtag august     august       tree       date      date      date       fruit      fruit      tree      fruit
 august.n.01    august 0.000       1.000       1.000      1.000     0.200     1.000      1.000      1.000      1.000     1.000
 august.a.01    august 1.000       0.000       0.667      1.000     1.000     1.000      1.000      0.500      1.000     1.000
 corner.v.02     tree     1.000    0.667       0.000      1.000     1.000     1.000      1.000      1.000      1.000     1.000
  date.n.02      date     1.000    1.000       1.000      0.000     1.000     1.000      1.000      1.000      0.400     1.000
  date.n.06      date    0.200     1.000       1.000      1.000     0.000     1.000      1.000      1.000      1.000     1.000
  date.n.08      date     1.000    1.000       1.000      1.000     1.000     0.000      0.100      1.000      1.000     1.000
  fruit.n.01     fruit    1.000    1.000       1.000      1.000     1.000     0.100      0.000      1.000      1.000     1.000
  fruit.v.01     fruit    1.000    0.500       1.000      1.000     1.000     1.000      1.000      0.000      1.000     1.000
   tree.n.01     tree     1.000    1.000       1.000      0.400     1.000     1.000      1.000      1.000      0.000     0.556
  yield.n.03     tree     1.000    1.000       1.000      1.000     1.000     1.000      1.000      1.000      0.556     0.000
                                         Figure 2: Distance matrix in the toy example.

   items clustered. For this reason, our evaluation is              they match those used in the evaluation of word-
   done with an external “standard” (i.e., gold stan-               level clustering by Vicient and Moreno (2014).
   dard test). To this end, we use f-score, which is                They used tweet messages from the Symplur web-
   commonly used in conjunction with recall and pre-                site, and so we did.
   cision to evaluate clusters in reference to ground                  We manually gathered a tweet dataset from the
   truth clusters, as the accuracy metric. In our eval-             Symplur web site (http://www.symplur.com). The
   uation, the f-score is calculated for each pair of a             dataset consists of 1,010 unique hashtags that are
   cluster in the ground truth cluster set and a clus-              included in 2,910 tweets. The median of the num-
   ter in the evaluated algorithm’s output cluster set.             ber of tweets per hashtag was only two. (Distribu-
   Then, the final f-score resulting from the compar-               tion of the number of tweet messages per hashtag
   ison of the two cluster sets is obtained in two dif-             generally follows the power law (Muntean et al.,
   ferent ways, depending on the purpose of the eval-               2012).
   uation. For the purpose of evaluating individual
                                                                    5.2   Experiment: gold standard test
   output clusters, the pairwise maximum (i.e., “best
   match”) f-score, denoted as fm -score, is used as                In order to enable the gold standard test, we pre-
   the final score. Given a ground truth cluster Gi                 pared a ground truth based on observed hashtag
   matched against an output cluster set C, the fm -                semantics. Out of the 1,010 hashtags, we manu-
   score is obtained as                                             ally annotated the semantics to choose 230 hash-
                                                                    tags and classified them into 15 clusters. The re-
       fm -score(Gi , C) =                                          maining hashtags were classified as noise. Fig-
                     max                f-score(Gi , Cj )   (4)     ure 3 shows the sizes of the 15 ground truth clus-
           Cj 2C ^ f-score(Gi ,Cj )>0
                                                                    ters.
   where the pairwise matching is one-to-one be-
   tween G and C.
      On the other hand, for comparing overall accu-
   racy of the entire set of clusters, the weighted aver-
   age of pairwise maximum f-scores, denoted as fa -
   score, is used instead. Given a ground truth cluster           Figure 3: Sizes of ground truth clusters.
   set G and an output cluster set C, the fa -score is
   calculated as                                              The distance threshold for determining flat clus-
                       P                                   ters  in hierarchical clustering was set using the
                          Gi 2G (f -score(Gi , C) ⇥ |Gi |)
                                  m
   a
  f -score(G, C) =                 P                       “best result” approach. That is, we tried both dis-
                                      Gi 2G |Gi |          tance measures (i.e., single-linkage and UPGMA)
                                                      (5)
                                                           and different distance threshold values and picked
   5.1.2 Dataset                                           the measure and value that produced the best result
   With the focus of evaluation on comparing be-           based on the weighted average f-score measure.
   tween the sense-level and the word-level of the            Figure 4 shows the accuracies achieved by the
   same clustering algorithm, deliberate choices were      metadata-based    semantic clustering at the word-
   made in the selection of the datasets and the num-      level  and the sense-level. Table 5 shows more de-
   ber of hashtags used in the experiments so that         tails, including precision and recall for individual
                                                           clusters. From the results we see that every sense-


                                                              147
level cluster outperforms the word-level counter-         7   Conclusion
part (except cluster 1 due to rounding-off differ-        In this paper, we enhanced the current metadata-
ence). Particularly, the fm -scores are zero for          based semantic hashtag clustering algorithm by
word-level clusters 6, 14, and 15, thus bringing          determining the semantic similarity between hash-
the performance gain to “infinity”. (Word-level           tags at the sense level as opposed to the word level.
clustering did not generate any cluster of size 3 or      This sense-level decision on clustering avoids in-
larger and with the best match f-score to clusters 6,     correctly putting hashtags of different senses in
14, and 15 greater than 0.1.) Further, when all 15        the same cluster. The result was significantly
clusters are considered together, the weighted av-        higher accuracy of semantic clusters without in-
erage of maximum pairwise f-scores, fa -score, is         creasing the complexities of the algorithm in prac-
0.43 for sense-level clustering and 0.34 for word-        tice. A gold standard test showed that the sense-
level clustering – a 26% gain.                            level algorithm produced significantly more accu-
                                                          rate clusters than the word-level algorithm, with
                                                          an overall gain of 26% in the weighted average of
                                                          maximum pairwise f-scores.
                                                             For the future work, new metadata sources can
                                                          be added to provide the metadata-based semantic
                                                          hashtag clustering algorithm with more abilities.
                                                          For example, to understand hashtags of a different
                                                          language, online translation services like Google
Figure 4: Maximum pairwise f-scores of output             Translate (https://translate.google.com) can be
clusters from word-level and sense-level semantic         a good source since empirical evidences sug-
clustering.                                               gest that it can be very effective in identify-
                                                          ing spelling errors, abbreviations, etc. Addition-
6   Related Work                                          ally, crowdsourced websites like Urban Dictionary
                                                          (www.urbandictionary.com) that specializes in in-
There are several works on semantic clustering of         formal human communication can be a helpful
hashtags that focused on the contextual semantics         metadata source for decoding lexical semantics of
of hashtags (Tsur et al., 2012; Tsur et al., 2013;        hashtags. Internet search engines also provide rich
Muntean et al., 2012; Rosa et al., 2011; Stilo and        information on the semantics of hashtags.
Velardi, 2014) by using the bag of words model to
represent the texts accompanying a hasthag. Tsur          Acknowledgments
et al. (2012; 2013) and Muntean et al. (2012) ap-         This project was supported by a Fulbright Program
pended tweets that belonged to each unique hash-          grant sponsored by the Bureau of Educational and
tag into a unique document called “virtual docu-          Cultural Affairs of the United States Department
ment”. These documents were then represented              of State and administered by the Institute of Inter-
as vectors in the vector space model. Rosa et             national Education.
al. (2011) used hashtag clusters to achieve topi-
cal clustering of tweets, where they compared the
effects of expanding URLs found in tweets. Stilo
                                                          References
and Paola (2014) clustered hashtag “senses” based         Sandjai Bhulai, Peter Kampstra, Lidewij Kooiman, Ger
                                                            Koole, Marijn Deurloo, and Bert Kok. 2012. Trend
on their temporal co-occurrence with other hash-            visualization on Twitter: What’s hot and what’s not?
tags. The term “sense”in their work is different            In Proceedings of the 1st International Conference
from the lexical sense used in this paper.                  on Data Analytics, pages 43–48.
   Lacking the ability to form lexical semantic
sense-level clusters of hashtag has been a ma-            Joanna Costa, Catarina Silva, Mário Antunes, and
jor shortcoming of the current approaches. To               Bernardete Ribeiro. 2013. Defining semantic meta-
                                                            hashtags for twitter classification. Lecture Notes in
the best our knowledge, the work by Vicient and             Computer Science, 7824:226–235.
Moreno (2014) is the only one that opened re-
search in this direction. They used Wordnet and           Internet Live Stats.               last viewed in
Wikipedia as the metadata source for clustering              May 2016.              Twitter usage statistics.
hashtags at a word-level.                                    http://www.internetlivestats.com/twitter-statistics/.


                                                    148
Ground truth clusters               Sense-level clusters                                Word-level clusters
Id      Size            Recall    Precision    fm -score     Size         Recall      Precision   fm -score   Size
1       32              0.63      0.65         0.63          31           0.63        0.67        0.65        30
2       26              0.35      0.39         0.37          23           0.31        0.35        0.33        23
3       23              0.39      0.43         0.41          21           0.35        0.19        0.24        43
4       23              0.91      0.84         0.88          25           0.83        0.76        0.79        25
5       22              0.41      0.45         0.43          20           0.41        0.20        0.27        44
6       14              0.21      0.18         0.19          17           n/a         n/a         n/a         n/a
7       14              0.64      0.50         0.56          18           0.64        0.50        0.56        18
8       12              0.25      0.43         0.32          7            0.50        0.24        0.32        25
9       11              0.82      0.39         0.53          23           0.82        0.08        0.14        118
10      11              0.18      0.11         0.14          18           0.09        0.17        0.12        6
11      10              0.40      0.27         0.32          15           0.50        0.08        0.14        59
12      9               0.11      0.25         0.15          4            0.11        0.17        0.13        6
13      9               0.22      0.33         0.27          6            0.22        0.29        0.25        7
14      8               0.13      0.25         0.17          4            n/a         n/a         n/a         n/a
15      6               0.17      0.20         0.18          5            n/a         n/a         n/a         n/a

 fa -score (weighted average of fm -scores) is 0.43 for sense-level clusters and 0.34 for word-level clusters.
                               Table 5: Details of gold standard test results.

 Anil K. Jain and Richard C. Dubes. 1988. Algorithms                Proceedings of the 19th International Conference on
   for Clustering Data. Prentice-Hall.                              Knowledge Engineering and Knowledge Manage-
                                                                    ment, pages 563–578, November.
 Ryan Kelly.          last viewed in May 2016.
   Twitter     study    reveals    interesting   results         Peter Teufl and Stefan Kraxberger. 2011. Extracting
   about usage - 40% is pointless babble.                          semantic knowledge from twitter. In Proceedings of
   http://pearanalytics.com/blog/2009/twitter-study-               the 3rd IFIP WG 8.5 International Conference on
   reveals-interesting-results-40-percent-pointless-               Electronic Participation, pages 48–59.
   babble/.
                                                                 Oren Tsur, Adi Littman, and Ari Rappoport. 2012.
 Christopher D. Manning, Prabhakar Raghavan, and                   Scalable multi stage clustering of tagged micro-
   Hinrich Schütze. 2008. Introduction to Information             messages. In Proceedings of the 21st International
   Retrieval. Cambridge University Press.                          Conference on World Wide Web, pages 621–622,
                                                                   April.
 Cristina Muntean, Gabriel Morar, and Darie Moldovan.
   2012. Exploring the meaning behind twitter hash-              Oren Tsur, Adi Littman, and Ari Rappoport. 2013. Ef-
   tags through clustering. Lecture Notes in Business              ficient clustering of short messages into general do-
   Information Processing, 127:231–242.                            mains. In Proceedings of the 7th International AAAI
                                                                   Conference on Weblogs and Social Media.
 Suzi Park and Hyopil Shin. 2014. Identification of
   implicit topics in twitter data not containing explicit       Carlos Vicient and Antonio Moreno. 2014. Unsu-
   search queries. In Proceedings of the 25th Inter-               pervised semantic clustering of twitter hashtags. In
   national Conference on Computational Linguistics,               Proceedings of the 21st European Conference on Ar-
   pages 58–68.                                                    ticifial Intelligence, pages 1119–1120, August.
 Kevin Dela Rosa, Rushin Shah, and Bo Lin. 2011.                 Carlos Vicient. 2014. Moving Towards The Semantic
   Topical clustering of tweets. In Proceedings of the             Web: Enabling New Technologies through the Se-
   3rd Workshop on Social Web Search and Mining,                   mantic Annotation of Social Contents. Ph.D. thesis,
   pages 133–138, July.                                            Universitat Robira I Virgili, December.
 Hassan Saif, Yulan He, and Harith Alani. 2012. Se-              Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou,
   mantic sentiment analysis of twitter. In Proceedings            and Ming Zhang. 2011. Topic sentiment analysis
   of the 11th International Conference on The Seman-              in Twitter: a graph-based hashtag sentiment classifi-
   tic Web - Volume Part I, pages 508–524.                         cation approach. In Proceedings of the 20th ACM
 SciPy.org.      last viewed in May 2016.            Hi-           Conference on Information and Knowledge Man-
   erarchical clustering flat cluster parameters.                  agement, pages 1031–1040, October.
   http://docs.scipy.org/doc/scipy-0.14.0/
                                                                 Zhibiao Wu and Martha Palmer. 1994. Verbs seman-
   reference/generated/scipy.cluster.hierarchy.fcluster.
                                                                   tics and lexical selection. In Proceedings of the 32nd
   html.
                                                                   Annual Meeting on Association for Computational
 Giovanni Stilo and Paola Velardi. 2014. Temporal se-              Linguistics, pages 133–138.
   mantics: Time-varying hashtag sense clustering. In


                                                           149