Sense-Level Semantic Clustering of Hashtags in Social Media Ali Javed Byung Suk Lee Department of Computer Science Department of Computer Science University of Vermont University of Vermont Burlington, Vermont 05405, U.S.A. Burlington, Vermont 05405, U.S.A. ajaved@uvm.edu bslee@uvm.edu Abstract hashtags are used to index those tweets. There- fore, admittedly, classification of tweets benefits We enhance the accuracy of the cur- from accurate clustering of hashtags. rently available semantic hashtag cluster- Social media is arguably the best source of ing method, which leverages hashtag se- timely information. On Twitter alone, for ex- mantics extracted from dictionaries such ample, an average of 6000 micro-messages are as Wordnet and Wikipedia. While immune posted per second (Internet Live Stats, last viewed to the uncontrolled and often sparse us- in May 2016). Thus, social media analysts use age of hashtags, the current method dis- clusters of hashtags as the basis for more com- tinguishes hashtag semantics only at the plex tasks (Muntean et al., 2012) such as retriev- word level. Unfortunately, a word can ing relevant tweets (Muntean et al., 2012; Park have multiple senses representing the ex- and Shin, 2014), tweet ranking, sentiment analy- act semantics of a word, and, therefore, sis (Wang et al., 2011), data visualization (Bhulai word-level semantic clustering fails to dis- et al., 2012), semantic information retrieval (Teufl ambiguate the true sense-level semantics and Kraxberger, 2011), and user characterization. of hashtags and, as a result, may generate Therefore, the accuracy of hashtag clustering is incorrect clusters. This paper shows how important to the quality of the resulting informa- this problem can be overcome through tion in those tasks. sense-level clustering and demonstrates its The popular approach to hashtag clustering has impacts on clustering behavior and accu- been to leverage the tweet texts accompanying racy. hashtags (Costa et al., 2013; Teufl and Kraxberger, 2011; Tsur et al., 2012; Tsur et al., 2013; Bhu- 1 Introduction lai et al., 2012; Muntean et al., 2012; Rosa et Hashtags are used in major social media (e.g., al., 2011) by identifying their “contextual” seman- Twitter, Facebook, Tumblr, Instagram, Google+, tics (Saif et al., 2012). There are two prominent Pinterest) for various purposes – to tell jokes, fol- problems with this approach, however. First, a ma- low topics, put advertisements, collect consumer jority of hashtags are not used frequently enough feedback, etc. For instance, McDonald’s created to find sizable tweet texts accompanying them, a hashtag #Mcdstories to collect consumer feed- thus causing a sparsity problem. Second, tweet back; #OccupyWallStreet, #ShareaCoke and #Na- texts are open-ended, with no control over their tionalFriedChickenDay are only a few examples contents at all, and therefore often exhibit poor of many successful hashtag campaigns. linguistic quality. (According to Pear Analytics, Twitter is the first social media platform that in- 40.1% of tweets are “pointless babble” (Kelly, last troduced hashtags, and is used as the representa- viewed in May 2016).) These problems make text- tive social media in this paper. Most tweets con- based techniques ineffective for hashtag cluster- tain one or more hashtags in their texts of up to ing. Hence, methods that utilize other means to 140 characters. identifying semantics of hashtags are needed. Clustering is commonly used as a text classifi- In this regard, the focus of this paper is on cation technique, and clustering of hashtags is the leveraging dictionary metadata to identify the first step in the classification of tweets given that semantics of hashtags. We adopt the pioneer- 140 ing work done by Vicient and Moreno (2014). Concept Meaning Their approach identifies the “lexical” semantics desert.n.01 arid land with little or no vegeta- of hashtags from external resources (e.g., Word- tion net, Wikipedia) independent of the tweet messages abandon.v.05 leave someone who needs or themselves. To the best of our knowledge, their counts on you; leave in the lurch work is the only one that uses this metadata-based defect.v.01 desert (a cause, a country or approach. This approach has the advantage of an army), often in order to join being immune to the sparsity and poor linguistic the opposing cause, country, or quality of tweet messages, and the results of their army work demonstrate it. desert.v.03 leave behind On the other hand, their work has a major draw- Table 1: Example synset for the word “desert”. back, in that it makes clustering decisions at the word level while the correct decision can be made at the sense (or “concept”) level. It goes with- out saying that the correct use of metadata is crit- 2 Background ical to the performance of any metadata-based ap- proach, and indeed clustering hashtags based on 2.1 Wordnet – synset hierarchy and their word-level semantics has been shown to er- similarity measure roneously putting hashtags of different senses in Wordnet is a free and publicly available lexical the same cluster (more on this in Section 4). database of English language. It groups English In this paper, we devise a more accurate sense- words into sets of synonyms called synsets. Each level metadata-based semantic clustering algo- word in Wordnet must point to at least one synset, rithm. The critical area of improvement is in the and each synset must point to at least one word. construction of similarity matrix between pairs of Hence, there is a many-to-many relationship be- hashtags, which then is input to a clustering algo- tween synsets and words (Vicient, 2014). Synsets rithm. The immediate benefits are shown in the in Wordnet are interlinked by their semantics and accuracy of resulting clusters, and we demonstrate lexical relationships, which results in a network of it using a toy example. Experimental results using meaningful related words and concepts. gold standard testing show a 26% gain of cluster- Table 1 shows an example synset. The synset ing accuracy in terms of the weighted average pair- contains 4 different concepts, where a concept is a wise maximum f-score (Equation 5), where the specific sense of a word – e.g., “desert” meaning weight is the size of a ground truth cluster. Despite “arid land with little or no vegetation”, “desert” the gain in the clustering accuracy, we were able meaning “to leave someone who needs or counts to keep the run-time and space overheads for sim- on you”. ilarity matrix construction within a constant factor All of these concepts are linked to each (e.g., 5 to 10) through a careful implementation other using the semantic and lexical relationships scheme. mentioned. For example “oasis.n.01”(meaning The remainder of this paper is organized as fol- “a fertile tract in a desert”) is a meronym lows. Section 2 provides some background knowl- of “desert.n.01” i.e, “oasis.n.01” is a part of edge. Section 3 describes the semantic hash- “desert.n.01”. tag clustering algorithm designed by Vicient and Given this network of relationships, Wordnet is Moreno (2014). Section 4 discusses the proposed frequently used in automatic text analysis through sense-level semantic enhancement to the cluster- the application program interface (API). There ing algorithm, and Section 5 presents its evalu- are different API functions that allow for the cal- ation against the word-level semantic clustering. culation of semantic similarity between synsets, Section 6 discusses other work related to the se- and the Wu-Palmer similarity measure (Wu and mantic hashtag clustering. Section 7 summarizes Palmer, 1994) is used in this paper in order to stay the paper and suggests future work. consistent with the baseline algorithm by Vicient and Moreno (2014). In a lexical database like Wordnet synset database, where concepts are or- ganized in a hierarchical structure, the Wu-Palmer similarity between two concepts C1 and C2 , de- 141 noted as simW P (C1 , C2 ), is defined as down (or divisive). In bottom-up strategy, each el- 2 · depth(LCS(C1 , C2 )) ement starts in its own cluster and two clusters are simW P (C1 , C2 ) = (1) merged to form one larger cluster as the cluster- depth(C1 ) + depth(C2 ) ing process moves up the hierarchy. In top-down where LCS(C1 , C2 ) is the least common subsumer strategy, all elements start in one cluster and one of C1 and C2 in the hierarchy of synsets. cluster is split into two smaller clusters as the clus- This Wordnet functionality is used to calculate tering process moves down the hierarchy. Bottom- the semantic similarity between hashtags in this up strategy is used in this paper because it is con- paper, that is, by grounding hashtags to specific ceptually simpler than top-down (Manning et al., concepts (called “semantic grounding”) and cal- 2008). culating the similarity between the concepts. For bottom-up strategy, several distance mea- 2.2 Wikipedia – auxiliary categories surement methods are available to provide link- age criteria for building up a hierarchy of clus- Wikipedia is by far the most popular crowd- ters. Among them, single-linkage method and un- sourced encyclopedia. Not all hashtags can be weighted pair group method with arithmetic mean grounded semantically using Wordnet because (UPGMA) are used most commonly, and are used many of them are simply not legitimate terms in this paper. Single-linkage method calculates the found in Wordnet (e.g. #Honda). This situation distance between two clusters Cu and Cv as is where Wikipedia can be used to look up those hashtags. Wikipedia provides auxiliary categories d(Cu , Cv ) = min dist(ui , vj ) (2) ui 2Cu ^ vj 2Cv for each article. For example, when Wikipedia is queried for categories related to the page titled and UPGMA calculates the distance as “Honda”, it returns the following auxiliary cate- X d(ui , vj ) d(Cu , Cv ) = (3) gories. |Cu | ⇥ |Cv | ui 2Cu ,vj 2Cv [Automotive companies of Japan’, Companies based in Tokyo’, where |Cu and |Cv | denote the number of elements Boat builders’, in clusters Cu and Cv , respectively. Truck manufacturers’, ... To generate output clusters, “flat clusters” are ] extracted from the hierarchy. There are multiple possible criteria to do that (SciPy.org, last viewed Auxiliary categories can be thought of as cate- in May 2016), and in this paper we use the “dis- gories the page belongs to. In this example, if we tance criterion” – that is, given either of the dis- are unable to look up the word “Honda” on Word- tance measures discussed above, flat clusters are net, then, through the help of these auxiliary cate- formed from the hierarchy when items in each gories, we can relate the term to Japan, Automo- cluster are no farther than a distance threshold. tive, Company, etc. There are several open source Wikipedia APIs available to achieve this purpose 3 Semantic Hashtag Clustering – for example, the Python library “wikipedia”. We adopted the semantic clustering approach pro- 2.3 Hierarchical clustering posed by Vicient and Moreno (2014) specifically for hashtags. This approach uses Wordnet and Hierarchical clustering is a viable approach to Wikipedia as the metadata for identifying the lex- cluster analysis, and is particularly suitable for the ical semantics of a hashtag. Source codes of their purpose of hashtag clustering in this paper for a algorithms were not available, and therefore we few reasons. First, the approach does not require implemented the approach described in Vicient’s apriori information about the number of clusters. PhD dissertation (Vicient, 2014) to the best of our (The number of outputs clusters is not known in abilities using the algorithms and descriptions pro- most real applications.) Second, it is suited to the vided. taxonomic nature of language semantics. Third, it There are three major steps in their semantic facilitates a fair comparison with the algorithm by clustering algorithm: (a) semantic grounding, (b) Vicient and Moreno (2014), which also uses hier- similarity matrix construction, and (c) semantic archical clustering. clustering. Algorithm 1 summarizes the steps. There are two popular strategies for hierarchical In the first stage (i.e., semantic grounding), each clustering – bottom-up (or agglomerative) and top- 142 Input: list H of hashtags cepts by appending them to the list LCh ; this step Output: clusters is repeated for each main noun. Stage 1 (Semantic grounding): In the second stage (i.e, similarity matrix con- Step 1: For each hashtag h 2 H perform Step struction), first, hashtags associated with an empty 1a. list of concepts are discarded; in other words, Step 1a: Look up h from Wordnet. If h is hashtags that did not match any Wordnet entry, found then append the synset of h to a list either by themselves or by using word segmen- (LCh ). Otherwise segment h into multiple tation technique, and also had no entry found in words and drop the leftmost word and Wikipedia are discarded. Then, using the remain- then try Step 1a again using the reduced h ing hashtags (each of whose LCh contains at least until either a match is found from one concept in it), semantic similarity is calcu- Wordnet or no more word is left in h. lated between each pair of them. Any ontology- Step 2: For each h 2 H that has an empty list based measure can be used, and Wu-Palmer mea- LCh , look up h in Wikipedia. If an article sure (see Section 2.1) has been used in our work to matching h is found in Wikipedia, acquire the stay consistent with the original work by Vicient list of auxiliary categories for the article, and Moreno (2014). extract main nouns from the auxiliary Specifically, the similarity between two hash- categories, and then, for each main noun tags, hi and hj , is calculated as the maximum pair- extracted, go to Step 1a using the main noun wise similarity (based on the Wu-Palmer measure) as h. between one set of concepts in LChi and another Stage 2 (Similarity matrix construction): set of concepts in LChj . Calculating the similar- Discard any hashtag h that has an empty LCh . ity this way is expected to find the correct sense of Calculate the maximum pairwise similarity hashtag (among all the sense/concepts in LCh ). between each pair of lists LChi and LChj Finally, in the third stage (i.e., clustering), any (i 6= j) using any ontology-based similarity clustering algorithm can be used to cluster hash- measure. tags based on the similarity matrix obtained in the Stage3 (Clustering): Perform clustering on second stage. As mentioned earlier, in this paper the distance matrix (1’s complement of the we use hierarchical clustering which was used in similarity matrix) resulting from Stage 2. the original work by Vicient and Moreno (2014). Algorithm 1: Semantic hashtag clustering (Vi- 4 Sense-Level Semantic Hashtag cient and Moreno, 2014). Clustering hashtag is looked up in Wordnet. If there is a di- In this section, we describe the enhancement made rect match, that is, the hashtag is found in Word- to the word-level semantic hashtag clustering and net, then it is added as a single candidate synset, showcase its positive impact using a toy example. and, accordingly, all the concepts (or senses) (see Both Stage 1 (i.e, semantic grounding) and Stage 3 Section 2.1) belonging to the synset are saved in (i.e, clustering) of the sense level semantic cluster- the form of a list of candidate concepts related ing algorithm are essentially the same as those in to the hashtag. We call this list LCh . If, on the the word-level semantic clustering algorithm (see other hand, the hashtag is not found in Wordnet, Algorithm 1 in Section 3). So, here, we discuss then the hashtag is split into multiple terms (us- only Stage 2 (i.e, similarity matrix construction) ing a word segmentation technique) and, then, the of the algorithm, with a focus on the difference in leftmost term is dropped sequentially until either the calculation of maximum pairwise similarity. a match is found in Wordnet or there is no more 4.1 Similarity matrix construction term left. For each hashtag that was not found from Word- 4.1.1 Word-level versus sense-level similarity net in Step 1 (i.e., of which the LCh is empty), it matrix is looked up in Wikipedia. If a match is found As mentioned in Section 3, the similarity between in Wikipedia, the auxiliary categories (see Sec- two hashtags hi and hj is defined as the maxi- tion 2.2) of the article are acquired. Main nouns mum pairwise similarity between one set of senses from the auxiliary categories are then looked up in in LChi and another set of senses in LChj . (Re- Wordnet, and if a match is found, we save the con- call that LCh denotes a list of senses retrieved 143 from Wordnet to semantically ground a hashtag around a hashtag that has multiple senses. h.) This maximum pairwise similarity is an ef- 4.1.2 Word-level similarity matrix fective choice for disambiguating the sense of a construction hashtag and was used to achieve a positive ef- fect in the word-based approach by Vicient and Algorithm 2 outlines the steps of calculating max- Moreno (2014). imum pairwise similarity between hashtags in the However, we have observed many instances word-level algorithm. One maximum pairwise where a hashtag word has multiple senses and it similarity value is calculated for each pair of hash- introduces an error in the clustering result. That tags semantically grounded in the previous stage is, the word-level algorithm does not distinguish (i.e., Stage 1) and is entered into the similarity ma- among different senses of the same word when trix. The similarity matrix size is |H|2 , where H constructing a similarity matrix and, as a result, is the number of hashtags that have at least one two hashtags are misjudged to be semantically sense (i.e., nonempty LCh ). Note that the pairwise similar (because they are similar to a third hash- similarity comparison is still done at the sense tag in two different senses) and are included in level, considering all senses of the hashtags that the same cluster. Moreover, a false triangle that are compared. violates the triangular inequality property may be Input: set H of hashtags h with nonempty formed at the word level. (Note this property is LCh . required of any distance metric like Wu-Palmer.) Output: pairwise hashtag similarity matrix. See Figure 1 for an illustration. As its side effect, 1 Initialize an empty similarity matrix we have observed that a cluster tends to be formed M[|H|, |H|]. centered around a hashtag that takes on multiple 2 Initialize maxSim to 0. senses. 3 for each pair (hi , hj ) of hashtags in H do 4 // Calculate the maximum pairwise similarity between hi and hj . 5 for each sp 2 LChi do 6 for each sq 2 LChj do 7 Calculate the similarity sim between sp and sq . 8 if sim > maxSim then 9 Update maxSim to sim . (a) Sense level. (b) Word level. 10 end (Edge weights denote similarity values (= 1 distance). 11 end Assume the minimum similarity threshold is 0.5. Then, 12 end at the sense level (a), two clusters ({H1 , H2 }, {H1 , H3 }) 13 Enter maxSim into M[i, j]. should formed because H2 and H3 are not similar (note 14 end 0.1 < 0.5), but, at the word level (b), one cluster {H1 , H2 , Algorithm 2: Word-level construction of seman- H3 } is formed because it appears as if H2 and H3 were tic similarity matrix. similar via H1 . Moreover, the false triangle that appears to be formed at the word level violates the triangular in- 4.1.3 Sense-level similarity matrix equality property because dist(H1 , H2 ) + dist(H1 , H3 ) < construction dist(H2 , H3 ).) Figure 1: An illustration of clustering at the word Algorithm 3 outlines the steps of constructing a level versus sense level. similarity matrix at the sense-level algorithm. Un- like the case of the word-level algorithm, entries Thus, we chose to explicitly record the sense in in the similarity matrix are between senses that which a hashtag is close to another hashtag when make maximum similarity pairs between a pair of constructing a similarity matrix. This sense-level hashtags. Since these senses are not known un- handling of hashtag semantic distance helps us en- til the maximum pairwise similarity calculations sure that the incorrect clustering problem of word- are completed, the construction of the similarity level clustering does not happen. Accordingly, it matrix is deferred until then. In the first phase avoids the formation of clusters that are centered (Lines 2⇠16), for each pair of hashtags, the al- 144 Input: set H of hashtags h with nonempty trix, therefore, is |Ŝ|2 , where Ŝ is the set of dis- LCh . tinct senses in LHs (see Lines 18⇠19). Second, Output: pairwise hashtag similarity matrix. it enables the algorithm to add exactly the needed 1 Create an empty list LHs of (hashtag sense number of entries, that is, |H|2 entries (i.e., one pair, pairwise maximum similarity). for each pair of hashtags (see Lines 20⇠22)) into 2 for each pair (hi , hj ) of hashtags in H do a matrix of size |Ŝ|2 , where |Ŝ|2 > |H|2 . (The 3 // Calculate the maximum pairwise remaining entries are initialized to 0 and remain 0, similarity between hi and hj . as they are for pairs of senses that do not represent 4 Initialize maxSim to 0. maximum similarity pair between any hashtags.) 5 Initialize maxSimPair to (null, null). Our observation is that the ratio |Ŝ|/|H| is limited 6 for each sp 2 LChi do from 5 to 10 for most individual hashtags, which is 7 for each sq 2 LChj do consistent with Vicient’s statement (Vicient, 2014) 8 Calculate the similarity sim that, out of semantically-grounded 903 hashtags, between sp and sq . almost 100 of them have only 2 senses and very 9 if sim > maxSim then few have more than 5 senses. 10 Update maxSim to sim . Since what is clustered are hashtags, although 11 Update maxSimPair to (hi .sp , their similarities are measured at the sense level, a hj .sq ). number of interesting points hold. First, we do not 12 end need to add similarities between all pairs of senses 13 end in the similarity matrix. Second, a hashtag may 14 end appear in multiple clusters, where each cluster is 15 Add (maxSimPair, maxSim) to LHs . formed based on distinct senses of the hashtag, and therefore the resulting clusters are overlapping. 16 end 17 // Construct the similarity matrix. 4.2 A toy example 18 Count the number |Ŝ| of distinct hashtag To demonstrate the merit of clustering at the sense senses in LHs . level as opposed to the word level, we made a toy 19 Initialize a similarity matrix M[|Ŝ|, |Ŝ|] as a set of hashtags and ran the metadata-based seman- 0 matrix. tic clustering algorithm at both the word level and 20 for each triplet (hi .sp , hj .sq , maxSim) in LHs the sense level. The hashtags used are #date, #au- do gust, #tree, and #fruit. From Wordnet, we found 21 Update the M[m, n] to maxSim, where that there were 3 senses associated with the word (m, n) is the matrix index for (hi .sp , august, 13 senses with date, 5 senses with fruit, hj .sq ) . and 7 senses with tree. 22 end Using the Wu-Palmer similarity measure (ex- Algorithm 3: Sense-level construction of seman- plained in Section 2.1) at the word level, we ob- tic similarity matrix. tained the distance matrix shown below. Hashtag august date fruit tree gorithm saves the pair of senses (hi .sp , hj .sq ) in august 0.000 0.200 0.500 0.667 the maximum similarity pair and the maximum date 0.200 0.000 0.100 0.400 similarity value in the list LHs . Then, in the sec- fruit 0.500 0.100 0.00 0.556 ond phase (Lines 18⇠22), for each triplet element tree 0.667 0.400 0.556 0.000 (hi .sp , hj .sq , maxSim) in LHs , the algorithm en- ters the maximum similarity value maxSim at the Then, to perform clustering using the word- matrix index corresponding to the pair of senses level distance matrix as the input, we used both the (hi .sp , hj .sq ). single-linkage and UPGMA (see Section 2.3) as This two-phase construction of similarity ma- the measure to calculate distance between newly trix brings two advantages. First, it enables the formed clusters and the distance threshold for ex- algorithm to use exactly the needed number of ma- tracting flat clusters from hierarchical clusters was trix entries for those senses that are distinct among set to 0.5. all senses that constitute pairwise maximum sim- Table 2 shows the clusters obtained using the ilarities between hashtags. The size of the ma- word-level clustering. We see that #august, #date, 145 and #fruit are included in the same cluster in both Sense Semantics cases of the distance measures. This example august.n.01 the month following July and pre- demonstrates a case in which #date takes on mul- ceding September tiple sense identities and glues together #august august.a.01 of or befitting a lord and #fruit in the same cluster at the word level al- corner.v.02 force a person or animal into a po- though these two are not similar at the sense level, sition from which he can not es- as shown next. cape Cluster date.n.02 a participant in a date Cluster using date.n.06 the particular day, month, or year Hashtag using single- UPGMA (usually according to Gregorian linkage august 1 1 calendar) that an even occurred date 1 1 date.n.08 sweet edible fruit of the date palm fruit 1 1 with single long woody seed tree 1 2 fruit.n.01 the ripened reproductive body of Table 2: Cluster assignment at the word level. a seed plant fruit.v.01 cause to bear fruit Now, using the sense-level clustering, out of a tree.n.01 a tall perennial woody plant hav- total of 28 senses associated with the four hash- ing a main trunk and branches tags, the algorithm picked 10 senses shown in Ta- forming a distinct elevated crown; ble 3. These 10 senses were picked as a result includes both gymnosperms and of maximum pairwise similarity calculations be- angiosperms tween two sets of senses belonging to each pair yield.n.03 an amount of product of hashtags. (With 4 hashtags, there are a max- (‘n’ stands for noun, ‘v’ for verb and ‘a’ for adjective.) imum of 12 senses that can be obtained for 6 (= Table 3: Senses and their semantics (source: C(4, 2)) maximum similarity pairs, and in this ex- Wordnet). ample case, there were duplicate senses, conse- Cluster Hashtag Hashtag sense using single- Cluster using quently giving 10 distinct senses.) As mentioned UPGMA linkage earlier, each of these senses represents the seman- date date.n.02 1 1 tics of the hashtag word it belongs to, and thus tree tree.n.01 1 1 makes an entry into the similarity (or distance) ma- fruit yield.n.03 2 2 fruit fruit.v.01 3 3 trix input to the hierarchical clustering algorithm. august august.a.01 3 3 tree corner.v.02 4 4 The distance matrix obtained from the 10 senses fruit fruit.n.01 5 5 date date.n.08 5 5 is shown in Figure 2. The numbers in bold face are august august.n.01 6 6 the maximum similarity values entered. Note that date date.n.06 6 6 distance 1.000 means similarity 0.000. Table 4: Cluster assignment at the sense level. Table 4 shows the resulting cluster assignments. (The outcome is the same for both distance mea- 1600 MHz DDR3 memory. sures, which we believe is coincidental.) We see 5.1 Experiment setup that #august and #date are together in the same 5.1.1 Performance metric cluster and so are #date and #fruit but, unlike the word-level clustering result, the three of #august, Evaluating clustering output is known to be a #date, and #fruit are not altogether in the same “black art” (Jain and Dubes, 1988) with no objec- cluster. This separation is because, at the sense tive accuracy criterion. It is particularly challeng- level, #date can no longer take on multiple identi- ing for the semantic hashtag clustering addressed ties as it did at the word level. in this paper. Sense-level clustering generates more items to be clustered than word-level and 5 Evaluation the output clusters are overlapping. Therefore, in- In this evaluation, all algorithms were imple- ternal measures (e.g., Silhouette coefficient, SSE) mented in Python and the experiments were per- are not desirable because they simply consider the formed on a computer with OS X operating sys- cohesion and separation among the output clus- tem, 2.6 GHz Intel Core i5 processor, and 8 GB ters without regard to the semantic accuracy of the 146 Hashtag sense august.n.01 august.a.01 corner.v.02 date.n.02 date.n.06 date.n.08 fruit.n.01 fruit.v.01 tree.n.01 yield.n.03 Hashtag august august tree date date date fruit fruit tree fruit august.n.01 august 0.000 1.000 1.000 1.000 0.200 1.000 1.000 1.000 1.000 1.000 august.a.01 august 1.000 0.000 0.667 1.000 1.000 1.000 1.000 0.500 1.000 1.000 corner.v.02 tree 1.000 0.667 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 date.n.02 date 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 0.400 1.000 date.n.06 date 0.200 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 date.n.08 date 1.000 1.000 1.000 1.000 1.000 0.000 0.100 1.000 1.000 1.000 fruit.n.01 fruit 1.000 1.000 1.000 1.000 1.000 0.100 0.000 1.000 1.000 1.000 fruit.v.01 fruit 1.000 0.500 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 tree.n.01 tree 1.000 1.000 1.000 0.400 1.000 1.000 1.000 1.000 0.000 0.556 yield.n.03 tree 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.556 0.000 Figure 2: Distance matrix in the toy example. items clustered. For this reason, our evaluation is they match those used in the evaluation of word- done with an external “standard” (i.e., gold stan- level clustering by Vicient and Moreno (2014). dard test). To this end, we use f-score, which is They used tweet messages from the Symplur web- commonly used in conjunction with recall and pre- site, and so we did. cision to evaluate clusters in reference to ground We manually gathered a tweet dataset from the truth clusters, as the accuracy metric. In our eval- Symplur web site (http://www.symplur.com). The uation, the f-score is calculated for each pair of a dataset consists of 1,010 unique hashtags that are cluster in the ground truth cluster set and a clus- included in 2,910 tweets. The median of the num- ter in the evaluated algorithm’s output cluster set. ber of tweets per hashtag was only two. (Distribu- Then, the final f-score resulting from the compar- tion of the number of tweet messages per hashtag ison of the two cluster sets is obtained in two dif- generally follows the power law (Muntean et al., ferent ways, depending on the purpose of the eval- 2012). uation. For the purpose of evaluating individual 5.2 Experiment: gold standard test output clusters, the pairwise maximum (i.e., “best match”) f-score, denoted as fm -score, is used as In order to enable the gold standard test, we pre- the final score. Given a ground truth cluster Gi pared a ground truth based on observed hashtag matched against an output cluster set C, the fm - semantics. Out of the 1,010 hashtags, we manu- score is obtained as ally annotated the semantics to choose 230 hash- tags and classified them into 15 clusters. The re- fm -score(Gi , C) = maining hashtags were classified as noise. Fig- max f-score(Gi , Cj ) (4) ure 3 shows the sizes of the 15 ground truth clus- Cj 2C ^ f-score(Gi ,Cj )>0 ters. where the pairwise matching is one-to-one be- tween G and C. On the other hand, for comparing overall accu- racy of the entire set of clusters, the weighted aver- age of pairwise maximum f-scores, denoted as fa - score, is used instead. Given a ground truth cluster Figure 3: Sizes of ground truth clusters. set G and an output cluster set C, the fa -score is calculated as The distance threshold for determining flat clus- P ters in hierarchical clustering was set using the Gi 2G (f -score(Gi , C) ⇥ |Gi |) m a f -score(G, C) = P “best result” approach. That is, we tried both dis- Gi 2G |Gi | tance measures (i.e., single-linkage and UPGMA) (5) and different distance threshold values and picked 5.1.2 Dataset the measure and value that produced the best result With the focus of evaluation on comparing be- based on the weighted average f-score measure. tween the sense-level and the word-level of the Figure 4 shows the accuracies achieved by the same clustering algorithm, deliberate choices were metadata-based semantic clustering at the word- made in the selection of the datasets and the num- level and the sense-level. Table 5 shows more de- ber of hashtags used in the experiments so that tails, including precision and recall for individual clusters. From the results we see that every sense- 147 level cluster outperforms the word-level counter- 7 Conclusion part (except cluster 1 due to rounding-off differ- In this paper, we enhanced the current metadata- ence). Particularly, the fm -scores are zero for based semantic hashtag clustering algorithm by word-level clusters 6, 14, and 15, thus bringing determining the semantic similarity between hash- the performance gain to “infinity”. (Word-level tags at the sense level as opposed to the word level. clustering did not generate any cluster of size 3 or This sense-level decision on clustering avoids in- larger and with the best match f-score to clusters 6, correctly putting hashtags of different senses in 14, and 15 greater than 0.1.) Further, when all 15 the same cluster. The result was significantly clusters are considered together, the weighted av- higher accuracy of semantic clusters without in- erage of maximum pairwise f-scores, fa -score, is creasing the complexities of the algorithm in prac- 0.43 for sense-level clustering and 0.34 for word- tice. A gold standard test showed that the sense- level clustering – a 26% gain. level algorithm produced significantly more accu- rate clusters than the word-level algorithm, with an overall gain of 26% in the weighted average of maximum pairwise f-scores. For the future work, new metadata sources can be added to provide the metadata-based semantic hashtag clustering algorithm with more abilities. For example, to understand hashtags of a different language, online translation services like Google Figure 4: Maximum pairwise f-scores of output Translate (https://translate.google.com) can be clusters from word-level and sense-level semantic a good source since empirical evidences sug- clustering. gest that it can be very effective in identify- ing spelling errors, abbreviations, etc. Addition- 6 Related Work ally, crowdsourced websites like Urban Dictionary (www.urbandictionary.com) that specializes in in- There are several works on semantic clustering of formal human communication can be a helpful hashtags that focused on the contextual semantics metadata source for decoding lexical semantics of of hashtags (Tsur et al., 2012; Tsur et al., 2013; hashtags. Internet search engines also provide rich Muntean et al., 2012; Rosa et al., 2011; Stilo and information on the semantics of hashtags. Velardi, 2014) by using the bag of words model to represent the texts accompanying a hasthag. Tsur Acknowledgments et al. (2012; 2013) and Muntean et al. (2012) ap- This project was supported by a Fulbright Program pended tweets that belonged to each unique hash- grant sponsored by the Bureau of Educational and tag into a unique document called “virtual docu- Cultural Affairs of the United States Department ment”. These documents were then represented of State and administered by the Institute of Inter- as vectors in the vector space model. Rosa et national Education. al. (2011) used hashtag clusters to achieve topi- cal clustering of tweets, where they compared the effects of expanding URLs found in tweets. Stilo References and Paola (2014) clustered hashtag “senses” based Sandjai Bhulai, Peter Kampstra, Lidewij Kooiman, Ger Koole, Marijn Deurloo, and Bert Kok. 2012. Trend on their temporal co-occurrence with other hash- visualization on Twitter: What’s hot and what’s not? tags. The term “sense”in their work is different In Proceedings of the 1st International Conference from the lexical sense used in this paper. on Data Analytics, pages 43–48. Lacking the ability to form lexical semantic sense-level clusters of hashtag has been a ma- Joanna Costa, Catarina Silva, Mário Antunes, and jor shortcoming of the current approaches. To Bernardete Ribeiro. 2013. Defining semantic meta- hashtags for twitter classification. Lecture Notes in the best our knowledge, the work by Vicient and Computer Science, 7824:226–235. Moreno (2014) is the only one that opened re- search in this direction. They used Wordnet and Internet Live Stats. last viewed in Wikipedia as the metadata source for clustering May 2016. Twitter usage statistics. hashtags at a word-level. http://www.internetlivestats.com/twitter-statistics/. 148 Ground truth clusters Sense-level clusters Word-level clusters Id Size Recall Precision fm -score Size Recall Precision fm -score Size 1 32 0.63 0.65 0.63 31 0.63 0.67 0.65 30 2 26 0.35 0.39 0.37 23 0.31 0.35 0.33 23 3 23 0.39 0.43 0.41 21 0.35 0.19 0.24 43 4 23 0.91 0.84 0.88 25 0.83 0.76 0.79 25 5 22 0.41 0.45 0.43 20 0.41 0.20 0.27 44 6 14 0.21 0.18 0.19 17 n/a n/a n/a n/a 7 14 0.64 0.50 0.56 18 0.64 0.50 0.56 18 8 12 0.25 0.43 0.32 7 0.50 0.24 0.32 25 9 11 0.82 0.39 0.53 23 0.82 0.08 0.14 118 10 11 0.18 0.11 0.14 18 0.09 0.17 0.12 6 11 10 0.40 0.27 0.32 15 0.50 0.08 0.14 59 12 9 0.11 0.25 0.15 4 0.11 0.17 0.13 6 13 9 0.22 0.33 0.27 6 0.22 0.29 0.25 7 14 8 0.13 0.25 0.17 4 n/a n/a n/a n/a 15 6 0.17 0.20 0.18 5 n/a n/a n/a n/a fa -score (weighted average of fm -scores) is 0.43 for sense-level clusters and 0.34 for word-level clusters. Table 5: Details of gold standard test results. Anil K. Jain and Richard C. Dubes. 1988. Algorithms Proceedings of the 19th International Conference on for Clustering Data. Prentice-Hall. Knowledge Engineering and Knowledge Manage- ment, pages 563–578, November. Ryan Kelly. last viewed in May 2016. Twitter study reveals interesting results Peter Teufl and Stefan Kraxberger. 2011. Extracting about usage - 40% is pointless babble. semantic knowledge from twitter. In Proceedings of http://pearanalytics.com/blog/2009/twitter-study- the 3rd IFIP WG 8.5 International Conference on reveals-interesting-results-40-percent-pointless- Electronic Participation, pages 48–59. babble/. Oren Tsur, Adi Littman, and Ari Rappoport. 2012. Christopher D. Manning, Prabhakar Raghavan, and Scalable multi stage clustering of tagged micro- Hinrich Schütze. 2008. Introduction to Information messages. In Proceedings of the 21st International Retrieval. Cambridge University Press. Conference on World Wide Web, pages 621–622, April. Cristina Muntean, Gabriel Morar, and Darie Moldovan. 2012. Exploring the meaning behind twitter hash- Oren Tsur, Adi Littman, and Ari Rappoport. 2013. Ef- tags through clustering. Lecture Notes in Business ficient clustering of short messages into general do- Information Processing, 127:231–242. mains. In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media. Suzi Park and Hyopil Shin. 2014. Identification of implicit topics in twitter data not containing explicit Carlos Vicient and Antonio Moreno. 2014. Unsu- search queries. In Proceedings of the 25th Inter- pervised semantic clustering of twitter hashtags. In national Conference on Computational Linguistics, Proceedings of the 21st European Conference on Ar- pages 58–68. ticifial Intelligence, pages 1119–1120, August. Kevin Dela Rosa, Rushin Shah, and Bo Lin. 2011. Carlos Vicient. 2014. Moving Towards The Semantic Topical clustering of tweets. In Proceedings of the Web: Enabling New Technologies through the Se- 3rd Workshop on Social Web Search and Mining, mantic Annotation of Social Contents. Ph.D. thesis, pages 133–138, July. Universitat Robira I Virgili, December. Hassan Saif, Yulan He, and Harith Alani. 2012. Se- Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, mantic sentiment analysis of twitter. In Proceedings and Ming Zhang. 2011. Topic sentiment analysis of the 11th International Conference on The Seman- in Twitter: a graph-based hashtag sentiment classifi- tic Web - Volume Part I, pages 508–524. cation approach. In Proceedings of the 20th ACM SciPy.org. last viewed in May 2016. Hi- Conference on Information and Knowledge Man- erarchical clustering flat cluster parameters. agement, pages 1031–1040, October. http://docs.scipy.org/doc/scipy-0.14.0/ Zhibiao Wu and Martha Palmer. 1994. Verbs seman- reference/generated/scipy.cluster.hierarchy.fcluster. tics and lexical selection. In Proceedings of the 32nd html. Annual Meeting on Association for Computational Giovanni Stilo and Paola Velardi. 2014. Temporal se- Linguistics, pages 133–138. mantics: Time-varying hashtag sense clustering. In 149