ClaimFinder: A Framework for Identifying Claims in Microblogs Wee Yong Lim Mong Li Lee Wynne Hsu Department of Computer Science National University of Singapore a0109697@u.nus, {leeml,whsu}@comp.nus.edu.sg ABSTRACT In fact, our observation of collected tweets related to ma- Twitter is a microblogging platform that allows users to jor events indicate that a majority of tweets were forwarded post public short messages. Posts shared by users pertaining (re-tweeted) by multiple users with little or no changes to the to real-world events or themes can provide a rich “on-the- content of the message. Considering the minimal changes ground” live update of the events for the benefit of everyone. by the users, the primary motivation of these users stem Unfortunately, the posted information may not be all cred- from their desire to disseminate the information in the tweet. ible and rumours can spread over this platform. Existing Such dissemination of information would indeed serve a so- credibility assessment work have focused on identifying fea- cial utility if the information is true, but would otherwise be tures for discriminating the credibility of messages at the detrimental if the information is false or even speculative. tweet level. However, they do not handle tweets that con- Research in information credibility has been gaining mo- tain multiple pieces of information, each of which may have mentum in recent years [4, 5, 18, 10]. Figure 1 shows the different level of credibility. In this work, we introduce the steps involved in a credibility assessment framework. Col- notion of a claim based on subject and predicate terms, and lecting a set of tweets related to a major event can be done propose a framework to identify claims from a corpus of manually using keywords relevant to natural disaster, ter- tweets related to some major event or theme. Specifically, rorist or shooting incident events [10], or automatically via we draw upon work done in open information extraction some event detection methods e.g. TwitterMonitor [12]. to extract from tweets, tuples that comprises of subjects These tweets are then analyzed to identify topics for subse- and their predicate. Then we cluster these tuples to iden- quent credibility classification [4, 5, 18]. Features used to tify claims such that each claim refers to only one aspect help identify suspicious tweets include sentiment [15], loca- of the event. Tweets corresponding to the tuples in each tion [22], message propagation characteristic [14] amongst cluster serve as evidence supporting subsequent credibility others. assessment task. Extensive experiments on two real world datasets shows the effectiveness of the proposed approach in identifying claims. 1. INTRODUCTION Communications over the web have increasingly become user-driven where there exist multiple platforms for users to post their messages that can be seen by the general pub- lic. Unfortunately, there is little or no mechanisms to en- sure the credibility of the posted messages, unlike traditional news media. Take the popular microblogging platform Twit- ter as an example, where users can freely post or re-post any short messages, known as tweets, from their mobile ac- counts. Such a platform allows for the fast dissemination of Figure 1: Credibility assessment framework involving tweet first hand and repeated information. When a major event collection, claims identification and classification. occurs, many tweets are generated or re-tweeted containing messages that may be true, false or speculative. Methods to find topics in a corpus of tweets can be broadly divided into feature-based and topic modeling based ap- proaches. The former extract features such as keywords from each tweet and clusters the tweets based on these fea- Copyright c 2016 held by author(s)/owner(s); copying permitted tures [2]. Each cluster of tweets defines a topic. For topic only for private and academic purposes. Published as part of the #Microposts2016 Workshop proceedings, modeling based approaches, a topic is represented by a word available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691) distribution. The work in [23] observe that “a single tweet #Microposts ’16 Montreal, Canada is usually about a single topic” and designed a TwitterLDA #Microposts2016, Apr 11th, 2016, Montréal, Canada. ACM ISBN 978-1-4503-2138-9. model where words in a tweet are chosen from a topic or the DOI: 10.1145/1235 background noise words. · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 We observe that tweets typically contain multiple claims provides the context for the claims and we would want to and advocate that current approaches which cluster tweets identify the claims within the event. based on topics is too coarse-grained to identify all the claims Since we do not assume that a tweet contains only one in tweets. Take for example the following tweet on the claim, we use an Open Information Extraction (OpenIE) Nashville flood: tool [6] to extract from each tweet, zero or more triples of the form (E1 , R, E2 ), where E1 and E2 are each a set of “Middle TN (Nashville) has been hit by a terrible words refering to real world entities, while R is a set of flood. Text 90999 to make $10 donation to the words describing the relationship between the entities E1 REDCROSS disaster relief. #nashvilleflood” and E2 . Each triple is mapped to a subject-predicate tu- ple that has a structure similar to a claim, that is, This tweet has two claims: (1) Nashville has been hit by a where S = E1 ∪ E2 and P = R. Thus, a tweet is associated flood, and (2) one can make a $10 donation by texting to with a set of subject-predicate tuples {t1 , t2 , ...}. 90999. It is important to identify both claims for subsequent credibility assessment. This is because while the first claim is Problem Statement. Let D be a corpus of tweets related likely to be true, the second claim appears highly suspicious. to a major event, and the ith tweet in D is mapped to a Existing credibility assessment work that utilizes tweet-level set of tuples {ti1 , ti2 , ...}, 1 ≤ i ≤ |D|. Let T be the set of features will only give a single credibility score to this tweet subject-predicate tuples obtained from all the tweets in D. and does not differentiate the two claims. The goal is to obtain a partitioning C of the tuples in T such In this work, we formalize the concept of a “claim” in a that C identifies the most number of claims in D. corpus of tweets related to some major event. Our goal is to design a framework to identify the set of claims such that By partitioning the tuples, we obtain a soft clustering of each claim refers to only one aspect of the event. Subse- the corresponding tweets since a tweet can contain more quently, the credibility of these claims can be verified against than a claim. The tweets that correspond to the tuples in official sources. Note that the credibility assessment task is each cluster provide evidence for the credibility assessment beyond the scope of this work. of the claim. We draw upon work done in the field of Open Informa- tion Extraction (IE) to extract entities in the tweets and Example. To provide an intuition of the tuple clustering the relationships between these entities. Then we construct and claim identification process, Table 1 shows the Ope- tuples comprising of from these en- nIE triples and the subject-predicate tuples obtained for 3 tities/relationships. Finally, we cluster the tuples to form tweets. To simplify discussion, let us cluster these tuples claims. The tweets that correspond to the cluster of tu- based on the similarity of their subject words. For each ples can be regarded as evidence supporting any subsequent cluster, we construct a claim by taking the union of the credibility classification task. Extensive experiments on two words in S and P respectively. Table 2 shows the clusters real-world datasets of tweets demonstrate the effectiveness obtained and the corresponding claims. Note that our ap- of our proposed approach to identify meaningful claims. proach identifies the multiple claims contained in the tweets. The paper is organized as follows. Section 2 defines the For example, tweet 1 has two claims (c1 and c2 ), tweet 2 has problem. Section 3 describes the proposed approach, and two claims (c2 and c3 ), while tweet 3 has three claims (c3 , Section 4 gives an incremental method to identify claims. c4 and c5 ). We present experiment results in Sections 5, followed by related work in Section 6 and conclude in Section 7. We will elaborate on our approach to identify claims in the next section. 2. PROBLEM DEFINITION The objective of this work is to identify claims by group- 3. CLAIMS IDENTIFICATION ing the tweets related to some major event such that tweets Different from past tweets clustering work reviewed in Sec- in each group refer to the same claim, of which can be true, tion 6, this work focuses on claim identification by clustering false, speculative, conversational or simply spam in nature. tuples mapped from OpenIE extractions of the tweets. We We introduce the concept of a claim as follows: propose a 3-step ClaimF inder method (see Algorithm 1) which comprises of: Definition 1. A claim is the assertion of a subject and the corresponding predicate expression for the subject. It has 1. Preprocessing. We preprocess each tweet to remove the structure (S, P ), where S is the set of words that refer known noise and tokenize the sentences prior to ap- to the same subject, P is the set of words that express the plying the OpenIE process. same predicate on S. 2. Subject-predicate tuple extraction. We use the state- The set of words that refer to the same subject/predicate of-the-art OpenIE technique, ClausIE [6] to extract is very much context dependent. For example, in a corpus basic semantic units of information from the content of tweets on the missing flight MH370 incident, the words of each tweet. Each extraction is mapped to a subject- “plane” and “MH370 aircraft” are likely to reflect the same predicate tuple . subject whilst this may not be true in other context involving 3. Clustering subject-predicate tuples. We define a sim- multiple planes such as news reports on manoeuvres between ilarity measure to compute the distance between the military planes1 . Here, we assume that the major event tuples. Then we can utilize methods such as 1 agglomerative or spectral clustering [16] to cluster the http://edition.cnn.com/2014/08/22/world/asia/us-china- air-encounter/ tuples. Each cluster of tuples form a claim. 14 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Tweet Content Open IE Triples Subject-Predicate Tuples 1 MAS CEO confirms SAR ops and says (mas ceo, confirm, sar ops) <{mas,ceo,sar,ops}, {confirm}> airline is working to verify speculation that (mh370, land, nanning) <{mh370,nanning}, {land}> the mh370 may have landed in Nanning. 2 MH370 landing safely in Nanming is pure (mh370, land, nanming) <{mh370,nanming}, {land}> speculation. No distress signal or call was (distress signal call, receive) <{distress,signal,call}, {receive}> received at all 3 So you want me to believe that mh370 has (mh370, crash, water) <{mh370,water}, {crash}> crashed in water, Aussies found debris but (aussie, found, debris) <{aussie,debris}, {found}> still no signals captured (signal, capture) <{signal}, {capture}> Table 1: Subject-predicate tuples obtained from sample tweets. Cluster of tuples Claim Description c1 { <{mas,ceo,sar,ops},{confirm}> } ({mas,ceo,sar,ops}, {confirm}) MAS CEO confirms SAR ops c2 { <{mh370,nanning}, {land}>, ({mh370,nanning,nanming}, {land}) MH370 has landed in <{mh370,nanming}, {land}> } Nanning/Nanming c3 { <{distress,signal,call}, {receive}>, ({distress,signal,call}, {receive,capture}) Signal received/captured <{signal}, {capture}> } c4 { <{mh370,water}, {crash}> } ({mh370,water}, {crash}) MH370 crashed in water c5 { <{aussie,debris}, {found}> } ({aussie,debris}, {found}) Australia found debris Table 2: Claims obtained by clustering the tuples in Table 1. Algorithm 1 ClaimF inder 3.3 Clustering Subject-Predicate Tuples Input: corpus D of tweets; number of clusters N At this juncture, we have obtained a set T of subject- Output: set C of clusters of tuples predicate tuples from the original corpus of tweets D. We 1: T = ∅ // initialise set of tuples use the popular Porter Stemmer [17] to stem the words in 2: for twt ∈ D do S and P , and filter the most frequent and infrequent words 3: F = OpenIE(P reprocess(twt)) from the tuples. 4: for triple (E1 , R, E2 ) ∈ F do We define the similarity between each pair of subject- 5: T ← T ∪ {< (E1 ∪ E2 ), R >} predicate tuples ti = and tj = as follows: 6: end for   7: end for |Si ∩ Sj | |Pi ∩ Pj | similarity(ti , tj ) = w · + (1 − w) · 8: C ← Cluster(T , N ) // cluster the tuples |Si ∪ Sj | |Pi ∪ Pj | 9: return C (1) where w is a weight, 0 ≤ w ≤ 1, which is empirically de- termined. Note that this similarity metric is based on the We describe each step in the following subsections. Jaccard index between sets from the respective tuples. This allows tuples comparison operations to be approximated and 3.1 Preprocessing scaled up (see Section 4). We can now apply existing clustering techniques to clus- This phase corresponds to the function Preprocess in ter the tuples in T . Here, we choose two commonly used Algorithm 1 line 3. We preprocess each tweet via a series methods, namely, agglomerative or spectral clustering in our of data cleaning operations to reduce the noise that may af- evaluation. Agglomerative clustering is a bottom-up hierar- fect subsequent OpenIE extraction. These include removing chical clustering approach, which initializes each subject- “rt” keywords (which indicate retweet message), URLs, user predicate tuple as a cluster by itself and successively merge mentions, emoticons, colons, quote marks and hashtags’ “#” the most similar pair of clusters at each step, till the spec- signs. The tweet content is tokenized using the twokenizer ified number of clusters have been generated. Each cluster tool designed for Twitter content 2 c is represented by a tuple tc which is formed by taking the 3.2 Subject-Predicate Tuple Extraction union of the respective S and P terms of the tuples in the cluster, that is, After preprocessing the tweets, each sentence is subse- quently fed to an OpenIE tool to generate a list of relation tc =< {S1 ∪ · · · ∪ Sn }, {P1 ∪ · · · ∪ Pn } > ∀ < Si , Pi >∈ c triples. This step corresponds to the OpenIE function call in Algorithm 1 Line 3. On the other hand, spectral clustering takes in a similar- We chose to use ClausIE, the state-of-the-art OpenIE tech- ity matrix between all pairs of tuples and construct a Lapla- nique in this work. ClausIE takes as input each sentence in a cian matrix. Then it performs an Eigen decomposition to tweet and identifies the entities E1 and E2 , as well as their obtain the top m eigenvectors, effectively reducing the di- relationship R. The output is a triple (E1 , R, E2 ). Then mensionality to m. Finally, we use k-means to cluster these each triple (E1 , R, E2 ) is mapped to a subject-predicate tu- eigenvectors to obtain the desired clusters. ple (Algorithm 1 Lines 4-5). The output of ClaimF inder is a set C of tuple clusters. This corresponds to Lines 8-9 in Algorithm 1. Each cluster 2 http://www.cs.cmu.edu/˜ark/TweetNLP/ corresponds to a claim. For each tuple in the cluster, we 15 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 can retrieve the corresponding tweets from which the tuple Algorithm 2 ClaimF inderIN C is derived. This forms a grouping of the tweets that can Input: incoming tweet twt; split threshold thres provide evidence to verify the credibility of the claim. Note Output: set of buckets B = {b1 , b2 , ...} that a tweet can belong to more than one grouping as it may 1: F = OpenIE(P reprocess(twt)) contain multiple claims. 2: for triple ∈ F do 3: extract < S, P > tuple from triple 4. INCREMENTAL APPROACH 4: i = LSH(M inHash(< S, P >)) Considering the streaming nature of the tweets, especially 5: bi ← bi ∪ {< S, P >} for ongoing controversial major events rife with the propa- 6: end for gation of rumours, we also propose an incremental approach 7: if |bi | ≥ thres then to quickly identify claims from incoming tweets. Algorithm 8: Split(bi ) into c1 and c2 2 gives the details of the ClaimF inderIN C method. 9: Let tc1 and tc2 be the representative tuples Each incoming tweet is preprocessed and the tuples con- of c1 and c2 respectively structed as described in Sections 3.1 and 3.2. We create a 10: Initialize bi = ∅ set of empty buckets and assign a tuple to the bucket de- 11: j = LSH(minHash(tc1 )) termined by a Locality Sensitive Hashing (LSH) function 12: bj ← bj ∪ {c1 } with MinHash (lines 2-6 of Algorithm 2). LSH allows us to 13: k = LSH(minHash(tc2 )) quickly estimate the similarity between the set of subject 14: bk ← bk ∪ {c2 } and predicate words in the tuple and those in the bucket. 15: end if Let us first consider the subject term S in a tuple t. Since S is an arbitrary sized set of words, we choose its top n most frequent corpus words and apply m hash functions to Stanford POS tagger using a trained model for tweets this set of words S 0 . For each hash function hi , we obtain [7] is used to identify these keywords. the minimum hash value among the n words, denoted by min(hi (S 0 )). With this, we form a vector • ngrams: set of n consecutive words in the tweet, ig- noring stop words. We use n = 3 as it has been shown 0 0 ( min(h1 (S )), · · · , min(hm (S )) ) to best capture the semantics in a tweet [1] generating 7,691 ngrams for the MH370 dataset and 3,998 ngrams Similarly, we form a second vector based on the predicate for the Castillo dataset. Note that the similarity be- term P as tween a pair of ngrams is based on the Jaccard index ( min(h1 (P 0 )), · · · , min(hm (P 0 )) ) (like Equation 1) rather than the fraction of overlap- ping tweets that contains both ngrams used in [1]. where P 0 is the set of top n most frequent words in P . These two vectors form the MinHash signature of a tuple. 5.1 Datasets Next, we apply LSH on the MinHash signatures. Tuples with similar subject and predicate terms will be hashed to We try to identify the claims in the two real world datasets: the same bucket. This is because if there exist some word • MH370 Dataset. We crawled and collected tweets that is present in both sets Si and Sj , then min(h(Si )) = on the crash of Malaysian Airline MH370 in 2014 for min(h(Sj )). This eliminates the need for performing pair- our experiments. This event involve the mysterious wise similarity computation between a tuple from an incom- disappearance of a Boeing 777 plane en route from ing tweet and each cluster. The corresponding tuples whose Kuala Lumpur to Beijing on 8 March 2014. Perceived MinHash signatures have been mapped to the same bucket mishandling of the public communication of the situa- are subsequently merged into a cluster by taking the union tion created an unfortunate conducive environment for of their S and P terms respectively. the proliferation of various rumours related to MH370 Our incremental approach provides a mechanism to re- with sustained public interest in the status of the flight adjust the clusters should the size of a cluster increases be- and the cause of the disappearance. Such rumours yond some threshold (lines 7-15 of Algorithm 2). This is range from the absurd such as alien abduction to more achieved by treating the cluster as a mini-corpus to be fur- plausible ones such as the plane’s safe landing in China ther partitioned via standard clustering methods based on during the early stage of the crisis. The location of the similarity measure defined in Equation 1. After the ad- the plane and cause of the disappearance remains un- justment, a merging operation may be applied to re-group known today. The tweet corpus was collected using the clusters to specified number of clusters. keyword “MH370” via Twitter’s REST API. In total, 510,433 tweets from 8 March to 9 April were collected. 5. PERFORMANCE STUDIES We extracted a subset of tweets from the MH370 dataset We implement the proposed algorithms ClaimF inder and using keywords of 6 known rumour and credible claims. ClaimF inderIN C in Python, and carry out experiments on Overall, 3,764 tweets have been identified and manu- a 2.3 GHz CPU with 8 GB RAM running on Ubuntu 14.04. ally labeled with the corresponding claims. Table 3 Our concept of claims is based on subject-predicate tu- gives the details. These claims form the ground truth. ples. We also compare with the following representations: • tweet: full text of the tweet • Castillo Dataset. We also obtain a subset of tweets with specific claims from 6 annotated topics in the • keywords: a bag-of-words containing nouns, verbs, Castillo dataset [5]. Table 4 shows 6 claims pertain- hashtags and cardinal numbers present in a tweet. The ing to President Obama. There are altogether 1,336 16 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Claim Description #tweets #unique The set Cmatch contains the ground truth claims that have tweets been covered by some cluster in C. M1 MH370 landed in Nanning 1393 271 M2 Pilot commit suicide 312 242 5.3 Performance of ClaimFinder M3 Plane change course 203 78 We have two versions of ClaimF inder depending on the M4 MH370 off course 1070 207 clustering technique used. ClaimF inder(Agglomerative) M5 Alien abduct MH370 538 398 implements the bottom-up agglomerative clustering in Line M6 MH370 sighted in Maldives 248 50 8 of Algorithm 1, while ClaimF inder(Spectral) utilizes spec- tral clustering. Table 3: Groundtruth claims in MH370 dataset. We run an initial set of experiments on each of the datasets to find the optimal settings for the parameters to achieve Claim Description #tweets #unique the best coverage results in Figures 2, 3 for ClaimF inder. tweets These parameters are the input number of clusters N and T269 President Obama visiting the 168 85 the weight w in Equation 1 that controls the relative impor- Gulf of Mexico tance of the S and P terms when computing the similarity T876 President Obama sending 466 283 scores between tuples. For the MH370 dataset, we have N = troops to the US-Mexico 18 and w = 0.6, whereas for the Castillo dataset, N = 6 and border w = 0.8. In addition, words less than 3% or more than 30% T1494 President Obama prais- 48 39 of the number of tweets are filtered prior to clustering the ing/hailing lawmakers for a bill MH370 dataset. For the smaller Castillo dataset, a higher T2370 President Obama signing the 212 104 minimum threshold of 4% is used. These thresholds are de- bill related to border security termined empirically based on the frequencies of words in T2384 President Obama support- 373 233 the groundtruth claims. ing/endorsing building of a Figures 2 and 3 show the coverage for ClaimF inder us- mosque near ground zero ing the different representations and clustering techniques. T2499 President Obama is Muslim 69 67 Spectral clustering gives better performance in both datasets, while keywords and ngrams generally gives lower coverage Table 4: Groundtruth claims in Castillo dataset. regardless of the clustering techniques employed. We observe that the proposed subject-predicate tuples consistently identify more claims in both datasets and ar- tweets, of which 811 are unique. Nomenclature of the gue that its effectiveness indicates merit in discriminating claims follows that of the original annotated topics in the entity and relation terms using different weights for the [5], but with the prefix “T” instead of “TM” to indicate different types of terms. This is not possible using keywords a filtered subset. We use these claims as ground truth. or ngrams. In addition, it is not effective to discriminate be- tween the subject and object entities obtained directly from 5.2 Evaluation Metric the OpenIE triple due to the interchangeability of the po- sitions of the entities in the sentence (e.g.plane abducted by We evaluate the performance of the algorithms based on alien vs alien abducts plane). the proportion of claims they are able to identify. Let G be the set of ground truth claims and Dg be the set of tweets 5.3.1 Comparison with TwitterLDA corresponding to a claim g ∈ G. The output of our algorithm TwitterLDA [23] is designed for identifying topics in tweets. is a set of tuple clusters, denoted C, where each cluster c ∈ C These topics are used to cluster the tweets for credibility refers to a claim. In other words, C is the set of claims assessment. We compare the performance of TwitterLDA identified by an algorithm. For each tuple cluster c ∈ C, using various tweet representations, namely, full tweet, key- we retrieve all the tweets associated with the tuples in c, words, subject-predicate tuples. denoted by Dc . In addition to the original TwitterLDA model, we also We define a match function to compute the fraction of experimented with its variants using author pooling and tweets common in both Dc and Dg as follows: temporal pooling. For the MH370 dataset, there are 3,764 2 × |Dc ∩ Dg | tweets from 3,557 authors. These tweets are posted across match(c, g) = (2) |Dc | + |Dg | a period of 15 days and thus, a daily (24 hour) time frame is chosen for its temporal pooling. For the Castillo dataset, Note that when C and G have identical sets of tweets, we there are 1,336 tweets from 1,100 authors, posted between have match(c, g) = 1. On the other hand, when C and G 1 May to 20 August 2010. The longer timeframe motivates have totally different sets of tweets, then match(c, g) = 0. the use of a weekly (7 days) time frame for temporal pooling. Given a claim c, we say that c sufficiently covers a ground Implementation for the TwitterLDA based approaches is truth claim g if match(c, g) ≥ 0.8. based on the publicly available code3 , ran with default 100 iterations. TwitterLDA requires the number of topics as an We introduce a metric called Coverage to measure the input parameter. Our initial experiments show that the best ability of a method to identify claims as follows: performance is achieved when the number of topics is 12 for |Cmatch | both datasets. We use this setting to obtain the coverage of Coverage = (3) |G| the various TwitterLDA models. 3 where Cmatch = {g ∈ G | ∃ c ∈ C, match(c, g) ≥ 0.8} https://github.com/minghui/TwitterLDA 17 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Figure 2: Performance of ClaimF inder (MH370). Figure 3: Performance of ClaimF inder (Castillo). Figure 4: Performance of TwitterLDA (MH370). Figure 5: Performance of TwitterLDA (Castillo). 18 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Figures 4 and 5 show the results. We observe that using the subject-predicate tuples representation always achieves Groundtruth Sample tweets the best coverage regardless of the TwitterLDA models used. claim This indicates that the subject-predicate tuples are able to M5 CNN has yet to rule out the theory that capture the underlying semantics of a claim. Alien abduct MH370 was abducted by aliens. Muldar, Using keywords generally yields better coverage compared MH370 where are you? to using ngrams or the full text of the tweet. Using the full The #MH370 was abducted by aliens? How tweet results in relatively bad coverage indicating that when come? there are multiple claims in a tweet, some of these claims Rumors: Malaysia Airline MH370 Abducted by Aliens? - News - Bubblews may be missed. What if the plane is abducted by the aliens? Overall, the best performance is obtained when the pro- #MH370 if a mysterious island (Lost) can posed subject-predicate tuples is used in conjunction with happen, so does an alien spaceship. TwitterLDA(Weekly Pooled). This is because there is a Has somebody floated alien abduction theory temporal correlation among the claims, that is, posts con- for MH370? taining the same claims are likely to be sent within simi- M6 BREAKING: Malaysia transport minister MH370 sighted in says reports of missing plane sighted over lar time windows. In contrast, TwitterLDA(Author Pooled) Maldives Maldives are untrue does not perform well due to the low tweet-to-author ratio Minister: Maldives says it’s not true that the for both datasets. plane was sighted in its airspace #MH370 When we compare the coverage of the best performing MH370: Reports that plane sighted in #Mal- variant of TwitterLDA, i.e. TwitterLDA(Weekly Pooled) dives not true in Figures 4 and 5, and the best performing ClaimF inder RT Yahoo MY: Plane sighted in Maldives? Not true, says Hishammuddin version, i.e. ClaimF inder(Spectral) with subject-predicate RT TODAYonline: #MH370 press con: Re- tuples, we see that the latter significantly increases the num- ports of plane sighted at Maldives are not ber of claims identified in both datasets. We note that the true; forensic work underway to look at data MH370 dataset is noisier (more diverse set of words) than deleted from... the Castillo dataset and believe that the larger improve- ment for the former is simply an indication of the weakness of TwitterLDA in dealing with the noise. Table 5: Sample claims found in MH370 dataset. 5.4 Effectiveness of ClaimFinder As a case study on the effectiveness of the proposed claim identification approach, we retrieve the sets of subject-predicate Groundtruth Sample tweets tuples in the cluster that match some ground truth claim, claim as well as their corresponding tweets. T269 President Obama will visit the Gulf of Mexico President Obama in the next 48 hours to check out the oil spill The identified claims and sample tweets obtained using visiting the Gulf and response, per a White House official. ClaimF inder(Spectral) are shown in Tables 5 and 6 for of Mexico RT @CNN: President Obama will visit the the MH370 and Castillo dataset respectively. We see that Gulf of Mexico in the next 48 hours to check the tweets retrieved based on the clusters by ClaimF inder out the oil spill and response. closely match the description of the ground truth claim, in- President Obama to visit Gulf of Mexico re- gion in next 48 hours to check oil spill re- dicating that the subject-predicate tuples are able to capture sponse, White House says. the semantics of a claim. RT @GWPStudio: President Obama to visit site of oil spill in the Gulf of Mexico in next 48 hours http://bit.ly/cZ0q73 #oilspill 5.5 Scalability of ClaimFinderINC RT @CNN: Just in: President Barack Obama Finally, we evaluate the scalability of the proposed incre- will visit the Gulf of Mexico oil spill area on Sunday morning. mental method ClaimF inderIN C to identify claims. T2384 RT @croedemeierAP: WASHINGTON (AP) We use 100 hash functions to generate the MinHash val- President Obama - President Obama supports allowing mosque ues, and spectral clustering for the splitting and merging supports building to be built near ground zero in Manhattan. operations. There are two parameters in ClaimF inderIN C , of a mosque near President Obama supports allowing mosque namely the number of LSH vectors and the threshold to split ground zero to be built near ground zero a cluster. We use 50 LSH vectors for both the MH370 and Obama backs Mosque near ground zero (AP): AP - President Barack Obama on Friday Castillo datasets. The split threshold is 10 and 30 tuples for forcefully endorsed building ... MH370 and Castillo dataset respectively. Breaking news: President Obama backs Figure 6 shows the runtime of ClaimF inderIN C com- mosque near ground zero pared to ClaimF inder (in log scale) under spectral clus- Looks interesting: Obama backs mosque near tering and ClaimF inder under agglomerative clustering. ground zero: President We observe ClaimF inderIN C is several orders of magni- Obama threw his support behind a controver- sial p... tude faster than both versions of ClaimF inder and remains scalable as the number of tweets increases. Table 6: Sample claims found in Castillo dataset. 19 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 troduced an incremental approach to quickly process incom- ing tweets. Empirical evaluation on two real world datasets demonstrate the effectiveness of ClaimF inder, and scala- bility of ClaimF inderIN C . For future work, we plan to in- vestigate existing features as well as information from other sources for credibility assessment. 8. REFERENCES [1] L.M. Aiello, G. Petkos, and C. Martin et al. Sensing trending topics in twitter. IEEE Transactions on Multimedia, 15(6), 2013. [2] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event identification on twitter. In AAAI Conference on Weblogs and Social Media, 2011. [3] D.M. Blei, A.Y. Ng, and M.I. Jordan et al. Latent dirichlet allocation. Journal of Machine Learning Research, 2003. Figure 6: Scalability of ClaimF inderIN C (MH370). [4] C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. In WWW, 2011. [5] C. Castillo, M. Mendoza, and B. Poblete. Predicting 6. RELATED WORK information credibility in time-sensitive social media. There are two main approaches to cluster tweets, namely Internet Research, 2013. features-based and topic modeling based clustering. Feature- [6] L.D. Corro and R. Gemulla. Clausie: Clause-based open information extraction. In WWW, 2013. based approach typically represent each tweet as a vector or [7] L. Derczynski, A. Ritter, and S. Clark et. al. Twitter set of features from which a similarity measure can then part-of-speech tagging for all: Overcoming sparse and noisy be used to quantify the distance between any given pair data. In Recent Advances in NLP, 2013. of tweets. A commonly used set of features is the TFIDF [8] I.S. Dhillon. Co-clustering documents and words using scores of the words present within the tweet content. Other bipartite spectral graph partitioning. In ACM SIGKDD, features useful for differentiating individual tweet to their 2001. event include references to temporal, geographical and user [9] E. Ferrara, M. JafariAsbagh, and O. Varol et. al. Clustering information extracted from the tweet content [21]. These memes in social media. In Advances in Social Networks Analysis and Mining, 2013. features are then used to cluster the tweets [9, 20, 8]. [10] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier. The alternative to features-based clustering is the genera- Tweetcred: Real-time credibility assessment of content on tive topic modeling approaches, e.g., LDA [3]. However, the twitter. In Social Informatics. 2014. limited number of words present in microblog pose a major [11] L. Hong and B.D. Davison. Empirical study of topic problem due to the lack of word co-occurrence within the modeling in twitter. In SIGKDD Workshop on Social tweets [11]. Empirical studies show that aggregating tweets Media Analytics, 2010. such that each document is the concatenation of tweets from [12] M. Mathioudakis and N. Koudas. Twittermonitor: Trend a user, hashtag or time window improves the topic cluster- detection over the twitter stream. In ACM SIGMOD, 2010. ing results [11][19][13]. The work in [23] assume that “a [13] R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and single tweet is usually about a single topic” and propose automatic labeling. In ACM SIGIR, 2013. the TwitterLDA model where words in a tweet are either [14] M. Mendoza, B. Pobletey, and C. Castillo. Twitter Under chosen from a topic or are background noise words. The Crisis: Can we trust what we RT? In 1st Workshop on TwitterLDA model is able to generate more coherent repre- Social Media Analytics,, 2010. sentative topic words compared to a standard LDA model. [15] J. O’Donovan, B. Kang, and G. Meyer et. al. Credibility in To date, prior work on tweet or keywords clustering are context: An analysis of feature distributions in twitter. In designed mainly for topic or event detection, of which are International Conference on Social Computing, 2012. overly encompassing in nature for the credibility assessment [16] F. Pedregosa, G. Varoquaux, and A. Gramfort et. al. Scikit-learn: Machine learning in Python. Journal of task. For example, an entity-oriented sample topic in [23] Machine Learning Research, 12:2825–2830, 2011. ‘ ‘iphone6, #iphone, apple, app” correspond to tweets refer- [17] M.F. Porter. An algorithm for suffix stripping. Program, ring to the iPhone and/or the technology company while a 14:130–137, 1980. event-oriented topic “health, flu, swine, #h1n1, #swineflu” [18] V. Qazvinian, E. Rosengren, D.R. Radev, and Q. Mei. correspond to tweets referring to the virus outbreak. The Rumor has it: Identifying misinformation in microblogs. In problem that there are multiple claims of varying credibility EMNLP, 2011. made within the tweets in each cluster remains unaddressed. [19] Y. Wang, J. Liu, and J. Qu et. al. Hashtag graph based topic model for tweet mining. In IEEE Data Mining, 2013. [20] C. Wartena and R. Brussee. Topic detection by clustering 7. CONCLUSION keywords. In DEXA, 2008. In this work, we observed that tweets may contain mul- [21] Y. Xia, X. Yang, and C. Wu et. al. Information credibility tiple claims and define a claim as comprising of subjects and on twitter in emergency situation. In Pacific Asia Conference on Intelligence and Security Informatics, 2012. predicates terms. We described a method called ClaimF inder [22] F. Yang, Y. Liu, X. Yu, and M. Yang. Automatic detection to identify claims in a corpus of tweets related to some real of rumor on sina weibo. In ACM SIGKDD Workshop on world event. In particular, we use OpenIE techniques to Mining Data Semantics, 2012. identify entities and their relationships in tweets and map [23] W. Zhao, J. Jiang, and J. Weng et. al. Comparing twitter them to subject-predicate tuples. These tuples are then clus- and traditional media using topic models. In European tered such that each cluster refers to a claim. We further in- Conference on Advances in Information Retrieval, 2011. 20 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016