Introducing a Framework for Automatically Differentiating Witness Accounts of Events from Social Media Marie Truelove, Maria Vasardani, and Stephan Winter Department of Infrastructure Engineering, The University of Melbourne, Australia; Emails: truelove@student.unimelb.edu.au (M.T.); maria.vasardani@unimelb.edu.au (M.V.); winter@unimelb.edu.au (S.W.) SUMMARY Identifying Witnesses of events from social media is an opportunity to crowdsource real-time information to enhance numerous applications including emergency response in a crisis, filtering sources for journalism, and enhancing marketing services. Using a sporting event broadcast live to a proportionally much larger audience, this research demonstrates a significant increase in the number of Witnesses identified posting from the event venue, in comparison to the number identified from geotags alone. This is achieved by considering the text and image content of micro-blogs as additional evidence. This paper also reports progress towards the automatic categorisation of the additional text and image evidence, and modelling and testing this evidence for corroboration or conflict, using Dempster-Shafter Theory of Evidence. Keywords: Crowdsourcing, Social Media, Witness Accounts, Supervised Machine Learning, Dempster-Shafer Theory of Evidence INTRODUCTION Crowdsourcing information about events from social networks such as Twitter is recognised as an opportunity to harvest detailed real-time information, for example enhancing situational awareness for emergency response and management [18] and creating news summaries of large sporting spectacles [19]. However, these opportunities come with many problems to solve, including detecting the fraction of relevant micro-blogs, and assessing the credibility and location of the micro-bloggers who posted them. This research makes unique contributions by proposing a framework towards distinguishing those micro-blogs which are Witness Accounts (WA) of events. WA are defined as those micro-blogs which contain an observation of the event or its effects [17], for example a statement I see the bushfire smoke! or an image conveying the same information. The micro-blogger who posted the WA is considered a potential Witness to the event, and it can be inferred they are on-the-ground (OTG) [15], that is they in close proximity to the event [17]. Impact Accounts (IA) are defined for those micro-blogs which do no contain an observation of the event, but from which it can also be inferred that the micro-blogger who posted it is OTG. IA statements may be as explicit as I’m being evacuated from my home due to the bushfire. Formally modelling the witnessing fundamentals of observation and spatial relationship separately enables a generic model for a range of event types including unpredicted natural disasters to scheduled events broadcast live from dedicated venues, such as the case study presented in this paper. All micro-bloggers who post observations of the event whether viewed direct from the grandstands or via television are by definition Witnesses. The research in this paper questions whether it is possible to differentiate those Witnesses which are physically at the event from those watching a broadcast. Such differentiation is supported by micro-blogs with geotags, but typically they are present in only a fraction of micro-blogs, for example 1% [1]. This research demonstrates that including the text content and linked images as evidence, the sample of micro-blogs posted from the event location can be increased significantly from those identified by geotags alone. Additionally, this research questions whether text content and linked images can be automatically categorised, and used to test whether they corroborate Proc. of the 3rd Annual Conference of Research@Locate 13 the inference they were posted from the event. In order to automatically differentiate those micro-blogs which are WA or IA, and test the Witness categorisation of the micro-bloggers who posted them, a framework is proposed with the following parts: 1. Machine learning approaches to categorise micro-blogs with text and linked images that are likely WA or IA; 2. Combine the evidence extracted for each individual micro-blog to determine those which can be ranked as containing corroborating or conflicting evidence; 3. For each micro-blogger found to have posted micro-blogs containing evidence, combine these to rank their likely status as a Witness OTG; and 4. For likely Witnesses, seek further evidence, for example from micro-blogging history posted during the event. This paper presents progress to date on parts 1) and 2). To demonstrate part 1) supervised machine learning approaches are used to categorise the text and image content. A model of the micro-blog text, linked images and geotags using Dempster-Shafer Theory of Evidence [3] is developed to demonstrate part 2). The results indicate a significant improvement on the recognition rate of micro-blogs posted from an event from geotags alone. And where multiple evidence is present for an individual micro-blog their combination does produce intuitive results, including identifying conflict due to GPS error. Enhancements and alternative approaches to those presented in this paper, as with parts 3) and 4) of the framework is the subject of future work. BACKGROUND Communication technologies have been described as space-adjusting techniques [14], as they enable events to be witnessed by proportionally much larger audiences than the capacity of the venues in which they are held. In these scenarios, unlike previous case studies such as those in [16], it is not possible to infer a Witness is OTG for the dominating category of observations, that is of the play on the field [19]. It has also been determined that the live broadcast delay of approximately 12 seconds cannot be detected in micro-blogs, ruling this feature out as a method to distinguish those witnessing via a broadcast [19] . In addition to sport, differentiating Witnesses of crisis events has gained much interest from researchers. A journalistic approach describes extracting observation features from text to identify Witnesses [2], whereas spatial presence in the city of the event is the criterion in other work [10]. Supervised Machine Learning for Categorisation Natural language processing (NLP) using bag-of-words approaches from unigram, bigram and parts-of- speech (POS) models, can be utilised as baseline text categorisation features [10] [18]. These research report success, comparable in many scenarios to more sophisticated features [10] [18]. A visual bag-of- words approach to categorise images linked to micro-blogs has also been tested [9]. The disadvantage of bag-of-words approaches is that although the methodology can be applied generically, the resulting model is not generic, for example, a model developed from training data for a football game cannot be used for a bushfire. Approaches which extract semantic meaning, for example locative expressions from text [7] would enable a generic model, but their success to-date is limited in domains such as social media [7]. Detecting micro-blogs posted from OTG is also recognised as a unbalanced class problem [15]. Approaches taken to mitigate class imbalance typically involve balancing the data via sampling [10] [5] [18], or algorithmically introducing a miss-classification cost to the under-represented class [15]. Dempster-Shafer Theory of Evidence Dempster-Shafters Theory of Evidence is one method that has found application in classifier fusion, and managing uncertainty and incomplete reasoning [3]. The theory models the power set for the frame of discernment of the hypothesis [3]. A mass function is assigned for each subset in the power set from which the belief interval can be derived [3]. The mass function can be assigned from various classifier results, including the overall accuracy, class statistics or individual instances [12]. The mass functions for independent evidence can then be combined [3]. Dempster’s Rule of Combination has been shown to produce uninituitve results in scenarios with conflict [13], resulting in many enhancements being proposed including PCR6 [13] based on proportional conflict resolution. Proc. of the 3rd Annual Conference of Research@Locate 14 1 METHODOLOGY Data Collection and Training Set Creation The case study event is an Australian Football League (AFL) match played at the Melbourne Cricket Ground (MCG) on the annual ANZAC Day public holiday. In 2015, this match attracted a near capacity crowd of 88,3981 and television ratings of 1.298 million2 . The corpus was collected using the AFL’s promoted hashtag #afldonspies, utilising the Twitter Data Analystics software packages [6]. Pre-processing samples the micro-blogs to those which can be identified as individual and original, that is not a retweet or posted by a non-individual such as the media [17]. To collect a sample of linked images, all micro-blogs in the corpus with a URL to Twitter or Instagram were inspected as these are more likely to contain WA [16]. To create the training set, two expert annotators coded the tweet text and linked images with one of three categories, examples for which are presented in Table 1. The three categories are: 1. No Evidence (NE) when no evidence of being posted from OTG or another place could be detected. 2. When evidence is detected, it is categorised as either evidence posted from OTG (E-OTG); 3. Or counter-evidence indicating that it is not posted from OTG (E-NOTG). Table 1. Example text and image content for each category. (Source: twitter.com, access date: 25-26/04/15.) No Evidence (NE) Evidence OTG (E-OTG) Evidence not OTG (E-NOTG) Fletcher goes bang with a 60 Not the best seats in the In front of TV with chips for next metre monster! #AFLDon- house but just glad to be here 3 hours! #AFLDonsPies sPies at @MCG #AFLDonsPies Supervised Machine Learning for Categorisation of Text and Images Pre-processing of the text included word tokenisation and parts-of-speech tagging using Ark NLP [11]. WEKA’s [4] default pre-processing filters were used to experiment with unigram and bigram models. WEKA default feature selection filters are utilised to reduce the number of redundant dimensions, and experiment with a range of classifiers indicated by previous research including Naive Bayes, Random Forest and Support Vector Machines (SVM). All experiments were completed with 10-fold cross valida- tion. As expected, class imbalance was an issue in particular for the text corpus. Sampling to micro-blogs posted by micro-bloggers with at least one piece of evidence detected in the the training set was used to mitigate the imbalance. The classifier selected was that which maximises precision of E-OTG and E-NOTG classes, at the expense of recall if necessary, to minimise conflict due to miss-classification in the Dempster-Shafer modeling. Categorisation of Geotags Geotag evidence was cateogorised as E-OTG or E-NOTG based on whether it was contained within or in the immediate vicinity of the MCG, the place of the event. It was necessary to create a decision boundary for this categorisation, which was informed primarily by the boundaries of places bordering the MCG, for example train lines, roads and other venues. 1 twitter.com/MCG/status/591859347891748865 2 http://footyindustry.com/files/afl/media/tvratings/2015/2015AFLRatings.png Proc. of the 3rd Annual Conference of Research@Locate 15 Dempster-Shafer Modelling of Evidence Extracted from Micro-blogs The frame of discernment is modelled as {E-OTG, E-NOTG} with power set (null, E-OTG, E-NOTG, {E-OTG, E-NOTG}). The categorisation of NE is not modelled in the frame of discernment. For example, if a micro-blog has a geotag categorised as E-OTG, and text and image categorised as NE, the text and image do not corroborate or produce conflict with the E-OTG categorisation provided by the geotag. For demonstraton mass functions are set manually, with derivation from classifier results left to future work. The mass functions assigned to geotags represent greater certainty than that assigned for images, which are greater than that assigned for text. The combination rule PCR6 implemented in Matlab [8] is then used to compute the combinations for analysis, and again, decision algorithm testing left for future work. 2 RESULTS AND DISCUSSION The corpus contained 3260 micro-blogs, 265 with linked images and 133 with geotags. Table 2 presents the categorisation results, both training and predicted by classifiers. The annotator agreement for the text and image content was high with Cohen’s Kappa of 0.895 and 0.929 respectively. Combining the three content sources, or evidence, from the training data indicates the number of micro-blogs categorised as E-OTG and E-NOTG can be increased significantly from those with geotags alone. The increase for E-OTG is from 21 to 176 micro-blogs, and the increase for E-NOTG is from 112 to 241 micro-blogs. This corresponds to an additional 125 potential Witnesses OTG from 16. 54 tweets had more than one piece of evidence which could be checked for conflict. Conflict did exist for a fraction of tweets, found to be due to GPS error. The geotag indicated the micro-blog was posted from a nearby venue, when the image and text indicated it was posted from the MCG. The combined results correctly predicted are fewer than the training data, but still a significant increase from those with geotags alone. The number of micro-blogs categorised as E-OTG increased from 21 to 125, and the number of micro-blogs categorised as E-NOTG increased from 112 to 182. This corresponded to an additional 77 potential Witnesses posting from OTG and an additional 50 potential Witnesses via the broadcast. From the predicted results of the classifiers, 26 micro-blogs had more than one piece of evidence, with five in conflict. In addition to GPS error, these conflicts are now also attributed to miss-classification. Table 2. Number of micro-blogs categorised for each content individually and in combination. The number of miss-classified micro-blogs in the class are presented in parenthesis. Content E-OTG E-NOTG NE Training Predicted Training Predicted Training Predicted Geotag 21 - 112 - - - Image 95 95 (11) 26 17 (10) 146 173 (27) Text 99 34 (10) 129 66 (6) 3032 1088 (143) Combination 176 125 (15) 241 182 (17) 2328 876 (132) From the classifier experimentation it was found the WEKA default SVM, feature selection filter, and a unigram model maximised precision of the E-OTG and E-NOTG classes for text content. Using the methodology described by [9] the SVM classifier was additionally selected for the image content. The precision and recall results for each class are presented in Table 3. These results indicate the image categorisation failed for the E-NOTG class, which is attributed to the insufficient number of samples in the training data. For future experiments this category could be removed, or the sample increased from other events, both options are to be tested in future work. In comparison, the E-OTG category proved acceptable for both precision and recall. The better precision for text E-NOTG compared to E-OTG could in part be explained by the topics contained in these micro-blogs were dominated by explicit statements critiquing the television coverage or the medium via which the broadcast was being viewed, enabling a more representative unigram model. In comparison, there was not a dominate topic for E-OTG. More robust feature development based on previously identified witnessing characteristics [16] is being developed in future work. Additionally, the results indicate improvements could be made if the class imbalance were further addressed. Proc. of the 3rd Annual Conference of Research@Locate 16 Table 3. Class precision and recall results for adopted classifier. Corpus Statistic E-OTG E-NOTG NE Text Precision 0.706 0.909 0.869 Recall 0.242 0.465 0.984 Image Precision 0.844 0.412 0.844 Recall 0.896 0.280 0.896 Table 4 presents example mass functions manually assigned for text and image evidence and combina- tion mass functions corresponding to predicted combination results. Manual analysis concludes that these results appear intuitive, for example, when there are multiple evidence present supporting a categorisation, the increased values indicate corroboration. Additionally, when conflict exists the values reflect this, suggesting the conflict redistribution of the PCR6 algorithm is appropriate for the modelled scenario. Hybrid methods for deriving the mass functions which can model the uncertainty of the evidence in addition to the automatic classification results are in progress. Table 4. Example mass functions assigned or PCR6 combination results for evidence combinations detected in micro-blogs. (- indicates no data.) Evidence Categorisat. Mass Function Comment Text Image Geotag E-OTG E-NOTG E-OTG,E-NOTG E-OTG - - 0.70 0.15 0.15 Assigned Mass Funct. E-OTG - E-OTG 0.95 0.04 0.01 E-OTG E-OTG E-OTG 0.97 0.02 0.01 E-OTG E-OTG E-NOTG 0.55 0.43 0.02 GPS error example E-OTG - E-NOTG 0.35 0.64 0.01 Miss-classified Text - E-NOTG - 0.10 0.80 0.10 Assigned Mass Funct. E-NOTG E-NOTG - 0.07 0.91 0.02 3 CONCLUSION This paper presented progress on a framework to automatically extract WA and IA of events from social media. Baseline supervised machine learning techniques to categorise text and images were demonstrated, enabling micro-blogs posted from OTG or via the broadcast to be identified in signficantly greater numbers than with geotags alone. Additionally, a method based on Dempster-Shafer Theory of Evidence was demonstrated to combine the extracted evidence to test corroboration or conflict in the categorisation of the micro-blogs. Many areas for enhancements are identified, including machine learning approaches that further mitigate class imbalance and enable generic model development. In addition to seeking these enhancements, future work will include modeling the combination of evidence from multiple micro-blogs to identify the status of potential Witnesses. REFERENCES [1] C HENG , Z., C AVERLEE , J., AND L EE , K. You are where you tweet: A content-based approach to geo-locating Twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (2010), ACM, pp. 759–768. [2] D IAKOPOULOS , N., D E C HOUDHURY, M., AND NAAMAN , M. Finding and assessing social media information sources in the context of journalism. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (2012), pp. 2451–2460. [3] G ARGIULO , F., M AZZARIELLO , C., AND S ANSONE , C. Multiple Classifier Systems: Theory, Application and Tools. Spinger-Verlag, 2013, ch. 10, pp. 335–378. Proc. of the 3rd Annual Conference of Research@Locate 17 [4] H ALL , M., F RANK , E., H OLMES , G., P FAHRINGER , B., R EUTEMANN , P., AND W ITTEN , I. H. The WEKA data mining software: An update. SIGKDD Explorations 11, 1 (2009), 10–18. [5] K UMAR , S., H U , X., AND L IU , H. A behaviour analytics approach to identifying tweets from crisis regions. In Proceedings of the 25th ACM Conference on Hypertext and Social Media (2014), pp. 255–260. [6] K UMAR , S., M ORSTATTER , F., AND L IU , H. Twitter Data Analytics. Springer, 2014. [7] L IU , F., VASARDANI , M., AND BALDWIN , T. Automatic identification of locative expressions from social media text: A comparative analysis. In Proceedings of the 4th International Workshop on Location and the Web (LocWeb) (2014), pp. 9–16. [8] M ARTIN , A. Implementing general belief function framework with a practical codification for low complexity. In Advances and Application os DSmT for Information Fusion, F. Smarandache and J. Dezert, Eds., vol. 3. American Press Rehoboth, 2009, pp. 217–273. [9] M C L EAN , S. Identifying witness account in social media using imagery. Master’s thesis, 2015. The University of Melbourne. [10] M ORSTATTER , F., L UBOLD , N., P ON -BARRY, H., P FEFFER , J., AND L IU , H. Finding eyewitness tweets during crises. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science (2014). [11] OWOPUTI , O., OC ONNOR , B., DYER , C., G IMPEL , K., S CHNEIDER , N., AND S MITH , N. A. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL 2013 (2013). [12] PARIKH , C. R., P ONT, M. J., AND J ONES , N. B. Application of Dempster-Shafer theory in condition monitoring applications: a case study. Pattern Recognition Letters 22 (2001), 777–785. [13] S MARANDACHE , F., AND D EZERT, J. On the consistency of PCR6 with the averaging rule and its application to probability estimation. In Proceedings of the 16th International Conference on Information Fusion (2013), pp. 1119–1126. [14] S PENCER , J. E., AND T HOMAS , W. L. J. Cultural Geography. John Wiley & Sons, Inc., 1969. [15] S TARBIRD , K., G RACE , M., AND L EYSIA , P. Learning from the crowd: Collaborative filtering techniques for identifying on-the-ground Twitterers during mass disruptions. In Proceedings of the 9th International ISCRAM Conference (2012), pp. 1–10. [16] T RUELOVE , M., VASARDANI , M., AND W INTER , S. Testing a model of witness accounts in social media. In Proceedings of the 8th Workshop on Geographic Information Retrieval (2014), no. 10. [17] T RUELOVE , M., VASARDANI , M., AND W INTER , S. Towards credibility of micro-blogs: charac- terising witness accounts. GeoJournal 80, 3 (2015), 339–359. [18] V ERMA , S., V IEWEG , S., C ORVEY, W. J., PALEN , L., M ARTIN , J. H., PALMER , M., S CHRAM , A., AND A NDERSON , K. M. Natural language processing to the rescue? Extracting ”situational awareness” tweets during mass emergency. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (2011), pp. 385–392. [19] Z HAO , S., Z HONG , L., W ICKRAMASURIYA , J., AND VASUDEVAN , V. Human as real-time sensors of social and physical events: A case study of Twitter and sports games. Tech. rep., Rice University and Motorola Labs, 2011. Proc. of the 3rd Annual Conference of Research@Locate 18