Seven Years of Social Sensors Mario Cataldi Luigi Di Caro Claudio Schifanella Universite Paris 8 University of Turin University of Turin m.cataldi@iut.univ- dicaro@di.unito.it schi@di.unito.it paris8.fr ABSTRACT lous events, facts, and topics in real time, crossing languages, The aim of this paper is to review seven years of research domains, locations and language barriers. Future technolo- on a specific vision of social media which is that of social gies on this connectivity may also provide applications with sensors, i.e., alternative information systems able to detect automatic techniques for the generation of news (filtered and characterize interesting and yet unreported information over user profiles), offering a sideways to the existing au- and events in real-time, crossing topics, locations and lan- thoritative information media. guage barriers. In particular, we here present a computa- tional exercise based on a Topic Modeling technique over The quite high impact of such view in the literature moti- a set of papers citing probably the first contribution about vated the organization of a workshop on its related aspects. the conceptualization and formalization of the social sensor The international workshop named SIDEWAYS, which cur- keyword. By extracting topics from 367 (English) titles and rently counts three editions, received interesting materials correlating them with metadata such as the year of publica- ranging from socio-cultural contributions to computational tion and the number of received citations, we tried to light approaches. In detail, the past two editions [7, 4] focused up interesting aspects and research directions in the social on the following subtopics: media mining community. • detect emerging events, facts, topics [21, 25, 20] Keywords Social Network Analysis, Data Mining, Social Media, Social • track the evolution over time of events, facts and topics Networks, Topic Detection, Event Detection, Social Sensors [27] • enrich them with contextual information like categories 1. INTRODUCTION and named entities [21] Nowadays, social platforms have become the most popular communication system all over the world. In fact, due to • identify communities and analyse large scale online/offline the short format of messages and the accessibility of these social networks[22] systems, users tend to shift from traditional communica- • unravel behaviours in social networks[19] tion tools (such as blogs, web sites and mailing lists) to social network for various purposes. Billions of messages are • retrieve partecipatory decision making on civic social appearing daily in these services such as Twitter, Tumblr, networks [26] Facebook, etc. The authors of these messages share content about their private life, exchanging opinions on a variety of • understand key social and psychological factors and topics and discussing a wide range of information news. Mi- problems [23, 10, 9, 11] croblogging services also exploit the immediateness of handy • find relationships with other events and sources of in- smart devices. formation[26] In [8], and later in [6], we conceptualize the vision of this • analyze privacy issues [16] powerful communication channel as social sensor, which can be used to detect and follow interesting and yet unreported information and specifically unknown / interesting / anoma- However, Social Sensor analysis may involve other fields and study such as visualization [13], collaboration networks [14], semantic annotation [3], influence analysis [5], Sentiment Analysis [15, 24], irony detection [17], TV content analy- sis [1], and others. The aim of this paper is to review those research works that based their ideas, motivations and concepts on such social sensor view. In the light of this, we carried out a classic SIDEWAYS 2017, Prague, Czech Republic Topic Modeling exercise over the collection of papers that Copyright held by the author(s). have cited our original conceptualization [8, 6, 14]. We thus In this section, we show the results of a LDA topic modeling Table 1: Topics extracted from the 367 English- exercise applied on the abstracts of the papers citing [8]. based papers citing [8] on the social-sensor view. As already mentioned, this paper represents one of the first TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 work which recognized (and formalized) the role of social social topic events data sensor of social media. information topics event time media text twitter model Table 1 shows the 4 most significant topics1 , which we tried users news detection social to label on the last row. We decided not to pre-process the twitter twitter stream networks texts with advanced natural language techniques (such as user emerging streams detection for example lemmatization, Named Entity Recognition and data information time mining Word Sense Disambiguation) in favor of a simple experiment research paper real-time trends bringing to light the naturally-observed linguistic variability. analysis clustering information network Only English stopwords have been filtered out to highlight network data tweets patterns topics comprehension. content results real problem people detection detect online The results seem to show a quite clear map, where the main paper microblog temporal topics scientific effort is divided on 1) the analysis of social media networks tweets sentiment microblogs (role, impact, contents, and user profiles), 2) the detection of emerging topics or 3) events, and 4) network mining ap- proaches involving community detection techniques. social media: emerging real-time network content topic event mining 4. SOCIAL-SENSOR TOPICS TRENDS and users detection detection In this section, we present some correlation study between the extracted topics (see previous section) and metadata collected around 368 publication titles with their relative such as the year of publication, the number of received cita- metadata information such as the type of publication (jour- tions and the type of publication (journal or not). Figure 1 nal or proceedings), the publication year and the number of shows the whole result of the study. received citations. We then extracted topics from titles and abstracts, correlating them along these dimensions, high- 4.1 Social-Sensor Topics and Time lighting some useful insights and historical perspectives for As it can be noticed, the total amount of research in the future research. field has been growing from 2011 to 2015, when it reached a kind of convergence (year-2017 had few data records only). 2. BACKGROUND ON TOPIC MODELING However, the topic ”Event Detection” is the only one that Topic models are fundamental tools for the extraction of kept growing also in 2016. It is possible to think that part regularities and patterns providing automatic ways to orga- of the community working on topic detection then focused nize, search and give sense to large data collections. The on events at a certain point, since Social Media is known shared basic assumption is that documents have a latent se- to contain much more event-based information rather than mantic structure that can be inferred from word-document other sources of information. This is actually one of the key distributions. motivation of the social sensor view. Latent Semantic Analysis (LSA) [12] is a linear algebra- 4.2 Social-Sensor Topics and Impact based method that reduces the a word-document co-occurrences Another interesting aspect was to analyze the impact of the matrix into a reduced space such that words which are close extracted topics in terms of received citations from the re- in the new space are similar. Its probabilistic and generative search community. Figure 1 (b) shows that social-sensor version (pLSA) [18] adds a latent context variable to each papers with low citation numbers are more about topic de- word occurrence which explicitly accounts for polysemy. tection and social media with respect to the other two topics. Instead, highly-cited papers are also about event detection, Latent Dirichlet Allocation (LDA) [2] is a fully Bayesian while topic detection papers disappear on the right side of probabilistic version of LSA. Given a corpus of documents, the plot. This is quite interesting, since topic detection is the idea underlying LDA is that all documents share the the top-2 topic. In a sense, it seems that most of the work same set of topics, but each document exhibits those top- is on topic detection though it does not linearly impact on ics in different proportions depending on words which are future and contextual research. present in that document. Topics, in turn, are defined as different probability distributions over the words of a fixed 4.3 Social-Sensor Topics and Journals vocabulary, but they are interpreted by restricting attention With this analysis, we tried to understand if social-sensor to words with the highest estimated frequency. Only doc- topics have a similar distribution on conferences and work- uments are observed, while the topics, per-document topic shops rather than on journals. What we found, as shown distributions and the per-document per-word topic assign- in 1 (c), is that the distribution on journals flatten the to- ments are latent structures inferred from the data. tal number of papers on the different topics. This can be probably interpreted as a quality-based natural filtering. 3. TOPICS FROM SOCIAL-SENSORS LIT- 1 We experimented with other number of topics, showing less ERATURE interpretable results. (a) (b) (c) Figure 1: Correlation between the 4 topics extracted from the 367 English-based papers citing [8] with (a) year of publication, (b) number of citations and (c) type of publication. 5. REFERENCES Interfaces, 35(5):442–453, 2013. [1] A. Antonini, L. Vignaroli, C. Schifanella, R. G. Pensa, [16] C. Ellwein and B. Noller. Social media mining: Impact and M. L. Sapino. Mesoontv: a media and of the business model and privacy settings. In Cataldi social-driven ontology-based tv knowledge et al. [7], pages 3–8. management system. In Proceedings of the 24th ACM [17] A. Gianti, C. Bosco, V. Patti, A. Bolioli, and Conference on Hypertext and Social Media, pages L. Di Caro. Annotating irony in a novel italian corpus 208–213. ACM, 2013. for sentiment analysis. In Proceedings of the 4th [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Workshop on Corpora for Research on Emotion dirichlet allocation. Journal of machine Learning Sentiment and Social Signals, pages 1–7, 2012. research, 3(Jan):993–1022, 2003. [18] T. Hofmann. Probabilistic latent semantic indexing. In [3] G. Boella and L. Di Caro. Extracting definitions and Proceedings of the 22nd annual international ACM hypernym relations relying on syntactic dependencies SIGIR conference on Research and development in and support vector machines. In ACL (2), pages information retrieval, pages 50–57. ACM, 1999. 532–537, 2013. [19] R. Kikas, M. Dumas, and A. Saabas. Explaining [4] L. D. Caro, M. Cataldi, and C. Schifanella, editors. international migration in the skype network: The role Proceedings of the 2nd International Workshop on of social network features. In Cataldi et al. [7], pages Social Media World Sensors, SIDEWAYS 2016, 17–22. co-located with 10th International Conference on [20] T. Kreutz and M. Nissim. Catching events in the Language Resources and Evaluation (LREC 2016), twitter stream: A showcase of student projects. In Portoroz, Slovenia, May 24, 2016, volume 1696 of Caro et al. [4], pages 14–18. CEUR Workshop Proceedings. CEUR-WS.org, 2016. [21] K. S. C. R. P. M. L. S. Luca Vignaroli, [5] M. Cataldi and M.-A. Aufaure. The 10 million Claudio Schifanella. Tracking and analyzing the follower fallacy: audience size does not prove ”second life” of tv content: a media and social-driven domain-influence on twitter. Knowledge and framework. In Caro et al. [4]. Information Systems, 44(3):559–580, 2015. [22] P. S. Ludu. Inferring latent attributes of an indian [6] M. Cataldi, L. D. Caro, and C. Schifanella. twitter user using celebrities and class influencers. In Personalized emerging topic detection based on a term Cataldi et al. [7], pages 9–15. aging model. ACM Transactions on Intelligent [23] C. F. U. K. Massimo Poesio, Ayman Alhelbawy. Systems and Technology (TIST), 5(1):7, 2013. Exploiting social media to address fundamental [7] M. Cataldi, L. D. Caro, and C. Schifanella, editors. human rights. In Caro et al. [4]. Proceedings of the 1st ACM Workshop on Social [24] L. Robaldo and L. Di Caro. Opinionmining-ml. Media World Sensors, Guzelyurt, SIdEWayS@HT Computer Standards & Interfaces, 35(5):454–469, 2015, TRNC, Cyprus, September 1, 2015. ACM, 2015. 2013. [8] M. Cataldi, L. Di Caro, and C. Schifanella. Emerging [25] E. D. Rosa and A. Durante. App2check: a machine topic detection on twitter based on temporal and learning-based system for sentiment analysis of app social terms evaluation. In Proceedings of the Tenth reviews in italian language. In Caro et al. [4], pages International Workshop on Multimedia Data Mining, 8–13. page 4. ACM, 2010. [26] A. Ruggeri and G. Boella. Gibsonian modeling of users [9] M. D. Choudhury. Social media for mental illness risk in social networks. In Caro et al. [4], pages 25–31. assessment, prevention and support. In Cataldi et al. [27] G. Siragusa. Place as topics: Analysis of spatial and [7], page 1. temporal evolution of topics from social networks [10] C. Colella. Distrusting science on communication data. In Caro et al. [4], pages 32–35. platforms: Socio-anthropological aspects of the science-society dialectic within a phytosanitary emergency. In Caro et al. [4], pages 19–24. [11] L. H. M. S. S. K. Dane Bell, Daniel Fried. Challenges for using social media for early detection of t2dm. In Caro et al. [4]. [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990. [13] L. Di Caro, K. S. Candan, and M. L. Sapino. Navigating within news collections using tag-flakes. Journal of Visual Languages & Computing, 22(2):120–139, 2011. [14] L. Di Caro, M. Cataldi, and C. Schifanella. The d-index: Discovering dependences among scientific collaborators from their bibliographic data records. Scientometrics, 93(3):583–607, 2012. [15] L. Di Caro and M. Grella. Sentiment analysis via dependency parsing. Computer Standards &