Seven Years of Social Sensors

                Mario Cataldi                        Luigi Di Caro                        Claudio Schifanella
               Universite Paris 8                   University of Turin                    University of Turin
           m.cataldi@iut.univ-                    dicaro@di.unito.it                        schi@di.unito.it
                 paris8.fr


ABSTRACT                                                          lous events, facts, and topics in real time, crossing languages,
The aim of this paper is to review seven years of research        domains, locations and language barriers. Future technolo-
on a specific vision of social media which is that of social      gies on this connectivity may also provide applications with
sensors, i.e., alternative information systems able to detect     automatic techniques for the generation of news (filtered
and characterize interesting and yet unreported information       over user profiles), offering a sideways to the existing au-
and events in real-time, crossing topics, locations and lan-      thoritative information media.
guage barriers. In particular, we here present a computa-
tional exercise based on a Topic Modeling technique over          The quite high impact of such view in the literature moti-
a set of papers citing probably the first contribution about      vated the organization of a workshop on its related aspects.
the conceptualization and formalization of the social sensor      The international workshop named SIDEWAYS, which cur-
keyword. By extracting topics from 367 (English) titles and       rently counts three editions, received interesting materials
correlating them with metadata such as the year of publica-       ranging from socio-cultural contributions to computational
tion and the number of received citations, we tried to light      approaches. In detail, the past two editions [7, 4] focused
up interesting aspects and research directions in the social      on the following subtopics:
media mining community.

                                                                     • detect emerging events, facts, topics [21, 25, 20]
Keywords
Social Network Analysis, Data Mining, Social Media, Social           • track the evolution over time of events, facts and topics
Networks, Topic Detection, Event Detection, Social Sensors             [27]
                                                                     • enrich them with contextual information like categories
1.   INTRODUCTION                                                      and named entities [21]
Nowadays, social platforms have become the most popular
communication system all over the world. In fact, due to             • identify communities and analyse large scale online/offline
the short format of messages and the accessibility of these            social networks[22]
systems, users tend to shift from traditional communica-             • unravel behaviours in social networks[19]
tion tools (such as blogs, web sites and mailing lists) to
social network for various purposes. Billions of messages are        • retrieve partecipatory decision making on civic social
appearing daily in these services such as Twitter, Tumblr,             networks [26]
Facebook, etc. The authors of these messages share content
about their private life, exchanging opinions on a variety of        • understand key social and psychological factors and
topics and discussing a wide range of information news. Mi-            problems [23, 10, 9, 11]
croblogging services also exploit the immediateness of handy         • find relationships with other events and sources of in-
smart devices.                                                         formation[26]
In [8], and later in [6], we conceptualize the vision of this        • analyze privacy issues [16]
powerful communication channel as social sensor, which can
be used to detect and follow interesting and yet unreported
information and specifically unknown / interesting / anoma-       However, Social Sensor analysis may involve other fields and
                                                                  study such as visualization [13], collaboration networks [14],
                                                                  semantic annotation [3], influence analysis [5], Sentiment
                                                                  Analysis [15, 24], irony detection [17], TV content analy-
                                                                  sis [1], and others.

                                                                  The aim of this paper is to review those research works that
                                                                  based their ideas, motivations and concepts on such social
                                                                  sensor view. In the light of this, we carried out a classic
 SIDEWAYS 2017, Prague, Czech Republic                            Topic Modeling exercise over the collection of papers that
 Copyright held by the author(s).                                 have cited our original conceptualization [8, 6, 14]. We thus
                                                                 In this section, we show the results of a LDA topic modeling
Table 1: Topics extracted from the 367 English-                  exercise applied on the abstracts of the papers citing [8].
based papers citing [8] on the social-sensor view.               As already mentioned, this paper represents one of the first
   TOPIC 1      TOPIC 2      TOPIC 3      TOPIC 4                work which recognized (and formalized) the role of social
      social        topic       events       data                sensor of social media.
  information      topics        event       time
      media          text      twitter      model                Table 1 shows the 4 most significant topics1 , which we tried
      users         news     detection      social               to label on the last row. We decided not to pre-process the
     twitter      twitter       stream    networks               texts with advanced natural language techniques (such as
       user      emerging      streams    detection              for example lemmatization, Named Entity Recognition and
       data    information       time       mining               Word Sense Disambiguation) in favor of a simple experiment
    research       paper      real-time     trends               bringing to light the naturally-observed linguistic variability.
    analysis    clustering  information    network               Only English stopwords have been filtered out to highlight
    network         data        tweets     patterns              topics comprehension.
     content      results         real     problem
     people     detection       detect      online               The results seem to show a quite clear map, where the main
      paper     microblog     temporal      topics               scientific effort is divided on 1) the analysis of social media
    networks      tweets     sentiment    microblogs             (role, impact, contents, and user profiles), 2) the detection
                                                                 of emerging topics or 3) events, and 4) network mining ap-
                                                                 proaches involving community detection techniques.
  social media:    emerging       real-time
                                                 network
     content         topic          event
                                                 mining          4.    SOCIAL-SENSOR TOPICS TRENDS
    and users      detection      detection
                                                                 In this section, we present some correlation study between
                                                                 the extracted topics (see previous section) and metadata
collected around 368 publication titles with their relative      such as the year of publication, the number of received cita-
metadata information such as the type of publication (jour-      tions and the type of publication (journal or not). Figure 1
nal or proceedings), the publication year and the number of      shows the whole result of the study.
received citations. We then extracted topics from titles and
abstracts, correlating them along these dimensions, high-        4.1    Social-Sensor Topics and Time
lighting some useful insights and historical perspectives for    As it can be noticed, the total amount of research in the
future research.                                                 field has been growing from 2011 to 2015, when it reached a
                                                                 kind of convergence (year-2017 had few data records only).
2.   BACKGROUND ON TOPIC MODELING                                However, the topic ”Event Detection” is the only one that
Topic models are fundamental tools for the extraction of         kept growing also in 2016. It is possible to think that part
regularities and patterns providing automatic ways to orga-      of the community working on topic detection then focused
nize, search and give sense to large data collections. The       on events at a certain point, since Social Media is known
shared basic assumption is that documents have a latent se-      to contain much more event-based information rather than
mantic structure that can be inferred from word-document         other sources of information. This is actually one of the key
distributions.                                                   motivation of the social sensor view.

Latent Semantic Analysis (LSA) [12] is a linear algebra-         4.2    Social-Sensor Topics and Impact
based method that reduces the a word-document co-occurrences     Another interesting aspect was to analyze the impact of the
matrix into a reduced space such that words which are close      extracted topics in terms of received citations from the re-
in the new space are similar. Its probabilistic and generative   search community. Figure 1 (b) shows that social-sensor
version (pLSA) [18] adds a latent context variable to each       papers with low citation numbers are more about topic de-
word occurrence which explicitly accounts for polysemy.          tection and social media with respect to the other two topics.
                                                                 Instead, highly-cited papers are also about event detection,
Latent Dirichlet Allocation (LDA) [2] is a fully Bayesian        while topic detection papers disappear on the right side of
probabilistic version of LSA. Given a corpus of documents,       the plot. This is quite interesting, since topic detection is
the idea underlying LDA is that all documents share the          the top-2 topic. In a sense, it seems that most of the work
same set of topics, but each document exhibits those top-        is on topic detection though it does not linearly impact on
ics in different proportions depending on words which are        future and contextual research.
present in that document. Topics, in turn, are defined as
different probability distributions over the words of a fixed    4.3    Social-Sensor Topics and Journals
vocabulary, but they are interpreted by restricting attention    With this analysis, we tried to understand if social-sensor
to words with the highest estimated frequency. Only doc-         topics have a similar distribution on conferences and work-
uments are observed, while the topics, per-document topic        shops rather than on journals. What we found, as shown
distributions and the per-document per-word topic assign-        in 1 (c), is that the distribution on journals flatten the to-
ments are latent structures inferred from the data.              tal number of papers on the different topics. This can be
                                                                 probably interpreted as a quality-based natural filtering.
3.   TOPICS FROM SOCIAL-SENSORS LIT-                             1
                                                                   We experimented with other number of topics, showing less
     ERATURE                                                     interpretable results.
                                                   (a)


                                                   (b)


                                                   (c)

Figure 1: Correlation between the 4 topics extracted from the 367 English-based papers citing [8] with (a)
year of publication, (b) number of citations and (c) type of publication.
5.   REFERENCES                                                     Interfaces, 35(5):442–453, 2013.
 [1] A. Antonini, L. Vignaroli, C. Schifanella, R. G. Pensa,   [16] C. Ellwein and B. Noller. Social media mining: Impact
     and M. L. Sapino. Mesoontv: a media and                        of the business model and privacy settings. In Cataldi
     social-driven ontology-based tv knowledge                      et al. [7], pages 3–8.
     management system. In Proceedings of the 24th ACM         [17] A. Gianti, C. Bosco, V. Patti, A. Bolioli, and
     Conference on Hypertext and Social Media, pages                L. Di Caro. Annotating irony in a novel italian corpus
     208–213. ACM, 2013.                                            for sentiment analysis. In Proceedings of the 4th
 [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent                 Workshop on Corpora for Research on Emotion
     dirichlet allocation. Journal of machine Learning              Sentiment and Social Signals, pages 1–7, 2012.
     research, 3(Jan):993–1022, 2003.                          [18] T. Hofmann. Probabilistic latent semantic indexing. In
 [3] G. Boella and L. Di Caro. Extracting definitions and           Proceedings of the 22nd annual international ACM
     hypernym relations relying on syntactic dependencies           SIGIR conference on Research and development in
     and support vector machines. In ACL (2), pages                 information retrieval, pages 50–57. ACM, 1999.
     532–537, 2013.                                            [19] R. Kikas, M. Dumas, and A. Saabas. Explaining
 [4] L. D. Caro, M. Cataldi, and C. Schifanella, editors.           international migration in the skype network: The role
     Proceedings of the 2nd International Workshop on               of social network features. In Cataldi et al. [7], pages
     Social Media World Sensors, SIDEWAYS 2016,                     17–22.
     co-located with 10th International Conference on          [20] T. Kreutz and M. Nissim. Catching events in the
     Language Resources and Evaluation (LREC 2016),                 twitter stream: A showcase of student projects. In
     Portoroz, Slovenia, May 24, 2016, volume 1696 of               Caro et al. [4], pages 14–18.
     CEUR Workshop Proceedings. CEUR-WS.org, 2016.             [21] K. S. C. R. P. M. L. S. Luca Vignaroli,
 [5] M. Cataldi and M.-A. Aufaure. The 10 million                   Claudio Schifanella. Tracking and analyzing the
     follower fallacy: audience size does not prove                 ”second life” of tv content: a media and social-driven
     domain-influence on twitter. Knowledge and                     framework. In Caro et al. [4].
     Information Systems, 44(3):559–580, 2015.                 [22] P. S. Ludu. Inferring latent attributes of an indian
 [6] M. Cataldi, L. D. Caro, and C. Schifanella.                    twitter user using celebrities and class influencers. In
     Personalized emerging topic detection based on a term          Cataldi et al. [7], pages 9–15.
     aging model. ACM Transactions on Intelligent              [23] C. F. U. K. Massimo Poesio, Ayman Alhelbawy.
     Systems and Technology (TIST), 5(1):7, 2013.                   Exploiting social media to address fundamental
 [7] M. Cataldi, L. D. Caro, and C. Schifanella, editors.           human rights. In Caro et al. [4].
     Proceedings of the 1st ACM Workshop on Social             [24] L. Robaldo and L. Di Caro. Opinionmining-ml.
     Media World Sensors, Guzelyurt, SIdEWayS@HT                    Computer Standards & Interfaces, 35(5):454–469,
     2015, TRNC, Cyprus, September 1, 2015. ACM, 2015.              2013.
 [8] M. Cataldi, L. Di Caro, and C. Schifanella. Emerging      [25] E. D. Rosa and A. Durante. App2check: a machine
     topic detection on twitter based on temporal and               learning-based system for sentiment analysis of app
     social terms evaluation. In Proceedings of the Tenth           reviews in italian language. In Caro et al. [4], pages
     International Workshop on Multimedia Data Mining,              8–13.
     page 4. ACM, 2010.                                        [26] A. Ruggeri and G. Boella. Gibsonian modeling of users
 [9] M. D. Choudhury. Social media for mental illness risk          in social networks. In Caro et al. [4], pages 25–31.
     assessment, prevention and support. In Cataldi et al.     [27] G. Siragusa. Place as topics: Analysis of spatial and
     [7], page 1.                                                   temporal evolution of topics from social networks
[10] C. Colella. Distrusting science on communication               data. In Caro et al. [4], pages 32–35.
     platforms: Socio-anthropological aspects of the
     science-society dialectic within a phytosanitary
     emergency. In Caro et al. [4], pages 19–24.
[11] L. H. M. S. S. K. Dane Bell, Daniel Fried. Challenges
     for using social media for early detection of t2dm. In
     Caro et al. [4].
[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.
     Landauer, and R. Harshman. Indexing by latent
     semantic analysis. Journal of the American society for
     information science, 41(6):391, 1990.
[13] L. Di Caro, K. S. Candan, and M. L. Sapino.
     Navigating within news collections using tag-flakes.
     Journal of Visual Languages & Computing,
     22(2):120–139, 2011.
[14] L. Di Caro, M. Cataldi, and C. Schifanella. The
     d-index: Discovering dependences among scientific
     collaborators from their bibliographic data records.
     Scientometrics, 93(3):583–607, 2012.
[15] L. Di Caro and M. Grella. Sentiment analysis via
     dependency parsing. Computer Standards &