=Paper= {{Paper |id=Vol-1749/paper40 |storemode=property |title=FB-NEWS15: A Topic–Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis |pdfUrl=https://ceur-ws.org/Vol-1749/paper40.pdf |volume=Vol-1749 |authors=Lucia C. Passaro,Alessandro Bondielli,Alessandro Lenci |dblpUrl=https://dblp.org/rec/conf/clic-it/PassaroBL16 }} ==FB-NEWS15: A Topic–Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis== https://ceur-ws.org/Vol-1749/paper40.pdf
             FB-NEWS15: A Topic-Annotated Facebook Corpus for
                 Emotion Detection and Sentiment Analysis
                Lucia C. Passaro, Alessandro Bondielli and Alessandro Lenci
                CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica
                                   University of Pisa (Italy)
                            lucia.passaro@for.unipi.it
                         alessandro.bondielli@gmail.com
                             alessandro.lenci@unipi.it

                    Abstract                          take part in discussions with larger groups of peo-
                                                      ple and, consequently, the bond between SN and
    English. In this paper we present the FB-         information is becoming increasingly stronger.
    NEWS15 corpus, a new Italian resource                Mass information is gradually moving towards
    for sentiment analysis and emotion detec-         general platforms, and official websites are losing
    tion. The corpus has been built by crawl-         their lead position in providing information. As
    ing the Facebook pages of the most impor-         noted by Newman et al. (2012), even though the
    tant newspapers in Italy and it has been          use of internet in the years 2009-2012 has grown,
    organized into topics using LDA. In this          the same is not reflected in the consumption of on-
    work we provide a preliminary analysis            line newspapers, probably because of the increas-
    of the corpus, including the most debated         ing use of SN for news diffusion and gathering.
    news in 2015.                                     If on the one hand this apparent decline of the
                                                      traditional news platforms may lead to a decline
    Italiano. In questo lavoro presentiamo il         in quality and news coverage (Chyi and Lasorsa,
    corpus FB- NEWS15, un corpus italiano             2002), on the other hand the rise of SN as plat-
    creato per scopi di sentiment analysis ed         forms to spread news promotes a more fervid de-
    emotion detection. Il corpus stato costru-        bate between users (Shah et al., 2005). This issue
    ito scaricando le pagine Facebook delle           is central for the present work. In fact, user’s com-
    maggiori testate giornalistiche in Italia e       ments very often contain their own opinions about
    successivamente organizzato in topic uti-         a certain issue. In addition, because of the col-
    lizzando LDA. In questo articolo forniamo         loquial style of the comments, they contain large
    una analisi preliminare del corpus, e mos-        amounts of words and collocations with a high
    triamo le notizie pi discusse nel 2015.           subjective content, mostly concerning the author’s
                                                      emotive stance.
                                                         Facebook is one of the most popular online SN
1   Introduction
                                                      in the world with 1 billion active users per month
The use of Social Networks (SN) platforms like        and it offers the possibility to collect data from
Facebook and Twitter has developed overwhelm-         people of different ages, educational levels and
ingly in recent years. SN are exploited for differ-   cultures. From a linguistic point of view, previous
ent purposes ranging from the sharing of contents     studies (Lin and Qiu, 2013) demonstrated that the
among friends and useful contacts to the news-        language in Facebook is more emotional and in-
gathering about different domains such as politics    terpersonal compared for example to the language
and sports (Ahmad, 2010; Ahmad, 2013; Shef-           in Twitter. Probably, this is due to the fact that in
fer and Schultz, 2010). Many journalists indeed       Facebook there is a stronger psychological close-
use SN platforms for professional reasons (Oriella,   ness between the author and audience because of
2013; Hermida, 2013).                                 the different structure (bidirectional vs. unidirec-
   Several recent studies provide insights on how     tional graphs) of the SNs.
the popularity of blogs and other user generated         In this paper we present the FB-NEWS15
content impacted the way in which news are con-       corpus, a new Italian resource for sentiment
sumed and reported. Picard (2009) states that SN      analysis and emotion detection.             The FB-
platforms provide an easy and affordable way to       NEWS15 corpus can be freely downloaded at
colinglab.humnet.unipi.it/resources/under                      full article. The corpus keeps tracks of the three-
the Creative Commons Attribution License                       fold hierarchical structure of Facebook, which in-
creativecommons.org/licenses/by/2.0.1                          cludes the news posts by the newspaper, the users’
   The debate among users in commenting news                   comments to the posts and the replies to the com-
and posts on Facebook offers a lot of subjective               ments. In this context, it is clear that the emotive
material to study the way in which people express              content of the post is often neutral, but this post
their own opinions and emotions about a target                 can inspire long discussions among readers, which
event. In fact, in FB-NEWS15 we find linguistic                can become useful material for sentiment analysis
items expressing the whole range of positive and               and emotion detection. Figure 1 shows a post, with
negative emotions. In analyzing a news corpus,                 some of its comments and replies.
however, it is not simple to aggregate the posts on
the basis of a certain fact, since several posts re-
late to the same event. For this reason, we decided
to organize the corpus into clusters of topically re-
lated news identified with Latent Dirichlet Allo-
cation (LDA: Blei et al. (2003)). This approach
allow us to infer the most debated news in the cor-
pus, and, in a second step, to discover the readers’
sentiment about a particular topic.
   The paper is organized as follows: Section 2
describes the creation of the corpus, from crawl-
ing (2.1) to linguistic annotation (2.2), and finally
provides basic corpus statistics (2.3). Section 3 re-
ports on the automatic topic extraction with LDA.

2   FB-NEWS15
For the creation of the corpus we followed the
most important Italian newspapers. Since we were
interested in building a corpus as heterogeneous
as possible, we decided to focus on major news-
papers with different political orientations, and
which have in general heterogeneous readers.                   Figure 1: Example of post in Facebook with the
   Facebook allow users to post states, links, pho-            relative comments and replies.
tos and videos on their own wall. In general, users
can be divided into two macro-categories: Peo-                    In order to create the FB-NEWS15, we decided
ple and Pages. People are often individuals, and               to download the timeline of the following news-
the interaction with them is usually bidirectional             papers, from 1 January 1 to 31 December 2015:
(user A can read what user B publishes if A and                La Repubblica, Il Giornale, L’Avvenire, Libero, Il
B have a friendship relation). Conversely, Pages               Fatto Quotidiano, Rainews24, Corriere Della Sera,
are typically used to represent organizations, pub-            Huffington Post Italia.
lic figures (web stars), companies or, as in our
case, newspapers. In this case, the relationship               2.1   Crawling
is unidirectional, in the sense that user A can ac-            Facebook offers developers Application Program-
cess the timeline of the page P by putting a ”Like”            ming Interfaces (APIs) for creating apps with
on P. Unlike a single-user, who usually publishes              Facebook’s native functionalities. In order to de-
photos, videos and links about his private life, the           velop the crawler, we exploited the Graph API,
timeline of a newspaper Facebook page, in general              which provides a simple view of the Facebook
contains news titles with a link to the official web-          social graph by showing the objects in the graph
site of the newspaper, where the user can read the             and the connections between them. The Graph
    1
      All data collected have been processed anonymously for   API allows us to navigate through the graph of
scientific purposes, without storing personal information.     the social network, which is organized into nodes
                                        tribution for each Newspaper.

Un business truffaldino [E ora                                     N EWSPAPER            N. OF T EXTS
finitela con l’eco-balla dei                                       La Repubblica           4558,829
controlli sulle emissioni]                                         Avvenire                 91,824
                                                                   Il Giornale            3,497,610
                                                             Libero                 2,436,246
                                                                   Il Fatto Quotidiano    4,900,314
         Figure 2: Example of crawled text.                        Rainews24                369,834
                                                                   Huffington Post        1,552,042
                                                                   Corriere della Sera    3,553,966
(Users, Pages, Photos and Comments) and Edges                      OVERALL                20,960,665
(Connections such as Friendship or Likes). The
graph is navigated by exploiting HTTP requests,       Table 1: Number of texts aggregated by Newspa-
that may be implemented using any programming         per in FB-NEWS.
language. The native APIs offered by Facebook
has some drawbacks: i) the maintenance of the            Table 2 shows the total number of tokens for
app, since the APIs change over time, making it       each page and the average number of texts, pro-
necessary to update the code of the crawler; ii)      duced for each post for each page. We can notice
only public data can be accessed without requir-      that the most followed newspapers on Facebook
ing the user’s consent; iii) Facebook places limi-    are Il Fatto Quotidiano and La Repubblica.
tations on the number of requests through a given
period of time. For each post, comment and reply,          N EWSPAPER             T OKENS      T EXTS /P OSTS
we stored the message (text), the story (presence          La Repubblica         96,059,756        182.61
of photos and links tags), its timestamp, the type         Avvenire               2,611,899         12.65
                                                           Il Giornale           64,345,260         77.93
(post, comment, reply), the parent post/comment,           Libero                41,166,457         81.87
the number of likes, shares and replies (Figure 2).        Il Fatto Quotidiano   99,025,541        193.33
                                                           Rainews24              7,735,908         10.21
                                                           Huffington Post       32,587,065         84.06
2.2   Linguistic annotation                                Corriere della Sera   64,197,579        95.01
A very basic preprocessing phase has been ap-              OVERALL               407,729,465        94.83
plied to the corpus before linguistic annotation,
                                                          Table 2: Tokens and Texts/Posts ratio for page.
to replace urls with the tag URL . The text has
been subsequently feed to a pipeline of general-
purpose NLP tools. In particular, it has been
POS-tagged with the Part-Of-Speech tagger de-         3     Topics in FB-NEWS15
scribed in (Dell’Orletta, 2009) and dependency-
parsed with the DeSR parser (Attardi et al., 2009).   FB-NEWS15 contains texts referring to a large
In addition, complex terms like forze dell’ordine     variety of events. In order to organize the cor-
(security force) or toccare il fondo (hit rock bot-   pus into clusters of thematically related news, we
tom) have been identified using the EXTra term        used LDA (Blei et al., 2003). LDA represents
extraction tool (Passaro and Lenci, 2015).            documents as random mixtures over latent topics,
                                                      where each topic is characterized by a distribu-
2.3   Corpus Analysis                                 tion over words. These random mixtures express
Except for Avvenire and Rainews24, for which          a document semantic content, and document sim-
we downloaded very few data, the other news-          ilarity can be estimated by looking at how similar
papers are attested in the corpus in a balanced       the corresponding topic mixtures are. For the topic
way. In general, the number of posts is very low      identification we used the software Mallet (Mc-
compared to the number of comments and replies.       Callum, 2002).
Figure 3: Cumulative distribution of posts, comments and replies in FB-NEWS15 for each Newspaper.


3.1   Selecting the vocabulary                           topic with the higher number of texts).
Since we were interested in extracting the topics
                                                          NATIONAL P OLITICS (2,516,640 TEXTS ,
from the news articles, we have built the model on
                                                            R ANK 1): {Renzi, presidente, premier,
the portion of FB-NEWS15 containing the posts
                                                            Mattarella, riforma, Alfano, senato, camera,
(FB-NEWS15 posts) published by the newspaper.
                                                            Boschi, aula} (Renzi, president, Mattarella,
In particular, we used entropy (Dumais, 1990) as
                                                            reform, Alfano, senate, chamber, Boschi,
a global term weighting and we selected for train-
                                                            hall)
ing the terms (nouns, adjectives, verbs and com-
plex terms) with a high informative value (thresh-        S CHOOL (1,707,145 TEXTS , R ANK 2):{scuola,
old fixed to 0.3), while using the remaining words            giovane, studente, protesta, corso, man-
as stopwords in Mallet (McCallum, 2002).                      care, sospendere, inglese, spiegare, lezione}
3.2   Extracting topics from posts                            (school, young, protest, class, lack, suspend,
                                                              English, explain, lesson)
In order to determine the most debated topics in
2015, we used LDA to assign 50 topics to the              C RIME (1,543,735 TEXTS , R ANK 7):
posts in FB-NEWS15 posts and we navigated the                 {uccidere, polizia, arrestare, fermare,
graph to assign the topics to the comments and the            sparare, uomo, poliziotto, colpo, ferire,
replies. Later, we restricted the topics associated           agente} (kill, police, detain, stop, open
to a post P to the topics T having a probability              fire, man, policeman, bump, wound, police
higher than the 90th percentile of the topic dis-             officer)
tribution of P . In this way, each post has been
assigned, on average, to 3.06 topics. Finally, com-       I SIS (1,267,749 TEXTS , R ANK 16): {Isis,
ments and replies have inherited the probability of            guerra, siria, minaccia, U.S.A., Libia,
belonging to the topic T from their parent post.               colpire, islamico, usare, jihadisti } (Isis, war,
Among the extracted topics ranked according to                 Syria, threat, U.S.A., Libya, damage, islamic,
the sum of these probabilities we can find national            use, jihadist)
and foreign politics, terrorism and church but also
food, football, cinema and weather forecast. We           F OOD (949,520 TEXTS , R ANK 40): {mangiare,
report some topics below, with the number of texts           ricetta, cibo, preparare, consiglio, evitare,
and the relative ranking (i.e., rank 1 is given to the       perfetto, trucco, salute, semplice} (eat,
     recipe, food, prepare, advice, avoid, perfect,     F. Dell’Orletta. 2009. Ensemble system for part-of-
     trick, health, simple)                                speech tagging. In EVALITA 2009 Evaluation of
                                                           NLP and Speech Tools for Italian 2009, LNCS, Reg-
 F OOTBALL (606,560 TEXTS , R ANK 50):                     gio Emilia (Italy). Springer.
    {seguire la diretta, guardare il video, campo,      S. T. Dumais. 1990. Enhancing performance in latent
    calcio, serie, Napoli, Milan, segnare, battere,        semantic indexing (lsi) retrieval. Technical Report
    partita} (follow the live, look at the video,          TM-ARH-017527.
    football field, football, league, Naples, Mi-       A. Hermida. 2013. #journalism. reconfiguring jour-
    lan)                                                  nalism research about twitter, one tweet at a time.
                                                          Digital Journalism.
4   Conclusions and ongoing work
                                                        H. Lin and L. Qiu. 2013. Two sites, two voices:
As one of the most widespread social networks,            Linguistic differences between facebook status up-
Facebook offers the possibility to collect opinion-       dates and tweets. In P. L. Patrick Rau, editor, Cross-
                                                          Cultural Design. Cultural Differences in Everyday
ated pieces of texts from people of different ages,       Life: 5th International Conference, CCD 2013, Held
cultures and education. The composition of FB-            as Part of HCI International 2013, volume 2, pages
NEWS15, in which each comment is explicitly as-           432–440, Las Vegas (USA). Springer Berlin Heidel-
sociated with a particular post, allows us to study       berg.
the differences in terms of readers’ perceptions        Andrew Kachites McCallum.       2002.    Mal-
about a particular topic. Differently from other so-      let: A machine learning for language toolkit.
cial media like Twitter, Facebook contains larger         http://mallet.cs.umass.edu.
texts including lot of subjective expressions that      N. Newman, W. H. Dutton, and G. Blank. 2012. Social
are very useful for the construction of sentiment          media in the changing ecology of news: The fourth
and emotive lexicons.                                      and fifth estate in britain. Internet Science, 7(1):6–
   Starting from previous works (Passaro et al.,           22.
2015; Passaro and Lenci, 2016), we plan to use          Oriella. 2013. The new normal for news. have global
this corpus to build lexical resources for sentiment      media changed forever? The 6th Annual Oriella
analysis and emotion detection, which will include        Digital Journalism Survey.
both words and complex terms. In addition, we           L. C. Passaro and A. Lenci. 2015. Extracting terms
plan to optimize the topic modeling phase and to           with extra. In Proceedings of the EUROPHRAS
investigate the possibility of using the extracted         2015 Computerised and Corpus-based Approaches
topics as a prior for inferring the sentiment orien-       to Phraseology: Monolingual and Multilingual Per-
                                                           spectives, pages 188–196, Malaga (Spain).
tation of a particular comment.
                                                        Lucia C. Passaro and Alessandro Lenci. 2016. Eval-
                                                          uating context selection strategies to build emotive
References                                                vector space models. In Proceedings of the Tenth In-
                                                          ternational Conference on Language Resources and
A. Ahmad. 2010. Is twitter a useful tool for journal-     Evaluation (LREC 2016). European Language Re-
  ists? Journal of Media Practice, 11(2):145–155.         sources Association (ELRA), may.
A. Ahmad. 2013. Whats in a tweet? foreign corre-        L. C. Passaro, L. Pollacci, and A. Lenci. 2015. Item:
  spondents use of social media. Journalism Practice,      A vector space model to bootstrap an italian emotive
  7(1):33–46.                                              lexicon. In Proceedings of the second Italian Con-
                                                           ference on Computational Linguistics CLiC-it 2015,
G. Attardi, F. Dell’Orletta, M. Simi, and J. Turian.       pages 215–220, Trento (Italy).
  2009. Accurate dependency parsing with a stacked
  multilayer perceptron. In EVALITA 2009 Evalu-         R. Picard. 2009. Blogs, tweets, social media, and the
  ation of NLP and Speech Tools for Italian 2009,          news business. Nieman Reports, 63(3):10–12.
  LNCS, Reggio Emilia (Italy). Springer.
                                                        D. V. Shah, J. Cho, W. P. Eveland, and N. Kwak. 2005.
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent       Information and expression in a digital age. Com-
  dirichlet allocation. The Journal of Machine Learn-      munication Research, 32(10):531–565.
  ing Research, 3:993–1022.
                                                        M. L. Sheffer and B. Schultz. 2010. Paradigm shift
H. Chyi and D. L. Lasorsa. 2002. An explorative           or passing fad? twitter and sports journalism. Inter-
  study on the market relation between online and         national journal of Sport Communication, 3(4):472–
  print newspapers. Journal of Media Economics,           484.
  15(2):91–106.