<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a more Systematic Analysis of Twitter Data: A Framework for the Analysis of Twitter Communities ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessia Antelmi</string-name>
          <email>aless.antelmi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josephine Gri th</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karen Young</string-name>
          <email>karen.youngg@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, Data Science Institute</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National University of Ireland Galway</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universita degli Studi di Salerno</institution>
          ,
          <addr-line>Fisciano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To date, many studies have used the social media platform Twitter to gather insights into real-life events. The current literature focuses on patterns around isolated case studies and their dynamics happening on the platform, but it still lacks standard techniques for comparing behavioural and interaction patterns within and across Twitter communities. To ll this gap, we present a framework for characterizing online Twitter communities from a quantitative and a semantic point of view. We then discuss an example of the application of the framework to compare two distinct Twitter fan communities. This case study application clearly illustrates the bene ts of the framework, while also highlighting potential areas for improvement and further extensions.</p>
      </abstract>
      <kwd-group>
        <kwd>Online Social Network Analysis Twitter Communities Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The ever-increasing use of the Internet produces a huge amount of structured
and unstructured data that can be mined and analysed to gather insights into
several domains. In this context, online social networks represent a rich
opportunity to collect real user data, especially from Twitter4 which is well-suited to
the task of discovering opinions, ideas and events [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With 335 million monthly
active users as reported by the Statista website, the microblogging platform
Twitter has been widely studied in contexts of political, crisis and brand
communication and user engagement around shared experiences such as TV shows
and everyday interpersonal exchanges [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Bruns and Stieglitz [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a
catalogue of standard, replicable metrics for studying hashtagged Twitter
conversations motivated by the absence in previous work of such metrics to compare
? This paper has been funded in part by Science Foundation Ireland under grant
number SFI/12/RC/2289 (Insight).
4 https://twitter.com
one hashtagged event with another. However, the literature still lacks standard
techniques for comparing behavioural and interaction patterns within and across
Twitter communities. This prevents researchers from developing a
comprehensive perspective about how Twitter is used by brands to engage with fans and
critics and how this use changes over time. In this work, we present a framework
for characterizing online Twitter communities from a quantitative and a
semantic point of view, using data retrieved from both the pro le and the timeline of
the users. In Section 2 we describe in detail the proposed framework, while also
outlining related work. In Section 3 we introduce two use cases showing how
the framework can be applied to analyse and compare users' behaviour within
and across Twitter communities. In experiments, we ensure that the collected
data is cleaned so that any spam/bot content is removed prior to analysis [
        <xref ref-type="bibr" rid="ref17 ref5">5,
17</xref>
        ]. Section 4 discusses the results obtained and ideas for future work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>A Framework for Analysing Twitter Communities</title>
      <p>
        In this work, we consider a Twitter community as a set of Twitter users who
share a common interest (e.g. some followers of a TV series' Twitter account),
motivated by the research of Java et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our framework is made up of
two principal components to deal with the User Generated Content (UGC) - in
terms of topics, sentiment and emotions expressed - and the user's interaction
behaviours and posting patterns, as shown in Figure 1. We will describe the
semantic component in Section 2.1, and the quantitative component in Section 2.2.
      </p>
      <p>UGC
community
general</p>
      <sec id="sec-2-1">
        <title>Semantic</title>
        <p>• Topic Modelling
• Sentiment Analysis
• Cognitive Analysis</p>
      </sec>
      <sec id="sec-2-2">
        <title>Quantitative</title>
        <p>• Activity Metrics
• Visibility Metrics
• Metadata Metrics</p>
        <p>Dashboard for presentation of results
Semantic analysis enables insights into the content produced by the
community of interest. In our framework we propose a three-level semantic analysis
approach, exploring the topics discussed, the sentiment and the cognitive sphere
of the posts. Where the given community is selected according to a speci c
interest/topic, it can be useful to split the UGC into two subsets: (i) the rst
containing all the activities related to the interest/topic chosen and (ii) the
remaining ones. Splitting the dataset in this way enables the comparison of
behavioural patterns across the same set of users regarding the topic of interest
and the other remaining activities. The analyses described can then be run
independently on both subsets indicating di erences and similarities that exist
within a community in comparison to general discussions. This division can be
done using a keyword list based on the chosen topic.</p>
        <p>
          Topic Modelling Level. Topic modelling is a machine learning technique that
looks for patterns in the use of words and attempts to inject semantic meaning
into vocabulary, whereby a topic consists of a cluster of words that frequently
occur together [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Previous work [
          <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
          ] presents a survey of tools and approaches
for topic detection from Twitter streams, exploring di erent types of topic
detection techniques and evaluating their performance. Lau et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and Jonsson
et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] focus on the evaluation of the Latent Dirichlet Allocation (LDA) topic
modelling algorithm and its variants, while Musto et al [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] implement a pipeline
of entity linking algorithms. Entity linking algorithms automatically incorporate
stopword removal, bigram recognition, entity identi cation and disambiguation.
They can also enrich the representation with features which do not explicitly
occur in a text: for example, if an entity is mapped to a Wikipedia page, it is
possible to browse a Wikipedia category' tree to further enrich content
representation introducing the most relevant ancestor categories of that page. Discovering
the topics discussed in the UGC is the rst step in detecting the interests of a
community.
        </p>
        <p>
          Sentiment Analysis Level. The study of the tweets' polarity (examination
of the sentiment of the tweets) can give important insights into what is
happening in the real world and what people think about a given event. Bollen et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
found that events in the social, political, cultural and economic sphere do have
a signi cant, immediate and highly speci c e ect on the various dimensions of
public mood, suggesting that large-scale analyses of mood can provide a solid
platform to model collective emotive trends in terms of their predictive value
with regards to existing social as well as economic indicators. Martinez et al.
o er a survey [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] of the state of the art techniques used to explore sentiment
analysis on Twitter. It is worthwhile highlighting that due to the nature of the
tweets, i.e rich in emojis, some studies focus on extracting the sentiment using
this piece of information [
          <xref ref-type="bibr" rid="ref19 ref20">20, 19</xref>
          ].
        </p>
        <p>
          Cognitive Analysis Level. Exploiting the cognitive sphere of the UGC gives
deeper knowledge about the emotional aspects of the content and the personality
of its author [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Several works explore this dimension on the Twitter platform.
Qiu et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] study the relationship between personality and the microblog,
pointing out the potential of using social media for personality research.
Tumasjan et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] use cognitive analysis to investigate whether Twitter is used
as a forum for political deliberation and whether online messages on Twitter
validly mirror o ine political sentiment. The synergistic use of Twitter and the
analysis of the cognitive sphere can also help in the health domain, where simple
natural language processing can yield insights into speci c disorders [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and into
the level of stress during workdays and weekends [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Linguistic Inquiry and
Word Count5 (LIWC ) - text analysis software developed to assess emotional,
cognitive and structural components of text samples using a psychometrically
validated internal dictionary - is one of the most used tools in cognitive analysis,
thanks to its ease of use and its broad range of social and psychological insights.
Other tools are ANEW 6, a dictionary focused on academic text, and GI 7, a
computer-assisted approach for content analyses of textual data.
2.2
        </p>
        <sec id="sec-2-2-1">
          <title>Quantitative Analysis</title>
          <p>While the semantic analysis provides a way to analyse the content posted by the
users, quantitative analysis of user tweets provides insights into user behaviours
and interaction patterns. We identify three typologies of quantitative metrics:
activity, visibility and metadata.</p>
          <p>Activity metrics. Activity metrics describe the daily activity pattern of the
community - in terms of the content posted or liked - and the number of
di erent types of activities, i.e. the total amount of tweets, quotes, retweets,
comments and likes. Evaluating the number of daily activities is useful in
identifying any spike in the interaction pattern and its potential reason (e.g.
political election, movie premiere). Assessing the number of di erent
activities that exist can help in nding out the proportional distribution between
information providing and information seeking users. This information can
be helpful when evaluating information propagation strategies.</p>
          <p>Visibility metrics. Visibility metrics count the number of retweets and likes
received and they can help in understanding the visibility of the users within
the community and Twitter in general.</p>
          <p>Metadata metrics. Metadata metrics are evaluated on the metadata eld
retrieved from the Twitter JSON object describing a user's activity. These
metrics enable the identi cation of the most used hashtags, posting devices
and attached media (in terms of photos, videos and gifs) as well as the
location of the activity posted.
2.3</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Summary of the Framework</title>
          <p>We have outlined the organization of our framework, designed to provide a
structured overview of approaches and techniques that best suit the analysis of the
community of interest. The framework serves to guide the choice of these,
exploring some of the most common tools and techniques used and describing
whether they can be useful or not according to the insights they o er about the
5 http://liwc.wpengine.com
6 http://www.newacademicwordlist.org
7 http://www.wjh.harvard.edu/ inquirer
data (e.g., understanding the cause of a peak in the user activity and the
associated reaction). The high modularity that distinguishes the framework allows
the use of some, or all, of its components based on the desired depth of analysis
required. A visualization component, where we outline several techniques to plot
the outcomes obtained from the analyses, is also included in the framework. We
do not discuss this component in this work, but we provide a link8 to a
dashboard visualizing all results from the use cases described. To understand how
the framework can be used to answer speci c research questions we tested it by
applying it to two real Twitter communities. These two use cases are described
in the following section.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Use Cases</title>
      <p>The application of the proposed framework to Twitter communities is trialled in
two separate use cases { individually initially, followed by a comparative analysis
across the two communities. While both communities were analysed using all
six components of the framework, we will present only the most interesting
outcomes here. It is worth noting that there are many tools available for each task
described; we chose the ones that have been widely used in the literature and that
are, at the same time, both easy to obtain and do not require a signi cant coding
e ort - thus making the framework more accessible to more people. Furthermore,
thanks to the modular design of the framework, di erent tools can be added,
replaced and removed as new technologies are developed.</p>
      <p>We consider two Twitter fan communities - one related to the TV show Game
of Thrones (GoT), the other to the British rock band Coldplay - and we apply
the same methodology to both of them. We chose the GoT community due to the
fact that the Game of Thrones TV show has established a reputation for being
widely discussed on social media channels, speci cally on the Twitter platform,
which allows dynamic, real-time engagement and interaction with viewers. We
picked a Coldplay community for a similar reason: with a gross of 523 million
dollars and 5.39 million fans attending their tour in 2017, the pop/rock band
Coldplay is one of the most famous in the last decade. In both cases, Twitter
plays the role of an important medium to engage fans. We retrieved the complete
followers list of the GoT Twitter account and then randomly picked 350,000
users from among them. We collected data - timeline and pro le information
from the 130,951 users among this random group whose timeline was publicly
shared over a 6-months timeframe from June 3rd to December 3rd, 2017. We
followed the same process to select a random subset of the Coldplay o cial
Twitter account followers obtaining a nal 121,306 users. We rst divided both
datasets into two di erent subsets, one containing all posts about either a GoT
or Coldplay topic, the other two consisting of all the remaining activities within
a dataset. To accomplish this task we used a keyword list and we searched for
these keywords in the text, hashtags and user mentions status elds. The GoT
keyword list was based on GoT world in general, HBO.com (e.g. promotional
8 http://www.alessiaantelmi.it/framework/production
campaigns, episodes' titles) and GoT books (e.g. titles, characters), while the
Coldplay keyword list was based on the band members' names, Coldplay songs
and albums' titles and tours' names. This process yielded a nal total of 404,650
GoT user activities (with 47,682 users who posted an update about the TV
show) and 16,160,878 generic GoT-unrelated posts within the GoT dataset, and
6,814 Coldplay user activities (with 2,103 users who posted an update about the
band) and 4,270,077 generic posts within the Coldplay dataset. It is interesting
to note that there is a considerable imbalance in the number of activities posted
by the two communities, probably due to the di erent nature of the
communityrelated events analysed (TV series and concerts). All the analyses that follow
have been run independently on all four sub datasets - the semantic analysis
approach has only been applied to the English tweets.
3.1</p>
      <sec id="sec-3-1">
        <title>Semantic Analysis</title>
        <p>Topic Analysis. The topic modelling tool we chose was the machine
learning toolkit MALLET9, which provides an e cient way to build up topic models
based on the LDA algorithm. We found that 4 was the optimal number of clusters
for the GoT-related activities, corresponding to the following topics: broadcast
of a new episode, season premiere/ nale, trailers/scenes (e.g., videos on
YouTube) and episodes' content (e.g., lines spoken by GoT TV series
characters). To evaluate the topics for the generic activities in the GoT dataset,
we split them according to their creation date and we analysed the topics per
month. Due to the huge amount of posts to evaluate, we randomly picked three
di erent subsets (around 50,000 posts each) from among them on which to run
the topic analysis. We found that the most discussed topics across the whole
six months are: politics (e.g., Brexit, Trump, Obama, Catalonia), sport (e.g.,
cricket, NBA, tennis, football), special events (Father's Day, Thanksgiving)
and news (e.g., Hurricanes Harvey and Irma, Mexico Earthquake, Las Vegas
shooting). To verify if the use of an entity linking algorithm can improve the
evaluation of the topics, we also used Tag.me APIs10 that enabled us to identify
Wikipedia entities referred to in the text of the content analysed. The
investigation of the Wikipedia entities in the GoT-related dataset helped in sharpening
the broader topics obtained through LDA, showing that they mainly refer to (i)
GoTstoryline characters: Jon Snow was the most discussed, followed by Arya
Stark, Daenerys Targaryen and Sansa Stark and (ii) locations: either described
in the books or used as lming sets (identi ed under the Wikipedia entity World
of A Song of Ice and Fire). The Wikipedia entities retrieved from the generic
activities in the GoT dataset didn't add any further information; most common
entities found are Father's Day, Theresa May, Israel cricket team, Manchester
United F.C., racism, Houston, hurricane season, Thanksgiving, Donald Trump.</p>
        <p>
          Analysis of the Coldplay dataset found the following topics in the
Coldplayrelated activities: A Head Full of Dreams Tour, Houston concert and
9 http://mallet.cs.umass.edu.
10 http://tagme.di.unipi.it
hurricane, Album anniversary, Albums/songs advertising. As in the
GoT case, the Wikipedia entities collected gave us further information about
these topics, such as most tweeted songs (A Sky Full of Stars and Fix You) and
the location of the Houston concert (NRG Stadium). Frequently discussed topics
in the generic activities within the Coldplay-dataset (evaluated as in the GoT
case) are mostly music-related with references to the Teen Choice Awards
and to the MTV Video Music Awards - the evaluation of the number of the
Wikipedia entities found in this dataset highlights the same result. Other topics
are related to politics (e.g., Obama, Trump, racism), TV-series (e.g., Game
of Thrones, Gotham), sport (e.g., Premier League, NBA, cricket) and special
events/news (e.g., Harry Potter, Father's Day, Houston Hurricane, Diwali). It
is interesting to note that both communities are engaged with almost the same
set of generic topics, such as politics, sports, news and special events.
Sentiment Analysis. To evaluate the sentiment of the collected activities we
used the freely available lexicon and rule-based classi er Vader [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], \speci cally
attuned to sentiments expressed in social media"11. We studied the sentiment
expressed in the whole GoT dataset from the 1st July until the 31st August,
when the majority of the GoT-related activities happened (see Figure 4a), while
we observed the sentiment expressed in the Coldplay dataset from August 15th
until August 31st, corresponding to a peak in the number of Coldplay-related
activities (see Figure 2d). Figures 2a and 2c show the average daily sentiment
within the GoT and the Coldplay dataset, respectively. This is prevalently
positive for both communities with few points touching zero, meaning an increasing
number of negative posts that lower the average sentiment value in those days.
To nd the events identi ed with these valleys in the sentiment, we combined the
GoT-related activities and the associated sentiment in Figure 2b. We discovered
that lower values in the sentiment are related to the third, fourth, fth and
seventh episodes of the GoT TV series (where an episode is represented by a spike
in the activities). Lower sentiment values are also observable (Figure 2a) for the
generic activities posted during the second week of August; they are mostly
related to the nationalist march in Charlottesville and to the racism question in
general. Figure 2d shows the same analysis for Coldplay-related activities, where
a peak in the activities corresponds to the lowest daily sentiment; this event is
related to the cancelled Coldplay concert in Houston because of the hurricane
Harvey. This also explains the many tweets related to the NRG stadium, the
location of this concert.
        </p>
        <p>Cognitive Analysis. We used the LIWC dictionary as a cognitive analysis
tool. The analysis identi ed (see Figures 3a and 3c) that both communities have
a positive and con dent style (a high value for the Tone and Clout variables,
respectively), expressed with a distanced form of discourse (a low Authentic value).
Figures 3b and 3d show the outcomes for the other LIWC dimensions analysed.
The GoT-related activities are not only the more negative in terms of sadness,
11 https://github.com/cjhutto/vaderSentiment
0.5
e
rsco
ten 0
m
it
n
e
S-0.5
-1
1
32k
24k tcA
16k iiitse</p>
        <p>v
8k
0
1000
800
600 tcA
iit
v
400 ise
200
0
16. Aug 18. Aug 20. Aug 22. Aug 24. Aug 26. Aug 28. Aug 30. Aug</p>
        <p>Coldplay avg sentiment General avg sentiment</p>
        <p>Highcharts.ctomh
(c) Average daily sentiment from the 15
August until the 31st August (Coldplay
dataset).</p>
        <p>A1u6g. A1u8g. A2u0g. A2u2g. A2u4g. A2u6g. A2u8g. A3u0g.</p>
        <p>Coldplay avg sentiment Coldplay activities</p>
        <p>Highcharts.com
(d) Comparison between the
Coldplayrelated activities and the sentiment
expressed.
anxiety and anger, but many of them also refer to status, dominance and social
hierarchies (a high value for Power variable) and they include several references
to other people (high value for A liation variable). Both these values are
reected in the Drives dimension. This result is not surprising due to the topics on
which the GoT TV series is based (e.g., battles for power). Both communities are
focused on present events, given the use of present-tense verbs; while the GoT
one refers also to past events with past-tense verbs. Moderately-high values for
the Informal, Netspeak and Assent variables re ect the writing style of social
media, such as the use of basic punctuation-based emoticons and abbreviations
like LOL. Examples of Assent words are: agree, OK, yes. The moderately-high
value for the CogProc dimension highlights how users are willing to express their
opinions in their tweets; this aspect is supported by the use of verbs like think,
consider and should. It is also interesting to note the absence of Perceptual
processes, i.e. the absence of the massive use of verbs indicating seeing, hearing and
feeling, even though many activities within the GoT community refer to the TV
show and many posts in the Coldplay community refer to music.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Quantitative Analysis. A study of the daily activity pattern of all datasets</title>
        <p>50
0
Analytic
50
0</p>
        <p>Authentic
Tone</p>
        <p>Clout
Tone</p>
        <p>Clout
(c) LIWC summary variables - ColdHighchartas.coym</p>
        <p>pl
dataset
power affiliation
(d) Other LIWC dimensions analysHigehchdarts.co-m</p>
        <p>Coldplay dataset
GoT
GoT general
Coldplay
Coldplay
general</p>
        <p>netspeak
informal
focuspresent
assent</p>
        <p>posemo
affect
10
5
0
negemo
cogproc
percept</p>
        <p>GoT
GoT
general
focuspast
drives
cogproc
percept
drives</p>
        <p>Coldplay
Coldplay</p>
        <p>General
netspeak
informal
focuspresent
assent</p>
        <p>posemo
affect
5
0</p>
        <p>Authentic
(a) LIWC summary variables - HigGhcharots.cTom
dataset
power affiliation
(b) Other LIWC dimensions
analyHsighcharts.com</p>
        <p>ed</p>
        <p>GoT dataset
reveals that the GoT community is far more active than the Coldplay one. In
particular, Figures 4a and 4b evidence clear peaks of user activity related to the
GoT show throughout the TV series' seventh season. Speci cally, the highest
levels of user activity are evident in the season nale (last spike), followed closely
by the premiere (third spike). Other events that stimulated interest are the nal
trailer (second spike) and the Twitter Emoji Engine release ( rst spike). By
contrast, no clear pattern emerges from the Generic activity set. The study of the
daily activity pattern for the Coldplay community (Figures 4c and 4d) shows a
huge peak of generic activities during August and the second half of the observed
period. This is likely due to some events happening in the month of August, as
a result of hurricane Harvey (and the following hurricane Irma) and the MTV
Video Music Awards held in California. Figure 4d reveals a single major peak in
the Coldplay-related activities happening on the 25th August corresponding to
the cancelled Houston concert. The other highest levels of user activity refer to
the concerts performed in Chicago (17th August), Cleveland (19th August) and
Miami (28th August), while other minor peaks in October and November refer
to concerts as well.</p>
        <p>The nal metadata analysis reveals that the most used hashtags related
to GoT are those standard, generic hashtags like the name of the show:
#gameofthrones, and di erent abbreviated versions of it, like #got and #got7, followed
Aug '17</p>
        <p>GoT</p>
        <p>Sep '17
General</p>
        <p>Aug '17</p>
        <p>GoT</p>
        <p>Sep '17
General
Jul '17</p>
        <p>Oct '17</p>
        <p>Nov '17 Dec '17</p>
        <p>Jul '17</p>
        <p>Oct '17</p>
        <p>Nov '17 Dec '17
80k
iiittsve60k
c
fao40k
r
e
b
m
uN20k</p>
        <p>0
100k
80k
iits
e
itcv 60k
a
f
o
reb 40k
m
u
N 20k
0
40k
by #winterishere, #thronesyall, #gotmvp, #gameofthrones nale and
#prepareforwinter. The top ve hashtags found in the generic activities within the GoT
dataset are #giveaway, #win, #tvtime, #mufc (Manchester United F.C.) and
#sdcc (San Diego Comic-Con). The most used hashtags related to the Coldplay
activities are associated with the A Head Full of Dreams Tour, like
#coldplaytoronto, #coldplaychicago and #coldplayhouston, in addition to generic ones
such as #coldplay and #chrismartin. The top ve hashtags for the generic
activities are: #pushawardskathniels (it refers to the Push Awards 2017 contest
which recognises top online in uencers), #mersal, #missuniverse (Miss Universe
Philippines), #philippines and #newpro lepic. Both communities share the clear
predominance for mobile activity - as the majority of them are generated from
an iPhone and from an Android device (more than 70% in total). The most
common language is English (more than 60%), followed by Spanish, Portuguese
and French in both cases.
3.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Discussion</title>
        <p>The proposed framework allows a larger problem, namely the analysis of
behavioural and interaction patterns of a Twitter community, to be broken into
sub-problems so that some or all of the di erent components described in
Section 2 can be considered when analysing data from a new community. Its
application to two use cases illustrates how a standard approach in analysing
communities makes it easier to acquire insights into them, especially when combining
and/or comparing the several outcomes obtained. The quantitative analysis can
o er an overview of the user behaviour, in terms of interaction patterns and
typology of content posted and clearly illustrates the activity level of a
community, while the evaluation of the most used hashtags provides insights into the
most discussed events within a community, like the San Diego Comic-Con for the
GoT community and Miss Universe Philippines for the Coldplay one. Merging
the outcomes from the metadata analysis (such as tweeting locations, most used
languages and posting devices) provides insights regarding any events
happening in a community, as in the case, for example, in the Coldplay community,
where most of the activities were from the USA during the band's tour.
Further insights can be obtained by merging the quantitative information with the
outcomes from the semantic analysis: Figure 2 shows how combining the daily
activity pattern with the sentiment and the topics expressed in the posts yields
the most discussed topics and the reaction associated with them. The study
of the cognitive dimension can further improve the analysis of the emotional
component extending the binary positive/negative sentiment categorization to
several categories, for instance in terms of anger, anxiety or happiness. This can
be useful to compare the reaction to di erent events or the way communities
express themselves; for example, we found that the GoT community was more
negative in terms of anger and anxiety in comparison to the Coldplay dataset.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        This work presents a framework for the analysis of User Generated Content
(UGC) using Twitter communities and applies the framework to two di erent
case studies. The framework comprises two main components - semantic and
quantitative - with each component comprising three sub-components. The
development of a standard framework for the analysis of Twitter communities
provides a simpli ed approach to compare and correlate outcomes across a range of
di erent case studies. This can be used to nd the similarities and di erences in
behavioural and interaction patterns within and across communities. The
presence of a dashboard to interactively visualize the results from the analyses and
the user insights produced can be another useful tool to acquire knowledge about
the dataset and will be discussed in future work. To further investigate the
communities of interest, several additional research methods can be employed. For
instance, the work described by Bruns and Stieglitz [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which focused on
hashtagged conversations, can be included to deepen the awareness about how
hashtags contribute to share knowledge and discuss events. The framework could also
be further extended to consider both a static snapshot of the network structure
of the community and its dynamic evolution over time. Finally, another point
of interest could be evaluating the framework on di erent types of datasets,
like real-time data collected from the Twitter stream, data strictly related to
an event (e.g. the World Cup) or only retweet data (for instance, to compare
retweeted data with only tweet data).
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Alessia Antelmi thanks the Erasmus+ grant.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atefeh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khreich</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A survey of techniques for event detection in twitter</article-title>
          .
          <source>Comput. Intell</source>
          .
          <volume>54</volume>
          (
          <issue>31</issue>
          ),
          <volume>132</volume>
          {
          <fpage>164</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barnaghi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Opinion mining and sentiment polarity on twitter and correlation between events and sentiment</article-title>
          .
          <source>In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications</source>
          . pp.
          <volume>52</volume>
          {
          <issue>57</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bollen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
          </string-name>
          , H.:
          <article-title>Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena</article-title>
          .
          <source>CoRR</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bruns</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stieglitz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Towards more systematic twitter analysis: Metrics for tweeting activities</article-title>
          16,
          <volume>91</volume>
          {
          <volume>108</volume>
          (03
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , et al.:
          <article-title>Detecting automation of twitter accounts: Are you a human, bot, or cyborg</article-title>
          ?
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          <volume>9</volume>
          (
          <issue>6</issue>
          ),
          <volume>811</volume>
          {
          <fpage>824</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Greene</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>How many topics? stability analysis for topic models</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <volume>498</volume>
          {
          <issue>513</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Harman</surname>
          </string-name>
          , G.:
          <article-title>Quantifying mental health signals in twitter 2014 (01</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hutto</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilbert</surname>
          </string-name>
          , E.:
          <article-title>Vader: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          .
          <source>In: ICWSM</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ibrahim</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Tools and approaches for topic detection from twitter streams: survey</article-title>
          .
          <source>Knowledge and Information Systems</source>
          <volume>54</volume>
          (
          <issue>3</issue>
          ),
          <volume>511</volume>
          {
          <fpage>539</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Java</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Why we twitter: An analysis of a microblogging community</article-title>
          .
          <source>In: Advances in Web Mining and Web Usage Analysis</source>
          . pp.
          <volume>118</volume>
          {
          <issue>138</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jonsson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stolee</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An evaluation of topic modelling techniques for twitter</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>On-line trend analysis with topic models: #twitter trends detection topic model online</article-title>
          .
          <source>In: COLING</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Martinez-Camara</surname>
          </string-name>
          , et.al:
          <article-title>Sentiment analysis in twitter</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>28</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Musto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Crowdpulse: A framework for real-time semantic analysis of social streams</article-title>
          .
          <source>Information Systems</source>
          <volume>54</volume>
          ,
          <fpage>127</fpage>
          {
          <fpage>146</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramsay</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>You are what you tweet : Personality expression and perception on twitter (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tumasjan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Predicting elections with twitter: What 140 characters reveal about political sentiment</article-title>
          .
          <source>In: ICWSM</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          :
          <article-title>Don't follow me: Spam detection in twitter</article-title>
          .
          <source>In: 2010 International Conference on Security and Cryptography (SECRYPT)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>10</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , et al.:
          <article-title>Twitter analysis: Studying us weekly trends in work stress</article-title>
          and
          <source>emotion</source>
          <year>2016</year>
          ,
          <volume>355</volume>
          {
          <volume>378</volume>
          (01
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Wolny</surname>
          </string-name>
          , W.:
          <article-title>Emotion analysis of twitter data that use emoticons and emoji ideograms</article-title>
          .
          <source>In: ISD</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Emoji as emotion tags for tweets</article-title>
          .
          <source>Emotion and Sentiment Analysis Workshop</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>