<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop,
Glasgow, Scotland</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting Resources that Help Tell Events' Stories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlo Andrea Conte</string-name>
          <email>carloante@msn.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Troncy</string-name>
          <email>raphael.troncy@eurecom.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mor Naaman</string-name>
          <email>mor.naaman@cornell.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cornell Tech and Mahaya, Inc.</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Biot</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mahaya Inc.</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>04</lpage>
      <abstract>
        <p>Social media platforms constitute a valuable source of information regarding real-world happenings. In particular, user generated content on mobile-oriented platforms like Twitter allows for real-time narrations thanks to the instantaneous nature of publishing. A common practice for users is to include in the tweets links pointing to articles, media les and other resources. In this paper, we are interested in how the resources shared in a stream of tweets for an event can be analyzed, and how can they help tell the event story. We describe a system that extracts, resolves, and eventually lters the resources shared in tweets content according to two di erent ranking functions. We are interested in how these two ranking functions perform (with respect to speed and accuracy) for discovering important and relevant resources that will tell the event story. We describe an experiment on a sample set of events where we evaluate those functions. We nally comment on the stories we obtained and we provide statistics that give meaningful insights for improving the system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For many events and real-world happenings, Twitter,
Facebook, Instagram, and other social media
platforms provide a continuous stream of user-contributed
messages and media. Very often, the messages posted
include hyperlinks pointing to content outside the
platform where the message was originally posted. The
Copyright ' by the paper's authors. Copying permitted only
for private and academic purposes.
nature of these external resources varies: they may
be images, news articles, real-time video streams and
many other types of content. Our goal is to identify
resources shared via hyperlinks in social media streams
that are highly relevant and important to an ongoing
event. Our next goal is to extract these relevant
resources, and assemble them in a storyline with the
objective of producing a rich narration of the event using
a very diverse set of media.</p>
      <p>In this paper, we describe a system that aims at
extracting a timeline of resources from a stream of
tweets1 about an event. These links will be ranked
and ltered in near real-time, in order to identify
relevant and valuable information as soon as possible
after being shared. In addition, the system extracts
descriptive metadata from the referenced pages that can
be used to represent the resource in the event
timeline in an intelligible way. The collection of items is
normalized according to the referenced resources, so
that their relevance score will result from the
aggregation of the social items referencing them. We will
eventually analyse and compare the storylines that are
produced in order to identify particular characteristics
that could help improve future versions of this
system. Our contributions include: a) a general
architecture for extracting resources from a stream of tweets;
b) the development of two scoring methods to rank
the importance of those resources in contributing to
the storyline of an event; and c) the representation of
meaningful statistics from the resulting story.</p>
      <p>The rest of this paper is organized as follow. In
Section 2, we present some related work. We then
describe our generic system architecture for
extracting resources in a stream of tweets (Section 3). In
Section 4, we present two ranking methods based on
volume and velocity. We show a simple interface we
have developed to visualize the storylines that are
created based on some ltering parameters (Section 5).</p>
      <p>1We observe that tweets can (and often do) contain links
pointing to other social networks such as Instagram, YouTube,
etc.</p>
      <p>We discuss the results of our experiments in Section 6.
Finally, we conclude and outline future work in
Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The system described in this paper is one out of many
services that gather social media shared about
particular events. These systems generally aim to provide a
comprehensive view of an event in order to support the
user in following or understanding the event. The
primary motivation of our system is to rapidly mine huge
volume of social media content so that a user receives
relevant information in a timely manner, for example
for news reporting or stock market investing.</p>
      <p>
        There has been a lot of work on the exploitation
of user generated content and social media for telling
events' stories. In this paper, we focus on collecting
and mining resources to compose in real-time a
storyline about a speci c event. Gathering and
analyzing contents is the main focus of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where the
authors provide techniques for automatically identifying
posts shared by users on social platforms for a-priori
known events, and describe di erent querying
techniques. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the focus is shifted towards identifying
the users contributing to the social content available
on a particular happening. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the authors select
the best tweets for a news article, a related (but
somewhat reversed) problem to the one we are presenting
here.
      </p>
      <p>
        Closer to our work, Shamma et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have looked at
summarization and extraction of stories from streams
of media, including for example, [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ] amongst
others. The user-friendly representation of these resources
is addressed in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], where videos provided by the users
are assembled in new video streams personalized for
each viewer, while in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], both the retrieval and
clustering of media items shared on social networks are
assembled in the so-called MediaFinder application.
In this paper, we address the problem of building a
storyline as soon as the event is progressing.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Architecture</title>
      <p>In this section, we describe the overall architecture for
the system that extracts and ranks resources that are
linked from Twitter messages about an event. The
system we propose needs to run in real-time, in order to
build storylines of events as they are unfolding. Our
goal is to be able to identify key resources with the
smallest delay possible. These requirements impose
the adoption of an e cient, exible concurrency
process that would be easily scalable according to the data
ow. In the section 6, we report on tests conducted on
three events that have happened in the past, with the
aim of simulating the same near real-time algorithm
that would be applied to happening events.</p>
      <p>For the purpose of this work, we assume our input
is a stream of Twitter messages that has been
identied as relevant to the event being tracked. In our case,
these streams are generated by using a hashtag that is
associated with each event, but the system described
here is agnostic to how the Twitter content is
identi ed. We assume that a separated process retrieves
the Twitter content, and keeps updating a database at
regular intervals with raw data from Twitter.</p>
      <p>The di erent computational steps performed in the
resource extraction process are:
1. Extracting links from a collection of tweets for an
event,
2. Resolving these links to their canonical form in
order to identify duplicate resources,
3. Ranking links and applying a rst basic lter,
4. Collecting useful metadata from the pages
referenced by the links which have been selected,
5. Outputting a timeline of resources that can be
further ltered using a simple web interface.</p>
      <p>We now detail the general architecture and the
major building blocks of this system (Figure 1). The
complete processing for each link includes two very severe
bottlenecks that require network calls that may take
a couple of seconds to complete: the resolution of the
url, and the scraping of the references pages. In order
to overcome this limitation and to build a more e
cient system, we split this processing into two systems,
that are intended to run in parallel.</p>
      <p>
        The rst issue addressed by the system is the
fact that, due to URL shorteners and non-canonical
URL formats, links to the same resource (e.g. http:
//www.example.com/some\_article) can take di
erent forms: a bit.ly link, a URL that includes a query
string, etc. The links dispatcher is the rst step of this
processing chain. It retrieves all the event tweets for
a recent temporal window from the content database.
After loading the raw data, the links dispatcher
extracts every link from the tweets' entities [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), and
queries the links mappings data to check if the URL
has already been resolved to a canonical form. If the
link was not yet resolved, the system adds the URL to
a link resolutor queue. If it was already resolved, the
system adds the link, together with the data about the
tweet, to the links appearances database. The links
resolutor queue is implemented as a jobs queue [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The advantages of a jobs queue approach is that many
workers instances can be run at the same time, and
their number and behavior can be adjusted according
to the workload. This allows for great exibility in
the way the system can be scaled depending on the
amount of data. All the urls in the queue are resolved
by the links resolutor function. This function will rst
look up the url in the links mappings. If it is not
found, the url will be resolved in order to save a new
mapping. For all links which have been successfully
resolved, there is an entry in the links appearances
dataset. The links appearances dataset is therefore
a collection of all the links which have been resolved.
Every item contains information about the url, the
tweet that contains this link and the event identi er
to which the tweet and the url are attached to.
      </p>
      <p>The second issue is due do the huge amount of
resources that are returned which requires to decide and
lter what resources will actually be scraped and used
to build a storyline. Links appearances are
periodically accessed by the decider, which is responsible for
ranking the resources and ltering them as needed.
The computation is always done on a sliding
temporal window of a xed size which is a parameter. The
scoring system relies on links score processors for its
decision process: di erent score processors can be
implemented for testing di erent score functions. We
detail the links score processors (LSPs) used for our
experiments in section 4. The LSPs we use implement
a very basic ltering in order to enrich the event links
with features useful for ranking. Additional ltering
possibilities are provided within our front-end
interface.</p>
      <p>The third issue is caused by the transformation of
a set of urls in a more human-readable representation.
This requires the scraping of additional information
from the pages pointed by the selected links. If a
link is selected for publication, the decider queries the
pages metadata database to check whether the system
has the available metadata for that link. If no
metadata is available, the link appearance is added to the
Links Metadata Queue, otherwise it is saved together
with its metadata as an event link. The Links
Metadata Queue is processed by the metadata scraper. This
function extracts the domain from the url and selects
a particular scraper class accordingly: it loads the
referenced page and extracts pieces of information from
it (e.g. a title, a description and a representative
image). Di erent scrapers look for di erent tags in the
DOM structure, as di erent web sites usually expose
di erent information in di erent ways. In our tests, we
use a generic scraper that collects information stored
in the Open Graph2 and Twitter Cards3 meta-tags.
Only the links for which enough metadata is found are
saved as event links, and the pages metadata database
is updated accordingly.</p>
      <p>The nal output of the system is stored in the event
links database. This dataset holds the nal set of
resources, together with their score, and other attributes
inferred by the link score processor and used to rank
the resource (e.g. total volume, highest volume in a
time-window duration). In addition, the record
contains the metadata extracted from the referenced page.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Ranking Methods</title>
      <p>In this section, we describe two methods for ranking
the links that appear in the Twitter stream. These
methods have to enable a robust detection of relevant
and important links with the smallest delay possible
from their publication. The reason why we want to
lter out irrelevant resources is the quantity of data
that is usually shared: it usually largely outgrows the
number of items that could be used for building a story.
4.1</p>
      <sec id="sec-4-1">
        <title>Volume based LSP</title>
        <p>Organizing a collection of every single appearance of
every link gives us the possibility to obtain information
about the volume of re-share reached by a link during
an event. As an example, Figure 2 shows the
number of appearances of the link pointing to president
Obama's Vine for Batkid4.</p>
        <p>The volume based LSP assigns to every link a score
equal to the cumulative function of its volume
throughout the event. A very basic ltering step is
implemented by a manually chosen volume threshold that
is only meant to exclude the noise of links that did not
trigger any interest at all in the audience, and can be
considered to be background noise (e.g. those links
2https://developers.facebook.com/docs/opengraph/
3https://dev.twitter.com/docs/cards
4An initiative of Make-A-Wish Foundation for a child
a ected by leukemia and that attracted the attention of
social media:
http://abcnews.go.com/US/batkids-maketransforming-san-francisco-gotham/story?id=20899254;
President Obama posted a Vine response at https:
//vine.co/v/htbdjZAPrAX
with only one appearance). Such threshold should
be set according to the general volume of an event:
we heuristically tune these thresholds after extracting
from the database aggregated statistics regarding the
volume of these links.</p>
        <p>A link that has been shared at a nearly constant rate
during the analyzed time range will be more likely to
appear in our timeline than a link that reached a very
high volume at a particular point in time. The display
time for links ranked by this LSP is chosen to be the
time of the earliest appearance that passed the
elementary ltering (e.g. the second appearance of a link).
Even if the precision of this parameter will su er from
setting very selective volume lters, we are assuming
that high-volume events imply a faster growth of the
volume of relevant links, thus introducing only smaller
delays. We expect this method to produce more
robust rankings. However, this LSP will take longer to
recognize important links as the event unfolds, and
results will not be reliable while the event is happening
as much as when it is over.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Velocity based LSP</title>
        <p>The velocity based LSP computes links' scores as the
appearances volume reached by a link within one
decider processing time window. The decider's time
windows occur every 20 minutes and are 30 minutes long,
thus allowing a 10 minutes overlap between each
window.</p>
        <p>This LSP implements the same lter mechanism
described in Section 4.1, with the only di erence that the
threshold is compared with the current time window's
volume. The display time is de ned as the rst time
a link appearance survives the ltering for the rst
time in a time window. However, the score is always
updated to the highest volume the link has reached
within a window.</p>
        <p>The velocity based method will recognize an
important link as soon as it quickly grows in volume,
regardless what happened throughout the rest of the event,
thus representing a better choice for realtime use. On
the other hand, this system can also produce noisier
results.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Front-end Interface</title>
      <p>The results of the resource extraction process can be
displayed in a simple web interface forming a
\storyline": the list of resources are sorted chronologically
according to a display time eld de ned by the LSP.
Every resource is represented by its title, description
and an image extracted from the referenced page. We
expect this arrangement to automatically provide the
reader with a narration of what was happening. This
has not been evaluated though. A possible
evaluation methodology could consist in generating various
timelines according to di erent thresholds. First, users
could be prompted to answer questions related to their
understanding of the event's chronological narration or
the quality of the storyline using a Lickert scale.
Second, user clicks on timelines could be collected in order
to get insights on the number of interesting links
included in those summaries. We leave this study as a
future work.</p>
      <p>A ltering functionality allows a user to select a
cuto score for the links to be visualized. When clicking
on a link, that link is marked as false-positive. This
marking functionality is used to plot the number of
links satisfying a certain volume threshold versus the
true-positives satisfying the same requirements. This
interface also draws a pie chart, representing the source
domains of the links displayed (taking into account the
ltering parameter). Figure 3 shows a simple
example of a storyline. This example uses the information
scraped for each resource to create a visual
representation of the content, and an \information feeling"
for users to decide whether they would want to click
through or not.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>In this section, we describe the experiments made to
evaluate the e ciency of the storylines built using our
method. We aim to compare the results obtained with
the two di erent LSPs ranking functions. We have
selected three events that feature very diverse
characteristics:</p>
      <sec id="sec-6-1">
        <title>Kanye West at Barclays Center. A concert of</title>
        <p>the \Resurrects Yeezus" tour that took place in a
new New York venue. This event has a relatively
low volume of 794 links shared on Twitter during
5 hours. We will refer to it as \#kanye" (Table 1).
Tech Crunch Disrupt. The 2013 conference held in
San Francisco. This is a longer event that lasted
San Francisco's Batkid. An event organized by the
Make a Wish Foundation, that transformed San
Francisco in Gotham City for one day, letting a
child a ected by leukemia help Batman to ght
the crime. This is a very high volume event, with
8842 links shared, during a timespan of 9 hours.
We will refer to it as \#SFBatkid" (Table 3).</p>
        <p>For each of those three events, we run the two LSPs
scoring functions described in Section 4 using the
minimum threshold possible: 1 for #Kanye and
#TCDisrupt, and 4 for #SFBatkid in order to obtain a number
of links that could be handled by the javascript
interface. For #SFBatkid, we extracted the data regarding
lower thresholds by directly querying the database.
# Extracted Title URL
1 Kanye West-Bound 2 http://www.youtube.
(Explicit) com/watch?v=</p>
        <p>BBAtAM7vtgc
2 sashahecht's video on http://instagram.</p>
        <p>Instagram com/p/g65f0pvrJ_/
3 angelon re's photo on http://instagram.</p>
        <p>Instagram com/p/g68ZQ1vXXf/</p>
        <p>Media shared on social networks are usually
nonpermanent and many of the links analyzed by our
sysWhen collecting data to compare the number of
true-positives against the number of false-positives, we
rst lter the displayed links according to a volume
threshold which is high enough to only select less than
100 links (number that we considered to be optimal for
obtaining an enjoyable storyline), and then we
proceeded with the false-positives marking process. We
always mark compromised resources as false-positives.
We also do not consider duplicates to be necessarily
marked as false-positives. During our experiments, we
also look at the variety of internet domains
generating all the links, and the way they varied according to
di erent lter settings. This approach provides useful
information for automatically improving the quality
of the storylines, for example, by emphasizing on the
diversity of the sources.
3
4
2
# Extracted Title
1 TechCrunch Disrupt</p>
        <p>
          Kicks O with
\Titstare" App and Fake
Masturbation
The data used in our experiments is provided by Seen5,
a service o ered by Mahaya Inc. that aims at
organizing social media by building automatic summaries of
events [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. An event can be de ned on this system by
specifying a set of hashtags and a time range. Once
these parameters are known, the contents database
(Figure 1) is constantly updated with new raw data
gathered from di erent social platforms at regular
intervals in time until the end of the event. No
particular ltering on language is performed making Seen a
language agnostic service for collecting tweets.
        </p>
        <p>Events' metadata is saved to a collection in a
different database (the web databases layer in Figure 1).
Initially, a user creating an event in Seen speci es
basic metadata for the event being tracked (dates, title,
hashtags). Later on, this description is automatically
enriched with meaningful information inferred by the
service. In our experiment, we only use the metadata
speci ed by the user to retrieve the right subset of
raw contents from the contents database. However, we
only select events that already exist in the platform,
so that we can rst acknowledge the general
characteristics of each of them in terms of data ow.
5http://seen.co
2
3
# Extracted Title
1 SF Morphs Into</p>
        <p>Gotham City for
\Batkid" Battling
Leukemia|NBC Bay
Area
We provide a qualitative description of the volume
based narrations obtained for these three events.
Before conducting this analysis, we ltered out using a
threshold enabling to reduce the number of items
composing a storyline under 100.
6.2.1</p>
      </sec>
      <sec id="sec-6-2">
        <title>Kanye West at Barclays Center</title>
        <p>We chose a threshold of 2 in order to obtain a
storyline of 43 items. Data appears to be noisy, since many
links are related to the artist in general instead of this
event in particular. Some examples are links to his
music video that apparently came out in the same period
when the concert took place (e.g. the rst link in
Table 1). This problem a ects the timeline until around
8PM. At that point, the concert must have e ectively
started, because between 8PM and 23PM the
storyline is only populated by Instagrams depicting various
moments of the performance. This strong visual
component is a feature that probably characterizes most
performance-related events.
6.2.2</p>
      </sec>
      <sec id="sec-6-3">
        <title>Tech Crunch Disrupt</title>
        <p>All items were ltered with a threshold of 5. The 66
links telling the story of the #TCDisrupt conference
are particularly e ective in describing what happened
at di erent level of details. They are often news
articles coming with very illustrative images, titles and
descriptions. The rst day of the conference contains
most of the links, because it includes a number of
general references to the event itself and to the hype for
its beginning. The time references seem to be correct:
for example, the rst item is about the rst application
that was pitched, and according to some following
resources, this project caused a sexist scandal, requiring
Tech Crunch to o cially apologize (see Table 2).</p>
        <p>Further down in the timeline, the links/day
ratio shrinks which increases the storyline quality as it
mostly includes speci c articles about the
presentations held on days two and three in chronological
order. In particular, the last item closes the storyline
by declaring the winner of #TCDisrupt (Row 4 in
Table 2).
6.2.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>San Francisco's Batkid</title>
        <p>This event has a very particular con guration: it
contains a huge amount of content (tens of thousands of
tweets) shared in a relatively short period of time,
mostly as an echo response to mass-medias. The
resulting narration, when ltered down to a readable
length of 99 items (using a threshold of 32), is very
general and redundant. It is mostly composed of
articles that describe the event as a whole. Instant media
(e.g. Instagrams) has been drowned by the huge
number of re-shares achieved by sources such as CNN and
NBC. As a result, this timeline is almost exempt from
noise but it is much less e ective for narrating the
event (see Table 3).
6.3</p>
      </sec>
      <sec id="sec-6-5">
        <title>Selection and Ranking quality of Di erent</title>
      </sec>
      <sec id="sec-6-6">
        <title>LSPs</title>
        <p>We ltered the storylines resulting from the velocity
and volume based processors until we obtained less
than 100 items. We marked the false-positive results
and we plotted the number of results and the number
of true-positives obtained while increasing the
threshold. The rst important di erence we noticed between
the two LSPs is the quality of the ranking: while the
velocity based LSP tends to concentrate most of the
results in the left-most part of the plot (thus in the
lower part of the ranking), the volume based one
distributes the results better. This strong di erence can
be seen in Figure 4.</p>
        <p>Figure 4 also indicates that our method is e cient in
selecting true-positive results when ltering the output
of the volume-based LSP. This was not observed with
the other two events, where the performance di erence
between the two LSPs under this point of view was
irrelevant. A characteristic of the plot made on
velocitybased results is that there is usually a very limited
number of highly referenced links that are underlined
by their distance from lower-ranked resources. This
can be seen in Figure 4 as well as in Figure 5 where the
top ranked item is a Vine of president Barack Obama
congratulating with the young super-hero, although
those few outliers with very high values happened to
be, in all our experiments, true positives.</p>
        <p>The same characteristic is common to all the
top ranked elements obtained with the velocity-based
method, thus making this selection system a good
option for choosing elements to recommend as interesting
highlights.
The analysis of the source domains underlined how
di erent categories of events can have a very
different \ ngerprint". Figure 6 clearly shows how a
performance-centered event mostly received data from
Instagram and Youtube, while a breaking news event
and a technological conference are described by a wider
variety of newspapers and magazines operating in the
respective elds. This information could be used to
automatically detect events' categories, or to
implement a smarter ranking function that assigns di
erent importance to links coming from di erent sources,
when the category of the event is known. We also
noticed how some of the biggest generators of social
content (i.e. Instagram, Facebook) tend to disappear
from the pie chart when rising the volume threshold
above one. This underlines the importance of an
additional dimension, the volume, in de ning a \category
ngerprint" in this particular space of the source
domains.</p>
        <p>This analysis can also help to automatically identify
o cial sources for a given event. In fact, if the category
ngerprint of an event is given, o cial sources can
sometimes be identi ed as outliers: this can be clearly
seen in Figure 7 where the \techcrunch" domain has
a remarkably outsized cardinality comparing with the
other ones.</p>
        <p>Our original goal was to explore the data produced
by the system we have implemented. Therefore, while
the results we are reporting are constrained by the
settings we have chosen, they well serve the purpose
of unveiling interesting patterns that should be further
investigated by experimenting on larger sets of events.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented a system for transforming
links shared on social media into narratives of what
happened at speci c events. These narratives are
composed by the set of pages referenced in these links,
ltered according to a score. We evaluated the storylines
obtained and we extracted some information that can
help designing new methods to improve their quality.
The collection of resources produced by this method
imposes the analysis of social activity in a new space,
where the single tweets shared are aggregated into a
feature of the resources they reference. Not only this
approach can produce real-time meaningful narratives
(as these resources become very descriptive thanks to
the page scraping process), it can also help
extracting useful insights, for example the composition of the
source domains.</p>
      <p>We noticed how di erent characteristics of an event
can a ect the e ciency of this system: while we
obtained a good narration of a technology conference,
the results obtained for a breaking news event were
more disappointing. Further research should be
conducted on this topic, in order to de ne some better
score functions tailored on the characteristics of each
category of event (e.g. considering alternatives to
volume based score functions for a breaking news event).
We also de ned a useful feature based on the evolution
of the composition of the source domains with
increasing volume thresholds. This can help identifying the
category of an event and it could also provide the
information necessary to automatically identify o cial
sources, when these are particularly active on social
channels.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research leading to this paper was partially
supported by the European Union's 7th Framework
Programme via the projects LinkedTV (GA 287911) and
MediaMixer (GA 318101).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gravano</surname>
          </string-name>
          .
          <article-title>Identifying Content for Planned Events Across Social Media Sites</article-title>
          .
          <source>In 5th International ACM Conference on Web Search and Data Mining</source>
          , Seattle, Whashington, USA,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Diakopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          .
          <article-title>Unfolding the Event Landscape on Twitter: Classi cation and Exploration of User Categories</article-title>
          .
          <source>In 15th ACM Conference on Computer Supported Cooperative Work</source>
          , Seattle, Whashington, USA,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Driessen</surname>
          </string-name>
          . Redis Queues Python Library. http: //python-rq.org,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Champ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Letessier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Herve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Buisson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Viaud</surname>
          </string-name>
          .
          <article-title>Visual-Based Transmedia Events Detection</article-title>
          .
          <source>In 20th ACM international conference on Multimedia (MM'12)</source>
          , Nara, Japan,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.-R.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sundaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kelliher</surname>
          </string-name>
          .
          <article-title>Temporal Patterns in Social Media Streams: Theme Discovery and Evolution Using Joint Analysis of Content and Context</article-title>
          .
          <source>In IEEE International Conference on Multimedia and Expo (ICME'09)</source>
          , pages
          <fpage>1456</fpage>
          {
          <fpage>1459</fpage>
          ,
          <string-name>
            <surname>Piscataway</surname>
          </string-name>
          , NJ, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Milicic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. Redondo</given-names>
            <surname>Garc</surname>
          </string-name>
          a, G. Rizzo, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>Tracking and Analyzing The 2013 Italian Election</article-title>
          .
          <source>In 10th Extended Semantic Web Conference (ESWC'13)</source>
          , Demo Session, Montpellier, France,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Milicic</surname>
          </string-name>
          , G. Rizzo,
          <string-name>
            <given-names>J. L. Redondo</given-names>
            <surname>Garc</surname>
          </string-name>
          <string-name>
            <surname>a</surname>
          </string-name>
          , R. Troncy, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner</surname>
          </string-name>
          .
          <article-title>Live Topic Generation from Event Streams</article-title>
          .
          <source>In 22nd World Wide Web Conference (WWW'13)</source>
          , Demo Session, Rio de Janeiro, Brazil,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Redondo</surname>
          </string-name>
          <article-title>Garc a, and</article-title>
          R. V. de Walle.
          <article-title>What Fresh Media Are You Looking For? Retrieving Media Items from Multiple Social Networks</article-title>
          . In International Workshop on Socially-aware
          <source>multimedia (SAM'12)</source>
          , Nara, Japan,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Churchill</surname>
          </string-name>
          . Conversational Shadows:
          <article-title>Describing Live Media Events Using Short Messages</article-title>
          .
          <source>In 4nd International Conference on Weblogs and Social Media (ICWSM'10)</source>
          , Washington, USA,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Churchill</surname>
          </string-name>
          .
          <article-title>Peaks and Persistence: Modeling the Shape of Microblog Conversations</article-title>
          .
          <source>In International Conference on Computer Supported Cooperative Work (CSCW'11)</source>
          , pages
          <fpage>355</fpage>
          {
          <fpage>358</fpage>
          , New York, NY, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner</surname>
          </string-name>
          .
          <article-title>A Meteoroid on Steroids: Ranking Media Items Stemming from Multiple Social Networks</article-title>
          .
          <source>In 22nd World Wide Web Conference (WWW'13)</source>
          , Demo Session, Rio de Janeiro, Brazil,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] T. tajner,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-M. Popescu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pennacchiotti</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Automatic Selection of Social Media Responses to News</article-title>
          .
          <source>In 19th International ACM Conference on Knowledge Discovery and Data mining (KDD'13)</source>
          , Chicago, Illinois, USA,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tate</surname>
          </string-name>
          .
          <article-title>The Next Big Thing You Missed: Recreate Live Events with Twitter and Instagram</article-title>
          . http://www.wired.com/business/2013/11/ seen-is
          <article-title>-real-life-instant-</article-title>
          <string-name>
            <surname>replay</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Twitter</surname>
          </string-name>
          .
          <article-title>Twitter's REST API Documentation - Tweets</article-title>
          . https://dev.twitter.com/docs/ platform-objects/tweets,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Zsombori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Frantzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Guimaraes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Ursu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cesar</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Craigie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. C. A.</given-names>
            <surname>Bulterman</surname>
          </string-name>
          .
          <article-title>Automatic Generation of Video Narratives from Shared UGC</article-title>
          .
          <source>In 22nd ACM Conference on Hypertext and Hypermedia</source>
          , Eindhoven, The Netherlands,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>