=Paper= {{Paper |id=Vol-1198/conte |storemode=property |title=Extracting Resources that Help Tell Events' Stories |pdfUrl=https://ceur-ws.org/Vol-1198/conte.pdf |volume=Vol-1198 |dblpUrl=https://dblp.org/rec/conf/mir/ConteTN14 }} ==Extracting Resources that Help Tell Events' Stories== https://ceur-ws.org/Vol-1198/conte.pdf
      Extracting Resources that Help Tell Events’ Stories

    Carlo Andrea Conte                      Raphaël Troncy                             Mor Naaman
        Mahaya Inc.                           EURECOM                          Cornell Tech and Mahaya, Inc.
      New York, USA                           Biot, France                            New York, USA
    carloante@msn.com                  raphael.troncy@eurecom.fr                 mor.naaman@cornell.edu



                                                                 nature of these external resources varies: they may
                                                                 be images, news articles, real-time video streams and
                        Abstract                                 many other types of content. Our goal is to identify re-
                                                                 sources shared via hyperlinks in social media streams
    Social media platforms constitute a valuable                 that are highly relevant and important to an ongoing
    source of information regarding real-world                   event. Our next goal is to extract these relevant re-
    happenings. In particular, user generated con-               sources, and assemble them in a storyline with the ob-
    tent on mobile-oriented platforms like Twitter               jective of producing a rich narration of the event using
    allows for real-time narrations thanks to the                a very diverse set of media.
    instantaneous nature of publishing. A com-                      In this paper, we describe a system that aims at
    mon practice for users is to include in the                  extracting a timeline of resources from a stream of
    tweets links pointing to articles, media files               tweets1 about an event. These links will be ranked
    and other resources. In this paper, we are                   and filtered in near real-time, in order to identify rel-
    interested in how the resources shared in a                  evant and valuable information as soon as possible af-
    stream of tweets for an event can be analyzed,               ter being shared. In addition, the system extracts de-
    and how can they help tell the event story.                  scriptive metadata from the referenced pages that can
    We describe a system that extracts, resolves,                be used to represent the resource in the event time-
    and eventually filters the resources shared in               line in an intelligible way. The collection of items is
    tweets content according to two different rank-              normalized according to the referenced resources, so
    ing functions. We are interested in how these                that their relevance score will result from the aggre-
    two ranking functions perform (with respect                  gation of the social items referencing them. We will
    to speed and accuracy) for discovering impor-                eventually analyse and compare the storylines that are
    tant and relevant resources that will tell the               produced in order to identify particular characteristics
    event story. We describe an experiment on a                  that could help improve future versions of this sys-
    sample set of events where we evaluate those                 tem. Our contributions include: a) a general architec-
    functions. We finally comment on the stories                 ture for extracting resources from a stream of tweets;
    we obtained and we provide statistics that give              b) the development of two scoring methods to rank
    meaningful insights for improving the system.                the importance of those resources in contributing to
                                                                 the storyline of an event; and c) the representation of
1    Introduction                                                meaningful statistics from the resulting story.
For many events and real-world happenings, Twitter,                 The rest of this paper is organized as follow. In
Facebook, Instagram, and other social media plat-                Section 2, we present some related work. We then
forms provide a continuous stream of user-contributed            describe our generic system architecture for extract-
messages and media. Very often, the messages posted              ing resources in a stream of tweets (Section 3). In
include hyperlinks pointing to content outside the plat-         Section 4, we present two ranking methods based on
form where the message was originally posted. The                volume and velocity. We show a simple interface we

          ©
                                                                 have developed to visualize the storylines that are cre-
Copyright     by the paper’s authors. Copying permitted only     ated based on some filtering parameters (Section 5).
for private and academic purposes.
In: S. Papadopoulos, P. Cesar, D. A. Shamma, A. Kelliher, R.        1 We observe that tweets can (and often do) contain links

Jain (eds.): Proceedings of the SoMuS ICMR 2014 Workshop,        pointing to other social networks such as Instagram, YouTube,
Glasgow, Scotland, 01-04-2014, published at http://ceur-ws.org   etc.
We discuss the results of our experiments in Section 6.     aim of simulating the same near real-time algorithm
Finally, we conclude and outline future work in Sec-        that would be applied to happening events.
tion 7.                                                         For the purpose of this work, we assume our input
                                                            is a stream of Twitter messages that has been identi-
2   Related Work                                            fied as relevant to the event being tracked. In our case,
                                                            these streams are generated by using a hashtag that is
The system described in this paper is one out of many       associated with each event, but the system described
services that gather social media shared about partic-      here is agnostic to how the Twitter content is iden-
ular events. These systems generally aim to provide a       tified. We assume that a separated process retrieves
comprehensive view of an event in order to support the      the Twitter content, and keeps updating a database at
user in following or understanding the event. The pri-      regular intervals with raw data from Twitter.
mary motivation of our system is to rapidly mine huge           The different computational steps performed in the
volume of social media content so that a user receives      resource extraction process are:
relevant information in a timely manner, for example
for news reporting or stock market investing.                1. Extracting links from a collection of tweets for an
    There has been a lot of work on the exploitation            event,
of user generated content and social media for telling
                                                             2. Resolving these links to their canonical form in
events’ stories. In this paper, we focus on collecting
                                                                order to identify duplicate resources,
and mining resources to compose in real-time a sto-
ryline about a specific event. Gathering and analyz-         3. Ranking links and applying a first basic filter,
ing contents is the main focus of [1] where the au-
thors provide techniques for automatically identifying       4. Collecting useful metadata from the pages refer-
posts shared by users on social platforms for a-priori          enced by the links which have been selected,
known events, and describe different querying tech-
niques. In [2], the focus is shifted towards identifying     5. Outputting a timeline of resources that can be
the users contributing to the social content available          further filtered using a simple web interface.
on a particular happening. In [12], the authors select
the best tweets for a news article, a related (but some-       We now detail the general architecture and the ma-
what reversed) problem to the one we are presenting         jor building blocks of this system (Figure 1). The com-
here.                                                       plete processing for each link includes two very severe
    Closer to our work, Shamma et al. [9] have looked at    bottlenecks that require network calls that may take
summarization and extraction of stories from streams        a couple of seconds to complete: the resolution of the
of media, including for example, [5, 10] amongst oth-       url, and the scraping of the references pages. In order
ers. The user-friendly representation of these resources    to overcome this limitation and to build a more effi-
is addressed in [15], where videos provided by the users    cient system, we split this processing into two systems,
are assembled in new video streams personalized for         that are intended to run in parallel.
each viewer, while in [8], both the retrieval and clus-
tering of media items shared on social networks are
assembled in the so-called MediaFinder application.
In this paper, we address the problem of building a
storyline as soon as the event is progressing.

3   Architecture
In this section, we describe the overall architecture for
the system that extracts and ranks resources that are
linked from Twitter messages about an event. The sys-
tem we propose needs to run in real-time, in order to
build storylines of events as they are unfolding. Our
goal is to be able to identify key resources with the
smallest delay possible. These requirements impose
                                                            Figure 1: Architecture of the link processing system
the adoption of an efficient, flexible concurrency pro-
cess that would be easily scalable according to the data       The first issue addressed by the system is the
flow. In the section 6, we report on tests conducted on     fact that, due to URL shorteners and non-canonical
three events that have happened in the past, with the       URL formats, links to the same resource (e.g. http:
//www.example.com/some\_article) can take differ-             Links Metadata Queue, otherwise it is saved together
ent forms: a bit.ly link, a URL that includes a query         with its metadata as an event link. The Links Meta-
string, etc. The links dispatcher is the first step of this   data Queue is processed by the metadata scraper. This
processing chain. It retrieves all the event tweets for       function extracts the domain from the url and selects
a recent temporal window from the content database.           a particular scraper class accordingly: it loads the ref-
After loading the raw data, the links dispatcher ex-          erenced page and extracts pieces of information from
tracts every link from the tweets’ entities [14]), and        it (e.g. a title, a description and a representative im-
queries the links mappings data to check if the URL           age). Different scrapers look for different tags in the
has already been resolved to a canonical form. If the         DOM structure, as different web sites usually expose
link was not yet resolved, the system adds the URL to         different information in different ways. In our tests, we
a link resolutor queue. If it was already resolved, the       use a generic scraper that collects information stored
system adds the link, together with the data about the        in the Open Graph2 and Twitter Cards3 meta-tags.
tweet, to the links appearances database. The links           Only the links for which enough metadata is found are
resolutor queue is implemented as a jobs queue [3].           saved as event links, and the pages metadata database
The advantages of a jobs queue approach is that many          is updated accordingly.
workers instances can be run at the same time, and               The final output of the system is stored in the event
their number and behavior can be adjusted according           links database. This dataset holds the final set of re-
to the workload. This allows for great flexibility in         sources, together with their score, and other attributes
the way the system can be scaled depending on the             inferred by the link score processor and used to rank
amount of data. All the urls in the queue are resolved        the resource (e.g. total volume, highest volume in a
by the links resolutor function. This function will first     time-window duration). In addition, the record con-
look up the url in the links mappings. If it is not           tains the metadata extracted from the referenced page.
found, the url will be resolved in order to save a new
mapping. For all links which have been successfully           4     Ranking Methods
resolved, there is an entry in the links appearances
dataset. The links appearances dataset is therefore           In this section, we describe two methods for ranking
a collection of all the links which have been resolved.       the links that appear in the Twitter stream. These
Every item contains information about the url, the            methods have to enable a robust detection of relevant
tweet that contains this link and the event identifier        and important links with the smallest delay possible
to which the tweet and the url are attached to.               from their publication. The reason why we want to
                                                              filter out irrelevant resources is the quantity of data
    The second issue is due do the huge amount of re-
                                                              that is usually shared: it usually largely outgrows the
sources that are returned which requires to decide and
                                                              number of items that could be used for building a story.
filter what resources will actually be scraped and used
to build a storyline. Links appearances are periodi-
                                                              4.1    Volume based LSP
cally accessed by the decider, which is responsible for
ranking the resources and filtering them as needed.           Organizing a collection of every single appearance of
The computation is always done on a sliding tempo-            every link gives us the possibility to obtain information
ral window of a fixed size which is a parameter. The          about the volume of re-share reached by a link during
scoring system relies on links score processors for its       an event. As an example, Figure 2 shows the num-
decision process: different score processors can be im-       ber of appearances of the link pointing to president
plemented for testing different score functions. We           Obama’s Vine for Batkid4 .
detail the links score processors (LSPs) used for our            The volume based LSP assigns to every link a score
experiments in section 4. The LSPs we use implement           equal to the cumulative function of its volume through-
a very basic filtering in order to enrich the event links     out the event. A very basic filtering step is imple-
with features useful for ranking. Additional filtering        mented by a manually chosen volume threshold that
possibilities are provided within our front-end inter-        is only meant to exclude the noise of links that did not
face.                                                         trigger any interest at all in the audience, and can be
    The third issue is caused by the transformation of        considered to be background noise (e.g. those links
a set of urls in a more human-readable representation.            2 https://developers.facebook.com/docs/opengraph/
This requires the scraping of additional information              3 https://dev.twitter.com/docs/cards
from the pages pointed by the selected links. If a                4 An initiative of Make-A-Wish Foundation for a child

link is selected for publication, the decider queries the     affected by leukemia and that attracted the attention of
                                                              social media:       http://abcnews.go.com/US/batkids-make-
pages metadata database to check whether the system           transforming-san-francisco-gotham/story?id=20899254;
has the available metadata for that link. If no meta-         President Obama posted a Vine response at https:
data is available, the link appearance is added to the        //vine.co/v/htbdjZAPrAX
                                                            thus representing a better choice for realtime use. On
                                                            the other hand, this system can also produce noisier
                                                            results.

                                                            5    Front-end Interface
                                                            The results of the resource extraction process can be
                                                            displayed in a simple web interface forming a “story-
                                                            line”: the list of resources are sorted chronologically
                                                            according to a display time field defined by the LSP.
                                                            Every resource is represented by its title, description
                                                            and an image extracted from the referenced page. We
                                                            expect this arrangement to automatically provide the
Figure 2: Volume of president Obama’s Vine video for
                                                            reader with a narration of what was happening. This
Batkid shares
                                                            has not been evaluated though. A possible evalua-
with only one appearance). Such threshold should            tion methodology could consist in generating various
be set according to the general volume of an event:         timelines according to different thresholds. First, users
we heuristically tune these thresholds after extracting     could be prompted to answer questions related to their
from the database aggregated statistics regarding the       understanding of the event’s chronological narration or
volume of these links.                                      the quality of the storyline using a Lickert scale. Sec-
   A link that has been shared at a nearly constant rate    ond, user clicks on timelines could be collected in order
during the analyzed time range will be more likely to       to get insights on the number of interesting links in-
appear in our timeline than a link that reached a very      cluded in those summaries. We leave this study as a
high volume at a particular point in time. The display      future work.
time for links ranked by this LSP is chosen to be the           A filtering functionality allows a user to select a cut-
time of the earliest appearance that passed the elemen-     off score for the links to be visualized. When clicking
tary filtering (e.g. the second appearance of a link).      on a link, that link is marked as false-positive. This
Even if the precision of this parameter will suffer from    marking functionality is used to plot the number of
setting very selective volume filters, we are assuming      links satisfying a certain volume threshold versus the
that high-volume events imply a faster growth of the        true-positives satisfying the same requirements. This
volume of relevant links, thus introducing only smaller     interface also draws a pie chart, representing the source
delays. We expect this method to produce more ro-           domains of the links displayed (taking into account the
bust rankings. However, this LSP will take longer to        filtering parameter). Figure 3 shows a simple exam-
recognize important links as the event unfolds, and re-     ple of a storyline. This example uses the information
sults will not be reliable while the event is happening     scraped for each resource to create a visual represen-
as much as when it is over.                                 tation of the content, and an “information feeling”
                                                            for users to decide whether they would want to click
4.2   Velocity based LSP                                    through or not.
The velocity based LSP computes links’ scores as the
appearances volume reached by a link within one de-         6    Experiments
cider processing time window. The decider’s time win-
                                                            In this section, we describe the experiments made to
dows occur every 20 minutes and are 30 minutes long,
                                                            evaluate the efficiency of the storylines built using our
thus allowing a 10 minutes overlap between each win-
                                                            method. We aim to compare the results obtained with
dow.
                                                            the two different LSPs ranking functions. We have
   This LSP implements the same filter mechanism de-
                                                            selected three events that feature very diverse charac-
scribed in Section 4.1, with the only difference that the
                                                            teristics:
threshold is compared with the current time window’s
volume. The display time is defined as the first time       Kanye West at Barclays Center. A concert of
a link appearance survives the filtering for the first         the “Resurrects Yeezus” tour that took place in a
time in a time window. However, the score is always            new New York venue. This event has a relatively
updated to the highest volume the link has reached             low volume of 794 links shared on Twitter during
within a window.                                               5 hours. We will refer to it as “#kanye” (Table 1).
   The velocity based method will recognize an impor-
tant link as soon as it quickly grows in volume, regard-    Tech Crunch Disrupt. The 2013 conference held in
less what happened throughout the rest of the event,           San Francisco. This is a longer event that lasted
              Figure 3: Front-end interface easing the process of extracting and visualizing data.
   3 days (80 hours of content gathered) with a to-       tem are broken. They can either trigger 404 answer (in
   tal of 1201 links. We will refer to this event as      this case, they are discarded in the process, without
   “#TCDisrupt” (Table 2).                                affecting the final storyline presentation), or they can
                                                          point to items which have been removed but for which
San Francisco’s Batkid. An event organized by the
                                                          there is still a page that may contain some metadata.
   Make a Wish Foundation, that transformed San
                                                          In the latter case, those items will typically appear in
   Francisco in Gotham City for one day, letting a
                                                          the timeline with no description and/or with meaning-
   child affected by leukemia help Batman to fight
                                                          less titles (e.g. “No Title”). It’s important to mention
   the crime. This is a very high volume event, with
                                                          that the older the event becomes, the more likely this
   8842 links shared, during a timespan of 9 hours.
                                                          type of compromised resource occurs.
   We will refer to it as “#SFBatkid” (Table 3).
   For each of those three events, we run the two LSPs
scoring functions described in Section 4 using the min-
imum threshold possible: 1 for #Kanye and #TCDis-
rupt, and 4 for #SFBatkid in order to obtain a number
of links that could be handled by the javascript inter-        When collecting data to compare the number of
face. For #SFBatkid, we extracted the data regarding       true-positives against the number of false-positives, we
lower thresholds by directly querying the database.        first filter the displayed links according to a volume
  # Extracted Title            URL                         threshold which is high enough to only select less than
  1 Kanye West-Bound 2 http://www.youtube.                 100 links (number that we considered to be optimal for
       (Explicit)              com/watch?v=                obtaining an enjoyable storyline), and then we pro-
                               BBAtAM7vtgc                 ceeded with the false-positives marking process. We
  2 sashahecht’s video on http://instagram.                always mark compromised resources as false-positives.
       Instagram               com/p/g65f0pvrJ_/           We also do not consider duplicates to be necessarily
  3 angelonfire’s photo on http://instagram.               marked as false-positives. During our experiments, we
       Instagram               com/p/g68ZQ1vXXf/           also look at the variety of internet domains generat-
                                                           ing all the links, and the way they varied according to
Table 1: Example of Links from “Kanye West at Bar-         different filter settings. This approach provides useful
clays Center” by order of appearance                       information for automatically improving the quality
  Media shared on social networks are usually non-         of the storylines, for example, by emphasizing on the
permanent and many of the links analyzed by our sys-       diversity of the sources.
  # Extracted Title         URL                               # Extracted Title             URL
  1 TechCrunch Disrupt      http://valleywag.                 1 SF    Morphs     Into       http://www.
    Kicks Off with “Tit-    gawker.com/                         Gotham    City    for       nbcbayarea.com/
    stare” App and Fake     techcrunch-disrupt-                 “Batkid”     Battling       news/local/SF-
    Masturbation            kicks-off-with-                     Leukemia—NBC Bay            Morphs-Into-Gotham-
                            titstare-app-and-fa-                Area                        City-for-Batkid-
                            1274394925                                                      Battling-Leukemia-
  2   An Apology From http://techcrunch.                                                    232054521.html
      TechCrunch—TechCrunchcom/2013/09/08/an-                 2     White House Video’s     https://vine.co/v/
                            apology-from-                           post on Vine            htbdjZAPrAX
                            techcrunch/                       3     BatKid saves trans-     http://www.cnn.
  3   Meet ’Titstare,’ the http://www.thewire.                      formed ’Gotham City’-   com/video/data/2.0/
      Tech World’s Latest com/technology/                           CNN.com Video           video/us/2013/11/15/
      ’Joke’ from the Minds 2013/09/titstare-                                               dnt-simon-batkid-
      of Brogrammers-The tech-worlds-                                                       dream-gotham-city-
      Wire                  latest-brogrammer-                                              rescue.cnn.html
                            joke-techcrunch-
                            disrupt/69171/                  Table 3: Example of Links from “San Francisco’s
  4   And     The    Winner http://techcrunch.              Batkid” by order of appearance
      Of TechCrunch Dis- com/2013/09/11/and-
                                                            6.2     Qualitative Analysis of the Storylines
      rupt SF 2013 Is... the-winner-of-
      Layer!—TechCrunch     techcrunch-disrupt-             We provide a qualitative description of the volume
                            sf-2013-is-layer/               based narrations obtained for these three events. Be-
                                                            fore conducting this analysis, we filtered out using a
Table 2: Example of Links from “Tech Crunch Dis-            threshold enabling to reduce the number of items com-
rupt” by order of appearance                                posing a storyline under 100.

6.1   Dataset                                               6.2.1     Kanye West at Barclays Center
                                                            We chose a threshold of 2 in order to obtain a story-
The data used in our experiments is provided by Seen5 ,     line of 43 items. Data appears to be noisy, since many
a service offered by Mahaya Inc. that aims at organiz-      links are related to the artist in general instead of this
ing social media by building automatic summaries of         event in particular. Some examples are links to his mu-
events [13]. An event can be defined on this system by      sic video that apparently came out in the same period
specifying a set of hashtags and a time range. Once         when the concert took place (e.g. the first link in Ta-
these parameters are known, the contents database           ble 1). This problem affects the timeline until around
(Figure 1) is constantly updated with new raw data          8PM. At that point, the concert must have effectively
gathered from different social platforms at regular in-     started, because between 8PM and 23PM the story-
tervals in time until the end of the event. No partic-      line is only populated by Instagrams depicting various
ular filtering on language is performed making Seen a       moments of the performance. This strong visual com-
language agnostic service for collecting tweets.            ponent is a feature that probably characterizes most
   Events’ metadata is saved to a collection in a dif-      performance-related events.
ferent database (the web databases layer in Figure 1).
Initially, a user creating an event in Seen specifies ba-   6.2.2     Tech Crunch Disrupt
sic metadata for the event being tracked (dates, title,     All items were filtered with a threshold of 5. The 66
hashtags). Later on, this description is automatically      links telling the story of the #TCDisrupt conference
enriched with meaningful information inferred by the        are particularly effective in describing what happened
service. In our experiment, we only use the metadata        at different level of details. They are often news ar-
specified by the user to retrieve the right subset of       ticles coming with very illustrative images, titles and
raw contents from the contents database. However, we        descriptions. The first day of the conference contains
only select events that already exist in the platform,      most of the links, because it includes a number of gen-
so that we can first acknowledge the general charac-        eral references to the event itself and to the hype for
teristics of each of them in terms of data flow.            its beginning. The time references seem to be correct:
                                                            for example, the first item is about the first application
                                                            that was pitched, and according to some following re-
  5 http://seen.co                                          sources, this project caused a sexist scandal, requiring
Tech Crunch to officially apologize (see Table 2).
   Further down in the timeline, the links/day ra-
tio shrinks which increases the storyline quality as it
mostly includes specific articles about the presenta-
tions held on days two and three in chronological or-
der. In particular, the last item closes the storyline
by declaring the winner of #TCDisrupt (Row 4 in Ta-
ble 2).

6.2.3    San Francisco’s Batkid
This event has a very particular configuration: it con-
tains a huge amount of content (tens of thousands of
tweets) shared in a relatively short period of time,
mostly as an echo response to mass-medias. The re-
sulting narration, when filtered down to a readable
length of 99 items (using a threshold of 32), is very
general and redundant. It is mostly composed of arti-
cles that describe the event as a whole. Instant media
(e.g. Instagrams) has been drowned by the huge num-
ber of re-shares achieved by sources such as CNN and
NBC. As a result, this timeline is almost exempt from
noise but it is much less effective for narrating the
event (see Table 3).

6.3     Selection and Ranking quality of Different
                                                             Figure 4: Number of results and number of true-
        LSPs
                                                             positive results obtained using the volume-based LSP
We filtered the storylines resulting from the velocity       and the velocity-based LSP for the TCDisrupt event
and volume based processors until we obtained less           while increasing the filtering thresholds
than 100 items. We marked the false-positive results
and we plotted the number of results and the number
of true-positives obtained while increasing the thresh-
old. The first important difference we noticed between
the two LSPs is the quality of the ranking: while the
velocity based LSP tends to concentrate most of the
results in the left-most part of the plot (thus in the
lower part of the ranking), the volume based one dis-
tributes the results better. This strong difference can
be seen in Figure 4.
   Figure 4 also indicates that our method is efficient in
selecting true-positive results when filtering the output
of the volume-based LSP. This was not observed with
the other two events, where the performance difference
between the two LSPs under this point of view was ir-
relevant. A characteristic of the plot made on velocity-
based results is that there is usually a very limited
number of highly referenced links that are underlined
by their distance from lower-ranked resources. This
can be seen in Figure 4 as well as in Figure 5 where the     Figure 5: Number of results and number of true-
top ranked item is a Vine of president Barack Obama          positive results obtained using the velocity-based LSP
congratulating with the young super-hero, although           for the SFBatkid event while increasing the filtering
those few outliers with very high values happened to         thresholds
be, in all our experiments, true positives.
   The same characteristic is common to all the
top ranked elements obtained with the velocity-based
method, thus making this selection system a good op-      fingerprint” in this particular space of the source do-
tion for choosing elements to recommend as interesting    mains.
highlights.                                                   This analysis can also help to automatically identify
                                                          official sources for a given event. In fact, if the category
6.4   Domains Composition                                 fingerprint of an event is given, official sources can
                                                          sometimes be identified as outliers: this can be clearly
The analysis of the source domains underlined how         seen in Figure 7 where the “techcrunch” domain has
different categories of events can have a very dif-       a remarkably outsized cardinality comparing with the
ferent “fingerprint”. Figure 6 clearly shows how a        other ones.
performance-centered event mostly received data from
Instagram and Youtube, while a breaking news event
and a technological conference are described by a wider
variety of newspapers and magazines operating in the
respective fields. This information could be used to




                                                          Figure 7: Source domains of the TCDisrupt event com-
                                                          puted on results obtained without any filtering

                                                             Our original goal was to explore the data produced
                                                          by the system we have implemented. Therefore, while
                                                          the results we are reporting are constrained by the
                                                          settings we have chosen, they well serve the purpose
                                                          of unveiling interesting patterns that should be further
                                                          investigated by experimenting on larger sets of events.

                                                          7    Conclusion and Future Work
                                                          In this paper, we presented a system for transforming
                                                          links shared on social media into narratives of what
                                                          happened at specific events. These narratives are com-
                                                          posed by the set of pages referenced in these links, fil-
                                                          tered according to a score. We evaluated the storylines
Figure 6: Source domains composition for the SF-
                                                          obtained and we extracted some information that can
Batkid and the Kanye events. Volume thresholds have
                                                          help designing new methods to improve their quality.
been chosen as the highest that still allowed enough
                                                          The collection of resources produced by this method
results to produce a meaningful analysis
                                                          imposes the analysis of social activity in a new space,
automatically detect events’ categories, or to imple-     where the single tweets shared are aggregated into a
ment a smarter ranking function that assigns differ-      feature of the resources they reference. Not only this
ent importance to links coming from different sources,    approach can produce real-time meaningful narratives
when the category of the event is known. We also          (as these resources become very descriptive thanks to
noticed how some of the biggest generators of social      the page scraping process), it can also help extract-
content (i.e. Instagram, Facebook) tend to disappear      ing useful insights, for example the composition of the
from the pie chart when rising the volume threshold       source domains.
above one. This underlines the importance of an addi-        We noticed how different characteristics of an event
tional dimension, the volume, in defining a “category     can affect the efficiency of this system: while we ob-
tained a good narration of a technology conference,            Conference (ESWC’13), Demo Session, Montpel-
the results obtained for a breaking news event were            lier, France, 2013.
more disappointing. Further research should be con-
ducted on this topic, in order to define some better        [7] V. Milicic, G. Rizzo, J. L. Redondo Garcı́a,
score functions tailored on the characteristics of each         R. Troncy, and T. Steiner. Live Topic Genera-
category of event (e.g. considering alternatives to vol-        tion from Event Streams. In 22nd World Wide
ume based score functions for a breaking news event).           Web Conference (WWW’13), Demo Session, Rio
We also defined a useful feature based on the evolution         de Janeiro, Brazil, 2013.
of the composition of the source domains with increas-      [8] G. Rizzo, T. Steiner, R. Troncy, R. Verborgh,
ing volume thresholds. This can help identifying the            J. L. Redondo Garcı́a, and R. V. de Walle. What
category of an event and it could also provide the in-          Fresh Media Are You Looking For? Retrieving
formation necessary to automatically identify official          Media Items from Multiple Social Networks. In
sources, when these are particularly active on social           International Workshop on Socially-aware multi-
channels.                                                       media (SAM’12), Nara, Japan, 2012.

Acknowledgments                                             [9] D. A. Shamma, L. Kennedy, and E. F. Churchill.
                                                                Conversational Shadows: Describing Live Media
The research leading to this paper was partially sup-           Events Using Short Messages. In 4nd Interna-
ported by the European Union’s 7th Framework Pro-               tional Conference on Weblogs and Social Media
gramme via the projects LinkedTV (GA 287911) and                (ICWSM’10), Washington, USA, 2010.
MediaMixer (GA 318101).
                                                           [10] D. A. Shamma, L. Kennedy, and E. F. Churchill.
References                                                      Peaks and Persistence: Modeling the Shape of Mi-
                                                                croblog Conversations. In International Confer-
 [1] H. Becker, D.Iter, M. Naaman, and L. Gravano.              ence on Computer Supported Cooperative Work
     Identifying Content for Planned Events Across              (CSCW’11), pages 355–358, New York, NY, USA,
     Social Media Sites. In 5th International ACM               2011.
     Conference on Web Search and Data Mining,
     Seattle, Whashington, USA, 2012.                      [11] T. Steiner. A Meteoroid on Steroids: Rank-
                                                                ing Media Items Stemming from Multiple Social
 [2] M. D. Choudhury, N. Diakopoulos, and M. Naa-               Networks. In 22nd World Wide Web Confer-
     man. Unfolding the Event Landscape on Twitter:             ence (WWW’13), Demo Session, Rio de Janeiro,
     Classification and Exploration of User Categories.         Brazil, 2013.
     In 15th ACM Conference on Computer Supported
     Cooperative Work, Seattle, Whashington, USA,          [12] T. tajner, B. Thomee, A.-M. Popescu, M. Pen-
     2012.                                                      nacchiotti, and A. Jaimes. Automatic Selection of
                                                                Social Media Responses to News. In 19th Inter-
 [3] V. Driessen. Redis Queues Python Library. http:            national ACM Conference on Knowledge Discov-
     //python-rq.org, 2013.                                     ery and Data mining (KDD’13), Chicago, Illinois,
                                                                USA, 2013.
 [4] A. Joly, J. Champ, P. Letessier, N. Hervé,
     O. Buisson, and M. Viaud. Visual-Based Trans-         [13] R. Tate. The Next Big Thing You Missed: Recre-
     media Events Detection. In 20th ACM interna-               ate Live Events with Twitter and Instagram.
     tional conference on Multimedia (MM’12), Nara,             http://www.wired.com/business/2013/11/
     Japan, 2012.                                               seen-is-real-life-instant-replay/, 2013.

 [5] Y.-R. Lin, H. Sundaram, M. D. Choudhury, and          [14] Twitter.    Twitter’s REST API Documenta-
     A. Kelliher. Temporal Patterns in Social Me-               tion - Tweets. https://dev.twitter.com/docs/
     dia Streams: Theme Discovery and Evolution                 platform-objects/tweets, 2014.
     Using Joint Analysis of Content and Context.
     In IEEE International Conference on Multimedia        [15] V. Zsombori, M. Frantzis, R. L. Guimaraes, M. F.
     and Expo (ICME’09), pages 1456–1459, Piscat-               Ursu, P. Cesar, I. Kegel, R. Craigie, and D. C. A.
     away, NJ, USA, 2009.                                       Bulterman. Automatic Generation of Video Nar-
                                                                ratives from Shared UGC. In 22nd ACM Confer-
 [6] V. Milicic, J. L. Redondo Garcı́a, G. Rizzo, and           ence on Hypertext and Hypermedia, Eindhoven,
     R. Troncy. Tracking and Analyzing The 2013                 The Netherlands, 2011.
     Italian Election. In 10th Extended Semantic Web