=Paper=
{{Paper
|id=Vol-1198/conte
|storemode=property
|title=Extracting Resources that Help Tell Events' Stories
|pdfUrl=https://ceur-ws.org/Vol-1198/conte.pdf
|volume=Vol-1198
|dblpUrl=https://dblp.org/rec/conf/mir/ConteTN14
}}
==Extracting Resources that Help Tell Events' Stories==
Extracting Resources that Help Tell Events’ Stories
Carlo Andrea Conte Raphaël Troncy Mor Naaman
Mahaya Inc. EURECOM Cornell Tech and Mahaya, Inc.
New York, USA Biot, France New York, USA
carloante@msn.com raphael.troncy@eurecom.fr mor.naaman@cornell.edu
nature of these external resources varies: they may
be images, news articles, real-time video streams and
Abstract many other types of content. Our goal is to identify re-
sources shared via hyperlinks in social media streams
Social media platforms constitute a valuable that are highly relevant and important to an ongoing
source of information regarding real-world event. Our next goal is to extract these relevant re-
happenings. In particular, user generated con- sources, and assemble them in a storyline with the ob-
tent on mobile-oriented platforms like Twitter jective of producing a rich narration of the event using
allows for real-time narrations thanks to the a very diverse set of media.
instantaneous nature of publishing. A com- In this paper, we describe a system that aims at
mon practice for users is to include in the extracting a timeline of resources from a stream of
tweets links pointing to articles, media files tweets1 about an event. These links will be ranked
and other resources. In this paper, we are and filtered in near real-time, in order to identify rel-
interested in how the resources shared in a evant and valuable information as soon as possible af-
stream of tweets for an event can be analyzed, ter being shared. In addition, the system extracts de-
and how can they help tell the event story. scriptive metadata from the referenced pages that can
We describe a system that extracts, resolves, be used to represent the resource in the event time-
and eventually filters the resources shared in line in an intelligible way. The collection of items is
tweets content according to two different rank- normalized according to the referenced resources, so
ing functions. We are interested in how these that their relevance score will result from the aggre-
two ranking functions perform (with respect gation of the social items referencing them. We will
to speed and accuracy) for discovering impor- eventually analyse and compare the storylines that are
tant and relevant resources that will tell the produced in order to identify particular characteristics
event story. We describe an experiment on a that could help improve future versions of this sys-
sample set of events where we evaluate those tem. Our contributions include: a) a general architec-
functions. We finally comment on the stories ture for extracting resources from a stream of tweets;
we obtained and we provide statistics that give b) the development of two scoring methods to rank
meaningful insights for improving the system. the importance of those resources in contributing to
the storyline of an event; and c) the representation of
1 Introduction meaningful statistics from the resulting story.
For many events and real-world happenings, Twitter, The rest of this paper is organized as follow. In
Facebook, Instagram, and other social media plat- Section 2, we present some related work. We then
forms provide a continuous stream of user-contributed describe our generic system architecture for extract-
messages and media. Very often, the messages posted ing resources in a stream of tweets (Section 3). In
include hyperlinks pointing to content outside the plat- Section 4, we present two ranking methods based on
form where the message was originally posted. The volume and velocity. We show a simple interface we
©
have developed to visualize the storylines that are cre-
Copyright by the paper’s authors. Copying permitted only ated based on some filtering parameters (Section 5).
for private and academic purposes.
In: S. Papadopoulos, P. Cesar, D. A. Shamma, A. Kelliher, R. 1 We observe that tweets can (and often do) contain links
Jain (eds.): Proceedings of the SoMuS ICMR 2014 Workshop, pointing to other social networks such as Instagram, YouTube,
Glasgow, Scotland, 01-04-2014, published at http://ceur-ws.org etc.
We discuss the results of our experiments in Section 6. aim of simulating the same near real-time algorithm
Finally, we conclude and outline future work in Sec- that would be applied to happening events.
tion 7. For the purpose of this work, we assume our input
is a stream of Twitter messages that has been identi-
2 Related Work fied as relevant to the event being tracked. In our case,
these streams are generated by using a hashtag that is
The system described in this paper is one out of many associated with each event, but the system described
services that gather social media shared about partic- here is agnostic to how the Twitter content is iden-
ular events. These systems generally aim to provide a tified. We assume that a separated process retrieves
comprehensive view of an event in order to support the the Twitter content, and keeps updating a database at
user in following or understanding the event. The pri- regular intervals with raw data from Twitter.
mary motivation of our system is to rapidly mine huge The different computational steps performed in the
volume of social media content so that a user receives resource extraction process are:
relevant information in a timely manner, for example
for news reporting or stock market investing. 1. Extracting links from a collection of tweets for an
There has been a lot of work on the exploitation event,
of user generated content and social media for telling
2. Resolving these links to their canonical form in
events’ stories. In this paper, we focus on collecting
order to identify duplicate resources,
and mining resources to compose in real-time a sto-
ryline about a specific event. Gathering and analyz- 3. Ranking links and applying a first basic filter,
ing contents is the main focus of [1] where the au-
thors provide techniques for automatically identifying 4. Collecting useful metadata from the pages refer-
posts shared by users on social platforms for a-priori enced by the links which have been selected,
known events, and describe different querying tech-
niques. In [2], the focus is shifted towards identifying 5. Outputting a timeline of resources that can be
the users contributing to the social content available further filtered using a simple web interface.
on a particular happening. In [12], the authors select
the best tweets for a news article, a related (but some- We now detail the general architecture and the ma-
what reversed) problem to the one we are presenting jor building blocks of this system (Figure 1). The com-
here. plete processing for each link includes two very severe
Closer to our work, Shamma et al. [9] have looked at bottlenecks that require network calls that may take
summarization and extraction of stories from streams a couple of seconds to complete: the resolution of the
of media, including for example, [5, 10] amongst oth- url, and the scraping of the references pages. In order
ers. The user-friendly representation of these resources to overcome this limitation and to build a more effi-
is addressed in [15], where videos provided by the users cient system, we split this processing into two systems,
are assembled in new video streams personalized for that are intended to run in parallel.
each viewer, while in [8], both the retrieval and clus-
tering of media items shared on social networks are
assembled in the so-called MediaFinder application.
In this paper, we address the problem of building a
storyline as soon as the event is progressing.
3 Architecture
In this section, we describe the overall architecture for
the system that extracts and ranks resources that are
linked from Twitter messages about an event. The sys-
tem we propose needs to run in real-time, in order to
build storylines of events as they are unfolding. Our
goal is to be able to identify key resources with the
smallest delay possible. These requirements impose
Figure 1: Architecture of the link processing system
the adoption of an efficient, flexible concurrency pro-
cess that would be easily scalable according to the data The first issue addressed by the system is the
flow. In the section 6, we report on tests conducted on fact that, due to URL shorteners and non-canonical
three events that have happened in the past, with the URL formats, links to the same resource (e.g. http:
//www.example.com/some\_article) can take differ- Links Metadata Queue, otherwise it is saved together
ent forms: a bit.ly link, a URL that includes a query with its metadata as an event link. The Links Meta-
string, etc. The links dispatcher is the first step of this data Queue is processed by the metadata scraper. This
processing chain. It retrieves all the event tweets for function extracts the domain from the url and selects
a recent temporal window from the content database. a particular scraper class accordingly: it loads the ref-
After loading the raw data, the links dispatcher ex- erenced page and extracts pieces of information from
tracts every link from the tweets’ entities [14]), and it (e.g. a title, a description and a representative im-
queries the links mappings data to check if the URL age). Different scrapers look for different tags in the
has already been resolved to a canonical form. If the DOM structure, as different web sites usually expose
link was not yet resolved, the system adds the URL to different information in different ways. In our tests, we
a link resolutor queue. If it was already resolved, the use a generic scraper that collects information stored
system adds the link, together with the data about the in the Open Graph2 and Twitter Cards3 meta-tags.
tweet, to the links appearances database. The links Only the links for which enough metadata is found are
resolutor queue is implemented as a jobs queue [3]. saved as event links, and the pages metadata database
The advantages of a jobs queue approach is that many is updated accordingly.
workers instances can be run at the same time, and The final output of the system is stored in the event
their number and behavior can be adjusted according links database. This dataset holds the final set of re-
to the workload. This allows for great flexibility in sources, together with their score, and other attributes
the way the system can be scaled depending on the inferred by the link score processor and used to rank
amount of data. All the urls in the queue are resolved the resource (e.g. total volume, highest volume in a
by the links resolutor function. This function will first time-window duration). In addition, the record con-
look up the url in the links mappings. If it is not tains the metadata extracted from the referenced page.
found, the url will be resolved in order to save a new
mapping. For all links which have been successfully 4 Ranking Methods
resolved, there is an entry in the links appearances
dataset. The links appearances dataset is therefore In this section, we describe two methods for ranking
a collection of all the links which have been resolved. the links that appear in the Twitter stream. These
Every item contains information about the url, the methods have to enable a robust detection of relevant
tweet that contains this link and the event identifier and important links with the smallest delay possible
to which the tweet and the url are attached to. from their publication. The reason why we want to
filter out irrelevant resources is the quantity of data
The second issue is due do the huge amount of re-
that is usually shared: it usually largely outgrows the
sources that are returned which requires to decide and
number of items that could be used for building a story.
filter what resources will actually be scraped and used
to build a storyline. Links appearances are periodi-
4.1 Volume based LSP
cally accessed by the decider, which is responsible for
ranking the resources and filtering them as needed. Organizing a collection of every single appearance of
The computation is always done on a sliding tempo- every link gives us the possibility to obtain information
ral window of a fixed size which is a parameter. The about the volume of re-share reached by a link during
scoring system relies on links score processors for its an event. As an example, Figure 2 shows the num-
decision process: different score processors can be im- ber of appearances of the link pointing to president
plemented for testing different score functions. We Obama’s Vine for Batkid4 .
detail the links score processors (LSPs) used for our The volume based LSP assigns to every link a score
experiments in section 4. The LSPs we use implement equal to the cumulative function of its volume through-
a very basic filtering in order to enrich the event links out the event. A very basic filtering step is imple-
with features useful for ranking. Additional filtering mented by a manually chosen volume threshold that
possibilities are provided within our front-end inter- is only meant to exclude the noise of links that did not
face. trigger any interest at all in the audience, and can be
The third issue is caused by the transformation of considered to be background noise (e.g. those links
a set of urls in a more human-readable representation. 2 https://developers.facebook.com/docs/opengraph/
This requires the scraping of additional information 3 https://dev.twitter.com/docs/cards
from the pages pointed by the selected links. If a 4 An initiative of Make-A-Wish Foundation for a child
link is selected for publication, the decider queries the affected by leukemia and that attracted the attention of
social media: http://abcnews.go.com/US/batkids-make-
pages metadata database to check whether the system transforming-san-francisco-gotham/story?id=20899254;
has the available metadata for that link. If no meta- President Obama posted a Vine response at https:
data is available, the link appearance is added to the //vine.co/v/htbdjZAPrAX
thus representing a better choice for realtime use. On
the other hand, this system can also produce noisier
results.
5 Front-end Interface
The results of the resource extraction process can be
displayed in a simple web interface forming a “story-
line”: the list of resources are sorted chronologically
according to a display time field defined by the LSP.
Every resource is represented by its title, description
and an image extracted from the referenced page. We
expect this arrangement to automatically provide the
Figure 2: Volume of president Obama’s Vine video for
reader with a narration of what was happening. This
Batkid shares
has not been evaluated though. A possible evalua-
with only one appearance). Such threshold should tion methodology could consist in generating various
be set according to the general volume of an event: timelines according to different thresholds. First, users
we heuristically tune these thresholds after extracting could be prompted to answer questions related to their
from the database aggregated statistics regarding the understanding of the event’s chronological narration or
volume of these links. the quality of the storyline using a Lickert scale. Sec-
A link that has been shared at a nearly constant rate ond, user clicks on timelines could be collected in order
during the analyzed time range will be more likely to to get insights on the number of interesting links in-
appear in our timeline than a link that reached a very cluded in those summaries. We leave this study as a
high volume at a particular point in time. The display future work.
time for links ranked by this LSP is chosen to be the A filtering functionality allows a user to select a cut-
time of the earliest appearance that passed the elemen- off score for the links to be visualized. When clicking
tary filtering (e.g. the second appearance of a link). on a link, that link is marked as false-positive. This
Even if the precision of this parameter will suffer from marking functionality is used to plot the number of
setting very selective volume filters, we are assuming links satisfying a certain volume threshold versus the
that high-volume events imply a faster growth of the true-positives satisfying the same requirements. This
volume of relevant links, thus introducing only smaller interface also draws a pie chart, representing the source
delays. We expect this method to produce more ro- domains of the links displayed (taking into account the
bust rankings. However, this LSP will take longer to filtering parameter). Figure 3 shows a simple exam-
recognize important links as the event unfolds, and re- ple of a storyline. This example uses the information
sults will not be reliable while the event is happening scraped for each resource to create a visual represen-
as much as when it is over. tation of the content, and an “information feeling”
for users to decide whether they would want to click
4.2 Velocity based LSP through or not.
The velocity based LSP computes links’ scores as the
appearances volume reached by a link within one de- 6 Experiments
cider processing time window. The decider’s time win-
In this section, we describe the experiments made to
dows occur every 20 minutes and are 30 minutes long,
evaluate the efficiency of the storylines built using our
thus allowing a 10 minutes overlap between each win-
method. We aim to compare the results obtained with
dow.
the two different LSPs ranking functions. We have
This LSP implements the same filter mechanism de-
selected three events that feature very diverse charac-
scribed in Section 4.1, with the only difference that the
teristics:
threshold is compared with the current time window’s
volume. The display time is defined as the first time Kanye West at Barclays Center. A concert of
a link appearance survives the filtering for the first the “Resurrects Yeezus” tour that took place in a
time in a time window. However, the score is always new New York venue. This event has a relatively
updated to the highest volume the link has reached low volume of 794 links shared on Twitter during
within a window. 5 hours. We will refer to it as “#kanye” (Table 1).
The velocity based method will recognize an impor-
tant link as soon as it quickly grows in volume, regard- Tech Crunch Disrupt. The 2013 conference held in
less what happened throughout the rest of the event, San Francisco. This is a longer event that lasted
Figure 3: Front-end interface easing the process of extracting and visualizing data.
3 days (80 hours of content gathered) with a to- tem are broken. They can either trigger 404 answer (in
tal of 1201 links. We will refer to this event as this case, they are discarded in the process, without
“#TCDisrupt” (Table 2). affecting the final storyline presentation), or they can
point to items which have been removed but for which
San Francisco’s Batkid. An event organized by the
there is still a page that may contain some metadata.
Make a Wish Foundation, that transformed San
In the latter case, those items will typically appear in
Francisco in Gotham City for one day, letting a
the timeline with no description and/or with meaning-
child affected by leukemia help Batman to fight
less titles (e.g. “No Title”). It’s important to mention
the crime. This is a very high volume event, with
that the older the event becomes, the more likely this
8842 links shared, during a timespan of 9 hours.
type of compromised resource occurs.
We will refer to it as “#SFBatkid” (Table 3).
For each of those three events, we run the two LSPs
scoring functions described in Section 4 using the min-
imum threshold possible: 1 for #Kanye and #TCDis-
rupt, and 4 for #SFBatkid in order to obtain a number
of links that could be handled by the javascript inter- When collecting data to compare the number of
face. For #SFBatkid, we extracted the data regarding true-positives against the number of false-positives, we
lower thresholds by directly querying the database. first filter the displayed links according to a volume
# Extracted Title URL threshold which is high enough to only select less than
1 Kanye West-Bound 2 http://www.youtube. 100 links (number that we considered to be optimal for
(Explicit) com/watch?v= obtaining an enjoyable storyline), and then we pro-
BBAtAM7vtgc ceeded with the false-positives marking process. We
2 sashahecht’s video on http://instagram. always mark compromised resources as false-positives.
Instagram com/p/g65f0pvrJ_/ We also do not consider duplicates to be necessarily
3 angelonfire’s photo on http://instagram. marked as false-positives. During our experiments, we
Instagram com/p/g68ZQ1vXXf/ also look at the variety of internet domains generat-
ing all the links, and the way they varied according to
Table 1: Example of Links from “Kanye West at Bar- different filter settings. This approach provides useful
clays Center” by order of appearance information for automatically improving the quality
Media shared on social networks are usually non- of the storylines, for example, by emphasizing on the
permanent and many of the links analyzed by our sys- diversity of the sources.
# Extracted Title URL # Extracted Title URL
1 TechCrunch Disrupt http://valleywag. 1 SF Morphs Into http://www.
Kicks Off with “Tit- gawker.com/ Gotham City for nbcbayarea.com/
stare” App and Fake techcrunch-disrupt- “Batkid” Battling news/local/SF-
Masturbation kicks-off-with- Leukemia—NBC Bay Morphs-Into-Gotham-
titstare-app-and-fa- Area City-for-Batkid-
1274394925 Battling-Leukemia-
2 An Apology From http://techcrunch. 232054521.html
TechCrunch—TechCrunchcom/2013/09/08/an- 2 White House Video’s https://vine.co/v/
apology-from- post on Vine htbdjZAPrAX
techcrunch/ 3 BatKid saves trans- http://www.cnn.
3 Meet ’Titstare,’ the http://www.thewire. formed ’Gotham City’- com/video/data/2.0/
Tech World’s Latest com/technology/ CNN.com Video video/us/2013/11/15/
’Joke’ from the Minds 2013/09/titstare- dnt-simon-batkid-
of Brogrammers-The tech-worlds- dream-gotham-city-
Wire latest-brogrammer- rescue.cnn.html
joke-techcrunch-
disrupt/69171/ Table 3: Example of Links from “San Francisco’s
4 And The Winner http://techcrunch. Batkid” by order of appearance
Of TechCrunch Dis- com/2013/09/11/and-
6.2 Qualitative Analysis of the Storylines
rupt SF 2013 Is... the-winner-of-
Layer!—TechCrunch techcrunch-disrupt- We provide a qualitative description of the volume
sf-2013-is-layer/ based narrations obtained for these three events. Be-
fore conducting this analysis, we filtered out using a
Table 2: Example of Links from “Tech Crunch Dis- threshold enabling to reduce the number of items com-
rupt” by order of appearance posing a storyline under 100.
6.1 Dataset 6.2.1 Kanye West at Barclays Center
We chose a threshold of 2 in order to obtain a story-
The data used in our experiments is provided by Seen5 , line of 43 items. Data appears to be noisy, since many
a service offered by Mahaya Inc. that aims at organiz- links are related to the artist in general instead of this
ing social media by building automatic summaries of event in particular. Some examples are links to his mu-
events [13]. An event can be defined on this system by sic video that apparently came out in the same period
specifying a set of hashtags and a time range. Once when the concert took place (e.g. the first link in Ta-
these parameters are known, the contents database ble 1). This problem affects the timeline until around
(Figure 1) is constantly updated with new raw data 8PM. At that point, the concert must have effectively
gathered from different social platforms at regular in- started, because between 8PM and 23PM the story-
tervals in time until the end of the event. No partic- line is only populated by Instagrams depicting various
ular filtering on language is performed making Seen a moments of the performance. This strong visual com-
language agnostic service for collecting tweets. ponent is a feature that probably characterizes most
Events’ metadata is saved to a collection in a dif- performance-related events.
ferent database (the web databases layer in Figure 1).
Initially, a user creating an event in Seen specifies ba- 6.2.2 Tech Crunch Disrupt
sic metadata for the event being tracked (dates, title, All items were filtered with a threshold of 5. The 66
hashtags). Later on, this description is automatically links telling the story of the #TCDisrupt conference
enriched with meaningful information inferred by the are particularly effective in describing what happened
service. In our experiment, we only use the metadata at different level of details. They are often news ar-
specified by the user to retrieve the right subset of ticles coming with very illustrative images, titles and
raw contents from the contents database. However, we descriptions. The first day of the conference contains
only select events that already exist in the platform, most of the links, because it includes a number of gen-
so that we can first acknowledge the general charac- eral references to the event itself and to the hype for
teristics of each of them in terms of data flow. its beginning. The time references seem to be correct:
for example, the first item is about the first application
that was pitched, and according to some following re-
5 http://seen.co sources, this project caused a sexist scandal, requiring
Tech Crunch to officially apologize (see Table 2).
Further down in the timeline, the links/day ra-
tio shrinks which increases the storyline quality as it
mostly includes specific articles about the presenta-
tions held on days two and three in chronological or-
der. In particular, the last item closes the storyline
by declaring the winner of #TCDisrupt (Row 4 in Ta-
ble 2).
6.2.3 San Francisco’s Batkid
This event has a very particular configuration: it con-
tains a huge amount of content (tens of thousands of
tweets) shared in a relatively short period of time,
mostly as an echo response to mass-medias. The re-
sulting narration, when filtered down to a readable
length of 99 items (using a threshold of 32), is very
general and redundant. It is mostly composed of arti-
cles that describe the event as a whole. Instant media
(e.g. Instagrams) has been drowned by the huge num-
ber of re-shares achieved by sources such as CNN and
NBC. As a result, this timeline is almost exempt from
noise but it is much less effective for narrating the
event (see Table 3).
6.3 Selection and Ranking quality of Different
Figure 4: Number of results and number of true-
LSPs
positive results obtained using the volume-based LSP
We filtered the storylines resulting from the velocity and the velocity-based LSP for the TCDisrupt event
and volume based processors until we obtained less while increasing the filtering thresholds
than 100 items. We marked the false-positive results
and we plotted the number of results and the number
of true-positives obtained while increasing the thresh-
old. The first important difference we noticed between
the two LSPs is the quality of the ranking: while the
velocity based LSP tends to concentrate most of the
results in the left-most part of the plot (thus in the
lower part of the ranking), the volume based one dis-
tributes the results better. This strong difference can
be seen in Figure 4.
Figure 4 also indicates that our method is efficient in
selecting true-positive results when filtering the output
of the volume-based LSP. This was not observed with
the other two events, where the performance difference
between the two LSPs under this point of view was ir-
relevant. A characteristic of the plot made on velocity-
based results is that there is usually a very limited
number of highly referenced links that are underlined
by their distance from lower-ranked resources. This
can be seen in Figure 4 as well as in Figure 5 where the Figure 5: Number of results and number of true-
top ranked item is a Vine of president Barack Obama positive results obtained using the velocity-based LSP
congratulating with the young super-hero, although for the SFBatkid event while increasing the filtering
those few outliers with very high values happened to thresholds
be, in all our experiments, true positives.
The same characteristic is common to all the
top ranked elements obtained with the velocity-based
method, thus making this selection system a good op- fingerprint” in this particular space of the source do-
tion for choosing elements to recommend as interesting mains.
highlights. This analysis can also help to automatically identify
official sources for a given event. In fact, if the category
6.4 Domains Composition fingerprint of an event is given, official sources can
sometimes be identified as outliers: this can be clearly
The analysis of the source domains underlined how seen in Figure 7 where the “techcrunch” domain has
different categories of events can have a very dif- a remarkably outsized cardinality comparing with the
ferent “fingerprint”. Figure 6 clearly shows how a other ones.
performance-centered event mostly received data from
Instagram and Youtube, while a breaking news event
and a technological conference are described by a wider
variety of newspapers and magazines operating in the
respective fields. This information could be used to
Figure 7: Source domains of the TCDisrupt event com-
puted on results obtained without any filtering
Our original goal was to explore the data produced
by the system we have implemented. Therefore, while
the results we are reporting are constrained by the
settings we have chosen, they well serve the purpose
of unveiling interesting patterns that should be further
investigated by experimenting on larger sets of events.
7 Conclusion and Future Work
In this paper, we presented a system for transforming
links shared on social media into narratives of what
happened at specific events. These narratives are com-
posed by the set of pages referenced in these links, fil-
tered according to a score. We evaluated the storylines
Figure 6: Source domains composition for the SF-
obtained and we extracted some information that can
Batkid and the Kanye events. Volume thresholds have
help designing new methods to improve their quality.
been chosen as the highest that still allowed enough
The collection of resources produced by this method
results to produce a meaningful analysis
imposes the analysis of social activity in a new space,
automatically detect events’ categories, or to imple- where the single tweets shared are aggregated into a
ment a smarter ranking function that assigns differ- feature of the resources they reference. Not only this
ent importance to links coming from different sources, approach can produce real-time meaningful narratives
when the category of the event is known. We also (as these resources become very descriptive thanks to
noticed how some of the biggest generators of social the page scraping process), it can also help extract-
content (i.e. Instagram, Facebook) tend to disappear ing useful insights, for example the composition of the
from the pie chart when rising the volume threshold source domains.
above one. This underlines the importance of an addi- We noticed how different characteristics of an event
tional dimension, the volume, in defining a “category can affect the efficiency of this system: while we ob-
tained a good narration of a technology conference, Conference (ESWC’13), Demo Session, Montpel-
the results obtained for a breaking news event were lier, France, 2013.
more disappointing. Further research should be con-
ducted on this topic, in order to define some better [7] V. Milicic, G. Rizzo, J. L. Redondo Garcı́a,
score functions tailored on the characteristics of each R. Troncy, and T. Steiner. Live Topic Genera-
category of event (e.g. considering alternatives to vol- tion from Event Streams. In 22nd World Wide
ume based score functions for a breaking news event). Web Conference (WWW’13), Demo Session, Rio
We also defined a useful feature based on the evolution de Janeiro, Brazil, 2013.
of the composition of the source domains with increas- [8] G. Rizzo, T. Steiner, R. Troncy, R. Verborgh,
ing volume thresholds. This can help identifying the J. L. Redondo Garcı́a, and R. V. de Walle. What
category of an event and it could also provide the in- Fresh Media Are You Looking For? Retrieving
formation necessary to automatically identify official Media Items from Multiple Social Networks. In
sources, when these are particularly active on social International Workshop on Socially-aware multi-
channels. media (SAM’12), Nara, Japan, 2012.
Acknowledgments [9] D. A. Shamma, L. Kennedy, and E. F. Churchill.
Conversational Shadows: Describing Live Media
The research leading to this paper was partially sup- Events Using Short Messages. In 4nd Interna-
ported by the European Union’s 7th Framework Pro- tional Conference on Weblogs and Social Media
gramme via the projects LinkedTV (GA 287911) and (ICWSM’10), Washington, USA, 2010.
MediaMixer (GA 318101).
[10] D. A. Shamma, L. Kennedy, and E. F. Churchill.
References Peaks and Persistence: Modeling the Shape of Mi-
croblog Conversations. In International Confer-
[1] H. Becker, D.Iter, M. Naaman, and L. Gravano. ence on Computer Supported Cooperative Work
Identifying Content for Planned Events Across (CSCW’11), pages 355–358, New York, NY, USA,
Social Media Sites. In 5th International ACM 2011.
Conference on Web Search and Data Mining,
Seattle, Whashington, USA, 2012. [11] T. Steiner. A Meteoroid on Steroids: Rank-
ing Media Items Stemming from Multiple Social
[2] M. D. Choudhury, N. Diakopoulos, and M. Naa- Networks. In 22nd World Wide Web Confer-
man. Unfolding the Event Landscape on Twitter: ence (WWW’13), Demo Session, Rio de Janeiro,
Classification and Exploration of User Categories. Brazil, 2013.
In 15th ACM Conference on Computer Supported
Cooperative Work, Seattle, Whashington, USA, [12] T. tajner, B. Thomee, A.-M. Popescu, M. Pen-
2012. nacchiotti, and A. Jaimes. Automatic Selection of
Social Media Responses to News. In 19th Inter-
[3] V. Driessen. Redis Queues Python Library. http: national ACM Conference on Knowledge Discov-
//python-rq.org, 2013. ery and Data mining (KDD’13), Chicago, Illinois,
USA, 2013.
[4] A. Joly, J. Champ, P. Letessier, N. Hervé,
O. Buisson, and M. Viaud. Visual-Based Trans- [13] R. Tate. The Next Big Thing You Missed: Recre-
media Events Detection. In 20th ACM interna- ate Live Events with Twitter and Instagram.
tional conference on Multimedia (MM’12), Nara, http://www.wired.com/business/2013/11/
Japan, 2012. seen-is-real-life-instant-replay/, 2013.
[5] Y.-R. Lin, H. Sundaram, M. D. Choudhury, and [14] Twitter. Twitter’s REST API Documenta-
A. Kelliher. Temporal Patterns in Social Me- tion - Tweets. https://dev.twitter.com/docs/
dia Streams: Theme Discovery and Evolution platform-objects/tweets, 2014.
Using Joint Analysis of Content and Context.
In IEEE International Conference on Multimedia [15] V. Zsombori, M. Frantzis, R. L. Guimaraes, M. F.
and Expo (ICME’09), pages 1456–1459, Piscat- Ursu, P. Cesar, I. Kegel, R. Craigie, and D. C. A.
away, NJ, USA, 2009. Bulterman. Automatic Generation of Video Nar-
ratives from Shared UGC. In 22nd ACM Confer-
[6] V. Milicic, J. L. Redondo Garcı́a, G. Rizzo, and ence on Hypertext and Hypermedia, Eindhoven,
R. Troncy. Tracking and Analyzing The 2013 The Netherlands, 2011.
Italian Election. In 10th Extended Semantic Web