Qlusty: Quick and Dirty Generation of Event Videos
              from Written Media Coverage

 Alberto Barrón-Cedeño, Giovanni Da San Martino, Yifan Zhang, Ahmed Ali, and Fahim Dalvi
                {albarron, gmartino, yzhang, amali, faimaduddin}@hbku.edu.qa
                                         Qatar Computing Research Institute
                                        HBKU Research Complex, Doha, Qatar


                                                                can be translated into IR and NLP problems: doc-
                                                                ument clustering, near-duplicate identification, rank-
                       Abstract                                 ing, and query generation. We present a quantita-
                                                                tive analysis of the clustering and de-duplication mod-
    Qlusty generates videos describing the cover-               ules, taking advantage of the METER corpus for text
    age of the same event by different news out-                re-use analysis. The clustering strategy we use —
    lets automatically. Throughout four modules                 DBSCAN— outperforms k-means even if in the for-
    it identifies events, de-duplicates notes, ranks            mer one no information about the number of clusters
    according to coverage, and queries for im-                  is known in advance: F1 values in the range of 0.71
    ages to generate an overview video. In this                 vs. 0.60. A qualitative analysis, carried out on the
    manuscript we present our preliminary mod-                  News Corpus G and Signalmedia 1M corpora, shows
    els, including quantitative evaluations of the              the potential of our diversification and query genera-
    former two and a qualitative analysis of the                tion modules in the generation of attractive videos.
    latter two. The results show the potential
    for achieving our main aim: contributing in
    breaking the information bubble, so common                  2    News Corpora
    in the current news landscape.                              We use three corpora to tune and test our models both
                                                                quantitatively and qualitatively.
1    Introduction                                               METER Corpus [CGP02]. It includes documents
Event reporting in digital media spans from the re-use          covering events as published by one news agency and
of contents from news agencies to the direct coverage           nine newspapers from the British press. This charac-
and shaping of a story. The point of view, aspects,             teristic allows for the tuning of models for event iden-
and storytelling of the same and related events can be          tification. Each newspaper document can be wholly-,
diverse from medium to medium, depending on their               partially-, or non-derived out of a news agency report.
editorial line (e.g., left vs right), target audience (e.g.,    Therefore METER is useful to test de-duplication
quality vs tabloid), house style, or mere interest in an        models. It is relatively small: 1.7k documents. Still it
event. Qlusty aims at presenting consumers with a               is manually annotated by expert journalists. Twenty-
short video overview of the facts with contrasting cov-         five percent of the newspaper notes are wholly-derived
erage of the same news event by different news outlets,         from an agency wire. Either derived or not, in general
the overall aim being to break the information bubble.          the notes are modified to stick to editorial focus, style,
   Our video-production architecture consists of four           and readability standards.
modules: event identification, de-duplication, coverage         News Corpus G [Gas17]. It was originally intended
diversification, and image gathering. Such modules              for the development of news recommendation models.
                                                                G does not contain full articles; only titles. The ar-
Copyright c 2018 for the individual papers by the papers’ au-   ticle’s content can be downloaded from the provided
thors. Copying permitted for private and academic purposes.     URL, pointing to the original publisher. We stick to
This volume is published and copyrighted by its editors.
                                                                use only the titles to assess the robustness of our mod-
In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez,
B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18
                                                                els when dealing with very short texts. G is signifi-
Workshop at ECIR, Grenoble, France, 26-March-2018, pub-         cantly larger: 423k documents covering 7, 231 events.
lished at http://ceur-ws.org                                    Such events are as provided by Google News and we
do not consider them as ground-truth.                               3.2   Near-Duplicate            Detection             for   De-
SignalMedia 1M Corpus [CAMM16]. This corpus                               Duplication
is significantly more diverse. Beside including doc-
                                                                    The input of this module is the articles belonging to
uments from major news agencies and papers, 1M
                                                                    a single event, as identified by the clustering mod-
contains material from magazines and blog entries,
                                                                    ule. The output is such articles after discarding near-
among others. We discard blog entries and focus on
                                                                    duplicates. We opt for standard text re-use identi-
items identified as news. This dataset is particularly
                                                                    fication approaches based on word n-grams compari-
challenging because it is only lightly curated; it may
                                                                    son [LBM04]. We represent the texts as bags of word
contain noisy text (e.g., with HTML tags) and even
                                                                    n-grams after standard pre-processing: casefolding,
content-less entries.1 Due to the regular querying of
                                                                    tokenisation, and stopword removal. Tokens shorter
articles, verbatim duplicates also exist in the collection
                                                                    than 2 characters are discarded as well. We use the
(i.e. the same article may exist a number of times with
                                                                    Jaccard coefficient [Jac01] to compute the similarities.
a different unique id). As stressed in [CAMM16], this
                                                                       The value for n as well as the threshold to con-
real-life dataset prevents from the over-estimation of
                                                                    sider that two documents are near-duplicates are set
performance usually obtained on clean data. 1M does
                                                                    empirically, once again on the METER corpus. The
not include any event-related information.
                                                                    experiments are discussed in Section 4.2
3     Architecture and Models                                       3.3   Ranking for Diversification
Our architecture consists of four modules plus the                  The input of this module is the de-duplicated articles
video-generation stage, which are described next.                   from a specific event as filtered by the de-duplication
                                                                    module. The output is a ranked list of the documents.
3.1     Clustering for Event Identification                         One of the premises of our system is breaking a user’s
The input to this module is a batch of news articles                bubble. We aim at presenting a news event includ-
from a fixed time period. The output is the arti-                   ing points of view as diverse as possible. The idea is
cles organised within a non-specified number of events.             that those articles which are most dissimilar to the
Traditionally, for this task the input data is treated              rest covering the event are those which contain the
as a continuous stream of documents. Hierarchical                   most diverse contents.
[SCK+ 06] and partitional clustering [AS12, AY10] are                  In a k-means-like model finding such dissimilarity
popular approaches. Still we use DBSCAN [EKSX96].                   would be as straight-forward as computing the simi-
The main reasons are that —at this stage— we are                    larity of each article against the centroid. Neverthe-
not interested in news streams but in temporal batches              less, no centroid exist in a DBSCAN-generated cluster.
and, perhaps more important, DBSCAN does not re-                    Therefore, our ranking function consists of computing
quire information related to the expected number of                 the average similarity between an article and the rest
events. As a result, no knowledge is necessary about                of articles in the cluster:
the distribution of the input documents.
                                                                                           −1        X
    DBSCAN does require to set two hyper-parameters.                          score(d) =                         sim(d, d0 )    (1)
The first one is the maximum distance under which                                           |c|
                                                                                                  d0 ∈c|d0 6=d
two elements can be considered as part of the same
cluster. The second one is the minimum number of el-                where d (d0 ) is a document in cluster c and |c| repre-
ements in a cluster. Items can belong to no neighbour-              sents the size of c. Once again, we use cosine similarity
hood at all, and be considered as noisy entries. We fix             on doc2vec representations. The articles will be pre-
the minimum size of a cluster to 2 news articles, thus              sented to the user according to this ranking, from top
considering singletons as noise. As for the maximum                 to bottom. We use −1 because we want the most dif-
distance, we use the METER corpus to empirically set                ferent articles to appear first. The opening article is
it. The experiments are described in Section 4.1.                   an exception: it is the last one according to the scor-
    We opt for doc2vec embeddings [LM14] for docu-                  ing function (i.e. the most similar to the rest of the
ment representation, pre-trained on articles from the               cluster members). The reason is that we consider this
Associated Press [LB16]. The pair-wise distances are                article is the best one to open the video and give a
computed using 1 minus cosine similarity. The use                   good overview of the event. This module requires no
of doc2vec for representing documents looks appealing               tuning. Section 4.3 shows a qualitative analysis.
due to its semantic properties.
                                                                    3.4   Query Generation for Image Gathering
    1 For instance, the content of entry f4edd16d-df59-41f9-ae01-

d4dee076b0d5 is “Your access to this site has been temporarily      Finally, we query a search engine to gather illustrations
blocked. This block will be automatically removed shortly”.         for each of the articles. The input to this module is
the ranked list of texts from the diversification module                                   Average outcome for 10 random experiments
and the output is one query per article.                                   1
   We explore three alternatives to generate the query.
Model q1 uses the news title. Models q2 and q3 follow                     0.8
a common mechanism. Firstly, all sub-chunks for all
texts are extracted and tf -idf -ranked. For each docu-


                                                              F-measure
                                                                          0.6
ment in the list, that chunk with the highest score is
selected as the query. Once a chunk has been used, it is                  0.4
discarded from the list of candidates to avoid grabbing                               k-means
duplicate images. For model q2 we use word 2-grams,                                  eps=0.40
                                                                          0.2        eps=0.45
whereas for q3 we use named entities (NE). Regardless                                eps=0.50
                                                                                     eps=0.55
of the contents in the first article in the ranking, its                             eps=0.60
                                                                           0
query consists of the top NE.                                                   10        50        100          150       200         250
   The so-generated chunks are queried to a search en-                                                    Gold clusters
gine, one at a time, and the top-5 pictures grabbed           Figure 1: Evolution of BCubed F1 -measure when deal-
for integration in the video. In this version we use          ing with increasing numbers of events and documents.
Google’s search engine.
                                                              repetitions to assess the stability of the outcome on in-
                                                              creasing numbers of gold clusters. Figure 1 shows, for
4     Experiments and Results                                 each eps value, the BCubed-F1 measure averaged over
4.1    Clustering Tuning                                      the 10 runs together with standard deviation. We in-
                                                              clude the performance of k-means to give perspective
Our first experiment intends to tune our event iden-          to the results. In principle, k-means has the advan-
tification model (cf. Section 3.1). Our objective is          tage of including the expected number of clusters as a
identifying the best DBSCAN neighbourhood maxi-               parameter and we always assign the right number.
mum distance (eps) for a random number of events                 Values of eps≥ 0.55 yield the best results when deal-
and their associated articles. We are interested in two       ing with small numbers of clusters, but drop drastically
factors: high quality and stability for different docu-       when facing larger numbers of events. Lower values
ment volumes.                                                 yield a relatively stable performance, regardless of the
    First we formalise the problem and describe the per-      number of events in the dataset. An analysis focused
formance measures. Let D be a collection of docu-             on BCubed precision and recall values (not reported)
ments covering a set of events E. We refer to the             indicate that the drop observed for eps≥ 0.55 is pulled
number of events in E as |E|. For each d ∈ D, let e(d)        by precision; the clusters tend to be larger than they
be the set of documents belonging to the same event as        should, including noisier entries. As a compromise be-
d. Analogously, let c(d) be the set of documents that         tween stability and purity, we select eps= 0.55.
the model assigns to the same cluster as d’s. For any
E 0 ⊆ E, let D|E 0 be the subset of D whose documents’        4.2           Near-Duplicate Identification Tuning
events are in E 0 . We use BCubed-F1 as clustering per-
formance measure [AGAV09]. We define δ(s1 , s2 ) = 1          Our second experiment intends to tune the model for
if the sets s1 and s2 are identical, 0 otherwise. Let         near-duplicate identification (cf. Section 3.2). The
                                                              purpose is tuning two parameters: the value of n —
        TP(d) =
                   X
                      δ(e(d), e(d0 )) · δ(c(d), c(d0 )) (2)   the word n-gram level— and the similarity threshold
                     d0
                                                              upon which documents are considered near-duplicates
                                                              and hence one can be discarded from the final output.
be a function counting the number of documents be-            In the sibling task of text re-use detection, setting n
longing to the same event as d which have been put            to {2, 3} [BCR09] and even 5 [KBK09] is considered
together in the same cluster by DBSCAN. BCubed-F1             standard. As we are interested in discarding whole
is the harmonic mean between BCubed Precision P               documents to reduce redundancy to a minimum, we
and BCubed Recall R:                                          explore low values to allow for a more flexible compar-
                                                              ison: n = {1, 2}.
             1 X TP(d)              1 X TP(d)                    Once again we use the METER corpus and its text
      P =               ,    R=                        (3)
            |D|  |c(d)|            |D|  |e(d)|                re-use annotation. We adopt two settings. In the
               d∈D                     d∈D
                                                              simple setting, we consider a pair of documents news
    We estimate parameter eps as follows. For 10 ≤ i ≤        agency–newspaper as positive iff the latter is labelled
|c|: we randomly select c0 , |c0 | = i events and run our     as wholly-derived and both cover the same event. In
clustering algorithm on D|c0 . We perform 10 random           the complex setting we consider an additional triangu-
                                                             is worth noting article 2, about a different event. Our
Table 1: Evolution of F1 for different similarity thresh-
                                                             event detection module got confused because this ar-
olds τ in the near-duplicate identification task.
                     simple          complex
                                                             ticle is about the girlfriend of an actor. Whether this
           τ     n=1     n=2      n=1     n=2                is relevant for a user is arguable.
          0.10   0.397    0.645   0.422   0.679
          0.15   0.581    0.497   0.621   0.518
          0.20   0.720    0.373   0.758   0.381
                                                             4.4   Query Generation
          0.25   0.752    0.274   0.787   0.268
          0.30   0.723    0.205   0.755   0.192
          0.35   0.657    0.147   0.686   0.133              Table 2 also shows the queries as generated by the
          0.40   0.575    0.107   0.599   0.096              three variations of our generator: q1 , q2 , and q3 plus
          0.45   0.496    0.072   0.511   0.064              a fourth variation: q4 = q2 + q3 (cf. Section 3.4).
          0.50   0.422    0.049   0.429   0.043
                                                             The NE-based q3 seems far from perfect when deal-
lar relationship: a pair newspaper–newspaper is con-         ing with the titles of instances A and B. The cause
sidered as positive iff both are labelled as wholly-         is that the camel-casing is confusing the NER. The
derived from the same news agency article. We re-            simple n-grams-based approach seems to produce sen-
strain our similarity comparison to all those articles       sitive queries. When having at hand the full article,
published in the same day, resulting in 38k and 48k          the NE-based model works slightly better.
comparisons in the simple and complex settings, re-             Figure 2 shows the photograms of videos generated
spectively. As a consequence of this volume of com-          with these four kinds of queries for Table 2 instance
parisons a high imbalance in the dataset exists —most        B. Each subfigure refers to one video and each row to
pairs are negative instances. We evaluate this exper-        one news article, which can include up to five images.
iment on the basis of the F1 -measure for binary clas-       The whole titles from strategy q1 provide a good vi-
                        2·tp                                 sual overview of the event: the listed house and its
sification: F1 = 2·tp+f    p+f n , where tp, f p, and f n
stand for number of true positives, false positives, and     owners. Still, due to contents overlapping, some im-
false negatives. Table 1 shows the results. Firstly, a       ages appear more than once: coordinates {1,4; 2,2},
more flexible comparison based on word 1-grams re-           {1,5; 5,3; 7,2}, and {5,1; 7,1}. The chunk-level strate-
sults in the best performance (this may imply docu-          gies result in less repetition. Strategy q3 based on NEs
ments which are not complete duplicates are discarded;       is more varied: focusing on football player Tom Brady
we prefer this over including very similar notes). In        for the first two titles and moving towards the main
both simple and complex settings the best F1 is ob-          event: the listing of a house for sale in Los Ange-
tained with τ = 0.25 and we select this threshold. This      les, and finally the second person involved: top-model
supports the concept of co-derivative and reflects that      Gisele Bündchen. Something similar occurs with q2 ’s
the threshold is valid for both news agency–newspaper        2-grams: non-duplicated photograms centred in the
and newspaper–newspaper comparisons.                         couple and the listed house. Still, q2 has a problem:
                                                             “The Brady report” is an Arizona radio show and the
4.3   Articles Ranking                                       resulting photograms refer to it. Even with this mis-
                                                             take in mind, it seems like q2 provides a good balance
Now we make a qualitative analysis. Table 2 shows            between relevance and diversity. Combining NEs and
the titles of the articles of three events ranked on the     2-grams into q4 reduces variation (photograms {5,3;
basis of our diversification model (cf. Section 3.3).        7,3} and {5,5; 7,1} are the same).
   Instance A tells the story of Libyan rebels and their
impact on oil. The top article does summarise the
event, referring to a rebel attack on naval forces. As       5     Final Remarks and Ongoing Work
expected, the topic of article 2 is not as close: it is
about the plans to sink a ship transporting illegal oil,     We presented our first efforts on breaking the news
currently besieged by the Libyan Navy. Whereas the           bubble. We integrated a system for the automatic
third article still refers to oil, rebel attacks, and even   generation of videos consisting of four modules: event
to the chances for a conflict, the latter two refer to the   identification, de-duplication, diversification, and im-
dismissal of the Libyan PM by the parliament. That is,       age gathering. The outcome comes in the form of short
we are indeed looking at a story from different angles.      illustrated videos aiming at providing a user with dif-
   Something similar occurs with instances B and C.          ferent points of view in the coverage of the same event.
Instance B is about the listing of a mansion. After an           Departing from this architecture, we aim at using
introductory first article, further details appear such      more sophisticated text representation and event iden-
as price or location. Instance C tells the story of the      tification technology. We are particularly interested in
decease of a former girlfriend of actor Jim Carrey. It       storyline generation [MSA+ 15, VCK15].
            Table 2: Three examples of diversification-ranked news and the generated queries: q1...3 .
   News title (q1 )                                                                        2-grams (q2 )       NE (q3 )
 A News Corpus G–11 March, 2014
 1 Rebels Fired at Libyan Naval Forces: Minister                                           Libyan              Libyan
 2 Brincat gives no details as Libya tanker standoff continues                             brincat gives       Brincat
 3 Rebel group manoeuvres over Libya’s oil could lead to renewed civil conflict            rebel group         Rebel
 4 URGENT - Libya PM                                                                       Libya PM            URGENT
 5 Libyan parliament dismisses PM                                                          libyan parliament   –
 B News Corpus G–20 April, 2014
 1 Gisele Bundchen and Tom Brady sell home for £30million                                  Tom Brady           Tom Brady
 2 Tom Brady wants $50m - for his mega mansion in LA                                       brady wants         Brady
 3 For Sale: Gisele’s $50 Million LA Chateau                                               sale gisele         Sale
 4 Tom and Gisele SELLING Mega-Estate in LA!!!!!                                           gisele selling      Tom
 5 Tom and Gisele’s LA home is listed at $50m                                              gisele home         LA
 C SignalMedia 1M–29 September, 2015
 1 Cathriona White, ex-girlfriend of actor Jim Carrey, committed suicide on Monday.        Cathriona White     Cathriona White
 2 Twilight star Robert Pattinson says that those who send negative comments [. . . ]      twilight star       Twilight
 3 A former girlfriend of actor Jim Carrey has died of an apparent suicide, the [. . . ]   girlfriend actor    Ed Winter
 4 Jim Carrey and Cathriona White are spotted hand in hand leaving their [. . . ]          posted photo        N.Y.
 5 Jim Carrey’s Irish ex-girlfriend Cathriona White has been found dead.                   2015 /raw           Cathriona Cappawhite


Figure 2: Photograms of videos as generated with queries q[1,..4] . One row corresponds to one article, for which
up to 5 images are integrated. The top and left numbers are the photograms coordinates, for easy location.
References                                                            Vaudoise des Sciences Naturelles, 37:547–
                                                                      579, 1901.
[AGAV09] Enrique Amigo, Julio Gonzalo, Javier Ar-
         tiles, and Felisa Verdejo. A comparison          [KBK09]     Jan Kasprzak, Michal Brandejs, and
         of extrinsic clustering evaluation metrics                   Miroslav Krip̌ač. Finding Plagiarism by
         based on formal constrains. Inf Retrieval,                   Evaluating Document Similarities. vol-
         12(4):1–32, 2009.                                            ume 502, pages 24–28, San Sebastian,
                                                                      Spain, 2009. CEUR-WS.org.
[AS12]     Joel Azzopardi and Christopher Staff. In-
           cremental Clustering of News Reports. Al-      [LB16]      Jey Han Lau and Timothy Baldwin. An
           gorithms, 5(4):364–378, aug 2012.                          Empirical Evaluation of doc2vec with
                                                                      Practical Insights into Document Embed-
[AY10]     Charu C Aggarwal and Philip S Yu. On
                                                                      ding Generation. In Proceedings of the 1st
           Clustering Massive Text and Categorical
                                                                      Workshop on Representation Learning for
           Data Streams. Knowledge and Informa-
                                                                      NLP, pages 78–86, 2016.
           tion Systems, pages 171–196, 2010.
                                                          [LBM04]     Caroline Lyon, Ruth Barret, and James
[BCR09]    Alberto Barrón-Cedeño and Paolo Rosso.                   Malcolm. A Theoretical Basis to the Au-
           On Automatic Plagiarism Detection                          tomated Detection of Copying Between
           based on n-grams Comparison. Advances                      Texts, and its Practical Implementation
           in Information Retrieval. Proceedings                      in the Ferret Plagiarism and Collusion De-
           of the 31st European Conference on IR                      tector. In Plagiarism: Prevention, Prac-
           Research, LNCS (5478):696–700, 2009.                       tice and Policies Conference, Newcastle
           Springer-Verlag.                                           upon Tyne, UK, 2004.
[CAMM16] David Corney, Dyaa Albakour, Miguel              [LM14]      Quoc Le and Tomas Mikolov. Distributed
         Martinez, and Samir Moussa. What do                          Representations of Sentences and Docu-
         a Million News Articles Look like? In                        ments. In Proceedings of the 31 st Interna-
         NewsIR 2016 Recent Trends in News In-                        tional Conference on Machine Learning,
         formation Retrieval, pages 42–47, Padua,                     Beijing, China, 2014.
         Italy, 2016.
                                                          [MSA+ 15] Anne-Lyse Minard, Manuela Speranza,
[CGP02]    Paul Clough, Robert Gaizauskas, and                      Eneko Agirre, Itziar Aldabe, Marieke van
           Scott Piao. Building and Annotating a                    Erp, Bernardo Magnini, German Rigau,
           Corpus for the Study of Journalistic Text                and Ruben Urizar. Semeval-2015 task
           Reuse. In Proceedings of the 3rd In-                     4: Timeline: Cross-document event or-
           ternational Conference on Language Re-                   dering. In Proceedings of the 9th In-
           sources and Evaluation (LREC 2002), vol-                 ternational Workshop on Semantic Eval-
           ume V, pages 1678–1691, Las Palmas,                      uation (SemEval 2015), pages 778–786.
           Spain, 2002. ELRA.                                       ACL, 2015.
[EKSX96]   Martin Ester, Hans-Peter Kriegel, Jörg        [SCK+ 06]   Nachiketa Sahoo, Jamie Callan, Ramayya
           Sander, and Xiaowei Xu. A density-based                    Krishnan, George Duncan, and Rema
           algorithm for discovering clusters in large                Padman. Incremental hierarchical clus-
           spatial databases with noise. In Proceed-                  tering of text documents. In Proceedings
           ings of the Second International Confer-                   of the 15th ACM International Confer-
           ence on Knowledge Discovery and Data                       ence on Information and Knowledge Man-
           Mining (KDD-96), pages 226–231. AAAI                       agement, CIKM ’06, pages 357–366, New
           Press, 1996.                                               York, NY, 2006. ACM.
[Gas17]    Fabio Gasparetti. Modeling User Interests      [VCK15]     Piek Vossen, Tommaso Caselli, and Yiota
           from Web Browsing Activities. Data Min-                    Kontzopoulou. Storylines for structuring
           ing and Knowledge Discovery, 31(2):502–                    massive streams of news. In Proceedings of
           547, 2017.                                                 the First Workshop on Computing News
                                                                      Storylines, pages 40–49, Beijing, China,
[Jac01]    Paul Jaccard. Étude comparative de la
                                                                      2015. ACL.
           distribution florale dans une portion des
           Alpes et des Jura. Bulletin del la Société