Qlusty: Quick and Dirty Generation of Event Videos from Written Media Coverage Alberto Barrón-Cedeño, Giovanni Da San Martino, Yifan Zhang, Ahmed Ali, and Fahim Dalvi {albarron, gmartino, yzhang, amali, faimaduddin}@hbku.edu.qa Qatar Computing Research Institute HBKU Research Complex, Doha, Qatar can be translated into IR and NLP problems: doc- ument clustering, near-duplicate identification, rank- Abstract ing, and query generation. We present a quantita- tive analysis of the clustering and de-duplication mod- Qlusty generates videos describing the cover- ules, taking advantage of the METER corpus for text age of the same event by different news out- re-use analysis. The clustering strategy we use — lets automatically. Throughout four modules DBSCAN— outperforms k-means even if in the for- it identifies events, de-duplicates notes, ranks mer one no information about the number of clusters according to coverage, and queries for im- is known in advance: F1 values in the range of 0.71 ages to generate an overview video. In this vs. 0.60. A qualitative analysis, carried out on the manuscript we present our preliminary mod- News Corpus G and Signalmedia 1M corpora, shows els, including quantitative evaluations of the the potential of our diversification and query genera- former two and a qualitative analysis of the tion modules in the generation of attractive videos. latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common 2 News Corpora in the current news landscape. We use three corpora to tune and test our models both quantitatively and qualitatively. 1 Introduction METER Corpus [CGP02]. It includes documents Event reporting in digital media spans from the re-use covering events as published by one news agency and of contents from news agencies to the direct coverage nine newspapers from the British press. This charac- and shaping of a story. The point of view, aspects, teristic allows for the tuning of models for event iden- and storytelling of the same and related events can be tification. Each newspaper document can be wholly-, diverse from medium to medium, depending on their partially-, or non-derived out of a news agency report. editorial line (e.g., left vs right), target audience (e.g., Therefore METER is useful to test de-duplication quality vs tabloid), house style, or mere interest in an models. It is relatively small: 1.7k documents. Still it event. Qlusty aims at presenting consumers with a is manually annotated by expert journalists. Twenty- short video overview of the facts with contrasting cov- five percent of the newspaper notes are wholly-derived erage of the same news event by different news outlets, from an agency wire. Either derived or not, in general the overall aim being to break the information bubble. the notes are modified to stick to editorial focus, style, Our video-production architecture consists of four and readability standards. modules: event identification, de-duplication, coverage News Corpus G [Gas17]. It was originally intended diversification, and image gathering. Such modules for the development of news recommendation models. G does not contain full articles; only titles. The ar- Copyright c 2018 for the individual papers by the papers’ au- ticle’s content can be downloaded from the provided thors. Copying permitted for private and academic purposes. URL, pointing to the original publisher. We stick to This volume is published and copyrighted by its editors. use only the titles to assess the robustness of our mod- In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 els when dealing with very short texts. G is signifi- Workshop at ECIR, Grenoble, France, 26-March-2018, pub- cantly larger: 423k documents covering 7, 231 events. lished at http://ceur-ws.org Such events are as provided by Google News and we do not consider them as ground-truth. 3.2 Near-Duplicate Detection for De- SignalMedia 1M Corpus [CAMM16]. This corpus Duplication is significantly more diverse. Beside including doc- The input of this module is the articles belonging to uments from major news agencies and papers, 1M a single event, as identified by the clustering mod- contains material from magazines and blog entries, ule. The output is such articles after discarding near- among others. We discard blog entries and focus on duplicates. We opt for standard text re-use identi- items identified as news. This dataset is particularly fication approaches based on word n-grams compari- challenging because it is only lightly curated; it may son [LBM04]. We represent the texts as bags of word contain noisy text (e.g., with HTML tags) and even n-grams after standard pre-processing: casefolding, content-less entries.1 Due to the regular querying of tokenisation, and stopword removal. Tokens shorter articles, verbatim duplicates also exist in the collection than 2 characters are discarded as well. We use the (i.e. the same article may exist a number of times with Jaccard coefficient [Jac01] to compute the similarities. a different unique id). As stressed in [CAMM16], this The value for n as well as the threshold to con- real-life dataset prevents from the over-estimation of sider that two documents are near-duplicates are set performance usually obtained on clean data. 1M does empirically, once again on the METER corpus. The not include any event-related information. experiments are discussed in Section 4.2 3 Architecture and Models 3.3 Ranking for Diversification Our architecture consists of four modules plus the The input of this module is the de-duplicated articles video-generation stage, which are described next. from a specific event as filtered by the de-duplication module. The output is a ranked list of the documents. 3.1 Clustering for Event Identification One of the premises of our system is breaking a user’s The input to this module is a batch of news articles bubble. We aim at presenting a news event includ- from a fixed time period. The output is the arti- ing points of view as diverse as possible. The idea is cles organised within a non-specified number of events. that those articles which are most dissimilar to the Traditionally, for this task the input data is treated rest covering the event are those which contain the as a continuous stream of documents. Hierarchical most diverse contents. [SCK+ 06] and partitional clustering [AS12, AY10] are In a k-means-like model finding such dissimilarity popular approaches. Still we use DBSCAN [EKSX96]. would be as straight-forward as computing the simi- The main reasons are that —at this stage— we are larity of each article against the centroid. Neverthe- not interested in news streams but in temporal batches less, no centroid exist in a DBSCAN-generated cluster. and, perhaps more important, DBSCAN does not re- Therefore, our ranking function consists of computing quire information related to the expected number of the average similarity between an article and the rest events. As a result, no knowledge is necessary about of articles in the cluster: the distribution of the input documents. −1 X DBSCAN does require to set two hyper-parameters. score(d) = sim(d, d0 ) (1) The first one is the maximum distance under which |c| d0 ∈c|d0 6=d two elements can be considered as part of the same cluster. The second one is the minimum number of el- where d (d0 ) is a document in cluster c and |c| repre- ements in a cluster. Items can belong to no neighbour- sents the size of c. Once again, we use cosine similarity hood at all, and be considered as noisy entries. We fix on doc2vec representations. The articles will be pre- the minimum size of a cluster to 2 news articles, thus sented to the user according to this ranking, from top considering singletons as noise. As for the maximum to bottom. We use −1 because we want the most dif- distance, we use the METER corpus to empirically set ferent articles to appear first. The opening article is it. The experiments are described in Section 4.1. an exception: it is the last one according to the scor- We opt for doc2vec embeddings [LM14] for docu- ing function (i.e. the most similar to the rest of the ment representation, pre-trained on articles from the cluster members). The reason is that we consider this Associated Press [LB16]. The pair-wise distances are article is the best one to open the video and give a computed using 1 minus cosine similarity. The use good overview of the event. This module requires no of doc2vec for representing documents looks appealing tuning. Section 4.3 shows a qualitative analysis. due to its semantic properties. 3.4 Query Generation for Image Gathering 1 For instance, the content of entry f4edd16d-df59-41f9-ae01- d4dee076b0d5 is “Your access to this site has been temporarily Finally, we query a search engine to gather illustrations blocked. This block will be automatically removed shortly”. for each of the articles. The input to this module is the ranked list of texts from the diversification module Average outcome for 10 random experiments and the output is one query per article. 1 We explore three alternatives to generate the query. Model q1 uses the news title. Models q2 and q3 follow 0.8 a common mechanism. Firstly, all sub-chunks for all texts are extracted and tf -idf -ranked. For each docu- F-measure 0.6 ment in the list, that chunk with the highest score is selected as the query. Once a chunk has been used, it is 0.4 discarded from the list of candidates to avoid grabbing k-means duplicate images. For model q2 we use word 2-grams, eps=0.40 0.2 eps=0.45 whereas for q3 we use named entities (NE). Regardless eps=0.50 eps=0.55 of the contents in the first article in the ranking, its eps=0.60 0 query consists of the top NE. 10 50 100 150 200 250 The so-generated chunks are queried to a search en- Gold clusters gine, one at a time, and the top-5 pictures grabbed Figure 1: Evolution of BCubed F1 -measure when deal- for integration in the video. In this version we use ing with increasing numbers of events and documents. Google’s search engine. repetitions to assess the stability of the outcome on in- creasing numbers of gold clusters. Figure 1 shows, for 4 Experiments and Results each eps value, the BCubed-F1 measure averaged over 4.1 Clustering Tuning the 10 runs together with standard deviation. We in- clude the performance of k-means to give perspective Our first experiment intends to tune our event iden- to the results. In principle, k-means has the advan- tification model (cf. Section 3.1). Our objective is tage of including the expected number of clusters as a identifying the best DBSCAN neighbourhood maxi- parameter and we always assign the right number. mum distance (eps) for a random number of events Values of eps≥ 0.55 yield the best results when deal- and their associated articles. We are interested in two ing with small numbers of clusters, but drop drastically factors: high quality and stability for different docu- when facing larger numbers of events. Lower values ment volumes. yield a relatively stable performance, regardless of the First we formalise the problem and describe the per- number of events in the dataset. An analysis focused formance measures. Let D be a collection of docu- on BCubed precision and recall values (not reported) ments covering a set of events E. We refer to the indicate that the drop observed for eps≥ 0.55 is pulled number of events in E as |E|. For each d ∈ D, let e(d) by precision; the clusters tend to be larger than they be the set of documents belonging to the same event as should, including noisier entries. As a compromise be- d. Analogously, let c(d) be the set of documents that tween stability and purity, we select eps= 0.55. the model assigns to the same cluster as d’s. For any E 0 ⊆ E, let D|E 0 be the subset of D whose documents’ 4.2 Near-Duplicate Identification Tuning events are in E 0 . We use BCubed-F1 as clustering per- formance measure [AGAV09]. We define δ(s1 , s2 ) = 1 Our second experiment intends to tune the model for if the sets s1 and s2 are identical, 0 otherwise. Let near-duplicate identification (cf. Section 3.2). The purpose is tuning two parameters: the value of n — TP(d) = X δ(e(d), e(d0 )) · δ(c(d), c(d0 )) (2) the word n-gram level— and the similarity threshold d0 upon which documents are considered near-duplicates and hence one can be discarded from the final output. be a function counting the number of documents be- In the sibling task of text re-use detection, setting n longing to the same event as d which have been put to {2, 3} [BCR09] and even 5 [KBK09] is considered together in the same cluster by DBSCAN. BCubed-F1 standard. As we are interested in discarding whole is the harmonic mean between BCubed Precision P documents to reduce redundancy to a minimum, we and BCubed Recall R: explore low values to allow for a more flexible compar- ison: n = {1, 2}. 1 X TP(d) 1 X TP(d) Once again we use the METER corpus and its text P = , R= (3) |D| |c(d)| |D| |e(d)| re-use annotation. We adopt two settings. In the d∈D d∈D simple setting, we consider a pair of documents news We estimate parameter eps as follows. For 10 ≤ i ≤ agency–newspaper as positive iff the latter is labelled |c|: we randomly select c0 , |c0 | = i events and run our as wholly-derived and both cover the same event. In clustering algorithm on D|c0 . We perform 10 random the complex setting we consider an additional triangu- is worth noting article 2, about a different event. Our Table 1: Evolution of F1 for different similarity thresh- event detection module got confused because this ar- olds τ in the near-duplicate identification task. simple complex ticle is about the girlfriend of an actor. Whether this τ n=1 n=2 n=1 n=2 is relevant for a user is arguable. 0.10 0.397 0.645 0.422 0.679 0.15 0.581 0.497 0.621 0.518 0.20 0.720 0.373 0.758 0.381 4.4 Query Generation 0.25 0.752 0.274 0.787 0.268 0.30 0.723 0.205 0.755 0.192 0.35 0.657 0.147 0.686 0.133 Table 2 also shows the queries as generated by the 0.40 0.575 0.107 0.599 0.096 three variations of our generator: q1 , q2 , and q3 plus 0.45 0.496 0.072 0.511 0.064 a fourth variation: q4 = q2 + q3 (cf. Section 3.4). 0.50 0.422 0.049 0.429 0.043 The NE-based q3 seems far from perfect when deal- lar relationship: a pair newspaper–newspaper is con- ing with the titles of instances A and B. The cause sidered as positive iff both are labelled as wholly- is that the camel-casing is confusing the NER. The derived from the same news agency article. We re- simple n-grams-based approach seems to produce sen- strain our similarity comparison to all those articles sitive queries. When having at hand the full article, published in the same day, resulting in 38k and 48k the NE-based model works slightly better. comparisons in the simple and complex settings, re- Figure 2 shows the photograms of videos generated spectively. As a consequence of this volume of com- with these four kinds of queries for Table 2 instance parisons a high imbalance in the dataset exists —most B. Each subfigure refers to one video and each row to pairs are negative instances. We evaluate this exper- one news article, which can include up to five images. iment on the basis of the F1 -measure for binary clas- The whole titles from strategy q1 provide a good vi- 2·tp sual overview of the event: the listed house and its sification: F1 = 2·tp+f p+f n , where tp, f p, and f n stand for number of true positives, false positives, and owners. Still, due to contents overlapping, some im- false negatives. Table 1 shows the results. Firstly, a ages appear more than once: coordinates {1,4; 2,2}, more flexible comparison based on word 1-grams re- {1,5; 5,3; 7,2}, and {5,1; 7,1}. The chunk-level strate- sults in the best performance (this may imply docu- gies result in less repetition. Strategy q3 based on NEs ments which are not complete duplicates are discarded; is more varied: focusing on football player Tom Brady we prefer this over including very similar notes). In for the first two titles and moving towards the main both simple and complex settings the best F1 is ob- event: the listing of a house for sale in Los Ange- tained with τ = 0.25 and we select this threshold. This les, and finally the second person involved: top-model supports the concept of co-derivative and reflects that Gisele Bündchen. Something similar occurs with q2 ’s the threshold is valid for both news agency–newspaper 2-grams: non-duplicated photograms centred in the and newspaper–newspaper comparisons. couple and the listed house. Still, q2 has a problem: “The Brady report” is an Arizona radio show and the 4.3 Articles Ranking resulting photograms refer to it. Even with this mis- take in mind, it seems like q2 provides a good balance Now we make a qualitative analysis. Table 2 shows between relevance and diversity. Combining NEs and the titles of the articles of three events ranked on the 2-grams into q4 reduces variation (photograms {5,3; basis of our diversification model (cf. Section 3.3). 7,3} and {5,5; 7,1} are the same). Instance A tells the story of Libyan rebels and their impact on oil. The top article does summarise the event, referring to a rebel attack on naval forces. As 5 Final Remarks and Ongoing Work expected, the topic of article 2 is not as close: it is about the plans to sink a ship transporting illegal oil, We presented our first efforts on breaking the news currently besieged by the Libyan Navy. Whereas the bubble. We integrated a system for the automatic third article still refers to oil, rebel attacks, and even generation of videos consisting of four modules: event to the chances for a conflict, the latter two refer to the identification, de-duplication, diversification, and im- dismissal of the Libyan PM by the parliament. That is, age gathering. The outcome comes in the form of short we are indeed looking at a story from different angles. illustrated videos aiming at providing a user with dif- Something similar occurs with instances B and C. ferent points of view in the coverage of the same event. Instance B is about the listing of a mansion. After an Departing from this architecture, we aim at using introductory first article, further details appear such more sophisticated text representation and event iden- as price or location. Instance C tells the story of the tification technology. We are particularly interested in decease of a former girlfriend of actor Jim Carrey. It storyline generation [MSA+ 15, VCK15]. Table 2: Three examples of diversification-ranked news and the generated queries: q1...3 . News title (q1 ) 2-grams (q2 ) NE (q3 ) A News Corpus G–11 March, 2014 1 Rebels Fired at Libyan Naval Forces: Minister Libyan Libyan 2 Brincat gives no details as Libya tanker standoff continues brincat gives Brincat 3 Rebel group manoeuvres over Libya’s oil could lead to renewed civil conflict rebel group Rebel 4 URGENT - Libya PM Libya PM URGENT 5 Libyan parliament dismisses PM libyan parliament – B News Corpus G–20 April, 2014 1 Gisele Bundchen and Tom Brady sell home for £30million Tom Brady Tom Brady 2 Tom Brady wants $50m - for his mega mansion in LA brady wants Brady 3 For Sale: Gisele’s $50 Million LA Chateau sale gisele Sale 4 Tom and Gisele SELLING Mega-Estate in LA!!!!! gisele selling Tom 5 Tom and Gisele’s LA home is listed at $50m gisele home LA C SignalMedia 1M–29 September, 2015 1 Cathriona White, ex-girlfriend of actor Jim Carrey, committed suicide on Monday. Cathriona White Cathriona White 2 Twilight star Robert Pattinson says that those who send negative comments [. . . ] twilight star Twilight 3 A former girlfriend of actor Jim Carrey has died of an apparent suicide, the [. . . ] girlfriend actor Ed Winter 4 Jim Carrey and Cathriona White are spotted hand in hand leaving their [. . . ] posted photo N.Y. 5 Jim Carrey’s Irish ex-girlfriend Cathriona White has been found dead. 2015 /raw Cathriona Cappawhite Figure 2: Photograms of videos as generated with queries q[1,..4] . One row corresponds to one article, for which up to 5 images are integrated. The top and left numbers are the photograms coordinates, for easy location. References Vaudoise des Sciences Naturelles, 37:547– 579, 1901. [AGAV09] Enrique Amigo, Julio Gonzalo, Javier Ar- tiles, and Felisa Verdejo. A comparison [KBK09] Jan Kasprzak, Michal Brandejs, and of extrinsic clustering evaluation metrics Miroslav Krip̌ač. Finding Plagiarism by based on formal constrains. Inf Retrieval, Evaluating Document Similarities. vol- 12(4):1–32, 2009. ume 502, pages 24–28, San Sebastian, Spain, 2009. CEUR-WS.org. [AS12] Joel Azzopardi and Christopher Staff. In- cremental Clustering of News Reports. Al- [LB16] Jey Han Lau and Timothy Baldwin. An gorithms, 5(4):364–378, aug 2012. Empirical Evaluation of doc2vec with Practical Insights into Document Embed- [AY10] Charu C Aggarwal and Philip S Yu. On ding Generation. In Proceedings of the 1st Clustering Massive Text and Categorical Workshop on Representation Learning for Data Streams. Knowledge and Informa- NLP, pages 78–86, 2016. tion Systems, pages 171–196, 2010. [LBM04] Caroline Lyon, Ruth Barret, and James [BCR09] Alberto Barrón-Cedeño and Paolo Rosso. Malcolm. A Theoretical Basis to the Au- On Automatic Plagiarism Detection tomated Detection of Copying Between based on n-grams Comparison. Advances Texts, and its Practical Implementation in Information Retrieval. Proceedings in the Ferret Plagiarism and Collusion De- of the 31st European Conference on IR tector. In Plagiarism: Prevention, Prac- Research, LNCS (5478):696–700, 2009. tice and Policies Conference, Newcastle Springer-Verlag. upon Tyne, UK, 2004. [CAMM16] David Corney, Dyaa Albakour, Miguel [LM14] Quoc Le and Tomas Mikolov. Distributed Martinez, and Samir Moussa. What do Representations of Sentences and Docu- a Million News Articles Look like? In ments. In Proceedings of the 31 st Interna- NewsIR 2016 Recent Trends in News In- tional Conference on Machine Learning, formation Retrieval, pages 42–47, Padua, Beijing, China, 2014. Italy, 2016. [MSA+ 15] Anne-Lyse Minard, Manuela Speranza, [CGP02] Paul Clough, Robert Gaizauskas, and Eneko Agirre, Itziar Aldabe, Marieke van Scott Piao. Building and Annotating a Erp, Bernardo Magnini, German Rigau, Corpus for the Study of Journalistic Text and Ruben Urizar. Semeval-2015 task Reuse. In Proceedings of the 3rd In- 4: Timeline: Cross-document event or- ternational Conference on Language Re- dering. In Proceedings of the 9th In- sources and Evaluation (LREC 2002), vol- ternational Workshop on Semantic Eval- ume V, pages 1678–1691, Las Palmas, uation (SemEval 2015), pages 778–786. Spain, 2002. ELRA. ACL, 2015. [EKSX96] Martin Ester, Hans-Peter Kriegel, Jörg [SCK+ 06] Nachiketa Sahoo, Jamie Callan, Ramayya Sander, and Xiaowei Xu. A density-based Krishnan, George Duncan, and Rema algorithm for discovering clusters in large Padman. Incremental hierarchical clus- spatial databases with noise. In Proceed- tering of text documents. In Proceedings ings of the Second International Confer- of the 15th ACM International Confer- ence on Knowledge Discovery and Data ence on Information and Knowledge Man- Mining (KDD-96), pages 226–231. AAAI agement, CIKM ’06, pages 357–366, New Press, 1996. York, NY, 2006. ACM. [Gas17] Fabio Gasparetti. Modeling User Interests [VCK15] Piek Vossen, Tommaso Caselli, and Yiota from Web Browsing Activities. Data Min- Kontzopoulou. Storylines for structuring ing and Knowledge Discovery, 31(2):502– massive streams of news. In Proceedings of 547, 2017. the First Workshop on Computing News Storylines, pages 40–49, Beijing, China, [Jac01] Paul Jaccard. Étude comparative de la 2015. ACL. distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société