Semi-Supervised Events Clustering in News Retrieval Jack G. Conrad Michael Bender Thomson Reuters Thomson Reuters Corporate Research & Development Thomson Reuters Global Resources Saint Paul, Minnesota 55123 USA Baar, Zug 6340 Switzerland jack.g.conrad@thomsonreuters.com michael.bender@thomsonreuters.com 1 Introduction Abstract 1.1 Motivations Thomson Reuters has been exploring alternative mod- The presentation of news articles to meet els for organizing and rendering articles found in its research needs has traditionally been a news repository. Whether the users are editors, finan- document-centric process. Yet users often cial analysts, lawyers or other professional researchers, want to monitor developing news stories based a more effective means of examining a set of event- on an event, rather than by examining an related news articles beyond that of a ranked list of exhaustive list of retrieved documents. In documents was expressly sought. The presentation of this work, we illustrate a news retrieval sys- news articles based on events aligns well with contem- tem, eventNews, and an underlying algorithm porary research use cases, such as those arising in the which is event-centric. Through this system, finance and risk sectors, where there is a salient need news articles are clustered around a single for more effectively organized news content through news event or an event and its sub-events. The the lens of events. Other news organizations such as algorithm presented can leverage the creation Google have experimented with news clustering, but of new Reuters stories and their compact la- in the absence of the concrete use cases of Thomson bels as seed documents for the clustering pro- Reuters’ professional users. cess. The system is configured to generate This project uses semi-supervised clustering capa- top-level clusters for news events based on an bilities in order to group news documents based upon editorially supplied topical label, known as a shared news events. Germinal Reuters stories with ed- ‘slugline,’ and to generate sub-topic-focused itorially assigned labels (a.k.a. ‘sluglines’) are used as clusters based on the algorithm. The system seed documents for event identification and organiza- uses an agglomerative clustering algorithm to tion. This task addresses the fundamental aim of the gather and structure documents into distinct project. result sets. Decisions on whether to merge re- lated documents or clusters are made accord- ing to the similarity of evidence derived from two distinct sources, one, relying on a digital signature based on the unstructured text in the document, the other based on the presence of named entity tags that have been assigned to the document by a named entity tagger, in this case Thomson Reuters’ Calais engine. Copyright c 2016 for the individual papers by the paper’s au- thors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- gartner, R. Campos and D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, Figure 1: News Events Clustering Process published at http://ceur-ws.org 1.2 Objectives gorithms necessary to cluster real-time news articles [5]. But they have focused largely on the math behind The main objective of this project is to develop an the clustering rather than the use case and practition- event-centric news paradigm that solves the challenge ers benefitting from it. of event validation and event story clustering at scale. Some of the earliest work in this area was pursued This goal is in response to feedback received from con- under DARPA and NIST funding and resulted in re- sumers on news in their products. In addition to or- ports written by various forums created to advance the ganizing news results around events rather than docu- state of the art in event detection [3, 1]. ments, another goal of this study is to provide a mech- There have also been research group work and dis- anism for clustering third-party (non-Reuters) news sertations on the subject of topic detection and track- documents together with corresponding Reuters arti- ing resulting from the above research [12, 11]. Sub- cles around common news events. This is aided by sequent work has attempted to capture some of the leveraging metadata tags that exist in Reuters news structure of events and their dependencies in a news articles about the same topical event. Since these tags topic by creating a model of events, a.k.a. ‘event distinguish Reuters news documents from third-party threading’ [10]. Yet more recently there have been content, it is possible to consider using them as the actual forums under large umbrella organizations like basis for grouping news articles together. The ini- ACL focusing on automatically computing news sto- tial plan for this project was developed in conjunc- ries (and their titles) [2, 14]. tion with R&D’s partner, the news asset owner and subject matter expert (SME), to use the initial or There is also another field of research that addresses top-level story labels known as primary sluglines (e.g., event extraction in the ACE tradition1 that is rel- VOLKSWAGEN-EMISSION-FRAUD/ ) as an orga- evant to the context of our current work, e.g., [9]. nizing principle for top-level clusters, and an algo- What is distinct about our present project, however, rithmic means for creating lower-level clusters which is the use of SME-defined seed stories and labels in a can incorporate second tier story labels known as sec- semi-supervised manner and the subsequent clustering ondary sluglines (e.g., VOLKSWAGEN-EMISSION- stages at scale for real world news streams. FRAUD/COMPENSATION). Worth noting is that one of the building blocks of the current work is represented by an initial form of 1.3 Workflow Illustration ‘local’ clustering that involves the identification and In Figure 1, we see an example involving the “General grouping of exact and fuzzy duplicate documents [8]. Motors Recall” for faulty ignition switches. Through This takes place in the stage immediately preceding regular editorial practices, journalists write and tag the final, aggregated clustering step. event-related stories. The first story with the first 3 Data Resources “GM Recall” tag serves as the seed story for initiating the cluster. As Reuters writes and tags more stories The news repository under examination in this effort about the GM Recall, the set of tags and text defin- is known as NewsRoom. It is a Thomson Reuters news ing the GM Recall event expands. As it expands, so aggregation platform. It consists of approximately 15- too does the algorithm’s grasp of the event, helping 30 million documents per year from 12,000 indepen- it to better identify cluster candidates, in particular, dent news sources which consist of national and lo- within third-party news. Both the editorially gener- cal newspapers, periodic journals, radio program tran- ated slugline responsible for the birth of the cluster scriptions, etc. From 2012 to 2015, NewsRoom con- and the algorithmic identification and population of sisted of approximately 80 million news articles. These subsequent sub-clusters are depicted in the figure. were the target of our investigation for this project (Table 1).2 2 Previous Work In order to test our news workflow and the cluster- Previous work published on the topic of news events ing algorithms that support it, we focus on chunks of structuring has been largely academic in nature, for data representing approximately three months of doc- example, as in Borglund [6]. This thesis includes three uments at a time. contributions: a survey of known clustering methods, Having investigated baseline news clusters in earlier an evaluation of human versus human results when research efforts (i.e., baseline algorithm, its granular- grouping news articles in an event-centric manner, and ity, speed and complexity) we have subsequently pur- lastly an evaluation of an incremental clustering algo- sued improvements and efficiencies to help us approach rithm to see if it is possible to consider a reduced input 1 http://www.itl.nist.gov/iad/mig/tests/ace/ size and still get a sufficient result. 2 Thomson Reuters has long made comparably large In addition, there have been journal articles that news collections available for external research: http: have explored the computational complexity of the al- //trec.nist.gov/data/reuters/reuters.html uments in the repository, or some user-defined sub- Table 1: NewsRoom Integrated Data Sources set of them. Since the repository contains substantial Year Sources Document Count numbers of Reuters and non-Reuters financial doc- 2012 Reuters / Diverse 14.6M uments, for example, some stories are largely non- 2013 ” 20.3M textual, e.g., containing tabular information only; very 2014 ” 27.8M short, e.g., stubs for stories in progress stories; or 2015 ” 20.0M meta-data snippets for topics that were not substan- Total ” 82.7M tiated. These types of documents would be consid- ered non-recommendable and thus are not retrieved our objectives more effectively. for subsequent processing. In general, over half of the 4 Methods documents in the repository would be classified as rec- ommendable for this use case. The NewsRoom envi- Given our substantial data resources and our goal to ronment comes with a recommendation classifier. Ad- build a flexible experimental retrieval environment, we ditional details beyond those provided above would be have established three stages for processing and clus- beyond the scope of our current focus. tering a large set of news documents around news The extraction process results in all recommend- events (Figure 2). These stages include: (1) document able documents being loaded from the repository to extraction (Reuters and non-Reuters articles) from our an Apache Derby JDBC relational database. The news repository; (2) local clustering based on duplicate tabular data structures that store the documents and document detection of identical and fuzzy duplicates subsequent clusters contain basic information such as [7]; and (3) aggregate clustering performed over the doc id, dataset name, doc date, title, article source, result set from stage 2. We have determined empiri- source url (if applicable), body, body length, together cally that the local clustering stage works highly effec- with tens of additional features that can be used to dis- tively [8]. It is the aggregate clustering stage that has criminate and used by various classifiers, e.g., primary spawned ongoing research, evaluation and refinement. news code, short sentence count, ticker count, quan- This stage consists of the application of hierarchical tity of numbers, quantity all-caps, quantity of press agglomerative clustering, where different types of clus- releases, etc. These additional features are available ter centroid representations were examined. Although for subsequent downstream processing such as classi- we provide descriptions of each of the three processing fication, routing or clustering. stages below, it is the third of these stages that is the principal focus of our latest efforts and this research 4.2 Local Clustering Stage report. The next process, local clustering, is designed to rapidly and efficiently identify initial clusters based on documents that satisfy criteria for identical or fuzzy duplicates. Documents are compared using two types of digital signatures that harness the most discrimi- nating terms, one, smaller and more compact lever- aging O(10) terms, is used to identify identical du- plicates; another, more expansive, leveraging O(100) terms, is used to identify fuzzy duplicates. The pro- cess being executed uses techniques reported on in [8]. For this application, a rolling window of n days is used, where (n < 10). Documents falling within this window are compared. Heuristics relying on features such as Figure 2: News Events Clustering Functional Stages doc length, are also invoked to reduce the number of comparisons required. For example, when a document 4.1 Document Extraction Stage exceeds the length of another by 20% or more, though they may satisfy a containment relationship, accord- The document extraction process can be customized ing to our definition, they would not be considered to facilitate experimentation such as that undertaken ‘duplicates.’ for this study. NewsRoom represents a news repos- itory of both Reuters and non-Reuters sources cov- 4.3 Aggregate Clustering Stage ering roughly 12,000 news sources. Given a date range, e.g., [20141001T0000000Z 20141231T235959Z], During the third, aggregate clustering stage, the clus- one can extract all of the ‘recommendable’ news doc- ters are initiated via seminal Reuters articles contain- ing slugline tags. These tags are distinct from head- clusters. It consists of two data structures, both repe- lines, as shown in Figure 3. The articles with sluglines sented in vector form. The first is a term-based vec- may be singletons or they may exist in one of the lo- tor. It is used to determine the degree of overlap be- cal clusters formed in preceding stage. Both of these tween two cluster centroids, constituted by two central ‘objects’ qualify to serve as a cluster ‘seed.’ ‘documents’ (e.g., longest, most recent, true centroid, etc.). The second is a tag-based vector, representing a set of Calais tags present in the cluster’s documents. The similarity measures used in each of these cases is thresholded, with the threshold determined empiri- cally. In the case of the term vectors for the unstruc- Figure 3: Reuters Article - Slugline Illustration tured text, the thresholds are set high, although not as Two main challenges confronted when implementing high as those for duplication detection used in stage 2. this hierarchical, agglomerative clustering stage were, In the case of the set of Calais tags for the structured first, finding the best set of features and metrics to text, a weighted sum is used, whereby various combi- decide whether a pair of singletons or local clusters nations of named entities can be assembled to satisfy justify merging into larger clusters while still remain- the threshold for merging. ing sufficiently cohesive, and, second, identifying the optimal sequence for comparing these clusters when Table 2: Experimental Processing (4 QTR 2014) considering merging (Figure 4). Stage Name Type Count 1. Document Extract documents 3.63M 2. Local Clustering clusters 2.10M 3. Agglom. Clustering clusters 1.67M 5 Evaluation Given the objectives of this study with respect to retrieval performance and organizational structure, evaluation is an essential piece of the validation process. After having conducted a number of trials Figure 4: Approaches to Source to Target Merging to establish various thresholds (document or cluster similarity, named entity similarity, etc.), we conducted Based upon observations made by subject matter a trial which focused on a number of news events experts who created exemplar news clusters to sup- chosen by subject matter experts (SMEs) from the port the project, we determined that there were two, final quarter of 2014. We focused on the set of often independent, means by which documents could high-level news events shown below. be identified as belonging to the same news event. One involves the unstructured text of an article; the other 1. Halliburton Buying Baker Hughes (Nov. 13, 15) involves the structured text, in our case, documents 2. Defense Secretary Hagel Resigns (Nov. 24, 25) that have been tagged by the Calais named-entity tag- 3. Air Asia Crash (Dec. 28, 31) ging engine [13, 4]. Given that articles involving news 4. Pope Urges Tolerance in Turkey (Nov. 28) events can be found to be similar based on either of 5. Lufthansa Braces for Next Strike (Dec. 3) these two feature spaces, our approach to aggregate 6. Iran Rouhani Says Will Try to Clinch Nuclear (stage 3) clustering is robust: a decision to merge two Deal in Talks (Dec. 15) of these documents or local clusters can be based on 7. Alstom Nearing $700M Bribery Settlement (Dec. the similarity between the unstructured text of two ob- 16) jects, the tagged named entities that have been iden- tified by Calais (listed below), or both. For each of the events identified, result sets were created and stored in worksheets (Table 2 presents • People – person name entities dataset details). The result sets consisted of numer- • Reuters Instrument Codes (RICs) – for companies ous clusters on the subject of the event (often involving • Reuters Classification System (RCS) – for topics named entities such as Halliburton, Hagel, the Pope, & industries Rouhani, Alstom, etc.), some of which are on the topic • Topics – domain independent topical phrases of the news event, some of which address the entity in • Smart Terms – topical taxonomy terms other contexts. For those that were on the subject of Operationally, the hybrid feature set described the event, the clusters represent sub-topical (second- above is used to decide whether or not to merge two level) clusters (see VW example in Section 1.2). Re- garding the result worksheets, in addition to doc ids, couple of the outliers found in the list of events, i.e., they included local cluster and batch cluster ids, date nos. 2 and 7. In the case of the latter, there was and time stamp, document title, document length and greater variety in the news sources and articles report- URL link to the complete news article (if available). ing on the statements coming from the Iranian leader, The worksheets were presented to two evaluators, both and as a result, the algorithm may not have captured subject matter experts from the news domain.3 the overarching similarity among the documents. In Two metrics were used to evaluate these experi- addition, there was a greater variety of persons men- ments. First, the assessors scored each cluster for co- tioned in these articles who were responding to Presi- herence and accuracy, making sure that all of the doc- dent Rouhani. uments that belong to a specific cluster were present, Regarding the queuing strategy and its impact on and that all of the documents that didn’t belong were agglomerative clustering and merging (Figure 4), we not present. The cluster database was queried broadly, conducted a series of experiments that involved differ- e.g., ‘Defense Secretary Hagel’, in order to permit the ent strategies, including least-recently-used and most- assessors to have access to clusters both about and not recently-used. Other strategies tended to have a signif- about the event in question, again, in order to inspect icant impact on computational complexity insofar as those documents that belong in the relevant clusters it was necessary to perform real-time tracking of dy- and those that do not. For this task, they used a five- namic cluster characteristics. Although the spectrum point Likert scale, A (very good) thru F (very weak), of considerations involved in those experiments may codified as 5-to-1.4 Secondly, the assessors determined be beyond the scope of the current reporting space, a ‘cluster edit distance’ for each cluster solution, indi- we found that the most-recently-used was as effective cating which sub-clusters they would merge and which a queuing strategy as the majority of others investi- they would split, if any, to achieve an optimal solution. gated. Each merge or split step would be the cluster equiv- There is clearly room for improved performance and alent of an ‘edit’ in the standard character-based edit additional evaluation. One way of addressing some of distance measure. The results of this assessment task the disparities revealed above is by tuning the joint are presented in the Table 3. thresholds for document signature and named entities In general, we see that with few exceptions, the ma- tagged. Alternatively, one could have the thresholds jority of clusters returned for our queries were about learned and optimized depending on features associ- the underlying event(s) (Table 3, column 4). In ad- ated with the documents (e.g., range of idfs in the dition, the coherence/accuracy scores for the clusters signatures, number and type of entities in the docu- reviewed were in the 4.0 or ‘B’ range, some higher, ment). Moreover, one could use a variable weighted some lower. When the same entities, but out-of- sum of the similarity scores, depending on the contri- event clusters are included (column 3), their scores are bution of the named entities and distinguishing terms slighty higher, still in the 4.0 or ‘B’ range.5 In terms present in the articles being compared. of the cluster edit distances measured, for the seven 6 Conclusions news events represented in the table, the mean num- ber of ‘splits’ required for each cluster set was λ=1.15 The news events clustering efforts summarized in this (σ=1.2) while the mean number of merges was λ=4.7 report and depicted in Figure 1 represent a combi- (σ=4.3). nation of semi-supervised clustering techniques and Clearly the larger numbers appearing in the con- human-generated, labeled data. They aim to deliver text of merges have been influenced significantly by a an effective solution by leveraging Reuters’ labels and validating the scope of events at scale. The ultimate 3 The first SME assessed the quality of both types of goal of the study is to determine to what extent com- clusters, those about the event and those not; the second bined human-computer resources can produce event- SME assessed the quality of the event clusters only. based clusters that are considerably more useful – i.e., 4 The five grades used in the American educational sys- more effective – than exhaustive lists of unstructured tem are A-B-C-D-F, which range from exceptional (A) to documents. In addition, third-party content can be failure (F). E is not used. gathered and organized around existing clustered con- 5 Although in aggregate, the mean of the grades as- tent based upon Reuters’ own editorially labeled and signed the clusters by the two SMEs were comparable, when we calculated the weighted Kappa score for inter- classified news events. The variety of challenges con- reviewer agreement, we found that they were not as uni- fronted – using Reuters’ metadata, getting the granu- form, as the scores generally fell into the bottom quartile. larity right, and scaling the solution – all depend on The reviewers assigned identical grades in only about a the right mix within this integration. By tracking the third of the cases. In the majority of the other cases, they steps outlined above, we anticipate having a more ro- were one and sometimes two grades apart. bust working model available for evaluation in the near Table 3: Graded Assessments of News Events Clusters No. Event Title No. No. Mean Avg Mean Avg Score Clusters Clusters Score for for Event Clusters on Event All Clusters SME #1 SME #2 1. Halliburton Buying Baker Hughes 6 5 4.33 4.20 4.00 2. Defense Secretary Hagel Resigns 24 17 4.02 3.94 2.94 3. Air Asia Crash 14 7 3.93 3.64 3.50 4. Pope Urges Tolerance in Turkey 7 6 4.29 4.17 4.33 5. Lufthansa Braces for Next Strike 5 2 4.00 3.00 4.50 6. Iran Rouhani Tries to Secure 59 47 3.99 3.86 3.93 Nuclear Deal 7. Alstom Nearing $700M Bribery 5 3 3.80 3.50 4.50 Settlement T. Total Avg = 4.05 Avg = 3.73 Avg = 3.95 future. Anticipated amendments or extensions of the [6] Jon Borglund. Event-centric clustering of news arti- model are addressed below. cles. Masters thesis, University of Uppsala, Sweden, Oct. 2013. 7 Future Work [7] Jack G. Conrad, Joanne C. Claussen, and Jie Lin. In- In future work, we will extend our evaluations by com- formation retrieval systems with duplicate document paring our results with exemplar clusters identified by detection and presentation functions. U.S. Patent our SMEs, both in terms of granularity and in terms #7,809,695, Oct. 2010. of completeness, at the top, topical cluster level and [8] Jack G. Conrad, Xi S. Guo, and Cindy P. Schriber. lower, sub-topical level of resulting clusters. This form Online duplicate document detection: Signature reli- of assessment addresses overall cluster precision. We ability in a dynamic retrieval environment. In Pro- will also need to conduct tests that approach evaluat- ceedings of the 12th Conference on Information and ing recall, i.e., of all the possible news events in the Knowledge Management (CIKM03), pages 243–252. data set or sample, how many do we capture and rep- ACM Press, Nov. 2003. resent at top and lower levels of the shallow hierarchy? [9] Qi Li, Heng Ji, and Liang Huang. Joint event extrac- tion via structured prediction with global features. In 8 Acknowledgments Proceedings of the 51st Annual Meeting of the ACL, The authors thank Sarah Edmonds at TRGR for her pages 73–82. Association for Computational Linguis- diligent work assessing result sets. We are also grateful tics, Aug. 2013. to Brian Romer with Reuters Data Innovation Lab for [10] Ramesh Nallipati, Ao Feng, Fuchun Peng, and James his innovative work on the UI and demo (to be shown Allan. Event threading within news topics. In Pro- at the workshop). ceedings of the 13th Conference on Information and Knowledge Management (CIKM04), pages 446–453. References ACM Press, Nov. 2004. [1] Topic Detection and Tracking Workshops, Washing- ton, D.C., 2004. NIST. [11] Ron Papka. On-Line New Event Detection, Cluster- ing, and Tracking. Ph.d. thesis, University of Mas- [2] First Workshop on Computing News Storylines sachusetts - Amherst, Sept. 1999. (CNewS 2015), Beijing, PRC, July 2015. ACL. [12] Jakub Piskorski, Hristo Tanev, Martin Atkinson, and [3] James Allan, Jaime Carbonell, George Doddingtom, Erik van der Gout. Cluster-centric approach to Jonathan Yamron, and Yiming Yang. Topic de- news event extraction. In 2008 Conference on New tection and tracking pilot study final report. In Trends in Multimedia and Network Information Sys- DARPA Broadcast News Transcription & Under- tems, pages 276–290, 2008. standing Workshop, Feb. 1998. [4] Samet Atdag and Vincent Labatut. A comparison of [13] Thomson Reuters. Open Calais NamedTM Entity Tag- named entity recognition tools applied to biograph- ging Engine. http://www.opencalais.com, 2016. ical texts. In 2nd International Conference on Sys- [14] Piek Vossen, Tommaso Caselli, and Yiota Kont- tems and Computer Science (ICSCS13), pages 228– zopoulou. Storylines for structuring massive streams 233. IEEE, Aug. 2013. of news. In Proceedings of the First Workshop on [5] Joel Azzopardi and Christopher Staff. Incremental Comparing News Storylines, pages 40–49. ACL and clustering of news reports. Algorithms, 5:364–378, Asian Federation of NLP, July 2015. 2012.