Semi-Supervised Events Clustering in News Retrieval
                    Jack G. Conrad                                        Michael Bender
                   Thomson Reuters                                       Thomson Reuters
          Corporate Research & Development                      Thomson Reuters Global Resources
           Saint Paul, Minnesota 55123 USA                          Baar, Zug 6340 Switzerland
          jack.g.conrad@thomsonreuters.com                      michael.bender@thomsonreuters.com

                                                                1     Introduction

                       Abstract                                 1.1   Motivations

                                                                Thomson Reuters has been exploring alternative mod-
    The presentation of news articles to meet                   els for organizing and rendering articles found in its
    research needs has traditionally been a                     news repository. Whether the users are editors, finan-
    document-centric process. Yet users often                   cial analysts, lawyers or other professional researchers,
    want to monitor developing news stories based               a more effective means of examining a set of event-
    on an event, rather than by examining an                    related news articles beyond that of a ranked list of
    exhaustive list of retrieved documents. In                  documents was expressly sought. The presentation of
    this work, we illustrate a news retrieval sys-              news articles based on events aligns well with contem-
    tem, eventNews, and an underlying algorithm                 porary research use cases, such as those arising in the
    which is event-centric. Through this system,                finance and risk sectors, where there is a salient need
    news articles are clustered around a single                 for more effectively organized news content through
    news event or an event and its sub-events. The              the lens of events. Other news organizations such as
    algorithm presented can leverage the creation               Google have experimented with news clustering, but
    of new Reuters stories and their compact la-                in the absence of the concrete use cases of Thomson
    bels as seed documents for the clustering pro-              Reuters’ professional users.
    cess. The system is configured to generate                     This project uses semi-supervised clustering capa-
    top-level clusters for news events based on an              bilities in order to group news documents based upon
    editorially supplied topical label, known as a              shared news events. Germinal Reuters stories with ed-
    ‘slugline,’ and to generate sub-topic-focused               itorially assigned labels (a.k.a. ‘sluglines’) are used as
    clusters based on the algorithm. The system                 seed documents for event identification and organiza-
    uses an agglomerative clustering algorithm to               tion. This task addresses the fundamental aim of the
    gather and structure documents into distinct                project.
    result sets. Decisions on whether to merge re-
    lated documents or clusters are made accord-
    ing to the similarity of evidence derived from
    two distinct sources, one, relying on a digital
    signature based on the unstructured text in
    the document, the other based on the presence
    of named entity tags that have been assigned
    to the document by a named entity tagger, in
    this case Thomson Reuters’ Calais engine.


Copyright c 2016 for the individual papers by the paper’s au-
thors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-
gartner, R. Campos and D. Albakour (eds.): Proceedings of the
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016,              Figure 1: News Events Clustering Process
published at http://ceur-ws.org
1.2   Objectives                                             gorithms necessary to cluster real-time news articles
                                                             [5]. But they have focused largely on the math behind
The main objective of this project is to develop an
                                                             the clustering rather than the use case and practition-
event-centric news paradigm that solves the challenge
                                                             ers benefitting from it.
of event validation and event story clustering at scale.
                                                                 Some of the earliest work in this area was pursued
This goal is in response to feedback received from con-
                                                             under DARPA and NIST funding and resulted in re-
sumers on news in their products. In addition to or-
                                                             ports written by various forums created to advance the
ganizing news results around events rather than docu-
                                                             state of the art in event detection [3, 1].
ments, another goal of this study is to provide a mech-
                                                                 There have also been research group work and dis-
anism for clustering third-party (non-Reuters) news
                                                             sertations on the subject of topic detection and track-
documents together with corresponding Reuters arti-
                                                             ing resulting from the above research [12, 11]. Sub-
cles around common news events. This is aided by
                                                             sequent work has attempted to capture some of the
leveraging metadata tags that exist in Reuters news
                                                             structure of events and their dependencies in a news
articles about the same topical event. Since these tags
                                                             topic by creating a model of events, a.k.a. ‘event
distinguish Reuters news documents from third-party
                                                             threading’ [10]. Yet more recently there have been
content, it is possible to consider using them as the
                                                             actual forums under large umbrella organizations like
basis for grouping news articles together. The ini-
                                                             ACL focusing on automatically computing news sto-
tial plan for this project was developed in conjunc-
                                                             ries (and their titles) [2, 14].
tion with R&D’s partner, the news asset owner and
subject matter expert (SME), to use the initial or               There is also another field of research that addresses
top-level story labels known as primary sluglines (e.g.,     event extraction in the ACE tradition1 that is rel-
VOLKSWAGEN-EMISSION-FRAUD/ ) as an orga-                     evant to the context of our current work, e.g., [9].
nizing principle for top-level clusters, and an algo-        What is distinct about our present project, however,
rithmic means for creating lower-level clusters which        is the use of SME-defined seed stories and labels in a
can incorporate second tier story labels known as sec-       semi-supervised manner and the subsequent clustering
ondary sluglines (e.g., VOLKSWAGEN-EMISSION-                 stages at scale for real world news streams.
FRAUD/COMPENSATION).                                             Worth noting is that one of the building blocks of
                                                             the current work is represented by an initial form of
1.3   Workflow Illustration                                  ‘local’ clustering that involves the identification and
In Figure 1, we see an example involving the “General        grouping of exact and fuzzy duplicate documents [8].
Motors Recall” for faulty ignition switches. Through         This takes place in the stage immediately preceding
regular editorial practices, journalists write and tag       the final, aggregated clustering step.
event-related stories. The first story with the first
                                                             3    Data Resources
“GM Recall” tag serves as the seed story for initiating
the cluster. As Reuters writes and tags more stories         The news repository under examination in this effort
about the GM Recall, the set of tags and text defin-         is known as NewsRoom. It is a Thomson Reuters news
ing the GM Recall event expands. As it expands, so           aggregation platform. It consists of approximately 15-
too does the algorithm’s grasp of the event, helping         30 million documents per year from 12,000 indepen-
it to better identify cluster candidates, in particular,     dent news sources which consist of national and lo-
within third-party news. Both the editorially gener-         cal newspapers, periodic journals, radio program tran-
ated slugline responsible for the birth of the cluster       scriptions, etc. From 2012 to 2015, NewsRoom con-
and the algorithmic identification and population of         sisted of approximately 80 million news articles. These
subsequent sub-clusters are depicted in the figure.          were the target of our investigation for this project
                                                             (Table 1).2
2     Previous Work                                              In order to test our news workflow and the cluster-
Previous work published on the topic of news events          ing algorithms that support it, we focus on chunks of
structuring has been largely academic in nature, for         data representing approximately three months of doc-
example, as in Borglund [6]. This thesis includes three      uments at a time.
contributions: a survey of known clustering methods,             Having investigated baseline news clusters in earlier
an evaluation of human versus human results when             research efforts (i.e., baseline algorithm, its granular-
grouping news articles in an event-centric manner, and       ity, speed and complexity) we have subsequently pur-
lastly an evaluation of an incremental clustering algo-      sued improvements and efficiencies to help us approach
rithm to see if it is possible to consider a reduced input       1 http://www.itl.nist.gov/iad/mig/tests/ace/
size and still get a sufficient result.                          2 Thomson Reuters has long made comparably large
   In addition, there have been journal articles that        news collections available for external research: http:
have explored the computational complexity of the al-        //trec.nist.gov/data/reuters/reuters.html
                                                            uments in the repository, or some user-defined sub-
     Table 1: NewsRoom Integrated Data Sources
                                                            set of them. Since the repository contains substantial
    Year       Sources       Document Count
                                                            numbers of Reuters and non-Reuters financial doc-
    2012 Reuters / Diverse         14.6M                    uments, for example, some stories are largely non-
    2013          ”                20.3M                    textual, e.g., containing tabular information only; very
    2014          ”                27.8M                    short, e.g., stubs for stories in progress stories; or
    2015          ”                20.0M                    meta-data snippets for topics that were not substan-
    Total         ”                82.7M                    tiated. These types of documents would be consid-
                                                            ered non-recommendable and thus are not retrieved
our objectives more effectively.
                                                            for subsequent processing. In general, over half of the
4     Methods                                               documents in the repository would be classified as rec-
                                                            ommendable for this use case. The NewsRoom envi-
Given our substantial data resources and our goal to        ronment comes with a recommendation classifier. Ad-
build a flexible experimental retrieval environment, we     ditional details beyond those provided above would be
have established three stages for processing and clus-      beyond the scope of our current focus.
tering a large set of news documents around news               The extraction process results in all recommend-
events (Figure 2). These stages include: (1) document       able documents being loaded from the repository to
extraction (Reuters and non-Reuters articles) from our      an Apache Derby JDBC relational database. The
news repository; (2) local clustering based on duplicate    tabular data structures that store the documents and
document detection of identical and fuzzy duplicates        subsequent clusters contain basic information such as
[7]; and (3) aggregate clustering performed over the        doc id, dataset name, doc date, title, article source,
result set from stage 2. We have determined empiri-         source url (if applicable), body, body length, together
cally that the local clustering stage works highly effec-   with tens of additional features that can be used to dis-
tively [8]. It is the aggregate clustering stage that has   criminate and used by various classifiers, e.g., primary
spawned ongoing research, evaluation and refinement.        news code, short sentence count, ticker count, quan-
This stage consists of the application of hierarchical      tity of numbers, quantity all-caps, quantity of press
agglomerative clustering, where different types of clus-    releases, etc. These additional features are available
ter centroid representations were examined. Although        for subsequent downstream processing such as classi-
we provide descriptions of each of the three processing     fication, routing or clustering.
stages below, it is the third of these stages that is the
principal focus of our latest efforts and this research     4.2   Local Clustering Stage
report.
                                                            The next process, local clustering, is designed to
                                                            rapidly and efficiently identify initial clusters based on
                                                            documents that satisfy criteria for identical or fuzzy
                                                            duplicates. Documents are compared using two types
                                                            of digital signatures that harness the most discrimi-
                                                            nating terms, one, smaller and more compact lever-
                                                            aging O(10) terms, is used to identify identical du-
                                                            plicates; another, more expansive, leveraging O(100)
                                                            terms, is used to identify fuzzy duplicates. The pro-
                                                            cess being executed uses techniques reported on in [8].
                                                            For this application, a rolling window of n days is used,
                                                            where (n < 10). Documents falling within this window
                                                            are compared. Heuristics relying on features such as
Figure 2: News Events Clustering Functional Stages          doc length, are also invoked to reduce the number of
                                                            comparisons required. For example, when a document
4.1   Document Extraction Stage                             exceeds the length of another by 20% or more, though
                                                            they may satisfy a containment relationship, accord-
The document extraction process can be customized
                                                            ing to our definition, they would not be considered
to facilitate experimentation such as that undertaken
                                                            ‘duplicates.’
for this study. NewsRoom represents a news repos-
itory of both Reuters and non-Reuters sources cov-
                                                            4.3   Aggregate Clustering Stage
ering roughly 12,000 news sources. Given a date
range, e.g., [20141001T0000000Z 20141231T235959Z],          During the third, aggregate clustering stage, the clus-
one can extract all of the ‘recommendable’ news doc-        ters are initiated via seminal Reuters articles contain-
ing slugline tags. These tags are distinct from head-      clusters. It consists of two data structures, both repe-
lines, as shown in Figure 3. The articles with sluglines   sented in vector form. The first is a term-based vec-
may be singletons or they may exist in one of the lo-      tor. It is used to determine the degree of overlap be-
cal clusters formed in preceding stage. Both of these      tween two cluster centroids, constituted by two central
‘objects’ qualify to serve as a cluster ‘seed.’            ‘documents’ (e.g., longest, most recent, true centroid,
                                                           etc.). The second is a tag-based vector, representing a
                                                           set of Calais tags present in the cluster’s documents.
                                                           The similarity measures used in each of these cases
                                                           is thresholded, with the threshold determined empiri-
                                                           cally. In the case of the term vectors for the unstruc-
   Figure 3: Reuters Article - Slugline Illustration
                                                           tured text, the thresholds are set high, although not as
Two main challenges confronted when implementing           high as those for duplication detection used in stage 2.
this hierarchical, agglomerative clustering stage were,    In the case of the set of Calais tags for the structured
first, finding the best set of features and metrics to     text, a weighted sum is used, whereby various combi-
decide whether a pair of singletons or local clusters      nations of named entities can be assembled to satisfy
justify merging into larger clusters while still remain-   the threshold for merging.
ing sufficiently cohesive, and, second, identifying the
optimal sequence for comparing these clusters when           Table 2: Experimental Processing (4 QTR 2014)
considering merging (Figure 4).                             Stage Name                      Type      Count
                                                              1.     Document Extract    documents 3.63M
                                                              2.     Local Clustering     clusters    2.10M
                                                              3.     Agglom. Clustering clusters      1.67M

                                                           5     Evaluation
                                                           Given the objectives of this study with respect to
                                                           retrieval performance and organizational structure,
                                                           evaluation is an essential piece of the validation
                                                           process. After having conducted a number of trials
 Figure 4: Approaches to Source to Target Merging          to establish various thresholds (document or cluster
                                                           similarity, named entity similarity, etc.), we conducted
    Based upon observations made by subject matter         a trial which focused on a number of news events
experts who created exemplar news clusters to sup-         chosen by subject matter experts (SMEs) from the
port the project, we determined that there were two,       final quarter of 2014. We focused on the set of
often independent, means by which documents could          high-level news events shown below.
be identified as belonging to the same news event. One
involves the unstructured text of an article; the other        1. Halliburton Buying Baker Hughes (Nov. 13, 15)
involves the structured text, in our case, documents           2. Defense Secretary Hagel Resigns (Nov. 24, 25)
that have been tagged by the Calais named-entity tag-          3. Air Asia Crash (Dec. 28, 31)
ging engine [13, 4]. Given that articles involving news        4. Pope Urges Tolerance in Turkey (Nov. 28)
events can be found to be similar based on either of           5. Lufthansa Braces for Next Strike (Dec. 3)
these two feature spaces, our approach to aggregate            6. Iran Rouhani Says Will Try to Clinch Nuclear
(stage 3) clustering is robust: a decision to merge two           Deal in Talks (Dec. 15)
of these documents or local clusters can be based on           7. Alstom Nearing $700M Bribery Settlement (Dec.
the similarity between the unstructured text of two ob-           16)
jects, the tagged named entities that have been iden-
tified by Calais (listed below), or both.                     For each of the events identified, result sets were
                                                           created and stored in worksheets (Table 2 presents
  • People – person name entities                          dataset details). The result sets consisted of numer-
  • Reuters Instrument Codes (RICs) – for companies        ous clusters on the subject of the event (often involving
  • Reuters Classification System (RCS) – for topics       named entities such as Halliburton, Hagel, the Pope,
    & industries                                           Rouhani, Alstom, etc.), some of which are on the topic
  • Topics – domain independent topical phrases            of the news event, some of which address the entity in
  • Smart Terms – topical taxonomy terms                   other contexts. For those that were on the subject of
  Operationally, the hybrid feature set described          the event, the clusters represent sub-topical (second-
above is used to decide whether or not to merge two        level) clusters (see VW example in Section 1.2). Re-
garding the result worksheets, in addition to doc ids,         couple of the outliers found in the list of events, i.e.,
they included local cluster and batch cluster ids, date        nos. 2 and 7. In the case of the latter, there was
and time stamp, document title, document length and            greater variety in the news sources and articles report-
URL link to the complete news article (if available).          ing on the statements coming from the Iranian leader,
The worksheets were presented to two evaluators, both          and as a result, the algorithm may not have captured
subject matter experts from the news domain.3                  the overarching similarity among the documents. In
   Two metrics were used to evaluate these experi-             addition, there was a greater variety of persons men-
ments. First, the assessors scored each cluster for co-        tioned in these articles who were responding to Presi-
herence and accuracy, making sure that all of the doc-         dent Rouhani.
uments that belong to a specific cluster were present,            Regarding the queuing strategy and its impact on
and that all of the documents that didn’t belong were          agglomerative clustering and merging (Figure 4), we
not present. The cluster database was queried broadly,         conducted a series of experiments that involved differ-
e.g., ‘Defense Secretary Hagel’, in order to permit the        ent strategies, including least-recently-used and most-
assessors to have access to clusters both about and not        recently-used. Other strategies tended to have a signif-
about the event in question, again, in order to inspect        icant impact on computational complexity insofar as
those documents that belong in the relevant clusters           it was necessary to perform real-time tracking of dy-
and those that do not. For this task, they used a five-        namic cluster characteristics. Although the spectrum
point Likert scale, A (very good) thru F (very weak),          of considerations involved in those experiments may
codified as 5-to-1.4 Secondly, the assessors determined        be beyond the scope of the current reporting space,
a ‘cluster edit distance’ for each cluster solution, indi-     we found that the most-recently-used was as effective
cating which sub-clusters they would merge and which           a queuing strategy as the majority of others investi-
they would split, if any, to achieve an optimal solution.      gated.
Each merge or split step would be the cluster equiv-              There is clearly room for improved performance and
alent of an ‘edit’ in the standard character-based edit        additional evaluation. One way of addressing some of
distance measure. The results of this assessment task          the disparities revealed above is by tuning the joint
are presented in the Table 3.                                  thresholds for document signature and named entities
   In general, we see that with few exceptions, the ma-        tagged. Alternatively, one could have the thresholds
jority of clusters returned for our queries were about         learned and optimized depending on features associ-
the underlying event(s) (Table 3, column 4). In ad-            ated with the documents (e.g., range of idfs in the
dition, the coherence/accuracy scores for the clusters         signatures, number and type of entities in the docu-
reviewed were in the 4.0 or ‘B’ range, some higher,            ment). Moreover, one could use a variable weighted
some lower. When the same entities, but out-of-                sum of the similarity scores, depending on the contri-
event clusters are included (column 3), their scores are       bution of the named entities and distinguishing terms
slighty higher, still in the 4.0 or ‘B’ range.5 In terms       present in the articles being compared.
of the cluster edit distances measured, for the seven          6   Conclusions
news events represented in the table, the mean num-
ber of ‘splits’ required for each cluster set was λ=1.15       The news events clustering efforts summarized in this
(σ=1.2) while the mean number of merges was λ=4.7              report and depicted in Figure 1 represent a combi-
(σ=4.3).                                                       nation of semi-supervised clustering techniques and
   Clearly the larger numbers appearing in the con-            human-generated, labeled data. They aim to deliver
text of merges have been influenced significantly by a         an effective solution by leveraging Reuters’ labels and
                                                               validating the scope of events at scale. The ultimate
  3 The   first SME assessed the quality of both types of      goal of the study is to determine to what extent com-
clusters, those about the event and those not; the second      bined human-computer resources can produce event-
SME assessed the quality of the event clusters only.           based clusters that are considerably more useful – i.e.,
   4 The five grades used in the American educational sys-
                                                               more effective – than exhaustive lists of unstructured
tem are A-B-C-D-F, which range from exceptional (A) to         documents. In addition, third-party content can be
failure (F). E is not used.                                    gathered and organized around existing clustered con-
   5 Although in aggregate, the mean of the grades as-
                                                               tent based upon Reuters’ own editorially labeled and
signed the clusters by the two SMEs were comparable,
when we calculated the weighted Kappa score for inter-
                                                               classified news events. The variety of challenges con-
reviewer agreement, we found that they were not as uni-        fronted – using Reuters’ metadata, getting the granu-
form, as the scores generally fell into the bottom quartile.   larity right, and scaling the solution – all depend on
The reviewers assigned identical grades in only about a        the right mix within this integration. By tracking the
third of the cases. In the majority of the other cases, they   steps outlined above, we anticipate having a more ro-
were one and sometimes two grades apart.                       bust working model available for evaluation in the near
                              Table 3: Graded Assessments of News Events Clusters
 No.       Event Title                          No.           No.        Mean Avg              Mean Avg Score
                                               Clusters     Clusters      Score for           for Event Clusters
                                                            on Event     All Clusters        SME #1      SME #2
    1.     Halliburton Buying Baker Hughes         6            5            4.33              4.20        4.00
    2.     Defense Secretary Hagel Resigns         24          17            4.02              3.94        2.94
    3.     Air Asia Crash                          14           7            3.93              3.64        3.50
    4.     Pope Urges Tolerance in Turkey          7            6            4.29              4.17        4.33
    5.     Lufthansa Braces for Next Strike        5            2            4.00              3.00        4.50
    6.     Iran Rouhani Tries to Secure            59          47            3.99              3.86        3.93
           Nuclear Deal
    7.     Alstom Nearing $700M Bribery             5           3              3.80             3.50            4.50
           Settlement
    T.     Total                                                           Avg = 4.05       Avg = 3.73      Avg = 3.95

future. Anticipated amendments or extensions of the          [6] Jon Borglund. Event-centric clustering of news arti-
model are addressed below.                                       cles. Masters thesis, University of Uppsala, Sweden,
                                                                 Oct. 2013.
7        Future Work
                                                             [7] Jack G. Conrad, Joanne C. Claussen, and Jie Lin. In-
In future work, we will extend our evaluations by com-           formation retrieval systems with duplicate document
paring our results with exemplar clusters identified by          detection and presentation functions. U.S. Patent
our SMEs, both in terms of granularity and in terms              #7,809,695, Oct. 2010.
of completeness, at the top, topical cluster level and       [8] Jack G. Conrad, Xi S. Guo, and Cindy P. Schriber.
lower, sub-topical level of resulting clusters. This form        Online duplicate document detection: Signature reli-
of assessment addresses overall cluster precision. We            ability in a dynamic retrieval environment. In Pro-
will also need to conduct tests that approach evaluat-           ceedings of the 12th Conference on Information and
ing recall, i.e., of all the possible news events in the         Knowledge Management (CIKM03), pages 243–252.
data set or sample, how many do we capture and rep-              ACM Press, Nov. 2003.
resent at top and lower levels of the shallow hierarchy?     [9] Qi Li, Heng Ji, and Liang Huang. Joint event extrac-
                                                                 tion via structured prediction with global features. In
8        Acknowledgments                                         Proceedings of the 51st Annual Meeting of the ACL,
The authors thank Sarah Edmonds at TRGR for her                  pages 73–82. Association for Computational Linguis-
diligent work assessing result sets. We are also grateful        tics, Aug. 2013.
to Brian Romer with Reuters Data Innovation Lab for         [10] Ramesh Nallipati, Ao Feng, Fuchun Peng, and James
his innovative work on the UI and demo (to be shown              Allan. Event threading within news topics. In Pro-
at the workshop).                                                ceedings of the 13th Conference on Information and
                                                                 Knowledge Management (CIKM04), pages 446–453.
References                                                       ACM Press, Nov. 2004.
 [1] Topic Detection and Tracking Workshops, Washing-
     ton, D.C., 2004. NIST.                                 [11] Ron Papka. On-Line New Event Detection, Cluster-
                                                                 ing, and Tracking. Ph.d. thesis, University of Mas-
 [2] First Workshop on Computing News Storylines
                                                                 sachusetts - Amherst, Sept. 1999.
     (CNewS 2015), Beijing, PRC, July 2015. ACL.
                                                            [12] Jakub Piskorski, Hristo Tanev, Martin Atkinson, and
 [3] James Allan, Jaime Carbonell, George Doddingtom,
                                                                 Erik van der Gout. Cluster-centric approach to
     Jonathan Yamron, and Yiming Yang. Topic de-
                                                                 news event extraction. In 2008 Conference on New
     tection and tracking pilot study final report. In
                                                                 Trends in Multimedia and Network Information Sys-
     DARPA Broadcast News Transcription & Under-
                                                                 tems, pages 276–290, 2008.
     standing Workshop, Feb. 1998.
 [4] Samet Atdag and Vincent Labatut. A comparison of       [13] Thomson Reuters. Open Calais NamedTM Entity Tag-
     named entity recognition tools applied to biograph-         ging Engine. http://www.opencalais.com, 2016.
     ical texts. In 2nd International Conference on Sys-    [14] Piek Vossen, Tommaso Caselli, and Yiota Kont-
     tems and Computer Science (ICSCS13), pages 228–             zopoulou. Storylines for structuring massive streams
     233. IEEE, Aug. 2013.                                       of news. In Proceedings of the First Workshop on
 [5] Joel Azzopardi and Christopher Staff. Incremental           Comparing News Storylines, pages 40–49. ACL and
     clustering of news reports. Algorithms, 5:364–378,          Asian Federation of NLP, July 2015.
     2012.