<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-Supervised Events Clustering in News Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jack G. Conrad</string-name>
          <email>jack.g.conrad@thomsonreuters.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Bender</string-name>
          <email>michael.bender@thomsonreuters.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Thomson Reuters, Corporate Research &amp; Development</institution>
          ,
          <addr-line>Saint Paul, Minnesota 55123</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Thomson Reuters, Thomson Reuters Global Resources</institution>
          ,
          <addr-line>Baar, Zug 6340</addr-line>
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The presentation of news articles to meet
research needs has traditionally been a
document-centric process. Yet users often
want to monitor developing news stories based
on an event, rather than by examining an
exhaustive list of retrieved documents. In
this work, we illustrate a news retrieval
system, eventNews, and an underlying algorithm
which is event-centric. Through this system,
news articles are clustered around a single
news event or an event and its sub-events. The
algorithm presented can leverage the creation
of new Reuters stories and their compact
labels as seed documents for the clustering
process. The system is con gured to generate
top-level clusters for news events based on an
editorially supplied topical label, known as a
`slugline,' and to generate sub-topic-focused
clusters based on the algorithm. The system
uses an agglomerative clustering algorithm to
gather and structure documents into distinct
result sets. Decisions on whether to merge
related documents or clusters are made
according to the similarity of evidence derived from
two distinct sources, one, relying on a digital
signature based on the unstructured text in
the document, the other based on the presence
of named entity tags that have been assigned
to the document by a named entity tagger, in
this case Thomson Reuters' Calais engine.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>1.1</p>
      <sec id="sec-2-1">
        <title>Motivations</title>
        <p>Thomson Reuters has been exploring alternative
models for organizing and rendering articles found in its
news repository. Whether the users are editors,
nancial analysts, lawyers or other professional researchers,
a more e ective means of examining a set of
eventrelated news articles beyond that of a ranked list of
documents was expressly sought. The presentation of
news articles based on events aligns well with
contemporary research use cases, such as those arising in the
nance and risk sectors, where there is a salient need
for more e ectively organized news content through
the lens of events. Other news organizations such as
Google have experimented with news clustering, but
in the absence of the concrete use cases of Thomson
Reuters' professional users.</p>
        <p>This project uses semi-supervised clustering
capabilities in order to group news documents based upon
shared news events. Germinal Reuters stories with
editorially assigned labels (a.k.a. `sluglines') are used as
seed documents for event identi cation and
organization. This task addresses the fundamental aim of the
project.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Objectives</title>
        <p>The main objective of this project is to develop an
event-centric news paradigm that solves the challenge
of event validation and event story clustering at scale.
This goal is in response to feedback received from
consumers on news in their products. In addition to
organizing news results around events rather than
documents, another goal of this study is to provide a
mechanism for clustering third-party (non-Reuters) news
documents together with corresponding Reuters
articles around common news events. This is aided by
leveraging metadata tags that exist in Reuters news
articles about the same topical event. Since these tags
distinguish Reuters news documents from third-party
content, it is possible to consider using them as the
basis for grouping news articles together. The
initial plan for this project was developed in
conjunction with R&amp;D's partner, the news asset owner and
subject matter expert (SME), to use the initial or
top-level story labels known as primary sluglines (e.g.,
VOLKSWAGEN-EMISSION-FRAUD/ ) as an
organizing principle for top-level clusters, and an
algorithmic means for creating lower-level clusters which
can incorporate second tier story labels known as
secondary sluglines (e.g.,
VOLKSWAGEN-EMISSIONFRAUD/COMPENSATION).
1.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Work ow Illustration</title>
        <p>In Figure 1, we see an example involving the \General
Motors Recall" for faulty ignition switches. Through
regular editorial practices, journalists write and tag
event-related stories. The rst story with the rst
\GM Recall" tag serves as the seed story for initiating
the cluster. As Reuters writes and tags more stories
about the GM Recall, the set of tags and text de
ning the GM Recall event expands. As it expands, so
too does the algorithm's grasp of the event, helping
it to better identify cluster candidates, in particular,
within third-party news. Both the editorially
generated slugline responsible for the birth of the cluster
and the algorithmic identi cation and population of
subsequent sub-clusters are depicted in the gure.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Previous Work</title>
      <p>
        Previous work published on the topic of news events
structuring has been largely academic in nature, for
example, as in Borglund [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]. This thesis includes three
contributions: a survey of known clustering methods,
an evaluation of human versus human results when
grouping news articles in an event-centric manner, and
lastly an evaluation of an incremental clustering
algorithm to see if it is possible to consider a reduced input
size and still get a su cient result.
      </p>
      <p>
        In addition, there have been journal articles that
have explored the computational complexity of the
algorithms necessary to cluster real-time news articles
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. But they have focused largely on the math behind
the clustering rather than the use case and
practitioners bene tting from it.
      </p>
      <p>
        Some of the earliest work in this area was pursued
under DARPA and NIST funding and resulted in
reports written by various forums created to advance the
state of the art in event detection [
        <xref ref-type="bibr" rid="ref1 ref3">3, 1</xref>
        ].
      </p>
      <p>
        There have also been research group work and
dissertations on the subject of topic detection and
tracking resulting from the above research [
        <xref ref-type="bibr" rid="ref12 ref13">12, 11</xref>
        ].
Subsequent work has attempted to capture some of the
structure of events and their dependencies in a news
topic by creating a model of events, a.k.a. `event
threading' [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ]. Yet more recently there have been
actual forums under large umbrella organizations like
ACL focusing on automatically computing news
stories (and their titles) [
        <xref ref-type="bibr" rid="ref15 ref2">2, 14</xref>
        ].
      </p>
      <p>
        There is also another eld of research that addresses
event extraction in the ACE tradition1 that is
relevant to the context of our current work, e.g., [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ].
What is distinct about our present project, however,
is the use of SME-de ned seed stories and labels in a
semi-supervised manner and the subsequent clustering
stages at scale for real world news streams.
      </p>
      <p>
        Worth noting is that one of the building blocks of
the current work is represented by an initial form of
`local' clustering that involves the identi cation and
grouping of exact and fuzzy duplicate documents [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ].
This takes place in the stage immediately preceding
the nal, aggregated clustering step.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Data Resources</title>
      <p>The news repository under examination in this e ort
is known as NewsRoom. It is a Thomson Reuters news
aggregation platform. It consists of approximately
1530 million documents per year from 12,000
independent news sources which consist of national and
local newspapers, periodic journals, radio program
transcriptions, etc. From 2012 to 2015, NewsRoom
consisted of approximately 80 million news articles. These
were the target of our investigation for this project
(Table 1).2</p>
      <p>In order to test our news work ow and the
clustering algorithms that support it, we focus on chunks of
data representing approximately three months of
documents at a time.</p>
      <p>
        Having investigated baseline news clusters in earlier
research e orts (i.e., baseline algorithm, its
granularity, speed and complexity) we have subsequently
pursued improvements and e ciencies to help us approach
1http://www.itl.nist.gov/iad/mig/tests/ace/
2Thomson Reuters has long made comparably large
news collections available for external research: http:
//trec.nist.gov/data/reuters/reuters.html
Given our substantial data resources and our goal to
build a exible experimental retrieval environment, we
have established three stages for processing and
clustering a large set of news documents around news
events (Figure 2). These stages include: (1) document
extraction (Reuters and non-Reuters articles) from our
news repository; (2) local clustering based on duplicate
document detection of identical and fuzzy duplicates
[
        <xref ref-type="bibr" rid="ref8">7</xref>
        ]; and (3) aggregate clustering performed over the
result set from stage 2. We have determined
empirically that the local clustering stage works highly e
ectively [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ]. It is the aggregate clustering stage that has
spawned ongoing research, evaluation and re nement.
This stage consists of the application of hierarchical
agglomerative clustering, where di erent types of
cluster centroid representations were examined. Although
we provide descriptions of each of the three processing
stages below, it is the third of these stages that is the
principal focus of our latest e orts and this research
report.
The document extraction process can be customized
to facilitate experimentation such as that undertaken
for this study. NewsRoom represents a news
repository of both Reuters and non-Reuters sources
covering roughly 12,000 news sources. Given a date
range, e.g., [20141001T0000000Z 20141231T235959Z],
one can extract all of the `recommendable' news
documents in the repository, or some user-de ned
subset of them. Since the repository contains substantial
numbers of Reuters and non-Reuters nancial
documents, for example, some stories are largely
nontextual, e.g., containing tabular information only; very
short, e.g., stubs for stories in progress stories; or
meta-data snippets for topics that were not
substantiated. These types of documents would be
considered non-recommendable and thus are not retrieved
for subsequent processing. In general, over half of the
documents in the repository would be classi ed as
recommendable for this use case. The NewsRoom
environment comes with a recommendation classi er.
Additional details beyond those provided above would be
beyond the scope of our current focus.
      </p>
      <p>The extraction process results in all
recommendable documents being loaded from the repository to
an Apache Derby JDBC relational database. The
tabular data structures that store the documents and
subsequent clusters contain basic information such as
doc id, dataset name, doc date, title, article source,
source url (if applicable), body, body length, together
with tens of additional features that can be used to
discriminate and used by various classi ers, e.g., primary
news code, short sentence count, ticker count,
quantity of numbers, quantity all-caps, quantity of press
releases, etc. These additional features are available
for subsequent downstream processing such as
classication, routing or clustering.
4.2</p>
      <sec id="sec-4-1">
        <title>Local Clustering Stage</title>
        <p>
          The next process, local clustering, is designed to
rapidly and e ciently identify initial clusters based on
documents that satisfy criteria for identical or fuzzy
duplicates. Documents are compared using two types
of digital signatures that harness the most
discriminating terms, one, smaller and more compact
leveraging O(10) terms, is used to identify identical
duplicates; another, more expansive, leveraging O(100)
terms, is used to identify fuzzy duplicates. The
process being executed uses techniques reported on in [
          <xref ref-type="bibr" rid="ref9">8</xref>
          ].
For this application, a rolling window of n days is used,
where (n &lt; 10). Documents falling within this window
are compared. Heuristics relying on features such as
doc length, are also invoked to reduce the number of
comparisons required. For example, when a document
exceeds the length of another by 20% or more, though
they may satisfy a containment relationship,
according to our de nition, they would not be considered
`duplicates.'
4.3
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Aggregate Clustering Stage</title>
        <p>During the third, aggregate clustering stage, the
clusters are initiated via seminal Reuters articles
containing slugline tags. These tags are distinct from
headlines, as shown in Figure 3. The articles with sluglines
may be singletons or they may exist in one of the
local clusters formed in preceding stage. Both of these
`objects' qualify to serve as a cluster `seed.'
Two main challenges confronted when implementing
this hierarchical, agglomerative clustering stage were,
rst, nding the best set of features and metrics to
decide whether a pair of singletons or local clusters
justify merging into larger clusters while still
remaining su ciently cohesive, and, second, identifying the
optimal sequence for comparing these clusters when
considering merging (Figure 4).</p>
        <p>
          Based upon observations made by subject matter
experts who created exemplar news clusters to
support the project, we determined that there were two,
often independent, means by which documents could
be identi ed as belonging to the same news event. One
involves the unstructured text of an article; the other
involves the structured text, in our case, documents
that have been tagged by the Calais named-entity
tagging engine [
          <xref ref-type="bibr" rid="ref14 ref4 ref6">13, 4</xref>
          ]. Given that articles involving news
events can be found to be similar based on either of
these two feature spaces, our approach to aggregate
(stage 3) clustering is robust: a decision to merge two
of these documents or local clusters can be based on
the similarity between the unstructured text of two
objects, the tagged named entities that have been
identi ed by Calais (listed below), or both.
        </p>
        <sec id="sec-4-2-1">
          <title>People { person name entities Reuters Instrument Codes (RICs) { for companies Reuters Classi cation System (RCS) { for topics &amp; industries</title>
          <p>Topics { domain independent topical phrases
Smart Terms { topical taxonomy terms</p>
          <p>Operationally, the hybrid feature set described
above is used to decide whether or not to merge two
clusters. It consists of two data structures, both
repesented in vector form. The rst is a term-based
vector. It is used to determine the degree of overlap
between two cluster centroids, constituted by two central
`documents' (e.g., longest, most recent, true centroid,
etc.). The second is a tag-based vector, representing a
set of Calais tags present in the cluster's documents.
The similarity measures used in each of these cases
is thresholded, with the threshold determined
empirically. In the case of the term vectors for the
unstructured text, the thresholds are set high, although not as
high as those for duplication detection used in stage 2.
In the case of the set of Calais tags for the structured
text, a weighted sum is used, whereby various
combinations of named entities can be assembled to satisfy
the threshold for merging.
Given the objectives of this study with respect to
retrieval performance and organizational structure,
evaluation is an essential piece of the validation
process. After having conducted a number of trials
to establish various thresholds (document or cluster
similarity, named entity similarity, etc.), we conducted
a trial which focused on a number of news events
chosen by subject matter experts (SMEs) from the
nal quarter of 2014. We focused on the set of
high-level news events shown below.</p>
          <p>For each of the events identi ed, result sets were
created and stored in worksheets (Table 2 presents
dataset details). The result sets consisted of
numerous clusters on the subject of the event (often involving
named entities such as Halliburton, Hagel, the Pope,
Rouhani, Alstom, etc.), some of which are on the topic
of the news event, some of which address the entity in
other contexts. For those that were on the subject of
the event, the clusters represent sub-topical
(secondlevel) clusters (see VW example in Section 1.2).
Regarding the result worksheets, in addition to doc ids,
they included local cluster and batch cluster ids, date
and time stamp, document title, document length and
URL link to the complete news article (if available).
The worksheets were presented to two evaluators, both
subject matter experts from the news domain.3</p>
          <p>Two metrics were used to evaluate these
experiments. First, the assessors scored each cluster for
coherence and accuracy, making sure that all of the
documents that belong to a speci c cluster were present,
and that all of the documents that didn't belong were
not present. The cluster database was queried broadly,
e.g., `Defense Secretary Hagel', in order to permit the
assessors to have access to clusters both about and not
about the event in question, again, in order to inspect
those documents that belong in the relevant clusters
and those that do not. For this task, they used a
vepoint Likert scale, A (very good) thru F (very weak),
codi ed as 5-to-1.4 Secondly, the assessors determined
a `cluster edit distance' for each cluster solution,
indicating which sub-clusters they would merge and which
they would split, if any, to achieve an optimal solution.
Each merge or split step would be the cluster
equivalent of an `edit' in the standard character-based edit
distance measure. The results of this assessment task
are presented in the Table 3.</p>
          <p>In general, we see that with few exceptions, the
majority of clusters returned for our queries were about
the underlying event(s) (Table 3, column 4). In
addition, the coherence/accuracy scores for the clusters
reviewed were in the 4.0 or `B' range, some higher,
some lower. When the same entities, but
out-ofevent clusters are included (column 3), their scores are
slighty higher, still in the 4.0 or `B' range.5 In terms
of the cluster edit distances measured, for the seven
news events represented in the table, the mean
number of `splits' required for each cluster set was =1.15
( =1.2) while the mean number of merges was =4.7
( =4.3).</p>
          <p>Clearly the larger numbers appearing in the
context of merges have been in uenced signi cantly by a
3The rst SME assessed the quality of both types of
clusters, those about the event and those not; the second
SME assessed the quality of the event clusters only.</p>
          <p>4The ve grades used in the American educational
system are A-B-C-D-F, which range from exceptional (A) to
failure (F). E is not used.</p>
          <p>5Although in aggregate, the mean of the grades
assigned the clusters by the two SMEs were comparable,
when we calculated the weighted Kappa score for
interreviewer agreement, we found that they were not as
uniform, as the scores generally fell into the bottom quartile.
The reviewers assigned identical grades in only about a
third of the cases. In the majority of the other cases, they
were one and sometimes two grades apart.
couple of the outliers found in the list of events, i.e.,
nos. 2 and 7. In the case of the latter, there was
greater variety in the news sources and articles
reporting on the statements coming from the Iranian leader,
and as a result, the algorithm may not have captured
the overarching similarity among the documents. In
addition, there was a greater variety of persons
mentioned in these articles who were responding to
President Rouhani.</p>
          <p>Regarding the queuing strategy and its impact on
agglomerative clustering and merging (Figure 4), we
conducted a series of experiments that involved di
erent strategies, including least-recently-used and
mostrecently-used. Other strategies tended to have a
significant impact on computational complexity insofar as
it was necessary to perform real-time tracking of
dynamic cluster characteristics. Although the spectrum
of considerations involved in those experiments may
be beyond the scope of the current reporting space,
we found that the most-recently-used was as e ective
a queuing strategy as the majority of others
investigated.</p>
          <p>There is clearly room for improved performance and
additional evaluation. One way of addressing some of
the disparities revealed above is by tuning the joint
thresholds for document signature and named entities
tagged. Alternatively, one could have the thresholds
learned and optimized depending on features
associated with the documents (e.g., range of idfs in the
signatures, number and type of entities in the
document). Moreover, one could use a variable weighted
sum of the similarity scores, depending on the
contribution of the named entities and distinguishing terms
present in the articles being compared.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The news events clustering e orts summarized in this
report and depicted in Figure 1 represent a
combination of semi-supervised clustering techniques and
human-generated, labeled data. They aim to deliver
an e ective solution by leveraging Reuters' labels and
validating the scope of events at scale. The ultimate
goal of the study is to determine to what extent
combined human-computer resources can produce
eventbased clusters that are considerably more useful { i.e.,
more e ective { than exhaustive lists of unstructured
documents. In addition, third-party content can be
gathered and organized around existing clustered
content based upon Reuters' own editorially labeled and
classi ed news events. The variety of challenges
confronted { using Reuters' metadata, getting the
granularity right, and scaling the solution { all depend on
the right mix within this integration. By tracking the
steps outlined above, we anticipate having a more
robust working model available for evaluation in the near
No.</p>
      <sec id="sec-5-1">
        <title>Halliburton Buying Baker Hughes</title>
        <p>Defense Secretary Hagel Resigns
Air Asia Crash
Pope Urges Tolerance in Turkey
Lufthansa Braces for Next Strike
Iran Rouhani Tries to Secure
Nuclear Deal
Alstom Nearing $700M Bribery
Settlement</p>
        <p>Total
future. Anticipated amendments or extensions of the
model are addressed below.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>In future work, we will extend our evaluations by
comparing our results with exemplar clusters identi ed by
our SMEs, both in terms of granularity and in terms
of completeness, at the top, topical cluster level and
lower, sub-topical level of resulting clusters. This form
of assessment addresses overall cluster precision. We
will also need to conduct tests that approach
evaluating recall, i.e., of all the possible news events in the
data set or sample, how many do we capture and
represent at top and lower levels of the shallow hierarchy?
8</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors thank Sarah Edmonds at TRGR for her
diligent work assessing result sets. We are also grateful
to Brian Romer with Reuters Data Innovation Lab for
his innovative work on the UI and demo (to be shown
at the workshop).</p>
      <p>No.</p>
      <p>Clusters</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Topic</given-names>
            <surname>Detection</surname>
          </string-name>
          and Tracking Workshops, Washington, D.C.,
          <year>2004</year>
          . NIST.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>First</given-names>
            <surname>Workshop</surname>
          </string-name>
          on Computing News Storylines (CNewS
          <year>2015</year>
          ), Beijing, PRC,
          <year>July 2015</year>
          . ACL.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          , Jaime Carbonell, George Doddingtom, Jonathan Yamron, and
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Topic detection and tracking pilot study nal report</article-title>
          .
          <source>In DARPA Broadcast News Transcription &amp; Understanding Workshop</source>
          , Feb.
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Samet</given-names>
            <surname>Atdag</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Labatut</surname>
          </string-name>
          .
          <article-title>A comparison of named entity recognition tools applied to biographical texts</article-title>
          .
          <source>In 2nd International Conference on Systems and Computer Science (ICSCS13)</source>
          , pages
          <fpage>228</fpage>
          {
          <fpage>233</fpage>
          . IEEE, Aug.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Joel</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Sta</surname>
          </string-name>
          .
          <article-title>Incremental clustering of news reports</article-title>
          .
          <source>Algorithms</source>
          ,
          <volume>5</volume>
          :
          <fpage>364</fpage>
          {
          <fpage>378</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>4.50 Avg = 4.05 Avg = 3.73 Avg = 3</source>
          .
          <fpage>95</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jon</given-names>
            <surname>Borglund</surname>
          </string-name>
          .
          <article-title>Event-centric clustering of news articles</article-title>
          .
          <source>Masters thesis</source>
          , University of Uppsala, Sweden, Oct.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jack</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Conrad</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joanne C. Claussen</surname>
            , and
            <given-names>Jie</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Information retrieval systems with duplicate document detection and presentation functions</article-title>
          . U.S. Patent #
          <volume>7</volume>
          ,
          <issue>809</issue>
          ,695, Oct.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Jack</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Conrad</surname>
            , Xi S. Guo, and
            <given-names>Cindy P.</given-names>
          </string-name>
          <string-name>
            <surname>Schriber</surname>
          </string-name>
          .
          <article-title>Online duplicate document detection: Signature reliability in a dynamic retrieval environment</article-title>
          .
          <source>In Proceedings of the 12th Conference on Information and Knowledge Management (CIKM03)</source>
          , pages
          <fpage>243</fpage>
          {
          <fpage>252</fpage>
          . ACM Press,
          <year>Nov</year>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Qi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Heng</given-names>
            <surname>Ji</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Liang</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Joint event extraction via structured prediction with global features</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the ACL</source>
          , pages
          <volume>73</volume>
          {
          <fpage>82</fpage>
          . Association for Computational Linguistics, Aug.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ramesh</surname>
            <given-names>Nallipati</given-names>
          </string-name>
          , Ao Feng, Fuchun Peng, and
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <article-title>Event threading within news topics</article-title>
          .
          <source>In Proceedings of the 13th Conference on Information and Knowledge Management (CIKM04)</source>
          , pages
          <fpage>446</fpage>
          {
          <fpage>453</fpage>
          . ACM Press,
          <year>Nov</year>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Ron</given-names>
            <surname>Papka</surname>
          </string-name>
          .
          <article-title>On-Line New Event Detection, Clustering, and Tracking</article-title>
          .
          <source>Ph.d. thesis</source>
          , University of Massachusetts - Amherst, Sept.
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jakub</surname>
            <given-names>Piskorski</given-names>
          </string-name>
          , Hristo Tanev,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atkinson</surname>
          </string-name>
          , and Erik van der Gout.
          <article-title>Cluster-centric approach to news event extraction</article-title>
          .
          <source>In 2008 Conference on New Trends in Multimedia and Network Information Systems</source>
          , pages
          <fpage>276</fpage>
          {
          <fpage>290</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Thomson</given-names>
            <surname>Reuters. Open Calais NamedTM Entity Tagging</surname>
          </string-name>
          <article-title>Engine</article-title>
          . http://www.opencalais.com,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Piek</surname>
            <given-names>Vossen</given-names>
          </string-name>
          , Tommaso Caselli, and
          <string-name>
            <given-names>Yiota</given-names>
            <surname>Kontzopoulou</surname>
          </string-name>
          .
          <article-title>Storylines for structuring massive streams of news</article-title>
          .
          <source>In Proceedings of the First Workshop on Comparing News Storylines</source>
          , pages
          <volume>40</volume>
          {
          <fpage>49</fpage>
          . ACL and
          <string-name>
            <surname>Asian</surname>
            <given-names>Federation of NLP</given-names>
          </string-name>
          ,
          <year>July 2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>