<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Qlusty: Quick and Dirty Generation of Event Videos from Written Media Coverage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Barron-Ceden~o</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Da San Martino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yifan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmed Ali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fahim Dalvi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>albarron</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gmartino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yzhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>amali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>faimadudding@hbku.edu.qa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qatar Computing Research Institute HBKU Research Complex</institution>
          ,
          <addr-line>Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Qlusty generates videos describing the coverage of the same event by di erent news outlets automatically. Throughout four modules it identi es events, de-duplicates notes, ranks according to coverage, and queries for images to generate an overview video. In this manuscript we present our preliminary models, including quantitative evaluations of the former two and a qualitative analysis of the latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common in the current news landscape.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Event reporting in digital media spans from the re-use
of contents from news agencies to the direct coverage
and shaping of a story. The point of view, aspects,
and storytelling of the same and related events can be
diverse from medium to medium, depending on their
editorial line (e.g., left vs right), target audience (e.g.,
quality vs tabloid), house style, or mere interest in an
event. Qlusty aims at presenting consumers with a
short video overview of the facts with contrasting
coverage of the same news event by di erent news outlets,
the overall aim being to break the information bubble.</p>
      <p>Our video-production architecture consists of four
modules: event identi cation, de-duplication, coverage
diversi cation, and image gathering. Such modules
can be translated into IR and NLP problems:
document clustering, near-duplicate identi cation,
ranking, and query generation. We present a
quantitative analysis of the clustering and de-duplication
modules, taking advantage of the METER corpus for text
re-use analysis. The clustering strategy we use |
DBSCAN| outperforms k-means even if in the
former one no information about the number of clusters
is known in advance: F1 values in the range of 0.71
vs. 0.60. A qualitative analysis, carried out on the
News Corpus G and Signalmedia 1M corpora, shows
the potential of our diversi cation and query
generation modules in the generation of attractive videos.
2</p>
    </sec>
    <sec id="sec-2">
      <title>News Corpora</title>
      <p>We use three corpora to tune and test our models both
quantitatively and qualitatively.</p>
      <p>METER Corpus [CGP02]. It includes documents
covering events as published by one news agency and
nine newspapers from the British press. This
characteristic allows for the tuning of models for event
identi cation. Each newspaper document can be wholly-,
partially-, or non-derived out of a news agency report.
Therefore METER is useful to test de-duplication
models. It is relatively small: 1:7k documents. Still it
is manually annotated by expert journalists.
Twentyve percent of the newspaper notes are wholly-derived
from an agency wire. Either derived or not, in general
the notes are modi ed to stick to editorial focus, style,
and readability standards.</p>
      <p>News Corpus G [Gas17]. It was originally intended
for the development of news recommendation models.
G does not contain full articles; only titles. The
article's content can be downloaded from the provided
URL, pointing to the original publisher. We stick to
use only the titles to assess the robustness of our
models when dealing with very short texts. G is signi
cantly larger: 423k documents covering 7; 231 events.
Such events are as provided by Google News and we
do not consider them as ground-truth.</p>
      <p>SignalMedia 1M Corpus [CAMM16]. This corpus
is signi cantly more diverse. Beside including
documents from major news agencies and papers, 1M
contains material from magazines and blog entries,
among others. We discard blog entries and focus on
items identi ed as news. This dataset is particularly
challenging because it is only lightly curated; it may
contain noisy text (e.g., with HTML tags) and even
content-less entries.1 Due to the regular querying of
articles, verbatim duplicates also exist in the collection
(i.e. the same article may exist a number of times with
a di erent unique id). As stressed in [CAMM16], this
real-life dataset prevents from the over-estimation of
performance usually obtained on clean data. 1M does
not include any event-related information.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Architecture and Models</title>
      <p>Our architecture consists of four modules plus the
video-generation stage, which are described next.
3.1</p>
      <sec id="sec-3-1">
        <title>Clustering for Event Identi cation</title>
        <p>The input to this module is a batch of news articles
from a xed time period. The output is the
articles organised within a non-speci ed number of events.
Traditionally, for this task the input data is treated
as a continuous stream of documents. Hierarchical
[SCK+06] and partitional clustering [AS12, AY10] are
popular approaches. Still we use DBSCAN [EKSX96].
The main reasons are that |at this stage| we are
not interested in news streams but in temporal batches
and, perhaps more important, DBSCAN does not
require information related to the expected number of
events. As a result, no knowledge is necessary about
the distribution of the input documents.</p>
        <p>DBSCAN does require to set two hyper-parameters.
The rst one is the maximum distance under which
two elements can be considered as part of the same
cluster. The second one is the minimum number of
elements in a cluster. Items can belong to no
neighbourhood at all, and be considered as noisy entries. We x
the minimum size of a cluster to 2 news articles, thus
considering singletons as noise. As for the maximum
distance, we use the METER corpus to empirically set
it. The experiments are described in Section 4.1.</p>
        <p>We opt for doc2vec embeddings [LM14] for
document representation, pre-trained on articles from the
Associated Press [LB16]. The pair-wise distances are
computed using 1 minus cosine similarity. The use
of doc2vec for representing documents looks appealing
due to its semantic properties.</p>
        <p>1For instance, the content of entry
f4edd16d-df59-41f9-ae01d4dee076b0d5 is \Your access to this site has been temporarily
blocked. This block will be automatically removed shortly".
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Near-Duplicate</title>
        <p>Duplication</p>
      </sec>
      <sec id="sec-3-3">
        <title>Detection for De</title>
        <p>The input of this module is the articles belonging to
a single event, as identi ed by the clustering
module. The output is such articles after discarding
nearduplicates. We opt for standard text re-use
identication approaches based on word n-grams
comparison [LBM04]. We represent the texts as bags of word
n-grams after standard pre-processing: casefolding,
tokenisation, and stopword removal. Tokens shorter
than 2 characters are discarded as well. We use the
Jaccard coe cient [Jac01] to compute the similarities.</p>
        <p>The value for n as well as the threshold to
consider that two documents are near-duplicates are set
empirically, once again on the METER corpus. The
experiments are discussed in Section 4.2
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Ranking for Diversi cation</title>
        <p>The input of this module is the de-duplicated articles
from a speci c event as ltered by the de-duplication
module. The output is a ranked list of the documents.
One of the premises of our system is breaking a user's
bubble. We aim at presenting a news event
including points of view as diverse as possible. The idea is
that those articles which are most dissimilar to the
rest covering the event are those which contain the
most diverse contents.</p>
        <p>In a k-means-like model nding such dissimilarity
would be as straight-forward as computing the
similarity of each article against the centroid.
Nevertheless, no centroid exist in a DBSCAN-generated cluster.
Therefore, our ranking function consists of computing
the average similarity between an article and the rest
of articles in the cluster:
score(d) =
sim(d; d0)
(1)
1 X
jcj d02cjd06=d
where d (d0) is a document in cluster c and jcj
represents the size of c. Once again, we use cosine similarity
on doc2vec representations. The articles will be
presented to the user according to this ranking, from top
to bottom. We use 1 because we want the most
different articles to appear rst. The opening article is
an exception: it is the last one according to the
scoring function (i.e. the most similar to the rest of the
cluster members). The reason is that we consider this
article is the best one to open the video and give a
good overview of the event. This module requires no
tuning. Section 4.3 shows a qualitative analysis.
3.4</p>
        <p>Query Generation for Image Gathering
Finally, we query a search engine to gather illustrations
for each of the articles. The input to this module is
the ranked list of texts from the diversi cation module
and the output is one query per article.</p>
        <p>We explore three alternatives to generate the query.
Model q1 uses the news title. Models q2 and q3 follow
a common mechanism. Firstly, all sub-chunks for all
texts are extracted and tf -idf -ranked. For each
document in the list, that chunk with the highest score is
selected as the query. Once a chunk has been used, it is
discarded from the list of candidates to avoid grabbing
duplicate images. For model q2 we use word 2-grams,
whereas for q3 we use named entities (NE). Regardless
of the contents in the rst article in the ranking, its
query consists of the top NE.</p>
        <p>The so-generated chunks are queried to a search
engine, one at a time, and the top-5 pictures grabbed
for integration in the video. In this version we use
Google's search engine.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <sec id="sec-4-1">
        <title>Clustering Tuning</title>
        <p>Our rst experiment intends to tune our event
identi cation model (cf. Section 3.1). Our objective is
identifying the best DBSCAN neighbourhood
maximum distance (eps) for a random number of events
and their associated articles. We are interested in two
factors: high quality and stability for di erent
document volumes.</p>
        <p>First we formalise the problem and describe the
performance measures. Let D be a collection of
documents covering a set of events E. We refer to the
number of events in E as jEj. For each d 2 D, let e(d)
be the set of documents belonging to the same event as
d. Analogously, let c(d) be the set of documents that
the model assigns to the same cluster as d's. For any
E0 E, let DjE0 be the subset of D whose documents'
events are in E0. We use BCubed-F1 as clustering
performance measure [AGAV09]. We de ne (s1; s2) = 1
if the sets s1 and s2 are identical, 0 otherwise. Let
TP(d) = X (e(d); e(d0))
(c(d); c(d0))
(2)
d0
be a function counting the number of documents
belonging to the same event as d which have been put
together in the same cluster by DBSCAN. BCubed-F1
is the harmonic mean between BCubed Precision P
and BCubed Recall R:</p>
        <p>P =
1 X TP(d)
jDj d2D jc(d)j
;</p>
        <p>R =
1 X TP(d)
jDj d2D je(d)j
(3)</p>
        <p>We estimate parameter eps as follows. For 10 i
jcj: we randomly select c0; jc0j = i events and run our
clustering algorithm on Djc0 . We perform 10 random
1
0.8
re 0.6
u
s
a
e
-Fm 0.4
0.2
0
repetitions to assess the stability of the outcome on
increasing numbers of gold clusters. Figure 1 shows, for
each eps value, the BCubed-F1 measure averaged over
the 10 runs together with standard deviation. We
include the performance of k-means to give perspective
to the results. In principle, k-means has the
advantage of including the expected number of clusters as a
parameter and we always assign the right number.</p>
        <p>Values of eps 0:55 yield the best results when
dealing with small numbers of clusters, but drop drastically
when facing larger numbers of events. Lower values
yield a relatively stable performance, regardless of the
number of events in the dataset. An analysis focused
on BCubed precision and recall values (not reported)
indicate that the drop observed for eps 0:55 is pulled
by precision; the clusters tend to be larger than they
should, including noisier entries. As a compromise
between stability and purity, we select eps= 0:55.
Our second experiment intends to tune the model for
near-duplicate identi cation (cf. Section 3.2). The
purpose is tuning two parameters: the value of n |
the word n-gram level| and the similarity threshold
upon which documents are considered near-duplicates
and hence one can be discarded from the nal output.
In the sibling task of text re-use detection, setting n
to f2; 3g [BCR09] and even 5 [KBK09] is considered
standard. As we are interested in discarding whole
documents to reduce redundancy to a minimum, we
explore low values to allow for a more exible
comparison: n = f1; 2g.</p>
        <p>Once again we use the METER corpus and its text
re-use annotation. We adopt two settings. In the
simple setting, we consider a pair of documents news
agency{newspaper as positive i the latter is labelled
as wholly-derived and both cover the same event. In
the complex setting we consider an additional
triangular relationship: a pair newspaper{newspaper is
considered as positive i both are labelled as
whollyderived from the same news agency article. We
restrain our similarity comparison to all those articles
published in the same day, resulting in 38k and 48k
comparisons in the simple and complex settings,
respectively. As a consequence of this volume of
comparisons a high imbalance in the dataset exists |most
pairs are negative instances. We evaluate this
experiment on the basis of the F1-measure for binary
classi cation: F1 = 2 tp+2ftpp+fn , where tp, f p, and f n
stand for number of true positives, false positives, and
false negatives. Table 1 shows the results. Firstly, a
more exible comparison based on word 1-grams
results in the best performance (this may imply
documents which are not complete duplicates are discarded;
we prefer this over including very similar notes). In
both simple and complex settings the best F1 is
obtained with = 0:25 and we select this threshold. This
supports the concept of co-derivative and re ects that
the threshold is valid for both news agency{newspaper
and newspaper{newspaper comparisons.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Articles Ranking</title>
        <p>Now we make a qualitative analysis. Table 2 shows
the titles of the articles of three events ranked on the
basis of our diversi cation model (cf. Section 3.3).</p>
        <p>Instance A tells the story of Libyan rebels and their
impact on oil. The top article does summarise the
event, referring to a rebel attack on naval forces. As
expected, the topic of article 2 is not as close: it is
about the plans to sink a ship transporting illegal oil,
currently besieged by the Libyan Navy. Whereas the
third article still refers to oil, rebel attacks, and even
to the chances for a con ict, the latter two refer to the
dismissal of the Libyan PM by the parliament. That is,
we are indeed looking at a story from di erent angles.</p>
        <p>Something similar occurs with instances B and C.
Instance B is about the listing of a mansion. After an
introductory rst article, further details appear such
as price or location. Instance C tells the story of the
decease of a former girlfriend of actor Jim Carrey. It
is worth noting article 2, about a di erent event. Our
event detection module got confused because this
article is about the girlfriend of an actor. Whether this
is relevant for a user is arguable.
4.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Query Generation</title>
        <p>Table 2 also shows the queries as generated by the
three variations of our generator: q1, q2, and q3 plus
a fourth variation: q4 = q2 + q3 (cf. Section 3.4).
The NE-based q3 seems far from perfect when
dealing with the titles of instances A and B. The cause
is that the camel-casing is confusing the NER. The
simple n-grams-based approach seems to produce
sensitive queries. When having at hand the full article,
the NE-based model works slightly better.</p>
        <p>Figure 2 shows the photograms of videos generated
with these four kinds of queries for Table 2 instance
B. Each sub gure refers to one video and each row to
one news article, which can include up to ve images.
The whole titles from strategy q1 provide a good
visual overview of the event: the listed house and its
owners. Still, due to contents overlapping, some
images appear more than once: coordinates f1,4; 2,2g,
f1,5; 5,3; 7,2g, and f5,1; 7,1g. The chunk-level
strategies result in less repetition. Strategy q3 based on NEs
is more varied: focusing on football player Tom Brady
for the rst two titles and moving towards the main
event: the listing of a house for sale in Los
Angeles, and nally the second person involved: top-model
Gisele Bundchen. Something similar occurs with q2's
2-grams: non-duplicated photograms centred in the
couple and the listed house. Still, q2 has a problem:
\The Brady report" is an Arizona radio show and the
resulting photograms refer to it. Even with this
mistake in mind, it seems like q2 provides a good balance
between relevance and diversity. Combining NEs and
2-grams into q4 reduces variation (photograms f5,3;
7,3g and f5,5; 7,1g are the same).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Final Remarks and Ongoing Work</title>
      <p>We presented our rst e orts on breaking the news
bubble. We integrated a system for the automatic
generation of videos consisting of four modules: event
identi cation, de-duplication, diversi cation, and
image gathering. The outcome comes in the form of short
illustrated videos aiming at providing a user with
different points of view in the coverage of the same event.</p>
      <p>Departing from this architecture, we aim at using
more sophisticated text representation and event
identi cation technology. We are particularly interested in
storyline generation [MSA+15, VCK15].
Cathriona White
Twilight
Ed Winter
N.Y.</p>
      <p>Cathriona Cappawhite</p>
      <p>Paul Jaccard. Etude comparative de la
distribution orale dans une portion des
Alpes et des Jura. Bulletin del la Societe
[KBK09]
[LB16]
[LBM04]
[LM14]
[SCK+06]
[VCK15]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AGAV09]
          <string-name>
            <given-names>Enrique</given-names>
            <surname>Amigo</surname>
          </string-name>
          , Julio Gonzalo, Javier Artiles, and
          <string-name>
            <given-names>Felisa</given-names>
            <surname>Verdejo</surname>
          </string-name>
          .
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constrains</article-title>
          .
          <source>Inf Retrieval</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ):1{
          <fpage>32</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [AS12]
          <article-title>[AY10] [BCR09] Joel Azzopardi and Christopher Sta . Incremental Clustering of News Reports</article-title>
          .
          <source>Algorithms</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <volume>364</volume>
          {378, aug
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Charu C Aggarwal and Philip S Yu</surname>
          </string-name>
          .
          <source>On Clustering Massive Text and Categorical Data Streams. Knowledge and Information Systems</source>
          , pages
          <fpage>171</fpage>
          {
          <fpage>196</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>On Automatic Plagiarism Detection based on n-grams Comparison. Advances in Information Retrieval. Proceedings of the 31st European Conference on IR Research</source>
          , LNCS (
          <volume>5478</volume>
          ):
          <volume>696</volume>
          {
          <fpage>700</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [CAMM16]
          <string-name>
            <given-names>David</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dyaa</given-names>
            <surname>Albakour</surname>
          </string-name>
          , Miguel Martinez, and
          <string-name>
            <given-names>Samir</given-names>
            <surname>Moussa</surname>
          </string-name>
          .
          <article-title>What do a Million News Articles Look like</article-title>
          ? In
          <source>NewsIR 2016 Recent Trends in News Information Retrieval</source>
          , pages
          <volume>42</volume>
          {
          <fpage>47</fpage>
          ,
          <string-name>
            <surname>Padua</surname>
          </string-name>
          , Italy,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [CGP02] [EKSX96] [Gas17] [Jac01]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Robert Gaizauskas, and
          <string-name>
            <given-names>Scott</given-names>
            <surname>Piao</surname>
          </string-name>
          .
          <article-title>Building and Annotating a Corpus for the Study of Journalistic Text Reuse</article-title>
          .
          <source>In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC</source>
          <year>2002</year>
          ), volume V, pages
          <volume>1678</volume>
          {
          <fpage>1691</fpage>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Palmas</given-names>
          </string-name>
          , Spain,
          <year>2002</year>
          . ELRA.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
            , Jorg Sander, and
            <given-names>Xiaowei</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96)</source>
          , pages
          <fpage>226</fpage>
          {
          <fpage>231</fpage>
          . AAAI Press,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Gasparetti</surname>
          </string-name>
          .
          <article-title>Modeling User Interests from Web Browsing Activities</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>31</volume>
          (
          <issue>2</issue>
          ):
          <volume>502</volume>
          {
          <fpage>547</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Vaudoise des Sciences Naturelles</source>
          ,
          <volume>37</volume>
          :
          <fpage>547</fpage>
          {
          <fpage>579</fpage>
          ,
          <year>1901</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jan</given-names>
            <surname>Kasprzak</surname>
          </string-name>
          , Michal Brandejs, and
          <string-name>
            <given-names>Miroslav</given-names>
            <surname>Kripac</surname>
          </string-name>
          .
          <article-title>Finding Plagiarism by Evaluating Document Similarities</article-title>
          . volume
          <volume>502</volume>
          , pages
          <fpage>24</fpage>
          {
          <fpage>28</fpage>
          ,
          <string-name>
            <surname>San</surname>
            <given-names>Sebastian</given-names>
          </string-name>
          , Spain,
          <year>2009</year>
          . CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Jey</given-names>
            <surname>Han Lau</surname>
          </string-name>
          and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Representation Learning for NLP</source>
          , pages
          <volume>78</volume>
          {
          <fpage>86</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Caroline</given-names>
            <surname>Lyon</surname>
          </string-name>
          , Ruth Barret, and
          <string-name>
            <given-names>James</given-names>
            <surname>Malcolm</surname>
          </string-name>
          .
          <article-title>A Theoretical Basis to the Automated Detection of Copying Between Texts, and its Practical Implementation in the Ferret Plagiarism and Collusion Detector</article-title>
          . In Plagiarism: Prevention, Practice and Policies Conference, Newcastle upon Tyne, UK,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>In Proceedings of the 31 st International Conference on Machine Learning</source>
          , Beijing, China,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [MSA+15]
          <string-name>
            <surname>Anne-Lyse</surname>
            <given-names>Minard</given-names>
          </string-name>
          , Manuela Speranza, Eneko Agirre, Itziar Aldabe, Marieke van Erp,
          <string-name>
            <surname>Bernardo Magnini</surname>
            , German Rigau, and
            <given-names>Ruben</given-names>
          </string-name>
          <string-name>
            <surname>Urizar</surname>
          </string-name>
          .
          <article-title>Semeval-2015 task 4: Timeline: Cross-document event ordering</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ), pages
          <fpage>778</fpage>
          {
          <fpage>786</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Nachiketa</given-names>
            <surname>Sahoo</surname>
          </string-name>
          , Jamie Callan, Ramayya Krishnan, George Duncan, and
          <string-name>
            <given-names>Rema</given-names>
            <surname>Padman</surname>
          </string-name>
          .
          <article-title>Incremental hierarchical clustering of text documents</article-title>
          .
          <source>In Proceedings of the 15th ACM International Conference on Information and Knowledge Management</source>
          ,
          <source>CIKM '06</source>
          , pages
          <fpage>357</fpage>
          {
          <fpage>366</fpage>
          , New York, NY,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Piek</given-names>
            <surname>Vossen</surname>
          </string-name>
          , Tommaso Caselli, and
          <string-name>
            <given-names>Yiota</given-names>
            <surname>Kontzopoulou</surname>
          </string-name>
          .
          <article-title>Storylines for structuring massive streams of news</article-title>
          .
          <source>In Proceedings of the First Workshop on Computing News Storylines</source>
          , pages
          <volume>40</volume>
          {
          <fpage>49</fpage>
          , Beijing, China,
          <year>2015</year>
          . ACL.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>