<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Reading News</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shahbaz Syed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Gollub</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcel Gohsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolay Kolyada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Hagen</string-name>
          <email>matthias.hagen@informatik.uni-halle.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: D. Albakour</institution>
          ,
          <addr-line>D. Corney, J. Gonzalo, M. Martinez, B. Poblete</addr-line>
          ,
          <institution>A. Vlachos (eds.): Proceedings of the NewsIR'18 Workshop at ECIR</institution>
          ,
          <addr-line>Grenoble, France, 26-March-2018, published at</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Martin-Luther-Universität Halle-Wittenberg</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1</volume>
      <fpage>42</fpage>
      <lpage>47</lpage>
      <abstract>
        <p>Journalists often need to perform multiple actions, using different tools, in order to create content for publication. This involves searching the web, curating the result list, choosing relevant entities for the article and writing. We aim to improve this pipeline through CrossReading News, a modular, and extendable web application aimed at helping journalists easily research and draft articles for publication. It combines information retrieval, natural language processing and deep learning in order to provide a smart and focused work-space by leveraging local collections of news articles maintained by media companies. Specifically, users can search for information in multiple local collections of news articles, gather named entities, extract keyqueries, find semantically related content and obtain title suggestions through summarization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Journalists from various media organizations can be
regarded as one of the key sources for generating new
content on the web in the form of articles. However,
generating new information often comprises searching,
reusing and curating information from different sources.
A journalist usually has a vague idea of what he/she
wants to write about and often consults an expert
(here, a person specialized in the taxonomy of the local
collection of articles ) to obtain relevant documents.
This information need is often hard to formulate as a
Copyright c 2018 for the individual papers by the papers’
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
concrete search query, making this process laborious to
iterate. Also, journalists covering specific types of news
such as sports or politics, are required to write articles
in a relatively short amount of time compared to those
covering say editorials or literature. In such cases,
a tool which provides faster access to relevant data
can significantly speed up the workflow. Specifically,
a collection of published articles maintained by most
media companies today is an excellent resource which
can be leveraged to build intelligent tools by applying
natural language processing and information retrieval.</p>
      <p>Although querying the web is also a possibility for
satisfying an information need, it is often overwhelming
as the journalist is bombarded with excessive results
which can be both distracting and time consuming to be
refined further. Using an existing collection can provide
a better focused search environment, and also ease the
process of finding similar content already published by
fellow journalists. This significantly simplifies author
citations, and makes it easier to find duplicate or similar
articles to be reused.</p>
      <p>In this paper, we present our application that
integrates multiple features such as; a local collection
of published articles that have been preprocessed to
recognize named entities and semantically similar
paragraphs, a search engine for querying this collection, an
automatic summarization model that provides title
suggestions for a piece of text (which can also be used as
keyphrases) and finally, a text editor to draft articles.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus and preprocessing</title>
      <p>We use the Signal Media corpus [CAMM16] as our
local data source for the search engine. Before indexing
the documents, we perform the following preprocessing
steps: First, if a document is longer than 400 words,
we split the text into paragraphs using NLTK1.</p>
      <p>Second, we perform named entity recognition
using Stanford NER [MSB+14] to annotate entities for
each paragraph. Specifically, we extract PERSON,
ORGANIZATION, LOCATION, FACILITY and GPE
1https://www.nltk.org
(Geo-Political Entity) entities and add this information
to our document store based on RocksDB2.</p>
      <p>Finally, in order to compute paragraph similarity we
create a vector (300 dimensional) for each paragraph
from an embedding model trained on the whole corpus
using fasttext [BGJM17]. A paragraph vector is simply
the average of consitituent word vectors. As the
number of paragraphs is large (1.5million), we perform the
similarity search as part of our preprocessing in order
to find similar paragraphs based on cosine similarity.
We use Faiss [JDJ17] in order to speed up the
similarity search in this large vector space. We empirically
chose to store only upto 3 similar paragraphs for each
paragraph (if available).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Gather relevant entities</title>
      <p>The first stage of writing an article is primarily
collecting information about a specific subject or an event. To
facilitate looking up documents from a local collection,
we built a search engine that uses three standard
retrieval models - tf , tf idf and bm-25. As an intialization
step, the user can choose a specific retrieval model to
be used by the application.</p>
      <p>Intuitively, as users of search engines, we go through
some of the retrieved documents and identify the
entities that might be of interest to our subject. Taking
this as a motivation, we designed our first stage to
return a list of named entities related to a query instead
of the full documents themselves. By providing such a
list (all entities from the top 20 documents returned by
the selected retrieval model), we can help users to easily
decide which of these entities to include in their article.
Selected entities can be organized into sections which
already provides a structure that serves as a guiding
map in the subsequent stages of the application. At
any time, the user can navigate back to this step in
order to add more entities to their sections.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Gather related text</title>
      <p>Using the entity list from the previous stage, the user
can now start searching for information with better
focus. It is possible to quickly perform a document
search for any section with the click of a button. To
facilitate this, we combine all the entities in a given
section to form a search query. It is also possible to
perform a free search with any other query. Results
from this search can be filtered in order to select content
from specific publishers and each paragraph in the
displayed result list can be quickly added to the draft
as shown in Figure 1. This speeds up the process of
collecting and reusing the existing content.</p>
      <p>Furthermore, for each main paragraph we show the
top k similar paragraphs (if available) according to
the cosine similariy scores computed for the paragraph
vectors as part of our preprocessing (we chose k=3).
This helps the user to quickly find related content
compared to reading a full document, to then reuse a
portion of it.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Generate keyphrases</title>
      <p>Keyphrases for unlabeled documents are an effective
source to perform clustering, topic search and
summarization [FPW+99]. Motivated by the idea of using
keyqueries as document descriptions [GHMS13], we
use deep learning to automatically generate keyphrases
for both existing paragraphs as well as any new
content written by the user (Figure 2). We believe that a
keyphrase can serve both as a title suggestion for the
text, and also be transformed into a keyquery in order
to find additional semantically related documents.</p>
      <p>We establish the task of generating keyphrase for a
text as an abstractive summarization task using neural
generative models. This allows the model to not only
gain a semantic understanding of the text, but also
generate novel words for the title conveying important
information selected from the semantic space. A sample
of generated summaries is presented in Table 1. While a
similar approach for generating keyphrases for scientific
articles was adopted by [MZH+17], we only use the
title of each document as the target.
We use a paragraph as the source text to be
summarized, and the title of the article to which it belongs
as the target summary to train our model. Our
training set consists of 1.29million pairs, and the validation
set has 68,000 pairs. We use a bidirectional RNN
encoder with global dot attention [LPM15] in the decoder,
stochastic gradient descent as optimizer and a dropout
value of 0.3 as suggested by [GG16].
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Outlook</title>
      <p>We plan to evaluate the usefulness of our tool with a
user study in the near future. As a next step, we aim
to use Elasticsearch3 to index and search additional
document collections. It is also possible to enhance
the experience by adding additional features using
deep learning such as; automatic POS tagging using
sequence models, locating important parts of the text
using attention networks, question answering, text
comprehension for visual inspection and dynamic
interaction with very long documents. We propose
that by leveraging big data and deep learning, we can
greatly improve the collaboration between the content
creators and consumers in today’s age of information.
G: Kzoo apartment fire prompts large emergency
response
P: Crews respond to fire at Kalamazoo apartment
complex</p>
      <sec id="sec-6-1">
        <title>G: Man accused of rape refused bail</title>
        <p>P: Bundaberg man accused of sexually abusing women
G: Third Ukrainian policeman dies from injuries after
clashes in Kiev , more than 140 people in hospital
P: Ukrainian policeman dies in Kiev battle</p>
      </sec>
      <sec id="sec-6-2">
        <title>G: Duquesne beats Bucknell 26-7</title>
        <p>P: Duquesne Rolls Past Bucknell</p>
      </sec>
      <sec id="sec-6-3">
        <title>G: Taylor ton helps England keep series alive P: Taylor century helps England beat Australia</title>
        <p>[GG16]
[JDJ17]
[LPM15]
[GHMS13] Tim Gollub, Matthias Hagen, Maximilian Michel,
and Benno Stein. From keywords to keyqueries:
Content descriptors for the web. SIGIR ’13, pages
981–984, New York, NY, USA, 2013. ACM.</p>
        <p>Jeff Johnson, Matthijs Douze, and Hervé Jégou.</p>
        <p>Billion-scale similarity search with gpus. CoRR,
abs/1702.08734, 2017.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Effective approaches to attention-based neural machine translation</article-title>
          .
          <source>EMNLP</source>
          <year>2015</year>
          , Lisbon, Portugal,
          <source>September 17-21</source>
          ,
          <year>2015</year>
          , pages
          <fpage>1412</fpage>
          -
          <lpage>1421</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>