=Paper= {{Paper |id=Vol-2079/paper6 |storemode=property |title=Cross-Reading News |pdfUrl=https://ceur-ws.org/Vol-2079/paper6.pdf |volume=Vol-2079 |authors=Shahbaz Syed,Tim Gollub,Marcel Gohsen,Nikolay Kolyada,Benno Stein,Matthias Hagen |dblpUrl=https://dblp.org/rec/conf/ecir/SyedGGKSH18 }} ==Cross-Reading News== https://ceur-ws.org/Vol-2079/paper6.pdf
                                           Cross-Reading News

 Shahbaz Syed1 Tim Gollub1 Marcel Gohsen1 Nikolay Kolyada1 Benno Stein1 Matthias Hagen2
                      1                                                   2
                     Bauhaus-Universität Weimar                          Martin-Luther-Universität
              .@uni-weimar.de                           Halle-Wittenberg
                                                                   matthias.hagen@informatik.uni-halle.de



                                                                   concrete search query, making this process laborious to
                                                                   iterate. Also, journalists covering specific types of news
                          Abstract                                 such as sports or politics, are required to write articles
                                                                   in a relatively short amount of time compared to those
    Journalists often need to perform multiple ac-                 covering say editorials or literature. In such cases,
    tions, using different tools, in order to create               a tool which provides faster access to relevant data
    content for publication. This involves search-                 can significantly speed up the workflow. Specifically,
    ing the web, curating the result list, choosing                a collection of published articles maintained by most
    relevant entities for the article and writing.                 media companies today is an excellent resource which
    We aim to improve this pipeline through Cross-                 can be leveraged to build intelligent tools by applying
    Reading News, a modular, and extendable web                    natural language processing and information retrieval.
    application aimed at helping journalists eas-                      Although querying the web is also a possibility for
    ily research and draft articles for publication.               satisfying an information need, it is often overwhelming
    It combines information retrieval, natural lan-                as the journalist is bombarded with excessive results
    guage processing and deep learning in order                    which can be both distracting and time consuming to be
    to provide a smart and focused work-space                      refined further. Using an existing collection can provide
    by leveraging local collections of news articles               a better focused search environment, and also ease the
    maintained by media companies. Specifically,                   process of finding similar content already published by
    users can search for information in multiple                   fellow journalists. This significantly simplifies author
    local collections of news articles, gather named               citations, and makes it easier to find duplicate or similar
    entities, extract keyqueries, find semantically                articles to be reused.
    related content and obtain title suggestions                       In this paper, we present our application that in-
    through summarization.                                         tegrates multiple features such as; a local collection
                                                                   of published articles that have been preprocessed to
1    Introduction                                                  recognize named entities and semantically similar para-
Journalists from various media organizations can be                graphs, a search engine for querying this collection, an
regarded as one of the key sources for generating new              automatic summarization model that provides title sug-
content on the web in the form of articles. However,               gestions for a piece of text (which can also be used as
generating new information often comprises searching,              keyphrases) and finally, a text editor to draft articles.
reusing and curating information from different sources.
A journalist usually has a vague idea of what he/she               2    Corpus and preprocessing
wants to write about and often consults an expert                  We use the Signal Media corpus [CAMM16] as our
(here, a person specialized in the taxonomy of the local           local data source for the search engine. Before indexing
collection of articles ) to obtain relevant documents.             the documents, we perform the following preprocessing
This information need is often hard to formulate as a              steps: First, if a document is longer than 400 words,
Copyright c 2018 for the individual papers by the papers’ au-
                                                                   we split the text into paragraphs using NLTK1 .
thors. Copying permitted for private and academic purposes.           Second, we perform named entity recognition us-
This volume is published and copyrighted by its editors.           ing Stanford NER [MSB+ 14] to annotate entities for
In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete,   each paragraph. Specifically, we extract PERSON,
A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop           ORGANIZATION, LOCATION, FACILITY and GPE
at ECIR, Grenoble, France, 26-March-2018, published at
http://ceur-ws.org                                                 1 https://www.nltk.org
(Geo-Political Entity) entities and add this information
to our document store based on RocksDB2 .
   Finally, in order to compute paragraph similarity we
create a vector (300 dimensional) for each paragraph
from an embedding model trained on the whole corpus
using fasttext [BGJM17]. A paragraph vector is simply
the average of consitituent word vectors. As the num-
ber of paragraphs is large (1.5million), we perform the
similarity search as part of our preprocessing in order
to find similar paragraphs based on cosine similarity.
We use Faiss [JDJ17] in order to speed up the similar-
ity search in this large vector space. We empirically
chose to store only upto 3 similar paragraphs for each
paragraph (if available).

3    Gather relevant entities
The first stage of writing an article is primarily collect-
ing information about a specific subject or an event. To
facilitate looking up documents from a local collection,
we built a search engine that uses three standard re-
trieval models - tf , tf ·idf and bm-25. As an intialization
step, the user can choose a specific retrieval model to
be used by the application.                                          Figure 1: Search and gather related text
    Intuitively, as users of search engines, we go through
some of the retrieved documents and identify the enti-
ties that might be of interest to our subject. Taking             Furthermore, for each main paragraph we show the
this as a motivation, we designed our first stage to re-       top k similar paragraphs (if available) according to
turn a list of named entities related to a query instead       the cosine similariy scores computed for the paragraph
of the full documents themselves. By providing such a          vectors as part of our preprocessing (we chose k=3).
list (all entities from the top 20 documents returned by       This helps the user to quickly find related content
the selected retrieval model), we can help users to easily     compared to reading a full document, to then reuse a
decide which of these entities to include in their article.    portion of it.
Selected entities can be organized into sections which
already provides a structure that serves as a guiding          5   Generate keyphrases
map in the subsequent stages of the application. At            Keyphrases for unlabeled documents are an effective
any time, the user can navigate back to this step in           source to perform clustering, topic search and summa-
order to add more entities to their sections.                  rization [FPW+ 99]. Motivated by the idea of using
                                                               keyqueries as document descriptions [GHMS13], we
4    Gather related text                                       use deep learning to automatically generate keyphrases
Using the entity list from the previous stage, the user        for both existing paragraphs as well as any new con-
can now start searching for information with better            tent written by the user (Figure 2). We believe that a
focus. It is possible to quickly perform a document            keyphrase can serve both as a title suggestion for the
search for any section with the click of a button. To          text, and also be transformed into a keyquery in order
facilitate this, we combine all the entities in a given        to find additional semantically related documents.
section to form a search query. It is also possible to             We establish the task of generating keyphrase for a
perform a free search with any other query. Results            text as an abstractive summarization task using neural
from this search can be filtered in order to select content    generative models. This allows the model to not only
from specific publishers and each paragraph in the             gain a semantic understanding of the text, but also
displayed result list can be quickly added to the draft        generate novel words for the title conveying important
as shown in Figure 1. This speeds up the process of            information selected from the semantic space. A sample
collecting and reusing the existing content.                   of generated summaries is presented in Table 1. While a
                                                               similar approach for generating keyphrases for scientific
2 http://rocksdb.org
                                                               articles was adopted by [MZH+ 17], we only use the
                                                               title of each document as the target.
                                                               Table 1: Gold (G) vs predicted (P) summaries
                                                            G: Kzoo apartment fire prompts large emergency re-
                                                            sponse
                                                            P: Crews respond to fire at Kalamazoo apartment
                                                            complex
                                                            G: Man accused of rape refused bail
                                                            P: Bundaberg man accused of sexually abusing women
                                                            G: Third Ukrainian policeman dies from injuries after
                                                            clashes in Kiev , more than 140 people in hospital
                                                            P: Ukrainian policeman dies in Kiev battle
                                                            G: Duquesne beats Bucknell 26-7
                                                            P: Duquesne Rolls Past Bucknell
                                                            G: Taylor ton helps England keep series alive
                                                            P: Taylor century helps England beat Australia


                                                            References
                                                            [BGJM17]    Piotr Bojanowski, Edouard Grave, Armand Joulin,
                                                                        and Tomas Mikolov. Enriching word vectors with
                                                                        subword information. TACL, 5:135–146, 2017.

    Figure 2: Title generation using summarization          [CAMM16] David Corney, Dyaa Albakour, Miguel Martinez,
                                                                     and Samir Moussa. What do a million news articles
                                                                     look like? ECIR 2016, Padua, Italy, March 20,
   We use a paragraph as the source text to be sum-                  2016., pages 42–47, 2016.
marized, and the title of the article to which it belongs   [FPW+ 99] Eibe Frank, Gordon W. Paynter, Ian H. Witten,
as the target summary to train our model. Our train-                  Carl Gutwin, and Craig G. Nevill-Manning.
                                                                      Domain-specific keyphrase extraction. IJCAI ’99,
ing set consists of 1.29million pairs, and the validation
                                                                      pages 668–673, San Francisco, CA, USA, 1999.
set has 68,000 pairs. We use a bidirectional RNN en-                  Morgan Kaufmann Publishers Inc.
coder with global dot attention [LPM15] in the decoder,     [GG16]      Yarin Gal and Zoubin Ghahramani. A theoretically
stochastic gradient descent as optimizer and a dropout                  grounded application of dropout in recurrent neural
value of 0.3 as suggested by [GG16].                                    networks. NIPS 2016, December 5-10, Barcelona,
                                                                        Spain, pages 1019–1027, 2016.

6    Conclusion and Outlook                                 [GHMS13] Tim Gollub, Matthias Hagen, Maximilian Michel,
                                                                     and Benno Stein. From keywords to keyqueries:
We plan to evaluate the usefulness of our tool with a                Content descriptors for the web. SIGIR ’13, pages
user study in the near future. As a next step, we aim                981–984, New York, NY, USA, 2013. ACM.
to use Elasticsearch3 to index and search additional        [JDJ17]     Jeff Johnson, Matthijs Douze, and Hervé Jégou.
                                                                        Billion-scale similarity search with gpus. CoRR,
document collections. It is also possible to enhance                    abs/1702.08734, 2017.
the experience by adding additional features using
                                                            [LPM15]     Thang Luong, Hieu Pham, and Christopher D.
deep learning such as; automatic POS tagging using                      Manning. Effective approaches to attention-based
sequence models, locating important parts of the text                   neural machine translation. EMNLP 2015, Lisbon,
using attention networks, question answering, text                      Portugal, September 17-21, 2015, pages 1412–1421,
comprehension for visual inspection and dynamic                         2015.
interaction with very long documents. We propose            [MSB+ 14]   Christopher D. Manning, Mihai Surdeanu, John
                                                                        Bauer, Jenny Finkel, Steven J. Bethard, and David
that by leveraging big data and deep learning, we can                   McClosky. The Stanford CoreNLP natural
greatly improve the collaboration between the content                   language processing toolkit. ACL System
creators and consumers in today’s age of information.                   Demonstrations, pages 55–60, 2014.
                                                            [MZH+ 17] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing
                                                                      He, Peter Brusilovsky, and Yu Chi. Deep keyphrase
                                                                      generation. ACL 2017, Vancouver, Canada, July
3 https://www.elastic.co/products/elasticsearch
                                                                      30 - August 4, Volume 1: Long Papers, pages
                                                                      582–592, 2017.