=Paper=
{{Paper
|id=Vol-2079/paper6
|storemode=property
|title=Cross-Reading News
|pdfUrl=https://ceur-ws.org/Vol-2079/paper6.pdf
|volume=Vol-2079
|authors=Shahbaz Syed,Tim Gollub,Marcel Gohsen,Nikolay Kolyada,Benno Stein,Matthias Hagen
|dblpUrl=https://dblp.org/rec/conf/ecir/SyedGGKSH18
}}
==Cross-Reading News==
Cross-Reading News Shahbaz Syed1 Tim Gollub1 Marcel Gohsen1 Nikolay Kolyada1 Benno Stein1 Matthias Hagen2 1 2 Bauhaus-Universität Weimar Martin-Luther-Universität. @uni-weimar.de Halle-Wittenberg matthias.hagen@informatik.uni-halle.de concrete search query, making this process laborious to iterate. Also, journalists covering specific types of news Abstract such as sports or politics, are required to write articles in a relatively short amount of time compared to those Journalists often need to perform multiple ac- covering say editorials or literature. In such cases, tions, using different tools, in order to create a tool which provides faster access to relevant data content for publication. This involves search- can significantly speed up the workflow. Specifically, ing the web, curating the result list, choosing a collection of published articles maintained by most relevant entities for the article and writing. media companies today is an excellent resource which We aim to improve this pipeline through Cross- can be leveraged to build intelligent tools by applying Reading News, a modular, and extendable web natural language processing and information retrieval. application aimed at helping journalists eas- Although querying the web is also a possibility for ily research and draft articles for publication. satisfying an information need, it is often overwhelming It combines information retrieval, natural lan- as the journalist is bombarded with excessive results guage processing and deep learning in order which can be both distracting and time consuming to be to provide a smart and focused work-space refined further. Using an existing collection can provide by leveraging local collections of news articles a better focused search environment, and also ease the maintained by media companies. Specifically, process of finding similar content already published by users can search for information in multiple fellow journalists. This significantly simplifies author local collections of news articles, gather named citations, and makes it easier to find duplicate or similar entities, extract keyqueries, find semantically articles to be reused. related content and obtain title suggestions In this paper, we present our application that in- through summarization. tegrates multiple features such as; a local collection of published articles that have been preprocessed to 1 Introduction recognize named entities and semantically similar para- Journalists from various media organizations can be graphs, a search engine for querying this collection, an regarded as one of the key sources for generating new automatic summarization model that provides title sug- content on the web in the form of articles. However, gestions for a piece of text (which can also be used as generating new information often comprises searching, keyphrases) and finally, a text editor to draft articles. reusing and curating information from different sources. A journalist usually has a vague idea of what he/she 2 Corpus and preprocessing wants to write about and often consults an expert We use the Signal Media corpus [CAMM16] as our (here, a person specialized in the taxonomy of the local local data source for the search engine. Before indexing collection of articles ) to obtain relevant documents. the documents, we perform the following preprocessing This information need is often hard to formulate as a steps: First, if a document is longer than 400 words, Copyright c 2018 for the individual papers by the papers’ au- we split the text into paragraphs using NLTK1 . thors. Copying permitted for private and academic purposes. Second, we perform named entity recognition us- This volume is published and copyrighted by its editors. ing Stanford NER [MSB+ 14] to annotate entities for In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, each paragraph. Specifically, we extract PERSON, A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop ORGANIZATION, LOCATION, FACILITY and GPE at ECIR, Grenoble, France, 26-March-2018, published at http://ceur-ws.org 1 https://www.nltk.org (Geo-Political Entity) entities and add this information to our document store based on RocksDB2 . Finally, in order to compute paragraph similarity we create a vector (300 dimensional) for each paragraph from an embedding model trained on the whole corpus using fasttext [BGJM17]. A paragraph vector is simply the average of consitituent word vectors. As the num- ber of paragraphs is large (1.5million), we perform the similarity search as part of our preprocessing in order to find similar paragraphs based on cosine similarity. We use Faiss [JDJ17] in order to speed up the similar- ity search in this large vector space. We empirically chose to store only upto 3 similar paragraphs for each paragraph (if available). 3 Gather relevant entities The first stage of writing an article is primarily collect- ing information about a specific subject or an event. To facilitate looking up documents from a local collection, we built a search engine that uses three standard re- trieval models - tf , tf ·idf and bm-25. As an intialization step, the user can choose a specific retrieval model to be used by the application. Figure 1: Search and gather related text Intuitively, as users of search engines, we go through some of the retrieved documents and identify the enti- ties that might be of interest to our subject. Taking Furthermore, for each main paragraph we show the this as a motivation, we designed our first stage to re- top k similar paragraphs (if available) according to turn a list of named entities related to a query instead the cosine similariy scores computed for the paragraph of the full documents themselves. By providing such a vectors as part of our preprocessing (we chose k=3). list (all entities from the top 20 documents returned by This helps the user to quickly find related content the selected retrieval model), we can help users to easily compared to reading a full document, to then reuse a decide which of these entities to include in their article. portion of it. Selected entities can be organized into sections which already provides a structure that serves as a guiding 5 Generate keyphrases map in the subsequent stages of the application. At Keyphrases for unlabeled documents are an effective any time, the user can navigate back to this step in source to perform clustering, topic search and summa- order to add more entities to their sections. rization [FPW+ 99]. Motivated by the idea of using keyqueries as document descriptions [GHMS13], we 4 Gather related text use deep learning to automatically generate keyphrases Using the entity list from the previous stage, the user for both existing paragraphs as well as any new con- can now start searching for information with better tent written by the user (Figure 2). We believe that a focus. It is possible to quickly perform a document keyphrase can serve both as a title suggestion for the search for any section with the click of a button. To text, and also be transformed into a keyquery in order facilitate this, we combine all the entities in a given to find additional semantically related documents. section to form a search query. It is also possible to We establish the task of generating keyphrase for a perform a free search with any other query. Results text as an abstractive summarization task using neural from this search can be filtered in order to select content generative models. This allows the model to not only from specific publishers and each paragraph in the gain a semantic understanding of the text, but also displayed result list can be quickly added to the draft generate novel words for the title conveying important as shown in Figure 1. This speeds up the process of information selected from the semantic space. A sample collecting and reusing the existing content. of generated summaries is presented in Table 1. While a similar approach for generating keyphrases for scientific 2 http://rocksdb.org articles was adopted by [MZH+ 17], we only use the title of each document as the target. Table 1: Gold (G) vs predicted (P) summaries G: Kzoo apartment fire prompts large emergency re- sponse P: Crews respond to fire at Kalamazoo apartment complex G: Man accused of rape refused bail P: Bundaberg man accused of sexually abusing women G: Third Ukrainian policeman dies from injuries after clashes in Kiev , more than 140 people in hospital P: Ukrainian policeman dies in Kiev battle G: Duquesne beats Bucknell 26-7 P: Duquesne Rolls Past Bucknell G: Taylor ton helps England keep series alive P: Taylor century helps England beat Australia References [BGJM17] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. TACL, 5:135–146, 2017. Figure 2: Title generation using summarization [CAMM16] David Corney, Dyaa Albakour, Miguel Martinez, and Samir Moussa. What do a million news articles look like? ECIR 2016, Padua, Italy, March 20, We use a paragraph as the source text to be sum- 2016., pages 42–47, 2016. marized, and the title of the article to which it belongs [FPW+ 99] Eibe Frank, Gordon W. Paynter, Ian H. Witten, as the target summary to train our model. Our train- Carl Gutwin, and Craig G. Nevill-Manning. Domain-specific keyphrase extraction. IJCAI ’99, ing set consists of 1.29million pairs, and the validation pages 668–673, San Francisco, CA, USA, 1999. set has 68,000 pairs. We use a bidirectional RNN en- Morgan Kaufmann Publishers Inc. coder with global dot attention [LPM15] in the decoder, [GG16] Yarin Gal and Zoubin Ghahramani. A theoretically stochastic gradient descent as optimizer and a dropout grounded application of dropout in recurrent neural value of 0.3 as suggested by [GG16]. networks. NIPS 2016, December 5-10, Barcelona, Spain, pages 1019–1027, 2016. 6 Conclusion and Outlook [GHMS13] Tim Gollub, Matthias Hagen, Maximilian Michel, and Benno Stein. From keywords to keyqueries: We plan to evaluate the usefulness of our tool with a Content descriptors for the web. SIGIR ’13, pages user study in the near future. As a next step, we aim 981–984, New York, NY, USA, 2013. ACM. to use Elasticsearch3 to index and search additional [JDJ17] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. CoRR, document collections. It is also possible to enhance abs/1702.08734, 2017. the experience by adding additional features using [LPM15] Thang Luong, Hieu Pham, and Christopher D. deep learning such as; automatic POS tagging using Manning. Effective approaches to attention-based sequence models, locating important parts of the text neural machine translation. EMNLP 2015, Lisbon, using attention networks, question answering, text Portugal, September 17-21, 2015, pages 1412–1421, comprehension for visual inspection and dynamic 2015. interaction with very long documents. We propose [MSB+ 14] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David that by leveraging big data and deep learning, we can McClosky. The Stanford CoreNLP natural greatly improve the collaboration between the content language processing toolkit. ACL System creators and consumers in today’s age of information. Demonstrations, pages 55–60, 2014. [MZH+ 17] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. Deep keyphrase generation. ACL 2017, Vancouver, Canada, July 3 https://www.elastic.co/products/elasticsearch 30 - August 4, Volume 1: Long Papers, pages 582–592, 2017.