Boolean Queries for News Monitoring: Suggesting new query terms to expert users Suzan Verberne Thymen Wabeke Radboud University TNO Nijmegen, the Netherlands The Hague, the Netherlands s.verberne@cs.ru.nl thymen.wabeke@tno.nl Rianne Kaptein TNO The Hague, the Netherlands rianne.kaptein@tno.nl Table 1: Examples of Boolean queries Topic Boolean query Abstract Products in “Output Campaign Manager” or the News “TransPromo” or “Output Wrap En- In this paper, we evaluate query suggestion velope” or “Adsert” or “OptiMail” or for Boolean queries in a news monitoring sys- “ePriority” or “output Address Direct” or tem. Users of this system receive news arti- “PredictionPro” or “offmydesk” cles that match their running query on a daily Diversity Diversity /2 inclusion OR “equal employ- basis. Because the news for a topic contin- ment” or discrimination or harassment or uously changes, the queries need regular up- race or gender or religion or “national ori- dating. We first investigated the users’ work- gin” or disability ing process through interviews and then evalu- ated multiple query suggestion methods based their work. An organization typically monitors multi- on pseudo-relevance feedback. The best per- ple topics. For monitoring the news for a user-defined forming method generates at least one rele- topic, LexisNexis Publisher takes a Boolean query as vant term among 5 suggestions for 25% of the input, together with a selection of news sources and searches. We found that expert users of news a date range. Two example queries can be found in retrieval software are critical in their selection Table 1. of query terms. Nevertheless, they judged the Interviews with users of LexisNexis Publisher in- demo application as clear and potentially use- dicate that noise in the set of retrieved documents ful in their work. is not very problematic because the user has the op- tion to disregard irrelevant documents in the selection, 1 Introduction thereby controlling precision. Recall is more difficult to control because the user does not know the docu- LexisNexis Publisher1 is an online tool for news mon- ments that were not found. For the user, it is impor- itoring. Hundreds of organizations in Europe and the tant that no relevant news stories are missed. There- US use the tool to collect news articles relevant to fore, the query needs to be extended when there are Copyright c 2016 for the individual papers by the paper’s au- changes to the topic. This can happen when new ter- thors. Copying permitted for private and academic purposes. minology becomes relevant for the topic (e.g. ‘wolf’ This volume is published and copyrighted by its editors. for the topic ‘biodiversity’), when there is a new stake- In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- holder (e.g. the name of the new minister of economic gartner, R. Campos and D. Albakour (eds.): Proceedings of the affairs for the topic ‘industry and ICT’) or when new NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, published at http://ceur-ws.org geographical names are relevant to the topic (e.g. ‘Les- 1 http://www.lexisnexis.com/bis-user- bos’ for the topic ‘refugees’). The goal of the current information/publisher/ work is to support users of news monitoring appli- cations by providing them with suggestions for new Boolean queries, which implies that we do not have a query terms in order to retrieve more relevant news relevance ranking of documents to extract terms from. articles. This means that the premise of ‘pseudo-relevance’ may Our intuition is that documents that are relevant be weak for the set of retrieved documents. but not retrieved for the current query have similari- ties with the documents that are retrieved for the cur- rent query. Therefore, our approach to query sugges- tion is to generate candidate query terms from the set 3 Interviews with expert users of retrieved documents. In this paper, we present the results of a user study We conducted interviews with three experienced users in which we evaluate our methodology for query term of LexisNexis Publisher to get to know their way of suggestion with 9 expert users of LexisNexis Publisher. working, their priorities and their wishes for query as- We first conducted interviews with the users to collect sistance. The following paragraphs summarize the in- their wishes and needs. Then we developed a demo ap- sights obtained during these interviews. plication for news retrieval with query term suggestion Way of working. Queries are not changed fre- functionality. We used this application to evaluate our quently; most attention is paid to the initial query. approach and compare 12 different methods for query Formulating this query takes several hours up to a term suggestion. whole day. Query constructions with Boolean oper- ators are often re-used, for example to exclude specific 2 Related work sources or newspaper sections. If a query gives too much noise, exclusions are added (using the ‘NOT’ The task of spotting novel terms in a news stream operator). If a query gives too few results, new terms is related to research on topic detection and tracking are added (with the ‘OR’ operator). Changes that are (TDT) which has its roots in the 1990s [2, 1]. TDT made a later stage are often changes in person and aims to automatically detect new topics or events in place names. Some customers have difficulties formu- temporally-ordered news streams, and to find new sto- lating good Boolean queries. These customers make ries on already known topics. The functionality of Lex- use of information specialist at LexisNexis to formu- isNexis Publisher is related to news tracking in TDT: late their queries. the topic is given (in the form of a query) and the tool is expected to find relevant new stories in the news Priorities. The experts we interviewed use Lexis- stream [14]. More recent work on TDT is directed Nexis Publisher to create newsletters for their organi- at topic tracking in microblog data (Twitter) [10, 5]. zation. Typically, they review all the retrieved articles Microblog data, like news data, is temporally ordered before deciding which are included in the newsletter. data that continuously changes. This selection is based on redundancy and relevance; Our approach to query suggestion – generating can- in case of overlapping news articles, the longest story didate query terms from the set of retrieved docu- from the most reliable source is selected. This is done ments – is related to pseudo-relevance feedback [3], manually, as it allows users to control the precision of a method for query expansion that assumes that the news articles included in the newsletter. The users the top-k retrieved documents are relevant, extract- indicate that for this reason, it is especially important ing terms from those documents and adding them to that no relevant documents are missed by the search. the query. Pseudo-relevance feedback has been ap- Noise in the result set is not so much an issue; if half of plied to microblog retrieval, expanding the user query the retrieved articles is relevant, the users are satisfied. with related terms from retrieved posts to improve re- Wishes for query assistance. Users indicate call [6, 8]. It is important to take into account that the that assistance in query formulation could be helpful, language use around a topic continuously evolves when not only when adapting existing queries, but especially selecting terms from Twitter and news data. One op- when formulating new queries. The users mention as- tion is to give a higher score to terms that are tempo- sistance in the form of: (a) suggestions of new query rally closer to query time [6]. Our approach to query terms; (b) suggestions for deleting query terms that term suggestion is related to this idea: we aim to find give too much noise; (c) suggestions for deleting query the terms that are prominent in the most recent news terms that give very few results. Of these three tasks, articles on a topic. we concentrated on the first: suggesting potential new There are two key differences between pseudo- query terms. One requirement posed by the users is relevance feedback and our approach: First, instead that the user still has full control over the query. Terms of adding terms blindly, we provide the user with sug- should not be added blindly, but be presented as sug- gestions for query adaptation. Second, we deal with gestions. 4 Methodology Our approach to query suggestion is to generate candi- date query terms from the set of retrieved documents.2 Generic corpus of Dutch newspapers The central methodology needed for generating terms from a document collection is term scoring; each can- didate term from the document collection is assigned a score that allows for selecting the best – most de- T1 T3 scriptive – terms. The term scoring methods that we use are defined below. T2 Problem definition. We have a text collection D News for topic News for topic (the ‘foreground collection’) consisting of one or more of last 30 days 30-60 days ago documents. Our goal is to generate a list of terms T with for each t ∈ T a score that indicates how descrip- tive t is for D. Each t is a sequence of n non-stopwords; we use n = {1, 2, 3} in our experiments. Figure 1: Schematic view of how the term lists are In most term scoring methods, descriptiveness is generated. The query suggester returns one of three determined by comparing the relative frequency of t in term lists to the user: A = T1 ; B = T2 and C = {t : the foreground collection D to the relative frequency t ∈ T1 ∧ t ∈ / T3 }. of t in a background collection. For a given Boolean determined by comparing the relative frequency query, we retrieve the result set Rrecent , which is the of t in D to the relative frequency of t in the set of articles published in the last 30 days, and the background collection. Phraseness is determined result set Rolder , which is the set of articles published by comparing the frequency of t as a whole to the 60 to 30 days ago. frequencies of the unigram that the n-gram t is Methods for generating descriptive terms. composed of; Informativeness of t and Phraseness We compare three methods for generating the most of t are summed to obtain a relevance score for t. relevant query terms (see Figure 1 for a schematic • Frequency profiling (FP) [11], designed for con- overview): trasting two separate corpora. This method uses A. Return the top-k terms from T1 , generated using a log-likelihood function based on expected and Rrecent as the foreground collection and a generic observed frequencies of a term in both corpora news corpus as background collection;3 (the foreground and background collections); B. Return the top-k terms from T2 , generated us- • Co-occurrence Based χ2 (CB) [7], which deter- ing Rrecent as foreground collection and Rolder as mines the relevance of t in the foreground collec- background collection; tion by the distribution of co-occurences of t with C. First generate T3 , using Rolder as foreground col- frequent terms in the collection itself. The ratio- lection and the generic news corpus as background nale of this method is that no background cor- collection. Then return the top-k terms from the pus is needed because the set of most frequent set {t : t ∈ T1 ∧ t ∈ / T3 } (all terms from T1 that terms from the foreground collection serves as are not in T3 ). background corpus. Term scoring algorithms. We implemented four For one query and the corresponding retrieved doc- different term scoring algorithms from the literature uments, we generate twelve lists of potential query that we compare for the task of generating potential terms: three different approaches (A–C) with four query terms from the set of retrieved documents: term scoring algorithms. • Parsimonious Language Models (PLM) [4], de- signed for creating document models in Informa- tion Retrieval. In PLM, the term frequency for 5 Experiment and results each t in D is weighted with the frequency of t in We collected feedback from expert users of LexisNexis the background collection using an expectation- Publisher to determine the best method for generat- maximization algorithm; ing term suggestions. For this purpose, we developed • Kullback-Leibler divergence for informativeness an external demo application for news retrieval from and phraseness (KLIP) [12]. Informativeness is the LexisNexis collection that includes query term sug- 2 A query term may consist of multiple words. gestion functionality. Note that the query term sug- 3 We used the newspaper section from the Dutch SoNaR- gestion functionality was not integrated in the exist- corpus [9], 50 Million words in total. Available at http://tst- ing LexisNexis search interface, but implemented as a centrale.org/producten/corpora/sonar-corpus/6-85 standalone web application. Figure 5 shows a screen- Figure 2: A screen shot illustrating the functionality of the demo application for query term suggestion. shot of the demo application.4 The user interface is in uation session was organized for each participant. The Dutch. In the top part of the screen (‘Zoekopdracht interviews described in Section 3 revealed that queries bewerken’ – ‘Edit search’), the user sees the current change more frequently when they are novel. There- query and the results (‘Resultaten’) retrieved for that fore, each participant was asked to perform two differ- query. In total, 1110 results were retrieved for this ent tasks with the assistance of our demo application query. In the bottom part of the screen (‘Query aan- during the evaluation session. In the first task, the passen’ – ‘Adapt query’), the user sees a list of term participant is asked to update a query that is already suggestions. This example illustrates the final func- being used by his company. In the second task, the tionality, in which only the 5 suggestions by the best participant designs a new query for a topic of which performing method are shown. In the experimental they received a short topic description. setting, the user saw a pool of 10–25 terms from dif- The initial (existing or new) Boolean query is issued ferent methods. in LexisNexis Publisher through its API, searching in Dutch newspapers of the last 60 days (the maximum 5.1 Evaluation design posed by the API). The titles and abstracts of the matching news articles are shown in a result list (in The query suggestion software was evaluated by 9 in- chronological order) and a list of query term sugges- dividual users of LexisNexis Publisher. A 2-hour eval- tions is presented. The participant reviews the set of 4 A video demonstrating the demo application can be viewed retrieved documents and improves the query by adding here: https://youtu.be/4yIYpvHVugQ and/or removing terms, optionally using a term from the suggestions. Subsequently, the updated query is Table 2: Results per method in terms of ‘selected- issued and the query can be improved again. In both success-rate’: the percentage of searches for which par- tasks, the participant was asked to review and update ticipants added a term from the top-5 to the query. the query up to a maximum of five iterations. After the complete evaluation session, the participants filled in CB FP KLIP PLM a post-experiment questionnaire, in which they could A = T1 10% 13% 11% 11% provide additional comments. B = T2 10% 7% 6% 6% C = {t : t ∈ T1 ∧ t ∈ / T3 } 10% 0% 11% 11% 5.2 Data The participants issued 83 searches in total. The Table 3: Results per method in terms of ‘relevant- Boolean queries are long: 45 terms on average. Terms success-rate’: the percentage of searches for which par- can be single words or phrases (multi-word terms), and ticipants judged at least a term from the top-5 as rel- they are combined with Boolean operators. We used evant (relevance score >= 4). the LexisNexis Publisher API to retrieve documents (news articles) published in the last 60 days. On aver- CB FP KLIP PLM age, 1, 031 documents were retrieved per query (ranked A = T1 14% 24% 25% 20% by date), with an average length of 63 words. The B = T2 14% 11% 13% 5% short document length is caused by the API allowing C = {t : t ∈ T1 ∧ t ∈ / T3 } 14% 11% 25% 20% us to extract only the summary of the news article, not the full text. This means that the size of the sub- The average rating given to the terms in the pool was collection from which potential new query terms are low: 1.36 on a 5-point scale. extracted for a query is on average 1, 031∗63 = 64, 953 Further analysis of the results showed that the term words. suggestions were noisy because the sets of retrieved We created a pool of terms from the 12 (3 ap- documents are noisy. The Boolean queries return a proaches * 4 term scoring algorithms) term lists per large set of documents (more than a thousand on av- topic. We assume that in a real application, the query erage for the last 60 days), without any relevance rank- suggestion software would show five candidate terms ing. The interviews with the users indicated that this to the user, and we want to be able to evaluate these is not a problem for the users (because they filter the 5 suggestions for each method. Therefore, the top 5 news items for the newsletter), but it turns out to terms from each term list were added to the pool. The be a problem for the extraction of relevant terms. In maximum number of terms in a pool is 60 (12*5) but other words, the premise of ‘pseudo-relevance’ does not in reality there is quite some overlap: the number of hold for Boolean retrieval, and this hurts the quality of terms per pool is between 10 and 25. For each query, query term suggestion based on retrieved documents. the participants were presented with this pool of 10–25 terms. The terms were ranked by the number of top-5 5.4 Qualitative feedback lists they appear in: the terms that were extracted by In the post-experiment questionnaire, participants in- most methods were ranked on top of the pool. dicated that the demo application was clear and intu- itive (median score of 4 on a 5-point scale for the state- 5.3 Experimental Results ment ‘the web application is clear’). Half of the par- The selection of query terms and the relevance judg- ticipants would be interested in using the tool. How- ments for the suggested terms in the pool allow us to ever, they felt that the quality of the terms should be evaluate and compare the methods. For each method, improved for the application to be really useful. Sug- we have judgments for the 5 highest scoring terms. gestions that were provided by the users included: We count how often one of these terms was selected • Do not to suggest terms that are already covered by a participant, and how often at least one of these by wildcards in the query. We improved this in terms received a relevance rating of at least 4. The the final version of the demo application. results are in Table 2 and Table 3. The results for the • Terms that occur in important parts of the text best performing methods (method A with either FP should be more relevant. In fact, this was already or KLIP as term scoring algorithm, or method C with taken into account because the API only allowed KLIP) are marked with boldface in the tables. With us to access the abstracts of the documents. these methods, participants selected a term from the • Multi-word terms should not be suggested. This top-5 suggestions for 13% of the searches, and judged comment appeared to be in contrast with the at least one term from the top-5 suggestions as rele- users’ term selections: of the selected terms by vant (relevance score >= 4) for 25% of the searches. the users (15), the majority (12) are multi-words. • Add suggestions for the use of Boolean operators. tracking in tweet streams. In: Proceedings of This was beyond the scope of the current project, the 17th ACM SIGKDD international conference which focused on term suggestion. on Knowledge discovery and data mining, ACM (2011) 422–429 6 Conclusions [6] Massoudi, K., Tsagkias, M., de Rijke, M., The results of our user experiment show that with the Weerkamp, W.: Incorporating query expan- best performing method, participants selected a term sion and quality indicators in searching microblog from the top-5 suggestion list for 13% of the topics, posts. In: Advances in Information Retrieval. and judged at least one term as relevant for 25% of Springer (2011) 362–367 the topics. Inspection of the results and the post-task questionnaire revealed that the term suggestions are [7] Matsuo, Y., Ishizuka, M.: Keyword extraction noisy, mainly because the set of retrieved documents from a single document using word co-occurrence for the Boolean query is noisy. We expect that the use statistical information. International Journal on of relevance ranking instead of Boolean retrieval, and Artificial Intelligence Tools 13(01) (2004) 157– a post-filtering for noisy terms, will give better user 169 satisfaction. [8] Metzler, D., Cai, C., Hovy, E.: Structured event The relevance judgments for the suggested terms retrieval over microblog archives. In: Proceed- are low compared to another application area for term ings of the 2012 Conference of the North Amer- extraction that we addressed in previous work with ican Chapter of the Association for Computa- the same methodology, namely author profiling [13]. tional Linguistics: Human Language Technolo- This can partly be explained by the noise in the set gies, Association for Computational Linguistics of retrieved documents (irrelevant documents lead to (2012) 646–655 irrelevant terms), but may also be caused by expert users of news retrieval software being critical in their [9] Oostdijk, N., Reynaert, M., Monachesi, P., selection of query terms. This shows that it is valu- Van Noord, G., Ordelman, R., Schuurman, I., able to evaluate query suggestion technology with real Vandeghinste, V.: From d-coi to sonar: a ref- users. erence corpus for dutch. In: LREC. (2008) [10] Phuvipadawat, S., Murata, T.: Breaking news References detection and tracking in twitter. In: Web In- [1] Allan, J.: Topic detection and tracking: telligence and Intelligent Agent Technology (WI- event-based information organization. Volume 12. IAT), 2010 IEEE/WIC/ACM International Con- Springer Science & Business Media (2002) ference on. Volume 3., IEEE (2010) 120–123 [2] Allan, J., Papka, R., Lavrenko, V.: On-line new [11] Rayson, P., Garside, R.: Comparing corpora us- event detection and tracking. In: Proceedings of ing frequency profiling. In: Proceedings of the the 21st annual international ACM SIGIR confer- workshop on Comparing Corpora, Association for ence on Research and development in information Computational Linguistics (2000) 1–6 retrieval, ACM (1998) 37–45 [12] Tomokiyo, T., Hurst, M.: A language model [3] Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Se- approach to keyphrase extraction. In: Proceed- lecting good expansion terms for pseudo-relevance ings of the ACL 2003 workshop on Multiword feedback. In: Proceedings of the 31st annual in- expressions: analysis, acquisition and treatment- ternational ACM SIGIR conference on Research Volume 18, Association for Computational Lin- and development in information retrieval, ACM guistics (2003) 33–40 (2008) 243–250 [13] Verberne, S., Sappelli, M., Kraaij, W.: Term ex- [4] Hiemstra, D., Robertson, S., Zaragoza, H.: Par- traction for user profiling: Evaluation by the user. simonious language models for information re- In: UMAP Workshops. (2013) trieval. In: Proceedings of the 27th annual in- ternational ACM SIGIR conference on Research [14] Yamron, J., Carp, I., Gillick, L., Lowe, S., and development in information retrieval, ACM Van Mulbregt, P.: Topic tracking in a news (2004) 178–185 stream. In: Proceedings of DARPA Broadcast News Workshop. (1999) 133–136 [5] Lin, J., Snow, R., Morgan, W.: Smoothing tech- niques for adaptive online language models: topic