Boolean Queries for News Monitoring:
              Suggesting new query terms to expert users

                          Suzan Verberne                   Thymen Wabeke
                        Radboud University                      TNO
                     Nijmegen, the Netherlands        The Hague, the Netherlands
                        s.verberne@cs.ru.nl             thymen.wabeke@tno.nl
                                            Rianne Kaptein
                                                TNO
                                     The Hague, the Netherlands
                                        rianne.kaptein@tno.nl


                                                                        Table 1: Examples of Boolean queries
                                                                 Topic            Boolean query
                       Abstract
                                                                 Products    in   “Output      Campaign      Manager”      or
                                                                 the News         “TransPromo” or “Output Wrap En-
    In this paper, we evaluate query suggestion
                                                                                  velope” or “Adsert” or “OptiMail” or
    for Boolean queries in a news monitoring sys-                                 “ePriority” or “output Address Direct” or
    tem. Users of this system receive news arti-                                  “PredictionPro” or “offmydesk”
    cles that match their running query on a daily               Diversity        Diversity /2 inclusion OR “equal employ-
    basis. Because the news for a topic contin-                                   ment” or discrimination or harassment or
    uously changes, the queries need regular up-                                  race or gender or religion or “national ori-
    dating. We first investigated the users’ work-                                gin” or disability
    ing process through interviews and then evalu-
    ated multiple query suggestion methods based                their work. An organization typically monitors multi-
    on pseudo-relevance feedback. The best per-                 ple topics. For monitoring the news for a user-defined
    forming method generates at least one rele-                 topic, LexisNexis Publisher takes a Boolean query as
    vant term among 5 suggestions for 25% of the                input, together with a selection of news sources and
    searches. We found that expert users of news                a date range. Two example queries can be found in
    retrieval software are critical in their selection          Table 1.
    of query terms. Nevertheless, they judged the                  Interviews with users of LexisNexis Publisher in-
    demo application as clear and potentially use-              dicate that noise in the set of retrieved documents
    ful in their work.                                          is not very problematic because the user has the op-
                                                                tion to disregard irrelevant documents in the selection,
1    Introduction                                               thereby controlling precision. Recall is more difficult
                                                                to control because the user does not know the docu-
LexisNexis Publisher1 is an online tool for news mon-           ments that were not found. For the user, it is impor-
itoring. Hundreds of organizations in Europe and the            tant that no relevant news stories are missed. There-
US use the tool to collect news articles relevant to            fore, the query needs to be extended when there are
Copyright c 2016 for the individual papers by the paper’s au-
                                                                changes to the topic. This can happen when new ter-
thors. Copying permitted for private and academic purposes.     minology becomes relevant for the topic (e.g. ‘wolf’
This volume is published and copyrighted by its editors.        for the topic ‘biodiversity’), when there is a new stake-
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-   holder (e.g. the name of the new minister of economic
gartner, R. Campos and D. Albakour (eds.): Proceedings of the   affairs for the topic ‘industry and ICT’) or when new
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016,
published at http://ceur-ws.org
                                                                geographical names are relevant to the topic (e.g. ‘Les-
   1 http://www.lexisnexis.com/bis-user-                        bos’ for the topic ‘refugees’). The goal of the current
information/publisher/                                          work is to support users of news monitoring appli-
cations by providing them with suggestions for new           Boolean queries, which implies that we do not have a
query terms in order to retrieve more relevant news          relevance ranking of documents to extract terms from.
articles.                                                    This means that the premise of ‘pseudo-relevance’ may
   Our intuition is that documents that are relevant         be weak for the set of retrieved documents.
but not retrieved for the current query have similari-
ties with the documents that are retrieved for the cur-
rent query. Therefore, our approach to query sugges-
tion is to generate candidate query terms from the set       3    Interviews with expert users
of retrieved documents.
   In this paper, we present the results of a user study     We conducted interviews with three experienced users
in which we evaluate our methodology for query term          of LexisNexis Publisher to get to know their way of
suggestion with 9 expert users of LexisNexis Publisher.      working, their priorities and their wishes for query as-
We first conducted interviews with the users to collect      sistance. The following paragraphs summarize the in-
their wishes and needs. Then we developed a demo ap-         sights obtained during these interviews.
plication for news retrieval with query term suggestion         Way of working. Queries are not changed fre-
functionality. We used this application to evaluate our      quently; most attention is paid to the initial query.
approach and compare 12 different methods for query          Formulating this query takes several hours up to a
term suggestion.                                             whole day. Query constructions with Boolean oper-
                                                             ators are often re-used, for example to exclude specific
2    Related work                                            sources or newspaper sections. If a query gives too
                                                             much noise, exclusions are added (using the ‘NOT’
The task of spotting novel terms in a news stream            operator). If a query gives too few results, new terms
is related to research on topic detection and tracking       are added (with the ‘OR’ operator). Changes that are
(TDT) which has its roots in the 1990s [2, 1]. TDT           made a later stage are often changes in person and
aims to automatically detect new topics or events in         place names. Some customers have difficulties formu-
temporally-ordered news streams, and to find new sto-        lating good Boolean queries. These customers make
ries on already known topics. The functionality of Lex-      use of information specialist at LexisNexis to formu-
isNexis Publisher is related to news tracking in TDT:        late their queries.
the topic is given (in the form of a query) and the tool
is expected to find relevant new stories in the news            Priorities. The experts we interviewed use Lexis-
stream [14]. More recent work on TDT is directed             Nexis Publisher to create newsletters for their organi-
at topic tracking in microblog data (Twitter) [10, 5].       zation. Typically, they review all the retrieved articles
Microblog data, like news data, is temporally ordered        before deciding which are included in the newsletter.
data that continuously changes.                              This selection is based on redundancy and relevance;
   Our approach to query suggestion – generating can-        in case of overlapping news articles, the longest story
didate query terms from the set of retrieved docu-           from the most reliable source is selected. This is done
ments – is related to pseudo-relevance feedback [3],         manually, as it allows users to control the precision of
a method for query expansion that assumes that               the news articles included in the newsletter. The users
the top-k retrieved documents are relevant, extract-         indicate that for this reason, it is especially important
ing terms from those documents and adding them to            that no relevant documents are missed by the search.
the query. Pseudo-relevance feedback has been ap-            Noise in the result set is not so much an issue; if half of
plied to microblog retrieval, expanding the user query       the retrieved articles is relevant, the users are satisfied.
with related terms from retrieved posts to improve re-          Wishes for query assistance. Users indicate
call [6, 8]. It is important to take into account that the   that assistance in query formulation could be helpful,
language use around a topic continuously evolves when        not only when adapting existing queries, but especially
selecting terms from Twitter and news data. One op-          when formulating new queries. The users mention as-
tion is to give a higher score to terms that are tempo-      sistance in the form of: (a) suggestions of new query
rally closer to query time [6]. Our approach to query        terms; (b) suggestions for deleting query terms that
term suggestion is related to this idea: we aim to find      give too much noise; (c) suggestions for deleting query
the terms that are prominent in the most recent news         terms that give very few results. Of these three tasks,
articles on a topic.                                         we concentrated on the first: suggesting potential new
   There are two key differences between pseudo-             query terms. One requirement posed by the users is
relevance feedback and our approach: First, instead          that the user still has full control over the query. Terms
of adding terms blindly, we provide the user with sug-       should not be added blindly, but be presented as sug-
gestions for query adaptation. Second, we deal with          gestions.
4     Methodology
Our approach to query suggestion is to generate candi-
date query terms from the set of retrieved documents.2
                                                                         Generic corpus of Dutch newspapers
The central methodology needed for generating terms
from a document collection is term scoring; each can-
didate term from the document collection is assigned
a score that allows for selecting the best – most de-                      T1                                 T3
scriptive – terms. The term scoring methods that we
use are defined below.
                                                                                            T2
   Problem definition. We have a text collection D                    News for topic                News for topic
(the ‘foreground collection’) consisting of one or more               of last 30 days               30-60 days ago
documents. Our goal is to generate a list of terms T
with for each t ∈ T a score that indicates how descrip-
tive t is for D. Each t is a sequence of n non-stopwords;
we use n = {1, 2, 3} in our experiments.                          Figure 1: Schematic view of how the term lists are
   In most term scoring methods, descriptiveness is               generated. The query suggester returns one of three
determined by comparing the relative frequency of t in            term lists to the user: A = T1 ; B = T2 and C = {t :
the foreground collection D to the relative frequency             t ∈ T1 ∧ t ∈
                                                                             / T3 }.
of t in a background collection. For a given Boolean                  determined by comparing the relative frequency
query, we retrieve the result set Rrecent , which is the              of t in D to the relative frequency of t in the
set of articles published in the last 30 days, and the                background collection. Phraseness is determined
result set Rolder , which is the set of articles published            by comparing the frequency of t as a whole to the
60 to 30 days ago.                                                    frequencies of the unigram that the n-gram t is
   Methods for generating descriptive terms.                          composed of; Informativeness of t and Phraseness
We compare three methods for generating the most                      of t are summed to obtain a relevance score for t.
relevant query terms (see Figure 1 for a schematic                  • Frequency profiling (FP) [11], designed for con-
overview):                                                            trasting two separate corpora. This method uses
 A. Return the top-k terms from T1 , generated using                  a log-likelihood function based on expected and
     Rrecent as the foreground collection and a generic               observed frequencies of a term in both corpora
     news corpus as background collection;3                           (the foreground and background collections);
  B. Return the top-k terms from T2 , generated us-                 • Co-occurrence Based χ2 (CB) [7], which deter-
     ing Rrecent as foreground collection and Rolder as               mines the relevance of t in the foreground collec-
     background collection;                                           tion by the distribution of co-occurences of t with
  C. First generate T3 , using Rolder as foreground col-              frequent terms in the collection itself. The ratio-
     lection and the generic news corpus as background                nale of this method is that no background cor-
     collection. Then return the top-k terms from the                 pus is needed because the set of most frequent
     set {t : t ∈ T1 ∧ t ∈ / T3 } (all terms from T1 that             terms from the foreground collection serves as
     are not in T3 ).                                                 background corpus.
   Term scoring algorithms. We implemented four                      For one query and the corresponding retrieved doc-
different term scoring algorithms from the literature             uments, we generate twelve lists of potential query
that we compare for the task of generating potential              terms: three different approaches (A–C) with four
query terms from the set of retrieved documents:                  term scoring algorithms.
   • Parsimonious Language Models (PLM) [4], de-
     signed for creating document models in Informa-
     tion Retrieval. In PLM, the term frequency for
                                                                  5   Experiment and results
     each t in D is weighted with the frequency of t in           We collected feedback from expert users of LexisNexis
     the background collection using an expectation-              Publisher to determine the best method for generat-
     maximization algorithm;                                      ing term suggestions. For this purpose, we developed
   • Kullback-Leibler divergence for informativeness              an external demo application for news retrieval from
     and phraseness (KLIP) [12]. Informativeness is               the LexisNexis collection that includes query term sug-
    2 A query term may consist of multiple words.
                                                                  gestion functionality. Note that the query term sug-
   3 We used the newspaper section from the Dutch SoNaR-          gestion functionality was not integrated in the exist-
corpus [9], 50 Million words in total. Available at http://tst-   ing LexisNexis search interface, but implemented as a
centrale.org/producten/corpora/sonar-corpus/6-85                  standalone web application. Figure 5 shows a screen-
      Figure 2: A screen shot illustrating the functionality of the demo application for query term suggestion.

shot of the demo application.4 The user interface is in         uation session was organized for each participant. The
Dutch. In the top part of the screen (‘Zoekopdracht             interviews described in Section 3 revealed that queries
bewerken’ – ‘Edit search’), the user sees the current           change more frequently when they are novel. There-
query and the results (‘Resultaten’) retrieved for that         fore, each participant was asked to perform two differ-
query. In total, 1110 results were retrieved for this           ent tasks with the assistance of our demo application
query. In the bottom part of the screen (‘Query aan-            during the evaluation session. In the first task, the
passen’ – ‘Adapt query’), the user sees a list of term          participant is asked to update a query that is already
suggestions. This example illustrates the final func-           being used by his company. In the second task, the
tionality, in which only the 5 suggestions by the best          participant designs a new query for a topic of which
performing method are shown. In the experimental                they received a short topic description.
setting, the user saw a pool of 10–25 terms from dif-              The initial (existing or new) Boolean query is issued
ferent methods.                                                 in LexisNexis Publisher through its API, searching in
                                                                Dutch newspapers of the last 60 days (the maximum
5.1     Evaluation design                                       posed by the API). The titles and abstracts of the
                                                                matching news articles are shown in a result list (in
The query suggestion software was evaluated by 9 in-            chronological order) and a list of query term sugges-
dividual users of LexisNexis Publisher. A 2-hour eval-          tions is presented. The participant reviews the set of
   4 A video demonstrating the demo application can be viewed   retrieved documents and improves the query by adding
here: https://youtu.be/4yIYpvHVugQ                              and/or removing terms, optionally using a term from
the suggestions. Subsequently, the updated query is
                                                          Table 2: Results per method in terms of ‘selected-
issued and the query can be improved again. In both
                                                          success-rate’: the percentage of searches for which par-
tasks, the participant was asked to review and update
                                                          ticipants added a term from the top-5 to the query.
the query up to a maximum of five iterations. After the
complete evaluation session, the participants filled in                                  CB    FP    KLIP     PLM
a post-experiment questionnaire, in which they could       A = T1                       10%   13%     11%      11%
provide additional comments.                               B = T2                       10%    7%      6%       6%
                                                           C = {t : t ∈ T1 ∧ t ∈
                                                                               / T3 }   10%    0%     11%      11%
5.2   Data
The participants issued 83 searches in total. The         Table 3: Results per method in terms of ‘relevant-
Boolean queries are long: 45 terms on average. Terms      success-rate’: the percentage of searches for which par-
can be single words or phrases (multi-word terms), and    ticipants judged at least a term from the top-5 as rel-
they are combined with Boolean operators. We used         evant (relevance score >= 4).
the LexisNexis Publisher API to retrieve documents
(news articles) published in the last 60 days. On aver-                                  CB    FP    KLIP     PLM
age, 1, 031 documents were retrieved per query (ranked     A = T1                       14%   24%    25%       20%
by date), with an average length of 63 words. The          B = T2                       14%   11%     13%       5%
short document length is caused by the API allowing        C = {t : t ∈ T1 ∧ t ∈
                                                                               / T3 }   14%   11%    25%       20%
us to extract only the summary of the news article,
not the full text. This means that the size of the sub-   The average rating given to the terms in the pool was
collection from which potential new query terms are       low: 1.36 on a 5-point scale.
extracted for a query is on average 1, 031∗63 = 64, 953      Further analysis of the results showed that the term
words.                                                    suggestions were noisy because the sets of retrieved
    We created a pool of terms from the 12 (3 ap-         documents are noisy. The Boolean queries return a
proaches * 4 term scoring algorithms) term lists per      large set of documents (more than a thousand on av-
topic. We assume that in a real application, the query    erage for the last 60 days), without any relevance rank-
suggestion software would show five candidate terms       ing. The interviews with the users indicated that this
to the user, and we want to be able to evaluate these     is not a problem for the users (because they filter the
5 suggestions for each method. Therefore, the top 5       news items for the newsletter), but it turns out to
terms from each term list were added to the pool. The     be a problem for the extraction of relevant terms. In
maximum number of terms in a pool is 60 (12*5) but        other words, the premise of ‘pseudo-relevance’ does not
in reality there is quite some overlap: the number of     hold for Boolean retrieval, and this hurts the quality of
terms per pool is between 10 and 25. For each query,      query term suggestion based on retrieved documents.
the participants were presented with this pool of 10–25
terms. The terms were ranked by the number of top-5       5.4   Qualitative feedback
lists they appear in: the terms that were extracted by
                                                          In the post-experiment questionnaire, participants in-
most methods were ranked on top of the pool.
                                                          dicated that the demo application was clear and intu-
                                                          itive (median score of 4 on a 5-point scale for the state-
5.3   Experimental Results
                                                          ment ‘the web application is clear’). Half of the par-
The selection of query terms and the relevance judg-      ticipants would be interested in using the tool. How-
ments for the suggested terms in the pool allow us to     ever, they felt that the quality of the terms should be
evaluate and compare the methods. For each method,        improved for the application to be really useful. Sug-
we have judgments for the 5 highest scoring terms.        gestions that were provided by the users included:
We count how often one of these terms was selected           • Do not to suggest terms that are already covered
by a participant, and how often at least one of these          by wildcards in the query. We improved this in
terms received a relevance rating of at least 4. The           the final version of the demo application.
results are in Table 2 and Table 3. The results for the      • Terms that occur in important parts of the text
best performing methods (method A with either FP               should be more relevant. In fact, this was already
or KLIP as term scoring algorithm, or method C with            taken into account because the API only allowed
KLIP) are marked with boldface in the tables. With             us to access the abstracts of the documents.
these methods, participants selected a term from the         • Multi-word terms should not be suggested. This
top-5 suggestions for 13% of the searches, and judged          comment appeared to be in contrast with the
at least one term from the top-5 suggestions as rele-          users’ term selections: of the selected terms by
vant (relevance score >= 4) for 25% of the searches.           the users (15), the majority (12) are multi-words.
    • Add suggestions for the use of Boolean operators.        tracking in tweet streams. In: Proceedings of
      This was beyond the scope of the current project,        the 17th ACM SIGKDD international conference
      which focused on term suggestion.                        on Knowledge discovery and data mining, ACM
                                                               (2011) 422–429
6     Conclusions
                                                            [6] Massoudi, K., Tsagkias, M., de Rijke, M.,
The results of our user experiment show that with the           Weerkamp, W.:        Incorporating query expan-
best performing method, participants selected a term            sion and quality indicators in searching microblog
from the top-5 suggestion list for 13% of the topics,           posts. In: Advances in Information Retrieval.
and judged at least one term as relevant for 25% of             Springer (2011) 362–367
the topics. Inspection of the results and the post-task
questionnaire revealed that the term suggestions are        [7] Matsuo, Y., Ishizuka, M.: Keyword extraction
noisy, mainly because the set of retrieved documents            from a single document using word co-occurrence
for the Boolean query is noisy. We expect that the use          statistical information. International Journal on
of relevance ranking instead of Boolean retrieval, and          Artificial Intelligence Tools 13(01) (2004) 157–
a post-filtering for noisy terms, will give better user         169
satisfaction.                                               [8] Metzler, D., Cai, C., Hovy, E.: Structured event
   The relevance judgments for the suggested terms              retrieval over microblog archives. In: Proceed-
are low compared to another application area for term           ings of the 2012 Conference of the North Amer-
extraction that we addressed in previous work with              ican Chapter of the Association for Computa-
the same methodology, namely author profiling [13].             tional Linguistics: Human Language Technolo-
This can partly be explained by the noise in the set            gies, Association for Computational Linguistics
of retrieved documents (irrelevant documents lead to            (2012) 646–655
irrelevant terms), but may also be caused by expert
users of news retrieval software being critical in their    [9] Oostdijk, N., Reynaert, M., Monachesi, P.,
selection of query terms. This shows that it is valu-           Van Noord, G., Ordelman, R., Schuurman, I.,
able to evaluate query suggestion technology with real          Vandeghinste, V.: From d-coi to sonar: a ref-
users.                                                          erence corpus for dutch. In: LREC. (2008)
                                                           [10] Phuvipadawat, S., Murata, T.: Breaking news
References
                                                                detection and tracking in twitter. In: Web In-
 [1] Allan, J.:      Topic detection and tracking:              telligence and Intelligent Agent Technology (WI-
     event-based information organization. Volume 12.           IAT), 2010 IEEE/WIC/ACM International Con-
     Springer Science & Business Media (2002)                   ference on. Volume 3., IEEE (2010) 120–123
 [2] Allan, J., Papka, R., Lavrenko, V.: On-line new       [11] Rayson, P., Garside, R.: Comparing corpora us-
     event detection and tracking. In: Proceedings of           ing frequency profiling. In: Proceedings of the
     the 21st annual international ACM SIGIR confer-            workshop on Comparing Corpora, Association for
     ence on Research and development in information            Computational Linguistics (2000) 1–6
     retrieval, ACM (1998) 37–45
                                                           [12] Tomokiyo, T., Hurst, M.: A language model
 [3] Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Se-            approach to keyphrase extraction. In: Proceed-
     lecting good expansion terms for pseudo-relevance          ings of the ACL 2003 workshop on Multiword
     feedback. In: Proceedings of the 31st annual in-           expressions: analysis, acquisition and treatment-
     ternational ACM SIGIR conference on Research               Volume 18, Association for Computational Lin-
     and development in information retrieval, ACM              guistics (2003) 33–40
     (2008) 243–250
                                                           [13] Verberne, S., Sappelli, M., Kraaij, W.: Term ex-
 [4] Hiemstra, D., Robertson, S., Zaragoza, H.: Par-            traction for user profiling: Evaluation by the user.
     simonious language models for information re-              In: UMAP Workshops. (2013)
     trieval. In: Proceedings of the 27th annual in-
     ternational ACM SIGIR conference on Research          [14] Yamron, J., Carp, I., Gillick, L., Lowe, S.,
     and development in information retrieval, ACM              Van Mulbregt, P.: Topic tracking in a news
     (2004) 178–185                                             stream. In: Proceedings of DARPA Broadcast
                                                                News Workshop. (1999) 133–136
 [5] Lin, J., Snow, R., Morgan, W.: Smoothing tech-
     niques for adaptive online language models: topic