=Paper= {{Paper |id=None |storemode=property |title=Use of Query Similarity for Improving Presentation of News Verticals |pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p62-Louis.pdf |volume=Vol-880 |dblpUrl=https://dblp.org/rec/conf/vlds/LouisCBSDC11 }} ==Use of Query Similarity for Improving Presentation of News Verticals== https://ceur-ws.org/Vol-880/VLDS-p62-Louis.pdf
Use of Query Similarity for Improving Presentation of News
                          Verticals
                                             ∗
                       Annie Louis                                       Eric Crestan *                     Youssef Billawala
              University of Pennsylvania                                 Microsoft                            Yahoo! Labs
             Philadelphia, PA 19104, USA                           81669 Munich, Germany                Sunnyvale, CA 94089, USA
             lannie@seas.upenn.edu                               ericres@microsoft.com                 billawal@yahoo-inc.com
                   Rao Shen                                           Fernando Diaz                     Jean-François Crespo
                     Yahoo! Labs                                       Yahoo! Labs                            Yahoo! Labs
               Sunnyvale, CA 94089, USA                          Sunnyvale, CA 94089, USA               Sunnyvale, CA 94089, USA
             raoshen@yahoo-inc.com                                fdiaz@yahoo-inc.com                  jfcrespo@yahoo-inc.com

ABSTRACT
Users often issue web queries related to current news events.
For such queries, it is useful to predict the news intent au-
tomatically and highlight the news documents on the search
result page. An example query would be “election results”
issued during the time of elections. These highlighted dis-
plays are called news verticals. Prior work has proposed sev-
eral features for predicting whether a query has news intent.
However, most approaches treat each query individually. So
on a given day, very similar queries can be assigned oppo-
site predictions. In our work, we explore how a system can
utilize query similarity information to improve the quality
of news verticals along two dimensions—prediction and pre-
sentation. We show via a study of actual search traffic that
the accuracy of predicting queries into newsworthy and not                              Figure 1: A news display above search results
newsworthy categories can be improved using query simi-
larity. Further, we present a method to identify a canonical
variant for a newsworthy query such that using the canonical
query would retrieve better results from the news backend                            highlighted on the result page. Such news displays, also
to show in the display. Use of the canonical query also has                          called news verticals now appear on the result pages of ma-
the advantage of creating a consistent presentation of results                       jor search engines such as Google, Yahoo! and Bing. An
for query variants related to the same news event.                                   example news vertical on the search result page of a generic
                                                                                     search engine is shown in Figure 1. Prior work has intro-
1.     INTRODUCTION                                                                  duced several features for news intent prediction. These
                                                                                     features characterize the bursty nature of a query during a
   Consider the query “Michelle Obama visits Spain” issued                           particular time period [8, 16, 9]. In this work, we show how
to web search engines during the time of the First Lady’s                            to improve the presentation of news displays by making use
visit to Spain. If this query is issued on a news website,                           of query similarity information.
the results would contain all news articles about the event.                            We consider two properties related to news displays which
On the other hand, a user may also issue such a query to                             can further benefit from information about similar queries.
a generic web search engine. If the search engine can pre-                              Triggering accuracy: When two queries q and q 0 are
dict the query as related to currently newsworthy events,                            related to the same news event, they would have the same
latest news can be retrieved for the query and explicitly                            news intent. For example, the queries “Michelle Obama
     9∗Work conducted at Yahoo! Labs.                                                Spain visit” and “Spain vacation Michelle Obama” refer to
                                                                                     the same event and both are newsworthy during that time.
                                                                                     If a system can identify q and q 0 as similar, it can make more
                                                                                     accurate predictions. Besides accuracy, it is desirable that
Permission to make digital or hard copies of all or part of this work for            a consistent system would trigger a news display for both
personal or classroom use is granted without fee provided that copies are            queries regardless of the actual lexical realization.
not made or distributed for profit or commercial advantage and that copies              Retrieval of results for the display: Another use of
bear this notice and the full citation on the first page. To copy otherwise, to      query similarity is in the case when q and q 0 are predicted
republish, to post on servers or to redistribute to lists, requires prior specific   to be newsworthy. The lexical form of q could retrieve very
permission and/or a fee. This article was presented at the workshop
Very Large Data Search (VLDS) 2011.                                                  relevant content from the news backend while q 0 might be
Copyright 2011.                                                                      a specific query with few matches. Knowledge that they
are similar can help us choose from q, q 0 and other similar         In most prior approaches, the vector of features for a query
queries, a variant which would retrieve the most relevant         is not dependent on other queries issued the same day. Even
results for the event that they represent. This setup could       when two queries are closely related, they could have vary-
also lead to uniformity in the results shown for different        ing frequencies in query logs and news documents. So their
variants belonging to the same event. So query similarity         predictions could be quite different. We want to incorpo-
can help in presentation quality and consistency.                 rate query similarity to create more accurate predictions.
   The work on news intent prediction that is closest to ours     In Diaz and Arguello’s work, they share user feedback from
is by Diaz and Arguello [8, 9] who seek to improve predic-        one query to those which are similar, so that similar queries
tions by incorporating user feedback. In their method, user       can also take advantage of the feedback information. In this
clicks on a query’s news display are tracked and used as feed-    work, we seek to produce more accurate baseline predictions
back not only for that query but also for similar ones. They      using query similarity. Our approach to do similarity-aware
compute similarity between queries based on the similarity        predictions is similar to techniques introduced to handle tail
in their search results. In this work, we show how to ad-         queries for advertisement display [5, 4, 22] and query sug-
just the predictions themselves using similarity information.     gestions [1, 20, 23, 17, 21]. There is also recent work which
Their method also requires the results of the search which        create specialized ranking models for a query by making use
is expensive to compute. The selection of a canonical vari-       of similar queries in the training data [11, 18].
ant and the issue of presentation consistency has not been           Our approach to canonical query selection is reminscent of
explored so far.                                                  query modification methods to obtain better matches with
   In this paper we augment a news intent predictor with          documents or advertisements. Here, when queries do not
query similarity information.                                     match the bid phrases or documents, some substitution [15],
                                                                  deletion [14], and other modification [12] is necessary. Our
(a) We show that a simple lexical match method to get sim-        situation is different in that we desire a concise query which
    ilar queries is highly accurate and useful for improving      can be issued to get the most relevant results from the news
    news intent predictions. We show that the improvement         index. So we need to judge the candidate queries based
    is pronounced even in a model which uses a rich set of        on both their match to original query as well as the news
    features for the baseline predictions.                        documents they retrieve from the news index. The idea is
                                                                  similar to work by Baeza-Yates et al. [1] who ranked query
(b) We validate through a user study on actual search traf-
                                                                  suggestions not only by similarity but also considering the
    fic, that the predictions get improved when we add query
                                                                  attractiveness or quality of the suggested query.
    similarity information.
(c) We propose a method to identify a canonical variant           3.    NEWS INTENT PREDICTOR
    for similar queries. We show that we can outperform              We consider a simple news predictor and then show how
    the baseline approaches significantly for this task and       we add query similarity information to it. We predict news-
    achieve a better display of news results for different        worthy queries in an offline process. To identify newsworthy
    query variants.                                               queries for day d, we use the query logs from day d − 1. The
                                                                  queries from immediately previous day are likely to be is-
   Our study is insightful for both the quality of news dis-
                                                                  sued again. (In fact, one-third are repeated on average.)
play and the search user experience. Inconsistent triggering
                                                                  Each query from day d − 1 is associated with a binary pre-
lowers accuracy. Recent work [4, 22] recognize that long
                                                                  diction (newsworthy or not) and the newsworthy queries are
and rare queries are also important components of user ex-
                                                                  are entered into a whitelist. On day d, if a query issued is in
perience but do not always return (informative) results. So
                                                                  the whitelist, we trigger a news display. The list is refreshed
approaches have made use of query similarity to address
                                                                  daily. So the goal of the predictor is to identify queries that
the needs of tail queries. Our work continues along these
                                                                  will continue to have news intent on the next day.
lines for a different search task. From a user experience per-
                                                                     This situation is obviously incapable of handling new queries
spective, particularly with explicitly highlighted displays, it
                                                                  that were not seen the previous day. However, for our goal
would be jarring if the choice to show a display and the pre-
                                                                  of evaluating the use of query similarity, this setup is sim-
sentation style were markedly different for closely related
                                                                  ple and appropriate, also enabling us to deploy it for a user
queries. It would be particularly difficult for users to find a
                                                                  study. Although the system is simple, this offline setup en-
certain article again using a different but equivalent query
                                                                  ables us to use a rich set of features which would be pro-
(referred to as the task of refinding [6, 7]). We seek to add
                                                                  hibitively costly to compute at query issue time. So we
accuracy and quality in the displays with the aid of query
                                                                  are evaluating the power of query similarity over an already
similarity.
                                                                  strong feature set. So we focus on this baseline model for our
                                                                  experiments, further approaches such as adding user feed-
2.   RELATED WORK                                                 back and handling new queries will improve the capabilities
  Prior work on news intent prediction have introduced a          of the model we use.
number of features [8, 16]. Three major resources are utilized—
query logs, news index and blogs. When a query appears            3.1    Data
with greater frequency in the documents belonging to a time         The gold standard news intent judgements for a query was
period compared to recent past, it can be called newsworthy.      obtained empirically as was done in prior work. We logged
Work by Diaz and Arguello [8, 9] extends beyond features          query statistics in early 2008 from a commerical web search
and incorporates user feedback as well to adjust predictions.     engine. A small portion of search users were shown news
They obtain a prior news intent prediction based on features      displays for every query they issued provided there was any
and then correct it as user feedback is obtained.                 matching content in the news backend. On average, 800,000
 2008 long beach grand prix      long beach grand prix (0.57), f1 grand prix (0.03), long beach (0.02)
 2008 nfl draft order            2008 nfl draft (0.30), 2008 draft (0.19), nfl draft (0.12), nfl draft 2008 (0.06)
 bad beef recall                 beef recall (0.15), recalled beef (0.02), massive beef recall (0.01), california beef recall (0.01)
 carly smithson controversy      carly smithson (0.10), carly smithson american idol (0.01), carly hennessy smithson (0.01)

               Table 1: Lexically similar variants for a query (similarity value within parentheses)


unique queries were issued per day by these users and 30,000         nifer Lopez babies” and “Marc Antony twins”. So we also
of them had matches in the news index. For these 30,000              tried other methods which offer more flexibility for match-
queries, displays were shown without any attempt to verify           ing queries. These methods included the similarity in the
their news intent. The display consisted of the titles of the        URLs, titles and abstracts of the top results [2, 19, 23,
three most relevant news documents. The click through rate           8], tracking user sessions to find rephrased queries [23, 3]
(CTR) on the display for each query was logged and we                and a dictionary-based identification of similar named en-
consider these values as gold standard. A click on any of            tities (Eg. Jennifer Lopez, Mariah Carey). We found that
the three titles within the display is considered as a click on      these looser notions (and their combinations) introduce sev-
the display. The 30,000 queries per day collected for a two          eral false positives for query similarity while lexically similar
week period form our data set. We train a model to predict           queries (with bigram match) had very high precision. For
the CTR of a query and then apply a threshold to obtain a            our task, particularly canonicalization, high precision is nec-
binary (newsworthy or not) prediction.                               essary and so we choose to use lexical match as our measure.
                                                                        The words in the query are stemmed and each bigram of
3.2    Prediction model                                              stems is a feature. We also include skip bigrams with gap
  As already outlined, our method predicts newsworthy queries        1, i.e. we form bigrams by ignoring atmost one interven-
for day d from the query traffic logged for day d − 1. So the        ing word. The feature value is the frequency of the bigram
target for training is the next day CTR of a query. Only             in the query multiplied by the inverse-document frequency
those queries observed on two consecutive days can be used           (idf) computed on all the queries seen that day (recall that
for training. We divide the data described above into train-         our processing is offline and all queries on day d − 1 are
ing (around 100,000 unique queries) and test (60,000) sets.          available). The vectors from two queries are compared us-
  Overall, we use around 70 features in our model. Most              ing cosine similarity. A cutoff is applied on the similarity
of them are inspired by prior work by König et al. [16]             value and we consider those query pairs above the threshold
and Diaz [8] and use query logs and news index. Some of              as similar. A threshold value of 0.01 was picked by manually
the distinguishing log-based features are: ratio between fre-        examining the query pairs from different threshold ranges.
quency of the query on this day and that during the pre-             We also limit the number of similar queries to atmost 10.
vious week/days, number of times the query was issued on             Some example matches are shown in Table 1.
a news search engine, the output from a navigational in-
tent predictor, the observed click rate on search results from       5.    PREDICTION RESCORING
news domain versus other webpages. The following fea-
tures are among the most discriminative ones based on the               Now, we add query similarity to the news intent predic-
news index: matches with titles and abstracts of news docu-          tions using a post-processing approach.
ments, the maximum score of the news documents retrieved,
matches in different categories such as business, health and
                                                                     5.1    Algorithm
sports, the number of relevant documents in the backend                Let us call the predicted CTR for a query q as score(q).
compared to the number observed the previous week.                   Next, we gather the most similar queries to q, kNN(q), using
  A Gradient Boosted Decision Trees (GBDT) [10] training             lexical similarity and applying the threshold we described
approach is used to predict the real-valued next day CTR.            earlier. Now, we want to modify score(q) such that if the
Then we apply a cutoff on the predicted value to obtain              majority of neighbours in kNN(q) are newsworthy, it is likely
a binary decision (newsworthy or not). To decide on the              that q is also such or vice versa.
cutoff, we consider the set of queries above different points          Some ways to modify score(q) are: assign the weighted
on the predicted value. For each of these sets, we compute           average of score for queries in kN N (q), or assign the score
the average value of their gold standard next day CTR. The           of the most similar query from kNN(q), etc. During de-
set that has an average value of 20% is taken and the corre-         velopment testing, we found that the predicted scores from
sponding threshold value is chosen as the cutoff. This choice        the baseline model were more accurate for high frequency
guarantees that on average if this threshold is applied on the       queries. So setting score(q) to be the same as that of its vari-
predicted value, the queries chosen as newsworthy will have          ant in kNN(q) with maximum frequency (was issued most
an average next day CTR value of 20% (a reasonable CTR               times) provided the best CTR adjustments which moved
expectation for news events).                                        queries reliably across the boundary for newsworthy/ not
  Next we explain how we add query similarity information            newsworthy decision. (We evaluate this aspect using the
to the predictions.                                                  true category of queries where the system proposed to make
                                                                     a change in category.) We also found that the queries pre-
                                                                     dicted with high confidence in the newsworthy category were
4.    COMPUTING SIMILAR QUERIES                                      quite accurate and demoting scores led to a number of false
  We use a simple metric that measures the match between             negatives. So we only focus on adding queries to the whitelist.
bigrams in the two queries. Such a lexical measure is in-              Our rescoring method can be summarized as follows. Let
adequate when we consider query variants such as “Jen-               T be the threshold on score(q) to mark newsworthy queries.
                            Added (%)       Removed (%)           Measure      baseline      rescored
 Very newsworthy            101 (40.4%)        78 (31.2%)         CTR             10.43          10.67
 Newsworthy                  48 (19.2%)        55 (22.0%)         Coverage         2.32           2.39
 Somewhat newsworthy          23 (9.2%)        35 (14.0%)
 Not newsworthy              78 (31.2%)        82 (32.8%)                   Table 3: Results from user study
 Total                              250               250

      Table 2: Human newsworthiness judgements                     We studied two small fractions of search users. There
                                                                 were 37 million unique queries issued by each set of users on
                                                                 a given day. During a two week period, news verticals were
                                                                 shown for one set of users as per baseline predictions. For the
                                                                 other, the rescored predictions were used. It is important to
modScore(q)    = score(q), if |kNN(q)| = 0 or score(q) > T       note that both systems were employed simultaneously dur-
               = score( arg max frequency(q 0 )), otherwise      ing the two weeks, so users would be querying for the same
                         q 0 ∈kNN(q)
                                                                 events in both tracks. The CTR and coverage (proportion
                                                           (1)   of queries that were shown displays) for the two branches
                                                                 are shown in Table 3.
5.2     Evaluation                                                 After rescoring, there was a 2.36% improvement in CTR
  We employed both human judgements and system deploy-           compared to the baseline approach. At the same time, the
ment for evaluating our approach. The human evaluation           system was also able to show displays for 3.13% more queries
was done on a sample of our data. The deployment was             than the baseline. Approximately same number of users
done for a fraction of users of a commerical search engine       interacted with the baseline and new system during the two
and serves as a large scale study in the target setting. Both    week period. The click rate of each user was used as a
cases confirm the usefulness of our approach.                    data point and a Wilcoxon test was used to compare these
                                                                 values between the two systems. Both CTR and coverage
5.2.1    Human evaluation                                        are significantly different with p-values lower than 0.0001.
   We ran our rescoring method on one day of full web traf-        These results show that after rescoring, we were able to
fic in 2010 containing over 20 million unique queries. The       show news displays for more queries at the same time im-
default whitelist had 33,000 queries. When the scores were       proving the average user engagement on the displays.
modified, the size increased to 57,000. We selected a ran-
dom sample of 250 queries from those added to the whitelist      6.    QUERY CANONICALIZATION
and obtained annotations for their newsworthiness. To also
                                                                   Making accurate predictions is only a first step towards
confirm our choice to not remove queries from the whitelist,
                                                                 display quality. The results retrieved for newsworthy queries
we also created a list of queries that would be demoted had
                                                                 should also be informative. Here again we can make use of
we kept that option. A random sample of 250 queries from
                                                                 information from similar queries. So we propose an approach
this list was also included for annotation.
                                                                 to query canonicalization. For each newsworthy query, we
   The queries were presented on the next day to annota-
                                                                 identify a canonical variant which retrieves the most relevant
tors. We did not use the same click data as for development
                                                                 results from the backend. This canonical query might be the
because it is difficult for people to judge the newsworthiness
                                                                 original query itself or a semantically equivalent variant. In
of events from the past. We asked them to rate the news
                                                                 this way, quality results can be shown regardless of the query
intent of queries on a 4-level scale a) very newsworthy, b)
                                                                 variant actually issued. Further this approach would ensure
newsworthy, c) somewhat newsworthy d) not newsworthy.
                                                                 that similar queries have some uniformity and consistency
The division that the judges provided are shown in Table 2.
                                                                 in what results get highlighted for that news event.
   From the results, we see that 60% of queries added to
the whitelist are clearly newsworthy. An additional 10%          6.1    Data and setup
of the added queries are on the borderline. So a number
                                                                    This experiment is based on the data described in Section
of queries missed by the default predictor are added to the
                                                                 3.1. We try to find a canonical variant for each query in the
whitelist. We also see from the annotations for the removed
                                                                 whitelist. Those queries not predicted as newsworthy will
queries, that removing queries from the whitelist seems not
                                                                 not trigger news displays and need not be considered. For
to be a good option. 30% of those removed are actually
                                                                 each query in the whitelist, we obtain a set of candidates for
rated ‘very newsworthy’. However, it should be remembered
                                                                 canonicalization. We limit these candidates to other queries
that these judgements are only on a small sample. We also
                                                                 in the whitelist which are highly similar (using the same
had 50 queries simultaneously annotated by 3 judges. The
                                                                 threshold on lexical similarity as we have done so far). The
pairwise annotator agreement (Kappa) is only 0.29 to 0.36.
                                                                 original query is always added as a candidate.
Collapsing the levels into two categories—‘very newswor-
                                                                    The task is to pick out the candidate that will have high-
thy’ and ‘newsworthy’ into one and the other two into a
                                                                 est success as a canonical variant. The whitelist queries
second category—increases the agreement to only 0.44 to
                                                                 are compiled to be used on the next day. So we define
0.60 range. So it is rather difficult for human annotators to
                                                                 the most successful query as the one with highest next day
agree on the news level of events even for the most recent
                                                                 CTR.1 However, we should avoid choosing very specific and
time frame. So we use this evaluation to validate our de-
                                                                 rare queries as canonical. For example, a query such as
sign and focus on system deployment to test if there are any
                                                                 “Obama Texas debate opposition response” is rather spe-
significant improvements in the user experience.
                                                                 cific and could have a CTR value of 1, but might have been
5.2.2    User study                                                9 1 Only queries overlapping between consecutive days were used.
 Original query                      Lexical variants                                                     Canonical query
                                     marc anthony, jennifer lopez, jennifer lopez baby,
 1. jennifer lopez marc anthony                                                                           jennifer lopez twins
                                     jennifer lopez babies, jennifer lopez twins
 2. presidential candidates          republican presidential candidates, 2008 presidential candidates     2008 presidential candidates
                                     satellite shoot down, US shoots down satellite,
 3. shooting down satellite                                                                               satellite shoot down
                                     spy satellite shot down
                                     nfl draft combine, nfl combine, nfl draft, 2008 nfl draft,nfl
 4. nfl draft combine                                                                                     nfl combine
                                     draft 2008, nfl scouting combine, nfl mock draft

                                     Table 4: Example original and canonical queries


issued only once. We do not wish to canonicalize queries                                    All    2 cand.      3 to 5    above 5
such as “Obama debate” and “Obama Clinton texas de-                  Random                0.43       0.59        0.30       0.10
bate” to such a specific query. So we filter out candidates          Self                  0.43        0.59       0.32       0.17
that were issued fewer than two times that day. (The orig-           Max. frequency        0.53        0.58       0.53       0.34
inal query is always retained as a candidate regardless of           Max. offline score    0.44        0.59       0.33       0.19
its frequency.) Then we also include the number of clicks            Max. length           0.34        0.49       0.23       0.13
with the CTR value while ranking the candidates: {next               Min. length           0.42        0.56       0.32       0.16
day CTR * log(next day clicks+1)}. The top variant in this           Ranking approach      0.56       0.60        0.55       0.44
ranking is considered canonical.
   Some examples from our data are shown in Table 4. We            Table 5: Accuracies for canoncial query selection
can see that the canonical variant chosen in this way is a         from different candidate set sizes
crisp characterization of the event. Since we use lexical
similarity, we can expect the candidate set to be of high
precision. The candidates have at least one bigram match           one. Our ranking approach combines a number of features
with the original query and so these matching keywords will        with the baseline metrics.
still appear in the search results. We divide the data into           Simple metrics. Frequency of the query, offline model
a training set of 2876 examples and a test set of 1659. The        score, length of the candidate
breakdown of the number of candidates for different queries           Relationship to original query. Is the original query?,
in our test set is below.                                          is substring/superstring of original query?, is subsequence/
(a) 2 candidates: 800                                              supersequence of original query?
(b) 3 to 5 candidates: 636                                            Content-based. These features are based on the rele-
(c) 6 and above: 223                                               vance of the news documents that are present for a candidate
   Our definition of canonical queries holds for a period of       query in the index. In addition, the readability of the titles
one day only. For the whitelist queries, we identify and store     from these news documents is also an important factor for
the mappings of original-canonical queries. On the next day,       quality of the news display. We retrieve the top 3 titles for
when a query from this list is issued, the canonical variant       the query from the news index and compute the following
can be used for retrieving news results in place of the original   features: length of the titles, query is subsequence/substring
one. The list is refreshed at the end of the day. It is also       of one of the titles (similarly for first title), whether there is
worth noting that we wish to present canonical results for         a title string that is spelled fully in capital letters.
the news display only so that the relevant news updates are           Informativeness of the query. Words used in queries
provided for all query variants. The regular search results        could make some queries more informative compared to oth-
would still provide more fine-grained distinctions (if any) as     ers. So we add some features based on query word frequency
per the original query.                                            and word types. We compute the frequency of each query
                                                                   word and bigram over all the candidates in a set. Then for
6.2    Canonicalization methods                                    each query, we indicate whether it contains the top, second
   We test several baselines for this task: a) a random query      or third most frequent unigrams and bigrams of that cluster.
from the candidate set, b) the original query itself (no canon-    Numbers or year also get mentioned in some queries, for ex-
icalization), c) the query with maximum frequency that day,        ample, “grammy awards 2008”. To check the usefulness of
d) the query with maximum score from the GBDT model                such information, we add the features: query contains num-
(the model predicts the next day CTR and high CTR queries          ber/year, query’s prefix/suffix is a number/year. We also
could be considered as more representative ones), e) maxi-         indicate the presence of prepositions.
mum and f) minimum length queries.
   We also propose a ranking approach to canonicalization.         6.3    Evaluation
The target value {next day CTR * log(next day clicks + 1)}           The accuracies for predicting the canonical query is shown
is used to train a ranker to predict the ordering of queries       in Table 5 for different baselines and the ranking approach.
within a candidate set. For this purpose, we employ SVM-           We also provide a breakdown of the results depending on
Rank [13], whose training approach seeks to minimize the           the size of the candidate sets (columns 3-5): choosing the
number of pairwise wrong orderings over the training set.          canonical variant from a larger set of options is bound to be
The regularization parameter was chosen using cross valida-        a harder task.
tion on the training set. The output of the ranker is a real-        Among the baselines, using the maximum frequency query
valued score for each candidate which can be used to rank          produces the best results. It has an accuracy of 53% on all
them. The highest ranked query is marked as the canonical          test examples, 10% higher than random choice. The maxi-
mum frequency heuristic is also consistently better perform-       [7] R. Capra III and M. Pérez-Quiñones. Using web
ing than the other baselines for larger sets. When a set               search engines to find and refind information.
contains only two candidates (original query and one other             Computer, 38(10):36–42, 2005.
variant), random choice, the original query or using the top       [8] F. Diaz. Integration of news content into web results.
offline score query are equally efficient, giving 60% accuracy.        In Proc. of WSDM, pages 182–191. ACM, 2009.
The accuracy of these baselines degrades much more as the          [9] F. Diaz and J. Arguello. Adaptation of offline vertical
candidate set size increases.                                          selection predictions in the presence of user feedback.
   The ranking approach outperforms the baselines for can-             In Proc. of SIGIR, pages 323–330, 2009.
didate sets of all sizes. There is a 3% improvement when all      [10] J. Friedman. Greedy function approximation: a
the test examples are considered. In closer look, one can see          gradient boosting machine. Annals of Statistics,
that the benefit of the ranking approach is more pronouned             29(5):1189–1232, 2001.
when there are more candidates. It gives a 10% improve-           [11] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and
ment over the maximum frequency baseline for sets of size              H.-Y. Shum. Query dependent ranking using k-nearest
above 5 queries. In fact, the very popular queries are likely          neighbor. In Proc. of SIGIR, pages 115–122, 2008.
to have several variants and therefore a large candidate set.
                                                                  [12] J. Guo, G. Xu, H. Li, and X. Cheng. A unified and
So the ranking approach would be the preferred setup for
                                                                       discriminative model for query refinement. In Proc. of
canonicalization in this case.
                                                                       SIGIR, pages 379–386, 2008.
                                                                  [13] T. Joachims. Training linear svms in linear time. In
7.   CONCLUSION                                                        Proc. of KDD, pages 217–226, 2006.
   In this work, we have shown that our system which used         [14] R. Jones and D. C. Fain. Query word deletion
query similarity had better user engagement and also in-               prediction. In Proc. of SIGIR, pages 435–436, 2003.
creased the recall of newsworthy queries. There are several       [15] R. Jones, B. Rey, O. Madani, and W. Greiner.
issues that we can address in future work. One is purging              Generating query substitutions. In Proc. of WWW,
non-newsworthy queries from the whitelist. Since similarity-           pages 387–396, 2006.
only rescoring was not accurate to do this, we have not ad-       [16] A. C. König, M. Gamon, and Q. Wu. Click-through
dressed this aspect. Perhaps, a combination of query simi-             prediction for news queries. In Proc. of SIGIR, pages
larity to add queries and user feedback to quickly learn how           347–354, 2009.
to demote queries would be a good approach. We will ex-           [17] Q. Mei, D. Zhou, and K. Church. Query suggestion
plore these ideas in future.                                           using hitting time. In Proc. of CIKM, pages 469–478,
   In our approach, we have not only considered the predic-            2008.
tion accuracy but also the presentation of the display. We        [18] J. Peng, C. Macdonald, and I. Ounis. Learning to
have introduced the notion of a canonical query for news               select a ranking function. In Proc. of ECIR, pages
intent. As a simple baseline for canonical query, maximum              114–126, 2010.
frequency query is a good choice and the selection is further     [19] M. Sahami and T. Heilman. A web-based kernel
improved using query match and content features. But we                function for measuring the similarity of short text
have only worked with CTR data in this paper. We plan to               snippets. In Proc. of WWW, pages 377–386, 2006.
conduct a user study to strengthen the result and help us to      [20] M. Sahami and T. D. Heilman. A web-based kernel
understand if canonical results are preferred by users.                function for measuring the similarity of short text
                                                                       snippets. In Proc. of WWW, pages 377–386, 2006.
8.   REFERENCES                                                   [21] Y. Song and L. He. Optimal rare query suggestion
                                                                       with implicit user feedback. In Proc. of WWW, pages
 [1] R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query                 901–910, 2010.
     recommendation using query logs in search engines. In
                                                                  [22] I. Szpektor, A. Gionis, and Y. Maarek. Improving
     Current Trends in Database Technology - EDBT 2004
                                                                       recommendation for long-tail queries via templates. In
     Workshops, pages 395–397. 2005.                                   Proc. of WWW, pages 47–56, 2011.
 [2] D. Beeferman and A. Berger. Agglomerative clustering         [23] Z. Zhang and O. Nasraoui. Mining search engine
     of a search engine query log. In Proc. of KDD, pages              query logs for query recommendation. In Proc. of
     407–416, 2000.
                                                                       WWW, pages 1039–1040, 2006.
 [3] I. Bordino, C. Castillo, D. Donato, and A. Gionis.
     Query similarity by projecting the query-flow graph.
     In Proc. of SIGIR, pages 515–522, 2010.
 [4] A. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski,
     D. Metzler, L. Riedel, and J. Yuan. Online expansion
     of rare queries for sponsored search. In Proc. of
     WWW, pages 511–520, 2009.
 [5] A. Z. Broder, M. Fontoura, E. Gabrilovich, A. Joshi,
     V. Josifovski, and T. Zhang. Robust classification of
     rare queries using web knowledge. In Proc. of SIGIR,
     pages 231–238, 2007.
 [6] H. Bruce, W. Jones, and S. Dumais. Information
     behaviour that keeps found things found. Information
     Research, 10(1):10–1, 2004.