<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Use of Query Similarity for Improving Presentation of News Verticals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annie Louis</string-name>
          <email>lannie@seas.upenn.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rao Shen</string-name>
          <email>raoshen@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Crestan *</string-name>
          <email>ericres@microsoft.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youssef Billawala</string-name>
          <email>billawal@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Diaz</string-name>
          <email>fdiaz@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Fran c¸ois Crespo</string-name>
          <email>jfcrespo@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Microsoft</institution>
          ,
          <addr-line>81669 Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pennsylvania</institution>
          ,
          <addr-line>Philadelphia, PA 19104</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yahoo! Labs</institution>
          ,
          <addr-line>Sunnyvale, CA 94089</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Users often issue web queries related to current news events. For such queries, it is useful to predict the news intent automatically and highlight the news documents on the search result page. An example query would be \election results" issued during the time of elections. These highlighted displays are called news verticals. Prior work has proposed several features for predicting whether a query has news intent. However, most approaches treat each query individually. So on a given day, very similar queries can be assigned opposite predictions. In our work, we explore how a system can utilize query similarity information to improve the quality of news verticals along two dimensions|prediction and presentation. We show via a study of actual search tra c that the accuracy of predicting queries into newsworthy and not newsworthy categories can be improved using query similarity. Further, we present a method to identify a canonical variant for a newsworthy query such that using the canonical query would retrieve better results from the news backend to show in the display. Use of the canonical query also has the advantage of creating a consistent presentation of results for query variants related to the same news event.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Consider the query \Michelle Obama visits Spain" issued
to web search engines during the time of the First Lady's
visit to Spain. If this query is issued on a news website,
the results would contain all news articles about the event.
On the other hand, a user may also issue such a query to
a generic web search engine. If the search engine can
predict the query as related to currently newsworthy events,
latest news can be retrieved for the query and explicitly</p>
      <p>Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at the workshop
Very Large Data Search (VLDS) 2011.</p>
      <p>
        Copyright 2011.
highlighted on the result page. Such news displays, also
called news verticals now appear on the result pages of
major search engines such as Google, Yahoo! and Bing. An
example news vertical on the search result page of a generic
search engine is shown in Figure 1. Prior work has
introduced several features for news intent prediction. These
features characterize the bursty nature of a query during a
particular time period [
        <xref ref-type="bibr" rid="ref16 ref8 ref9">8, 16, 9</xref>
        ]. In this work, we show how
to improve the presentation of news displays by making use
of query similarity information.
      </p>
      <p>We consider two properties related to news displays which
can further bene t from information about similar queries.</p>
      <p>Triggering accuracy: When two queries q and q0 are
related to the same news event, they would have the same
news intent. For example, the queries \Michelle Obama
Spain visit" and \Spain vacation Michelle Obama" refer to
the same event and both are newsworthy during that time.
If a system can identify q and q0 as similar, it can make more
accurate predictions. Besides accuracy, it is desirable that
a consistent system would trigger a news display for both
queries regardless of the actual lexical realization.</p>
      <p>Retrieval of results for the display: Another use of
query similarity is in the case when q and q0 are predicted
to be newsworthy. The lexical form of q could retrieve very
relevant content from the news backend while q0 might be
a speci c query with few matches. Knowledge that they
are similar can help us choose from q, q0 and other similar
queries, a variant which would retrieve the most relevant
results for the event that they represent. This setup could
also lead to uniformity in the results shown for di erent
variants belonging to the same event. So query similarity
can help in presentation quality and consistency.</p>
      <p>
        The work on news intent prediction that is closest to ours
is by Diaz and Arguello [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] who seek to improve
predictions by incorporating user feedback. In their method, user
clicks on a query's news display are tracked and used as
feedback not only for that query but also for similar ones. They
compute similarity between queries based on the similarity
in their search results. In this work, we show how to
adjust the predictions themselves using similarity information.
Their method also requires the results of the search which
is expensive to compute. The selection of a canonical
variant and the issue of presentation consistency has not been
explored so far.
      </p>
      <p>In this paper we augment a news intent predictor with
query similarity information.
(a) We show that a simple lexical match method to get
similar queries is highly accurate and useful for improving
news intent predictions. We show that the improvement
is pronounced even in a model which uses a rich set of
features for the baseline predictions.
(b) We validate through a user study on actual search
trafc, that the predictions get improved when we add query
similarity information.
(c) We propose a method to identify a canonical variant
for similar queries. We show that we can outperform
the baseline approaches signi cantly for this task and
achieve a better display of news results for di erent
query variants.</p>
      <p>
        Our study is insightful for both the quality of news
display and the search user experience. Inconsistent triggering
lowers accuracy. Recent work [
        <xref ref-type="bibr" rid="ref22 ref4">4, 22</xref>
        ] recognize that long
and rare queries are also important components of user
experience but do not always return (informative) results. So
approaches have made use of query similarity to address
the needs of tail queries. Our work continues along these
lines for a di erent search task. From a user experience
perspective, particularly with explicitly highlighted displays, it
would be jarring if the choice to show a display and the
presentation style were markedly di erent for closely related
queries. It would be particularly di cult for users to nd a
certain article again using a di erent but equivalent query
(referred to as the task of re nding [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]). We seek to add
accuracy and quality in the displays with the aid of query
similarity.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Prior work on news intent prediction have introduced a
number of features [
        <xref ref-type="bibr" rid="ref16 ref8">8, 16</xref>
        ]. Three major resources are utilized|
query logs, news index and blogs. When a query appears
with greater frequency in the documents belonging to a time
period compared to recent past, it can be called newsworthy.
Work by Diaz and Arguello [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] extends beyond features
and incorporates user feedback as well to adjust predictions.
They obtain a prior news intent prediction based on features
and then correct it as user feedback is obtained.
      </p>
      <p>
        In most prior approaches, the vector of features for a query
is not dependent on other queries issued the same day. Even
when two queries are closely related, they could have
varying frequencies in query logs and news documents. So their
predictions could be quite di erent. We want to
incorporate query similarity to create more accurate predictions.
In Diaz and Arguello's work, they share user feedback from
one query to those which are similar, so that similar queries
can also take advantage of the feedback information. In this
work, we seek to produce more accurate baseline predictions
using query similarity. Our approach to do similarity-aware
predictions is similar to techniques introduced to handle tail
queries for advertisement display [
        <xref ref-type="bibr" rid="ref22 ref4 ref5">5, 4, 22</xref>
        ] and query
suggestions [
        <xref ref-type="bibr" rid="ref1 ref17 ref20 ref21 ref23">1, 20, 23, 17, 21</xref>
        ]. There is also recent work which
create specialized ranking models for a query by making use
of similar queries in the training data [
        <xref ref-type="bibr" rid="ref11 ref18">11, 18</xref>
        ].
      </p>
      <p>
        Our approach to canonical query selection is reminscent of
query modi cation methods to obtain better matches with
documents or advertisements. Here, when queries do not
match the bid phrases or documents, some substitution [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
deletion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and other modi cation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is necessary. Our
situation is di erent in that we desire a concise query which
can be issued to get the most relevant results from the news
index. So we need to judge the candidate queries based
on both their match to original query as well as the news
documents they retrieve from the news index. The idea is
similar to work by Baeza-Yates et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] who ranked query
suggestions not only by similarity but also considering the
attractiveness or quality of the suggested query.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>NEWS INTENT PREDICTOR</title>
      <p>We consider a simple news predictor and then show how
we add query similarity information to it. We predict
newsworthy queries in an o ine process. To identify newsworthy
queries for day d, we use the query logs from day d 1. The
queries from immediately previous day are likely to be
issued again. (In fact, one-third are repeated on average.)
Each query from day d 1 is associated with a binary
prediction (newsworthy or not) and the newsworthy queries are
are entered into a whitelist. On day d, if a query issued is in
the whitelist, we trigger a news display. The list is refreshed
daily. So the goal of the predictor is to identify queries that
will continue to have news intent on the next day.</p>
      <p>This situation is obviously incapable of handling new queries
that were not seen the previous day. However, for our goal
of evaluating the use of query similarity, this setup is
simple and appropriate, also enabling us to deploy it for a user
study. Although the system is simple, this o ine setup
enables us to use a rich set of features which would be
prohibitively costly to compute at query issue time. So we
are evaluating the power of query similarity over an already
strong feature set. So we focus on this baseline model for our
experiments, further approaches such as adding user
feedback and handling new queries will improve the capabilities
of the model we use.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>The gold standard news intent judgements for a query was
obtained empirically as was done in prior work. We logged
query statistics in early 2008 from a commerical web search
engine. A small portion of search users were shown news
displays for every query they issued provided there was any
matching content in the news backend. On average, 800,000
2008 long beach grand prix long beach grand prix (0.57), f1 grand prix (0.03), long beach (0.02)
2008 n draft order 2008 n draft (0.30), 2008 draft (0.19), n draft (0.12), n draft 2008 (0.06)
bad beef recall beef recall (0.15), recalled beef (0.02), massive beef recall (0.01), california beef recall (0.01)
carly smithson controversy carly smithson (0.10), carly smithson american idol (0.01), carly hennessy smithson (0.01)
unique queries were issued per day by these users and 30,000
of them had matches in the news index. For these 30,000
queries, displays were shown without any attempt to verify
their news intent. The display consisted of the titles of the
three most relevant news documents. The click through rate
(CTR) on the display for each query was logged and we
consider these values as gold standard. A click on any of
the three titles within the display is considered as a click on
the display. The 30,000 queries per day collected for a two
week period form our data set. We train a model to predict
the CTR of a query and then apply a threshold to obtain a
binary (newsworthy or not) prediction.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Prediction model</title>
      <p>As already outlined, our method predicts newsworthy queries
for day d from the query tra c logged for day d 1. So the
target for training is the next day CTR of a query. Only
those queries observed on two consecutive days can be used
for training. We divide the data described above into
training (around 100,000 unique queries) and test (60,000) sets.</p>
      <p>
        Overall, we use around 70 features in our model. Most
of them are inspired by prior work by Konig et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
and Diaz [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and use query logs and news index. Some of
the distinguishing log-based features are: ratio between
frequency of the query on this day and that during the
previous week/days, number of times the query was issued on
a news search engine, the output from a navigational
intent predictor, the observed click rate on search results from
news domain versus other webpages. The following
features are among the most discriminative ones based on the
news index: matches with titles and abstracts of news
documents, the maximum score of the news documents retrieved,
matches in di erent categories such as business, health and
sports, the number of relevant documents in the backend
compared to the number observed the previous week.
      </p>
      <p>
        A Gradient Boosted Decision Trees (GBDT) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] training
approach is used to predict the real-valued next day CTR.
Then we apply a cuto on the predicted value to obtain
a binary decision (newsworthy or not). To decide on the
cuto , we consider the set of queries above di erent points
on the predicted value. For each of these sets, we compute
the average value of their gold standard next day CTR. The
set that has an average value of 20% is taken and the
corresponding threshold value is chosen as the cuto . This choice
guarantees that on average if this threshold is applied on the
predicted value, the queries chosen as newsworthy will have
an average next day CTR value of 20% (a reasonable CTR
expectation for news events).
      </p>
      <p>Next we explain how we add query similarity information
to the predictions.</p>
    </sec>
    <sec id="sec-6">
      <title>COMPUTING SIMILAR QUERIES</title>
      <p>
        We use a simple metric that measures the match between
bigrams in the two queries. Such a lexical measure is
inadequate when we consider query variants such as
\Jennifer Lopez babies" and \Marc Antony twins". So we also
tried other methods which o er more exibility for
matching queries. These methods included the similarity in the
URLs, titles and abstracts of the top results [
        <xref ref-type="bibr" rid="ref19 ref2 ref23 ref8">2, 19, 23,
8</xref>
        ], tracking user sessions to nd rephrased queries [
        <xref ref-type="bibr" rid="ref23 ref3">23, 3</xref>
        ]
and a dictionary-based identi cation of similar named
entities (Eg. Jennifer Lopez, Mariah Carey). We found that
these looser notions (and their combinations) introduce
several false positives for query similarity while lexically similar
queries (with bigram match) had very high precision. For
our task, particularly canonicalization, high precision is
necessary and so we choose to use lexical match as our measure.
      </p>
      <p>The words in the query are stemmed and each bigram of
stems is a feature. We also include skip bigrams with gap
1, i.e. we form bigrams by ignoring atmost one
intervening word. The feature value is the frequency of the bigram
in the query multiplied by the inverse-document frequency
(idf) computed on all the queries seen that day (recall that
our processing is o ine and all queries on day d 1 are
available). The vectors from two queries are compared
using cosine similarity. A cuto is applied on the similarity
value and we consider those query pairs above the threshold
as similar. A threshold value of 0.01 was picked by manually
examining the query pairs from di erent threshold ranges.
We also limit the number of similar queries to atmost 10.
Some example matches are shown in Table 1.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>PREDICTION RESCORING</title>
      <p>Now, we add query similarity to the news intent
predictions using a post-processing approach.
5.1</p>
    </sec>
    <sec id="sec-8">
      <title>Algorithm</title>
      <p>Let us call the predicted CTR for a query q as score(q).
Next, we gather the most similar queries to q, kNN(q), using
lexical similarity and applying the threshold we described
earlier. Now, we want to modify score(q) such that if the
majority of neighbours in kNN(q) are newsworthy, it is likely
that q is also such or vice versa.</p>
      <p>Some ways to modify score(q) are: assign the weighted
average of score for queries in kN N (q), or assign the score
of the most similar query from kNN(q), etc. During
development testing, we found that the predicted scores from
the baseline model were more accurate for high frequency
queries. So setting score(q) to be the same as that of its
variant in kNN(q) with maximum frequency (was issued most
times) provided the best CTR adjustments which moved
queries reliably across the boundary for newsworthy/ not
newsworthy decision. (We evaluate this aspect using the
true category of queries where the system proposed to make
a change in category.) We also found that the queries
predicted with high con dence in the newsworthy category were
quite accurate and demoting scores led to a number of false
negatives. So we only focus on adding queries to the whitelist.</p>
      <p>Our rescoring method can be summarized as follows. Let
T be the threshold on score(q) to mark newsworthy queries.
modScore(q) = score(q), if jkNN(q)j = 0 or score(q) &gt; T
= score( arg max frequency(q0)), otherwise</p>
      <p>q02kNN(q)
5.2</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation</title>
      <p>We employed both human judgements and system
deployment for evaluating our approach. The human evaluation
was done on a sample of our data. The deployment was
done for a fraction of users of a commerical search engine
and serves as a large scale study in the target setting. Both
cases con rm the usefulness of our approach.
5.2.1</p>
      <sec id="sec-9-1">
        <title>Human evaluation</title>
        <p>We ran our rescoring method on one day of full web
trafc in 2010 containing over 20 million unique queries. The
default whitelist had 33,000 queries. When the scores were
modi ed, the size increased to 57,000. We selected a
random sample of 250 queries from those added to the whitelist
and obtained annotations for their newsworthiness. To also
con rm our choice to not remove queries from the whitelist,
we also created a list of queries that would be demoted had
we kept that option. A random sample of 250 queries from
this list was also included for annotation.</p>
        <p>The queries were presented on the next day to
annotators. We did not use the same click data as for development
because it is di cult for people to judge the newsworthiness
of events from the past. We asked them to rate the news
intent of queries on a 4-level scale a) very newsworthy, b)
newsworthy, c) somewhat newsworthy d) not newsworthy.
The division that the judges provided are shown in Table 2.</p>
        <p>From the results, we see that 60% of queries added to
the whitelist are clearly newsworthy. An additional 10%
of the added queries are on the borderline. So a number
of queries missed by the default predictor are added to the
whitelist. We also see from the annotations for the removed
queries, that removing queries from the whitelist seems not
to be a good option. 30% of those removed are actually
rated `very newsworthy'. However, it should be remembered
that these judgements are only on a small sample. We also
had 50 queries simultaneously annotated by 3 judges. The
pairwise annotator agreement (Kappa) is only 0.29 to 0.36.
Collapsing the levels into two categories|`very
newsworthy' and `newsworthy' into one and the other two into a
second category|increases the agreement to only 0.44 to
0.60 range. So it is rather di cult for human annotators to
agree on the news level of events even for the most recent
time frame. So we use this evaluation to validate our
design and focus on system deployment to test if there are any
signi cant improvements in the user experience.
5.2.2</p>
      </sec>
      <sec id="sec-9-2">
        <title>User study</title>
        <p>rescored
10.67
2.39</p>
        <p>We studied two small fractions of search users. There
were 37 million unique queries issued by each set of users on
a given day. During a two week period, news verticals were
shown for one set of users as per baseline predictions. For the
other, the rescored predictions were used. It is important to
note that both systems were employed simultaneously
during the two weeks, so users would be querying for the same
events in both tracks. The CTR and coverage (proportion
of queries that were shown displays) for the two branches
are shown in Table 3.</p>
        <p>After rescoring, there was a 2.36% improvement in CTR
compared to the baseline approach. At the same time, the
system was also able to show displays for 3.13% more queries
than the baseline. Approximately same number of users
interacted with the baseline and new system during the two
week period. The click rate of each user was used as a
data point and a Wilcoxon test was used to compare these
values between the two systems. Both CTR and coverage
are signi cantly di erent with p-values lower than 0.0001.</p>
        <p>These results show that after rescoring, we were able to
show news displays for more queries at the same time
improving the average user engagement on the displays.
6.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>QUERY CANONICALIZATION</title>
      <p>Making accurate predictions is only a rst step towards
display quality. The results retrieved for newsworthy queries
should also be informative. Here again we can make use of
information from similar queries. So we propose an approach
to query canonicalization. For each newsworthy query, we
identify a canonical variant which retrieves the most relevant
results from the backend. This canonical query might be the
original query itself or a semantically equivalent variant. In
this way, quality results can be shown regardless of the query
variant actually issued. Further this approach would ensure
that similar queries have some uniformity and consistency
in what results get highlighted for that news event.
6.1</p>
    </sec>
    <sec id="sec-11">
      <title>Data and setup</title>
      <p>This experiment is based on the data described in Section
3.1. We try to nd a canonical variant for each query in the
whitelist. Those queries not predicted as newsworthy will
not trigger news displays and need not be considered. For
each query in the whitelist, we obtain a set of candidates for
canonicalization. We limit these candidates to other queries
in the whitelist which are highly similar (using the same
threshold on lexical similarity as we have done so far). The
original query is always added as a candidate.</p>
      <p>The task is to pick out the candidate that will have
highest success as a canonical variant. The whitelist queries
are compiled to be used on the next day. So we de ne
the most successful query as the one with highest next day
CTR.1 However, we should avoid choosing very speci c and
rare queries as canonical. For example, a query such as
\Obama Texas debate opposition response" is rather
speci c and could have a CTR value of 1, but might have been
91Only queries overlapping between consecutive days were used.
Original query</p>
      <sec id="sec-11-1">
        <title>1. jennifer lopez marc anthony</title>
      </sec>
      <sec id="sec-11-2">
        <title>2. presidential candidates</title>
      </sec>
      <sec id="sec-11-3">
        <title>3. shooting down satellite</title>
      </sec>
      <sec id="sec-11-4">
        <title>4. n draft combine</title>
        <p>Lexical variants
marc anthony, jennifer lopez, jennifer lopez baby,
jennifer lopez babies, jennifer lopez twins
republican presidential candidates, 2008 presidential candidates
satellite shoot down, US shoots down satellite,
spy satellite shot down
n draft combine, n combine, n draft, 2008 n draft,n
draft 2008, n scouting combine, n mock draft
issued only once. We do not wish to canonicalize queries
such as \Obama debate" and \Obama Clinton texas
debate" to such a speci c query. So we lter out candidates
that were issued fewer than two times that day. (The
original query is always retained as a candidate regardless of
its frequency.) Then we also include the number of clicks
with the CTR value while ranking the candidates: fnext
day CTR * log(next day clicks+1)g. The top variant in this
ranking is considered canonical.</p>
        <p>Some examples from our data are shown in Table 4. We
can see that the canonical variant chosen in this way is a
crisp characterization of the event. Since we use lexical
similarity, we can expect the candidate set to be of high
precision. The candidates have at least one bigram match
with the original query and so these matching keywords will
still appear in the search results. We divide the data into
a training set of 2876 examples and a test set of 1659. The
breakdown of the number of candidates for di erent queries
in our test set is below.
(a) 2 candidates: 800
(b) 3 to 5 candidates: 636
(c) 6 and above: 223</p>
        <p>Our de nition of canonical queries holds for a period of
one day only. For the whitelist queries, we identify and store
the mappings of original-canonical queries. On the next day,
when a query from this list is issued, the canonical variant
can be used for retrieving news results in place of the original
one. The list is refreshed at the end of the day. It is also
worth noting that we wish to present canonical results for
the news display only so that the relevant news updates are
provided for all query variants. The regular search results
would still provide more ne-grained distinctions (if any) as
per the original query.
6.2</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Canonicalization methods</title>
      <p>We test several baselines for this task: a) a random query
from the candidate set, b) the original query itself (no
canonicalization), c) the query with maximum frequency that day,
d) the query with maximum score from the GBDT model
(the model predicts the next day CTR and high CTR queries
could be considered as more representative ones), e)
maximum and f) minimum length queries.</p>
      <p>
        We also propose a ranking approach to canonicalization.
The target value fnext day CTR * log(next day clicks + 1)g
is used to train a ranker to predict the ordering of queries
within a candidate set. For this purpose, we employ
SVMRank [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], whose training approach seeks to minimize the
number of pairwise wrong orderings over the training set.
The regularization parameter was chosen using cross
validation on the training set. The output of the ranker is a
realvalued score for each candidate which can be used to rank
them. The highest ranked query is marked as the canonical
      </p>
      <sec id="sec-12-1">
        <title>Random</title>
        <p>Self
Max. frequency
Max. o ine score
Max. length
Min. length
Ranking approach
one. Our ranking approach combines a number of features
with the baseline metrics.</p>
        <p>Simple metrics. Frequency of the query, o ine model
score, length of the candidate</p>
        <p>Relationship to original query. Is the original query?,
is substring/superstring of original query?, is subsequence/
supersequence of original query?</p>
        <p>Content-based. These features are based on the
relevance of the news documents that are present for a candidate
query in the index. In addition, the readability of the titles
from these news documents is also an important factor for
quality of the news display. We retrieve the top 3 titles for
the query from the news index and compute the following
features: length of the titles, query is subsequence/substring
of one of the titles (similarly for rst title), whether there is
a title string that is spelled fully in capital letters.</p>
        <p>Informativeness of the query. Words used in queries
could make some queries more informative compared to
others. So we add some features based on query word frequency
and word types. We compute the frequency of each query
word and bigram over all the candidates in a set. Then for
each query, we indicate whether it contains the top, second
or third most frequent unigrams and bigrams of that cluster.
Numbers or year also get mentioned in some queries, for
example, \grammy awards 2008". To check the usefulness of
such information, we add the features: query contains
number/year, query's pre x/su x is a number/year. We also
indicate the presence of prepositions.
6.3</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Evaluation</title>
      <p>The accuracies for predicting the canonical query is shown
in Table 5 for di erent baselines and the ranking approach.
We also provide a breakdown of the results depending on
the size of the candidate sets (columns 3-5): choosing the
canonical variant from a larger set of options is bound to be
a harder task.</p>
      <p>Among the baselines, using the maximum frequency query
produces the best results. It has an accuracy of 53% on all
test examples, 10% higher than random choice. The
maximum frequency heuristic is also consistently better
performing than the other baselines for larger sets. When a set
contains only two candidates (original query and one other
variant), random choice, the original query or using the top
o ine score query are equally e cient, giving 60% accuracy.
The accuracy of these baselines degrades much more as the
candidate set size increases.</p>
      <p>The ranking approach outperforms the baselines for
candidate sets of all sizes. There is a 3% improvement when all
the test examples are considered. In closer look, one can see
that the bene t of the ranking approach is more pronouned
when there are more candidates. It gives a 10%
improvement over the maximum frequency baseline for sets of size
above 5 queries. In fact, the very popular queries are likely
to have several variants and therefore a large candidate set.
So the ranking approach would be the preferred setup for
canonicalization in this case.</p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
      <p>In this work, we have shown that our system which used
query similarity had better user engagement and also
increased the recall of newsworthy queries. There are several
issues that we can address in future work. One is purging
non-newsworthy queries from the whitelist. Since
similarityonly rescoring was not accurate to do this, we have not
addressed this aspect. Perhaps, a combination of query
similarity to add queries and user feedback to quickly learn how
to demote queries would be a good approach. We will
explore these ideas in future.</p>
      <p>In our approach, we have not only considered the
prediction accuracy but also the presentation of the display. We
have introduced the notion of a canonical query for news
intent. As a simple baseline for canonical query, maximum
frequency query is a good choice and the selection is further
improved using query match and content features. But we
have only worked with CTR data in this paper. We plan to
conduct a user study to strengthen the result and help us to
understand if canonical results are preferred by users.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          .
          <article-title>Query recommendation using query logs in search engines</article-title>
          .
          <source>In Current Trends in Database Technology - EDBT 2004 Workshops</source>
          , pages
          <volume>395</volume>
          {
          <fpage>397</fpage>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Beeferman</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Berger</surname>
          </string-name>
          .
          <article-title>Agglomerative clustering of a search engine query log</article-title>
          .
          <source>In Proc. of KDD</source>
          , pages
          <volume>407</volume>
          {
          <fpage>416</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Bordino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Donato</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gionis</surname>
          </string-name>
          .
          <article-title>Query similarity by projecting the query- ow graph</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>515</volume>
          {
          <fpage>522</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciccolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          .
          <article-title>Online expansion of rare queries for sponsored search</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>511</volume>
          {
          <fpage>520</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fontoura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          , and
          <string-name>
            <surname>T. Zhang.</surname>
          </string-name>
          <article-title>Robust classi cation of rare queries using web knowledge</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>231</volume>
          {
          <fpage>238</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bruce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          .
          <article-title>Information behaviour that keeps found things found</article-title>
          .
          <source>Information Research</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ):
          <volume>10</volume>
          {
          <fpage>1</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>R. Capra</surname>
          </string-name>
          <article-title>III and M. Perez-Quin~ones. Using web search engines to nd and re nd information</article-title>
          .
          <source>Computer</source>
          ,
          <volume>38</volume>
          (
          <issue>10</issue>
          ):
          <volume>36</volume>
          {
          <fpage>42</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          .
          <article-title>Integration of news content into web results</article-title>
          .
          <source>In Proc. of WSDM</source>
          , pages
          <volume>182</volume>
          {
          <fpage>191</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Arguello</surname>
          </string-name>
          .
          <article-title>Adaptation of o ine vertical selection predictions in the presence of user feedback</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>323</volume>
          {
          <fpage>330</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          .
          <source>Annals of Statistics</source>
          ,
          <volume>29</volume>
          (
          <issue>5</issue>
          ):
          <volume>1189</volume>
          {
          <fpage>1232</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          , T.-Y. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and H.-Y.</given-names>
            <surname>Shum</surname>
          </string-name>
          .
          <article-title>Query dependent ranking using k-nearest neighbor</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>115</volume>
          {
          <fpage>122</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Xu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and X.</given-names>
            <surname>Cheng</surname>
          </string-name>
          .
          <article-title>A uni ed and discriminative model for query re nement</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>379</volume>
          {
          <fpage>386</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Training linear svms in linear time</article-title>
          .
          <source>In Proc. of KDD</source>
          , pages
          <volume>217</volume>
          {
          <fpage>226</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Fain</surname>
          </string-name>
          .
          <article-title>Query word deletion prediction</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>435</volume>
          {
          <fpage>436</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Madani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Greiner</surname>
          </string-name>
          .
          <article-title>Generating query substitutions</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>387</volume>
          {
          <fpage>396</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A. C.</surname>
          </string-name>
          <article-title>Konig, M. Gamon, and</article-title>
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Click-through prediction for news queries</article-title>
          .
          <source>In Proc. of SIGIR</source>
          , pages
          <volume>347</volume>
          {
          <fpage>354</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Church</surname>
          </string-name>
          .
          <article-title>Query suggestion using hitting time</article-title>
          .
          <source>In Proc. of CIKM</source>
          , pages
          <volume>469</volume>
          {
          <fpage>478</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <article-title>Learning to select a ranking function</article-title>
          .
          <source>In Proc. of ECIR</source>
          , pages
          <volume>114</volume>
          {
          <fpage>126</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahami</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heilman</surname>
          </string-name>
          .
          <article-title>A web-based kernel function for measuring the similarity of short text snippets</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>377</volume>
          {
          <fpage>386</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahami</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Heilman</surname>
          </string-name>
          .
          <article-title>A web-based kernel function for measuring the similarity of short text snippets</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>377</volume>
          {
          <fpage>386</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>Optimal rare query suggestion with implicit user feedback</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>901</volume>
          {
          <fpage>910</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Szpektor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gionis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          .
          <article-title>Improving recommendation for long-tail queries via templates</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>47</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          .
          <article-title>Mining search engine query logs for query recommendation</article-title>
          .
          <source>In Proc. of WWW</source>
          , pages
          <volume>1039</volume>
          {
          <fpage>1040</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>