<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>and Promotion of Subtopic Level High ality Domains for Program ming eries in Web Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arpita Das</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prateek Agrawal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandeep Sahoo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manoj Chinnakotla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Microsoft</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arpita Das</institution>
          ,
          <addr-line>Saurabh Shrivastava, Prateek Agrawal, Sandeep Sahoo</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>High Qual-ity Domains for Programming Queries in Web Search</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>With the advancement of technology in modern era, a signi cant portion of the web referred to as developer segment serves to satisfy the programming related information need of the users. User satisfaction in this segment not only depends on the relevance of the retrieved pages but also on the domains that these pages belong to. We aim to discover sub-topic level associations of the domains and queries. We propose a supervised deep neural network based approach using the click-through data of a commercial web search engine to discover and promote the domains which provide high quality and expert level content for a query intent. Experiments show that our domain speci c ranker performs signi cantly well, both qualitatively as well as quantitatively, on real-world coding query sets when compared with standard web ranking baseline. This paper further demonstrates how associating domains with query intents results in the formation of overlapping domain clusters where domains in each cluster represent a topical space of query intent(s).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Page and site ranking;
domain preference, web search, user behavior
ACM Reference Format:</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>With the increase in the number of technologies and coding
infrastructures, developers are becoming more and more dependent on
the web. A coding query may have various intents ranging from
learning basics of a programming language to debugging a code
S1PDFEJOHT GP IUF GJSTU *OUFSBJPM
B3OLFST
N"TUFSEB</p>
      <p>D0UPCFS &amp;-"3/
&amp;HOFSBUJP
BQHFT
P$QZSJHIUªGTBFCVONEW
DBEFNJQVSPT
&amp;-"3/0DUPCFSNTEB5IMO
&amp;-"3OJH F/YU
snippet. The most relevant search result for a coding query is
dependent on how much the result satis es the query intent. For example,
if the query is about a particular function in a programming
language, the developer will prefer a small description of the function
and an example code snippet, however, if the query is about an
error code he is probably looking for ways to debug it. Promoting
the domain serving the correct intent will drastically improve the
search engine result page(SERP).</p>
      <p>The entire web can be grouped into intersecting clusters of
domains where every cluster represents a latent topic space satisfying
some query intent(s). Given a new query, we map the query to
the nearest topic cluster and promote the domains associated with
that cluster. For example, the query “how to format date in c#”
belong to the clusters centered around coarse topics like “c#”, “time”,
“changing date format” and have domains like “docs.microsoft.com”,
“c-sharpcorner.com”, “dotnetperls.com” associated with them.</p>
      <p>We extracted coding queries from the click logs of the commercial
search engine Microsoft Bing for the past three years(2014-2016).
Over this period we observed the trend of clicks for the queries
with respect to 45453 coding domains. The distribution of the clicks
gathered by di erent domains is not uniform as shown in Table
1. The domains like “stackover ow.com”and “msdn.microsoft.com”
clearly dominate the click shares. One might argue that since clicks
model user satisfaction, promoting the most clicked domains for
the past year might improve the SERP. Interestingly, this is not
the case because ultimately the satisfaction of user will depend
upon the relevance of the result with respect to the query,
therefore in the domain front also, it only makes sense to promote the
domain that satis</p>
      <p>es the user intent. For the query “connecting
database in azure”, from authority perspective one can assume that
a developer will prefer documents from “msdn.microsoft.com” or
“docs.microsoft.com” but a third domain named “dzone.com” exists
which contain speci c information about databases and their
connections which exactly matches the query intent. Slight promotion
of the third domain will result in the satisfaction of the user. Clicks
capture the high level scenario of domain preference, but we
discover and promote the domains which have high sub-topic level
association with the query intent.</p>
      <p>
        Retrieving intent speci c domain is still unexplored in the
research world. However, work has been done to detect authoritative,
trustworthiness etc of domains. Traditionally researchers have used
link structure based approaches and supervised approaches to
predict trustworthiness of a domain. Link based approaches such as
PageRank, HITS, SALSA[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses the structure present in hypertext
      </p>
      <p>Domain
stackover ow.com
msdn.microsoft.com</p>
      <p>w3schools.com
social.msdn.microsoft.com</p>
      <p>technet.microsoft.com
social.technet.microsoft.com
microsoft.com
codeproject.com
answers.microsoft.com
docs.oracle.com</p>
      <p>Clicks</p>
      <p>Detecting query-intent speci c domains for the developer segment
in web is an unexplored problem in the world of research. However,
several work has been done to solve the analogous research
problems of eliminating spam websites, determining domain authority,
trustworthiness, bias etc in the web.</p>
      <p>
        Previous work on web spam removal or establishing reliability
focused mostly on unsupervised techniques for detection of link
spam (that creates tightly knit community of links to a ect
linkbased ranking algorithm) and content spam (that malaciously spam
the content of web pages). Researchers worked on automatic
detection of suspicious signal in the link dependencies [
        <xref ref-type="bibr" rid="ref1 ref11 ref2 ref20 ref24 ref8">1, 2, 8, 11, 20, 24</xref>
        ]
and the content of web pages [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Castillo et al. combined
linkbased and content-based features and used the topology of the web
graph by exploiting the link dependencies among the web pages
to detect spam pages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Interconnections of spam farms is also
exploited to combat spam pages [
        <xref ref-type="bibr" rid="ref12 ref22 ref23 ref3">3, 12, 22, 23</xref>
        ].
      </p>
      <p>
        Establishing authority of a web page was tried using supervised
approaches too. In health domain, search results can directly impact
decisions related to people’s health so it is highly imperative for
search engines to provide reliable information. Chinnakotla et al.,
Gaudinat et al., Sondhi et al. employed supervised machine learning
techniques to learn the notion of trustworthiness of web pages in
Health domain [
        <xref ref-type="bibr" rid="ref10 ref21 ref7 ref9">7, 9, 10, 21</xref>
        ]. Also, Hassan et al., modeled web search
satisfaction of users [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13–15</xref>
        ].
      </p>
      <p>
        Ieong et al. introduced domain bias which shows a user’s
propensity to believe that a page is more relevant just because it comes from
a particular domain [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. They demonstrated the importance of
domain preferences in web search even after factoring out position
bias and relevance. This impact of the domain bias [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] motivated
us to promote documents from domains satisfying the exact query
intent.
      </p>
      <p>We aim to learn a signal that promote the domains which satisfy
the query intent. We use a convolutional neural network model
to learn non-linear relationships between a domain and a query
intent. Another way of putting it is, the neural network segment
the queries into a set of ne grained topics and associate most
likely domains to each of the topic space. Each topic space can
be considered as a representation of a set of overlapping query
intent(s).</p>
      <p>
        We extracted coding queries and their clicked URLs from the
Bing click logs. For feature extraction, we used character trigram
based word hashing [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We attach the delimiter “#” to a word (say
“pen” -&gt; “#pen#” ) and extract its letter trigrams ( #pe, pen, en#).
We obtained 52339 unique letter trigrams for the entire dataset of
query-clicked domain pairs. We convert each word in the query
and the domain to a vector of size 52339 and mark the presence
of number of occurrences of each letter trigram in the word. This
representation takes care of out-of-vocabulary words and words
with spelling errors.
      </p>
      <p>We build a convolutional neural network with three levels of
alternating convolution, max pooling and recti ed linear (ReLU)
layers and a fully connected layer at the top. The network gives
a non linear projection of the query and domain vectors in their
corresponding semantic spaces. Let x be the word hashed input
term vector, is the output vector and h is the number of hidden
layers used. Let, Hj represents the jth intermediate layer whose
weight matrix is Wj and bias term is bj , where j = {1, 2,. . . ,h}.
where j = {2,3,...,h} and H1 = W1x
lj = f (Wj Hj 1 + bj )</p>
      <p>= f (Wh Hh 1 + bh )
where we use tanh as the activation function f . The
relevance R(d, q) of a domain d for a particular query q is calculated
using:</p>
      <p>R(d, q) =</p>
      <p>d T q
| d || q |</p>
      <p>We use the supervision of the click logs to create positive and
negative samples for our training data. We treat queries and the
clicked domains as the positive samples (d+) and queries and
combination of domains from SERP which are not clicked for the query
and some randomly selected domains as negative data (d ). We
train our network with the objective to maximize the conditional
likelihood of the clicked domain given the queries or to minimize
the loss function in equation 4.</p>
      <p>L( ) = lo
÷
(q,d+)</p>
      <p>P (d+ |q)
where denotes the set of parameters of our network and
P (d+ |q) is the posterior probability of the clicked domain given the
query.</p>
      <p>One might question if the signal is learnt from the clicked logs of
a search engine then why the search engine itself does not re ect
the desired behavior already. We argue that SERP of a search engine
is not only dependent on clicked signal it takes other features into
account too. Also, our model does not associate a domain to the
particular query, it associates domain with a topical space that
represent query intent(s) and that topical space is learnt from a
large collection of coding queries. For example, “docs.oracle.com”
is not associated with the query “read a le in java” but with the
topics “java”, “ les” etc, so when a new query “write a le in java”
arrives “docs.oracle.com” will still be promoted.
(1)
(2)
(3)
(4)</p>
      <p>We combine our intent speci c domain score with relevance
score of web ranker of Bing to promote both relevant and
authoritative pages. We take the top 50 results from the initial retrieval
and re-rank them using a scoring function designed to associate
relevance and authority (Equation 5). Let the initial ranker assigns
scores {s1,s2,. . . ,s50} to the top 50 URLs {u1,u1,. . . ,u50} retrieved
for a query q. Let, {d1,d2,. . . ,d50} be the corresponding domains
extracted from these URLs. The new scoring function is de ned as:
(q, ui , di ) = si + ⇤ R(di , q)
(5)
where is the factor with which we boost the domain
signal. We intentionally kept it’s value small to prevent irrelevant
pages from preferred domains from being promoted.
4</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS AND ANALYSIS</title>
      <p>In this section, we rst describe the dataset and evaluation metric
used in our experiments. We also present some interesting analysis
that we can infer from the results.</p>
      <p>Dataset Details. We collected past three years of Bing click logs
and extracted queries of coding intent from them. We obtain 103
million unique query-clicked domain pairs for training the neural
network. We preprocess every query by lower-casing them and
removing stop words from them, we preserve the special characters
as they are important in coding domain. For the preprocessing of
domains we lower case them and remove pre xes like ‘http’ ,‘https’
,‘www’ ,‘ftp’ etc if present. We run our re-ranking function on a
set of 20,000 new coding queries from logs of 2017. We randomly
sample 400 queries from the above set where our ranking logic
introduce changes in the top 10 results and consider them as the
test set. We evaluate the performance of the scoring function using
our domain signal on these test queries against the current Bing
ranking baseline.</p>
      <p>
        Evaluation Metric. As pointed out by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], standard IR metrics
such as NDCG are not suitable for evaluating domain based signal.
We also wanted to obtain a whole page comparison of the baseline
and treatment therefore we chose the evaluation metric “Surplus”
proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Following the similar setting, we show the top 10
results of baseline and treatment results to a human judge in two
separate tabs in a single window. The judge can give the ratings on a
seven-point scale :Left Much Better, Left Better, Left Slightly Better,
Neutral, Right Slightly Better, Right Better and Right Much Better.
We obtained three judgments per query for all the 400 queries in
the test set to abate human judgment errors. Surplus for n queries
is de ned as :
      </p>
      <p>Surplus = nWnW+ nL n+LnT ⇤ 100 (6)
where the technique scores nW wins, nL losses and nT ties.</p>
      <p>The nal metric used for measurement isSurplusst r on , where
strong win/losses are used, and Surplusweak where weak win/losses
are used. A good surplus on a large query set implies that the
technique is performing well with respect to the baseline.</p>
      <p>Results and Analysis. The result of our technique with respect
to the baseline is shown in Table 2. Our technique shows signi cant
gains in weak and strong surplus over the baseline web ranker.
Table 3 illustrates the qualitative analysis of our technique. For</p>
      <sec id="sec-3-1">
        <title>Query set</title>
      </sec>
      <sec id="sec-3-2">
        <title>Test set</title>
      </sec>
      <sec id="sec-3-3">
        <title>Number of Queries 400</title>
        <p>Surplusstron Surplusweak
the query “page break in html” we are promoting “w3schools.com”
(which caters to the query intent in topical space of “web page
structuring in html” ) over domains like “cybertext.com”, “lvsys.com”
etc. For the second query “excel vba protect sheet”, apart from
promoting “msdn.microsoft.com” over “support.o ce.com ”, we also
promote “mrexcel.com” (which has specialized content in excel) over
“analysistabs.com” .</p>
        <p>In the process of associating domains with query intents, we
found that our model inherently clusters domains whose content lie
in similar topic space. We show two such clusters in Figure 1. While
searching for domains similar to “stackover ow.com”, we observe
that other forums and question-answering platforms such as
“social.msdn.microsoft.com”, “forums.asp.net”, “answers.microsoft.com,
“superuser.com”, etc. come up as the closest ones. Similarly, when
searched for domains similar to “w3schools.com”, domains such
as “developer.mozilla.org”, “tizag.com”, “webdesign.about.com”, etc.,
were retrieved. Interestingly, all of these domains can be associated
with a common topic space catering queries around designing web
pages.</p>
        <p>Another interesting observation that we came across is how a
slight modi cation in query can change the a nity of domains
containing relevant results. In Table 4, we demonstrate the same
along two verticals. The left side portrays how a small change in
query intent, with the same target coding language, changes the top
retrieved domain. Whereas, the right side depicts how the change
in target coding language, with same developer intent, changes the
top retrieved domain.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS</title>
      <p>In this paper, we proposed a novel deep learning based supervised
technique to promote intent speci c domains in the developer
segment using Bing clicked logs. The evaluation metric “Surplus”
proves that our method performs better than the baseline web
ranking algorithm. From the experiments conducted we prove that
our model segments the queries into a set of topic based clusters
and associates domain with each cluster. The topicality of cluster
is representation of some coarse level of query intent which the
developer is looking for.</p>
      <p>The approach proposed is re-usable and scalable in nature.
Currently we have worked in the developer segment but this work can
be extended to any domain. As part of future work, we plan to learn
a domain signal for the entire web. Currently, we assume that the
SERP contains relevant pages and slight re-ranking of pages based
on domain will satisfy the users. In future, we plan to learn a signal
which is a composition of query-title relevance and intent-speci c
domain preference and use it to re-rank results in web with more
impact.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Brian</given-names>
            <surname>Amento</surname>
          </string-name>
          , Loren Terveen, and
          <string-name>
            <given-names>Will</given-names>
            <surname>Hill</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Does &amp;Ldquo;Authority&amp;Rdquo; Mean Quality? Predicting Expert Quality Ratings of Web Documents</article-title>
          .
          <source>In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Becchetti</surname>
          </string-name>
          , Carlos Castillo, Debora Donato, Stefano Leonardi, and
          <string-name>
            <surname>Ricardo A Baeza-Yates</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Link-Based Characterization and Detection of Web Spam.</article-title>
          .
          <source>In AIRWeb. 1-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>András</surname>
            <given-names>A Benczúr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Károly</given-names>
            <surname>Csalogány</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tamás</given-names>
            <surname>Sarlós</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Link-based similarity search to ght web spam</article-title>
          .
          <source>InIn AIRWEB</source>
          . Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Monica</given-names>
            <surname>Bianchini</surname>
          </string-name>
          , Marco Gori, and
          <string-name>
            <given-names>Franco</given-names>
            <surname>Scarselli</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>PageRank and Web Communities.</article-title>
          .
          <source>In Web Intelligence</source>
          .
          <fpage>365</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Reprint of: The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>Computer networks 56</source>
          ,
          <issue>18</issue>
          (
          <year>2012</year>
          ),
          <fpage>3825</fpage>
          -
          <lpage>3833</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Castillo</surname>
          </string-name>
          , Debora Donato, Aristides Gionis, Vanessa Murdock, and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Know your neighbors: Web spam detection using the web topology</article-title>
          .
          <source>In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          ,
          <volume>423</volume>
          -
          <fpage>430</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Manoj</surname>
            <given-names>K Chinnakotla</given-names>
          </string-name>
          ,
          <article-title>Rupesh K Mehta,</article-title>
          and
          <string-name>
            <given-names>Vipul</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Unsupervised Detection and Promotion of Authoritative Domains for Medical Queries in Web Search</article-title>
          .
          <source>In 11th International Conference on Natural Language Processing</source>
          .
          <volume>388</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>André</given-names>
            <surname>Luiz da Costa Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Paul-Alexandru</surname>
            <given-names>Chirita</given-names>
          </string-name>
          , Edleno Silva De Moura, Pável Calado, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Site level noise removal for search engines</article-title>
          .
          <source>In Proceedings of the 15th international conference on World Wide Web. ACM</source>
          ,
          <volume>73</volume>
          -
          <fpage>82</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Arnaud</given-names>
            <surname>Gaudinat</surname>
          </string-name>
          , Natalia Grabar, and
          <string-name>
            <given-names>Célia</given-names>
            <surname>Boyer</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Automatic retrieval of web pages with standards of ethics and trustworthiness within a medical portal: What a page name tells us</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          (
          <year>2007</year>
          ),
          <fpage>185</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Arnaud</surname>
            <given-names>Gaudinat</given-names>
          </string-name>
          , Natalia Grabar,
          <string-name>
            <given-names>Célia</given-names>
            <surname>Boyer</surname>
          </string-name>
          , et al.
          <year>2007</year>
          .
          <article-title>Machine learning approach for automatic quality criteria detection of health web pages</article-title>
          .
          <source>In Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems. IOS Press</source>
          ,
          <volume>705</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Zoltán</given-names>
            <surname>Gyöngyi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hector</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Link spam alliances</article-title>
          .
          <source>In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment</source>
          ,
          <volume>517</volume>
          -
          <fpage>528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Zoltán</given-names>
            <surname>Gyöngyi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hector</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Link spam alliances</article-title>
          .
          <source>In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment</source>
          ,
          <volume>517</volume>
          -
          <fpage>528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Ahmed</given-names>
            <surname>Hassan</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A Semi-supervised Approach to Modeling Web Search Satisfaction</article-title>
          .
          <source>In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12).</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>Hassan</given-names>
          </string-name>
          , Rosie Jones, and Kristina Lisa Klinkner.
          <year>2010</year>
          .
          <article-title>Beyond DCG: User Behavior As a Predictor of a Successful Search</article-title>
          .
          <source>In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>Hassan</given-names>
          </string-name>
          , Xiaolin Shi,
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Ramsey</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Beyond clicks: query reformulation as a predictor of search satisfaction</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Conference on information &amp;#38; knowledge management (CIKM '13).</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Samuel</surname>
            <given-names>Ieong</given-names>
          </string-name>
          , Nina Mishra, Eldar Sadikov,
          <string-name>
            <given-names>and Li</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Domain Bias in Web Search</article-title>
          .
          <source>In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM '12).</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Mcnamee</surname>
          </string-name>
          and James May eld.
          <year>2004</year>
          .
          <article-title>Character n-gram tokenization for European language text retrieval</article-title>
          .
          <source>Information retrieval 7</source>
          ,
          <issue>1</issue>
          (
          <year>2004</year>
          ),
          <fpage>73</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Gilad</surname>
            <given-names>Mishne</given-names>
          </string-name>
          , David Carmel,
          <string-name>
            <given-names>Ronny</given-names>
            <surname>Lempel</surname>
          </string-name>
          , et al.
          <year>2005</year>
          .
          <article-title>Blocking Blog Spam with Language Model Disagreement.</article-title>
          .
          <source>In AIRWeb</source>
          , Vol.
          <volume>5</volume>
          . 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Alexandros</surname>
            <given-names>Ntoulas</given-names>
          </string-name>
          , Marc Najork,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Manasse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Fetterly</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Detecting spam web pages through content analysis</article-title>
          .
          <source>In Proceedings of the 15th international conference on World Wide Web. ACM</source>
          ,
          <volume>83</volume>
          -
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Guoyang</surname>
            <given-names>Shen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Bin</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tie-Yan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Guang Feng, Shiji Song, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Detecting link spam using temporal information</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2006</year>
          . ICDM'
          <fpage>06</fpage>
          . Sixth International Conference on. IEEE,
          <fpage>1049</fpage>
          -
          <lpage>1053</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Parikshit</surname>
            <given-names>Sondhi</given-names>
          </string-name>
          ,
          <source>VG Vinod Vydiswaran, and ChengXiang Zhai</source>
          .
          <year>2012</year>
          .
          <article-title>Reliability Prediction of Webpages in the Medical Domain.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ECIR</given-names>
          </string-name>
          , Vol.
          <volume>12</volume>
          . Springer,
          <fpage>219</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Baoning</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brian D</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Identifying link farm spam pages</article-title>
          .
          <source>In Special interest tracks and posters of the 14th international conference on World Wide Web. ACM</source>
          ,
          <volume>820</volume>
          -
          <fpage>829</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Baoning</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Vinay Goel, and
          <string-name>
            <given-names>Brian D</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Propagating Trust and Distrust to Demote Web Spam</article-title>
          .
          <source>MTW</source>
          <volume>190</volume>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Hui</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Ashish Goel, Ramesh Govindan, Kahn Mason, and Benjamin Van Roy.
          <year>2004</year>
          .
          <article-title>Making eigenvector-based reputation systems robust to collusion</article-title>
          .
          <source>In WAW</source>
          , Vol.
          <volume>3243</volume>
          . Springer,
          <fpage>92</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>