Discovery and Promotion of Subtopic Level High �ality
             Domains for Programming �eries in Web Search
                      Arpita Das                                        Saurabh Shrivastava                           Prateek Agrawal
                  Microsoft, India                                          Microsoft, India                           Microsoft, India
               arpda@microosft.com                                       sauras@microsoft.com                      pragraw@microsoft.com

                                               Sandeep Sahoo                               Manoj Chinnakotla
                                               Microsoft, India                               Microsoft, India
                                           sasaho@microsoft.com                            manojc@microsoft.com
ABSTRACT                                                                             snippet. The most relevant search result for a coding query is depen-
With the advancement of technology in modern era, a signi�cant                       dent on how much the result satis�es the query intent. For example,
portion of the web referred to as developer segment serves to sat-                   if the query is about a particular function in a programming lan-
isfy the programming related information need of the users. User                     guage, the developer will prefer a small description of the function
satisfaction in this segment not only depends on the relevance of                    and an example code snippet, however, if the query is about an
the retrieved pages but also on the domains that these pages belong                  error code he is probably looking for ways to debug it. Promoting
to. We aim to discover sub-topic level associations of the domains                   the domain serving the correct intent will drastically improve the
and queries. We propose a supervised deep neural network based                       search engine result page(SERP).
approach using the click-through data of a commercial web search                         The entire web can be grouped into intersecting clusters of do-
engine to discover and promote the domains which provide high                        mains where every cluster represents a latent topic space satisfying
quality and expert level content for a query intent. Experiments                     some query intent(s). Given a new query, we map the query to
show that our domain speci�c ranker performs signi�cantly well,                      the nearest topic cluster and promote the domains associated with
both qualitatively as well as quantitatively, on real-world coding                   that cluster. For example, the query “how to format date in c#” be-
query sets when compared with standard web ranking baseline.                         long to the clusters centered around coarse topics like “c#”, “time”,
This paper further demonstrates how associating domains with                         “changing date format” and have domains like “docs.microsoft.com”,
query intents results in the formation of overlapping domain clus-                   “c-sharpcorner.com”, “dotnetperls.com” associated with them.
ters where domains in each cluster represent a topical space of                          We extracted coding queries from the click logs of the commercial
query intent(s).                                                                     search engine Microsoft Bing for the past three years(2014-2016).
                                                                                     Over this period we observed the trend of clicks for the queries
CCS CONCEPTS                                                                         with respect to 45453 coding domains. The distribution of the clicks
                                                                                     gathered by di�erent domains is not uniform as shown in Table
• Information systems → Page and site ranking;
                                                                                     1. The domains like “stackover�ow.com”and “msdn.microsoft.com”
                                                                                     clearly dominate the click shares. One might argue that since clicks
KEYWORDS
                                                                                     model user satisfaction, promoting the most clicked domains for
domain preference, web search, user behavior                                         the past year might improve the SERP. Interestingly, this is not
ACM Reference Format:                                                                the case because ultimately the satisfaction of user will depend
Arpita Das, Saurabh Shrivastava, Prateek Agrawal, Sandeep Sahoo, and        upon the relevance of the result with respect to the query, there-
Manoj Chinnakotla. 2017. Discovery and Promotion of Subtopic Level          fore in the domain front also, it only makes sense to promote the
High Qual-ity Domains for Programming Queries in Web Search. *O            domain that satis�es the user intent. For the query “connecting
1SPDFFEJOHT PG UIF GJSTU *OUFSOBUJPOBM 8PSLTIPQ PO -&"3OJOH /FYU
                                                                                     database in azure”, from authority perspective one can assume that
H&OFSBUJPO 3BOLFST  "NTUFSEBN  0DUPCFS    -&"3/&3   
QBHFT                                                                               a developer will prefer documents from “msdn.microsoft.com” or
                                                                                     “docs.microsoft.com” but a third domain named “dzone.com” exists
1     INTRODUCTION                                                                   which contain speci�c information about databases and their con-
                                                                                     nections which exactly matches the query intent. Slight promotion
With the increase in the number of technologies and coding infras-
                                                                                     of the third domain will result in the satisfaction of the user. Clicks
tructures, developers are becoming more and more dependent on
                                                                                     capture the high level scenario of domain preference, but we dis-
the web. A coding query may have various intents ranging from
                                                                                     cover and promote the domains which have high sub-topic level
learning basics of a programming language to debugging a code
                                                                                     association with the query intent.
                                                                                         Retrieving intent speci�c domain is still unexplored in the re-
-&"3/&3 0DUPCFS  "NTUFSEBN 5IF/FUIFSMBOET
$PQZSJHIUªGPSUIJTQBQFSCZJUTBVUIPST$PQZJOHQFSNJUUFEGPSQSJWBUFBOE    search world. However, work has been done to detect authoritative,
BDBEFNJDQVSQPTFT                                                                   trustworthiness etc of domains. Traditionally researchers have used
                                                                                     link structure based approaches and supervised approaches to pre-
                                                                                     dict trustworthiness of a domain. Link based approaches such as
                                                                                     PageRank, HITS, SALSA[5] uses the structure present in hypertext
ICTIR, Oct2017, Netherlands                                                                                                          Das et al.


                                        Domain                 Clicks              Domain                Clicks
                                   stackover�ow.com            42.01%           ozgrid.com                0.10%
                                  msdn.microsoft.com           15.23%         powershell.com              0.10%
                                      w3schools.com            4.29%        pandas.pydata.org             0.09%
                               social.msdn.microsoft.com       3.31%     community.spiceworks.com         0.09%
                                 technet.microsoft.com         2.84%          sourceforge.net             0.09%
                              social.technet.microsoft.com     1.71%         getbootstrap.com             0.09%
                                      microsoft.com            1.54%             mkyong.com               0.09%
                                     codeproject.com           1.41%            vbforums.com              0.08%
                                 answers.microsoft.com         1.24%         webdesign.about.com          0.08%
                                     docs.oracle.com           1.22%           blog.udemy.com             0.08%

Table 1: Distribution of clicks among the top-100 domains speci�c to coding queries. The left half shows share of the top-10
domains while the right half shows the bottom-10 (91-100) domains.
of the web pages to identify page quality. PageRank is a well known     2 RELATED WORK
algorithm that uses link information to assign global importance        Detecting query-intent speci�c domains for the developer segment
scores to all pages on the web. Bianchini et al. pointed out the vul-   in web is an unexplored problem in the world of research. However,
nerabilities of the link based algorithms to spamming [4]. Since, it    several work has been done to solve the analogous research prob-
is possible to arti�cially boost authority score by forming an associ-  lems of eliminating spam websites, determining domain authority,
ation of highly interlinked content, content farm websites manages      trustworthiness, bias etc in the web.
to get high PageRank score. Contrary to the link based approaches,         Previous work on web spam removal or establishing reliability
supervised approaches are robust to hyperlinked structure manipu-       focused mostly on unsupervised techniques for detection of link
lation but they are heavily dependent on gold structured labeled        spam (that creates tightly knit community of links to a�ect link-
data. Obtaining large-scale human judged query-domain pairs is          based ranking algorithm) and content spam (that malaciously spam
extremely challenging in terms of cost and e�ciency. Click logs are     the content of web pages). Researchers worked on automatic detec-
assumed to be substitution for human judged data as clicks capture      tion of suspicious signal in the link dependencies [1, 2, 8, 11, 20, 24]
human behavior and feedback to queries. Chinnakotla et al., Sondhi      and the content of web pages [18, 19]. Castillo et al. combined link-
et al. used clicked data from web to learn supervised model to es-      based and content-based features and used the topology of the web
tablish reliability in the health segment [7, 21]. Our paper focuses    graph by exploiting the link dependencies among the web pages
to learn the signal that is a composition of reliability, authority etc to detect spam pages [6]. Interconnections of spam farms is also
and serves the exact coding intent of the user using supervision        exploited to combat spam pages [3, 12, 22, 23].
from Bing clickthrough data.                                               Establishing authority of a web page was tried using supervised
   In this paper, we propose a novel deep learning based method to      approaches too. In health domain, search results can directly impact
maximize the conditional likelihood of a clicked domain for a given     decisions related to people’s health so it is highly imperative for
query intent. We train a three layered deep convolution neural          search engines to provide reliable information. Chinnakotla et al.,
network to project query and domains into their corresponding           Gaudinat et al., Sondhi et al. employed supervised machine learning
semantic spaces. We consider the domains with minimum semantic          techniques to learn the notion of trustworthiness of web pages in
distance from the query to be slightly promoted in the SERP. We         Health domain [7, 9, 10, 21]. Also, Hassan et al., modeled web search
assume that the title of the search results in SERP is semantically     satisfaction of users [13–15].
relevant to the query. We make this assumption because promoting           Ieong et al. introduced domain bias which shows a user’s propen-
a relevant domain will not make sense if the document from that         sity to believe that a page is more relevant just because it comes from
domain is irrelevant. For example, if the user query is “how to lower-  a particular domain [16]. They demonstrated the importance of do-
case in javascript”, domains like “w3schools.com, stackover�ow.com,     main preferences in web search even after factoring out position
developer.mozilla.org” should be promoted, however, if a document       bias and relevance. This impact of the domain bias [16] motivated
with title “how to uppercase in javascript” from “w3schools.com”        us to promote documents from domains satisfying the exact query
is promoted the relevance of search result is hampered. The key         intent.
contributions of the papers are : 1) We learn the deep correlations
between domains and query intents for the developer segment in
web. 2) We perform experiments to show how the a�nity for a             3 LEARNING INTENT SPECIFIC DOMAINS
domain changes with a slight change in intent of the query. 3) We       We aim to learn a signal that promote the domains which satisfy
highlight how domains in the developer segment can be clustered         the query intent. We use a convolutional neural network model
based on the query intents. 4) We perform qualitative and quantita-     to learn non-linear relationships between a domain and a query
tive analysis of our ranker which incorporates domain signal using      intent. Another way of putting it is, the neural network segment
large scale coding query test set and compare them with standard        the queries into a set of �ne grained topics and associate most
web ranking baseline.                                                   likely domains to each of the topic space. Each topic space can
Discovery and Promotion of Subtopic Level High �ality Domains for Programming �eries in Web Search ICTIR, Oct2017, Netherlands


be considered as a representation of a set of overlapping query             We combine our intent speci�c domain score with relevance
intent(s).                                                               score of web ranker of Bing to promote both relevant and author-
   We extracted coding queries and their clicked URLs from the           itative pages. We take the top 50 results from the initial retrieval
Bing click logs. For feature extraction, we used character trigram       and re-rank them using a scoring function designed to associate
based word hashing [17]. We attach the delimiter “#” to a word (say      relevance and authority (Equation 5). Let the initial ranker assigns
“pen” -> “#pen#” ) and extract its letter trigrams ( #pe, pen, en#).     scores {s 1 ,s 2 ,. . . ,s 50 } to the top 50 URLs {u 1 ,u 1 ,. . . ,u 50 } retrieved
We obtained 52339 unique letter trigrams for the entire dataset of       for a query q. Let, {d 1 ,d 2 ,. . . ,d 50 } be the corresponding domains
query-clicked domain pairs. We convert each word in the query            extracted from these URLs. The new scoring function is de�ned as:
and the domain to a vector of size 52339 and mark the presence                                    (q, ui , di ) = si +   ⇤ R(di , q)                      (5)
of number of occurrences of each letter trigram in the word. This
representation takes care of out-of-vocabulary words and words                    where is the factor with which we boost the domain
with spelling errors.                                                    signal. We intentionally kept it’s value small to prevent irrelevant
   We build a convolutional neural network with three levels of          pages from preferred domains from being promoted.
alternating convolution, max pooling and recti�ed linear (ReLU)
layers and a fully connected layer at the top. The network gives         4    EXPERIMENTS AND ANALYSIS
a non linear projection of the query and domain vectors in their         In this section, we �rst describe the dataset and evaluation metric
corresponding semantic spaces. Let x be the word hashed input            used in our experiments. We also present some interesting analysis
term vector, is the output vector and h is the number of hidden          that we can infer from the results.
layers used. Let, H j represents the j t h intermediate layer whose         Dataset Details. We collected past three years of Bing click logs
weight matrix is Wj and bias term is b j , where j = {1, 2,. . . ,h}.    and extracted queries of coding intent from them. We obtain 103
                        l j = f (Wj H j 1 + b j )                 (1)    million unique query-clicked domain pairs for training the neural
                                                                         network. We preprocess every query by lower-casing them and
        where j = {2,3,...,h} and H 1 = W1x                              removing stop words from them, we preserve the special characters
                                                                         as they are important in coding domain. For the preprocessing of
                          = f (Wh Hh 1 + bh )                     (2)
                                                                         domains we lower case them and remove pre�xes like ‘http’ ,‘https’
        where we use tanh as the activation function f . The rele-       ,‘www’ ,‘ftp’ etc if present. We run our re-ranking function on a
vance R(d, q) of a domain d for a particular query q is calculated       set of 20,000 new coding queries from logs of 2017. We randomly
using:                                                                   sample 400 queries from the above set where our ranking logic
                                           T                             introduce changes in the top 10 results and consider them as the
                                               q
                         R(d, q) =
                                       d
                                                                  (3)    test set. We evaluate the performance of the scoring function using
                                     | d || q |                          our domain signal on these test queries against the current Bing
   We use the supervision of the click logs to create positive and       ranking baseline.
negative samples for our training data. We treat queries and the            Evaluation Metric. As pointed out by [7], standard IR metrics
clicked domains as the positive samples (d + ) and queries and com-      such as NDCG are not suitable for evaluating domain based signal.
bination of domains from SERP which are not clicked for the query        We also wanted to obtain a whole page comparison of the baseline
and some randomly selected domains as negative data (d ). We             and treatment therefore we chose the evaluation metric “Surplus”
train our network with the objective to maximize the conditional         proposed by [7]. Following the similar setting, we show the top 10
likelihood of the clicked domain given the queries or to minimize        results of baseline and treatment results to a human judge in two
the loss function in equation 4.                                         separate tabs in a single window. The judge can give the ratings on a
                                  ÷
                                                                         seven-point scale :Left Much Better, Left Better, Left Slightly Better,
                     L( ) = lo        P(d + |q)                  (4)
                                                                         Neutral, Right Slightly Better, Right Better and Right Much Better.
                                   (q,d + )
                                                                         We obtained three judgments per query for all the 400 queries in
          where denotes the set of parameters of our network and         the test set to abate human judgment errors. Surplus for n queries
P(d + |q) is the posterior probability of the clicked domain given the   is de�ned as :
query.
                                                                                                          nW n L
   One might question if the signal is learnt from the clicked logs of                         Surplus =             ⇤ 100                (6)
                                                                                                       nW + n L + nT
a search engine then why the search engine itself does not re�ect
                                                                                  where the technique scores nW wins, n L losses and nT ties.
the desired behavior already. We argue that SERP of a search engine
                                                                            The �nal metric used for measurement is Surplusst r on , where
is not only dependent on clicked signal it takes other features into
                                                                         strong win/losses are used, and Surplusweak where weak win/losses
account too. Also, our model does not associate a domain to the
                                                                         are used. A good surplus on a large query set implies that the tech-
particular query, it associates domain with a topical space that
                                                                         nique is performing well with respect to the baseline.
represent query intent(s) and that topical space is learnt from a
large collection of coding queries. For example, “docs.oracle.com”          Results and Analysis. The result of our technique with respect
is not associated with the query “read a �le in java” but with the       to the baseline is shown in Table 2. Our technique shows signi�cant
topics “java”, “�les” etc, so when a new query “write a �le in java”     gains in weak and strong surplus over the baseline web ranker.
arrives “docs.oracle.com” will still be promoted.                        Table 3 illustrates the qualitative analysis of our technique. For
ICTIR, Oct2017, Netherlands                                                                                                  Das et al.


                                                                        Number of
                                           Query set                                Surplusst r on Surplusweak
                                                                         Queries

                                            Test set                       400          1.486          9.807

Table 2: The table compares the performance of our re-ranking technique with Baseline web ranker on the test set. Results
marked in boldfaced indicate that the surplus was found to be statistically signi�cant over the baseline at 95% con�dence level
( < 0.0001). W/L/T denote the number of Wins, Losses and Ties observed.
                        Query: Page break html                  Query: excel vba protect sheet
                                         1.cybertext.com                          1.support.o�ce.com
                        Baseline         2.lvsys.com            Baseline          2.msdn.microsoft.com
                                         3.w3schools.com                          3.analysistabs.com
                        Our Technique 1.w3schools.com           Our Technique 1.msdn.microsoft.com
                        (strong win)     2.stackover�ow.com     (weak win)        2.support.o�ce.com
                                         3.msdn.microsoft.com                     3.mrexcel.com
                      Table 3: This table compares the top 3 domains shown by baseline and our technique.

                   Query                                 Top Host                Query               Top Host
                   c# string                             msdn.microsoft.com      oop in python       docs.python.org
                   c# string out of memory exception     stackover�ow.com        oop in javascript   developer.mozilla.org
                   c# string tutorial                    tutorialspoint.com      oop in c++          tutorialspoint.com
         Table 4: This table shows how a slight change in query intent changes the a�nity for most relevant domain.


the query “page break in html” we are promoting “w3schools.com”
(which caters to the query intent in topical space of “web page
structuring in html” ) over domains like “cybertext.com”, “lvsys.com”
etc. For the second query “excel vba protect sheet”, apart from
promoting “msdn.microsoft.com” over “support.o�ce.com”, we also
promote “mrexcel.com” (which has specialized content in excel) over
“analysistabs.com” .
   In the process of associating domains with query intents, we
found that our model inherently clusters domains whose content lie
in similar topic space. We show two such clusters in Figure 1. While
searching for domains similar to “stackover�ow.com”, we observe
that other forums and question-answering platforms such as “so-
cial.msdn.microsoft.com”, “forums.asp.net”, “answers.microsoft.com,
“superuser.com”, etc. come up as the closest ones. Similarly, when
searched for domains similar to “w3schools.com”, domains such
as “developer.mozilla.org”, “tizag.com”, “webdesign.about.com”, etc.,
were retrieved. Interestingly, all of these domains can be associated
with a common topic space catering queries around designing web
pages.
   Another interesting observation that we came across is how a
slight modi�cation in query can change the a�nity of domains
containing relevant results. In Table 4, we demonstrate the same
along two verticals. The left side portrays how a small change in
query intent, with the same target coding language, changes the top
retrieved domain. Whereas, the right side depicts how the change
in target coding language, with same developer intent, changes the
top retrieved domain.

5   CONCLUSIONS
In this paper, we proposed a novel deep learning based supervised          Figure 1: Examples of domain based clusters. Each cluster
technique to promote intent speci�c domains in the developer               captures topicality of underlying query intent(s).
Discovery and Promotion of Subtopic Level High �ality Domains for Programming �eries in Web Search ICTIR, Oct2017, Netherlands


segment using Bing clicked logs. The evaluation metric “Surplus”                           [18] Gilad Mishne, David Carmel, Ronny Lempel, et al. 2005. Blocking Blog Spam
proves that our method performs better than the baseline web                                    with Language Model Disagreement.. In AIRWeb, Vol. 5. 1–6.
                                                                                           [19] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006.
ranking algorithm. From the experiments conducted we prove that                                 Detecting spam web pages through content analysis. In Proceedings of the 15th
our model segments the queries into a set of topic based clusters                               international conference on World Wide Web. ACM, 83–92.
                                                                                           [20] Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, and Hang Li. 2006.
and associates domain with each cluster. The topicality of cluster                              Detecting link spam using temporal information. In Data Mining, 2006. ICDM’06.
is representation of some coarse level of query intent which the                                Sixth International Conference on. IEEE, 1049–1053.
developer is looking for.                                                                  [21] Parikshit Sondhi, VG Vinod Vydiswaran, and ChengXiang Zhai. 2012. Reliability
                                                                                                Prediction of Webpages in the Medical Domain.. In ECIR, Vol. 12. Springer, 219–
   The approach proposed is re-usable and scalable in nature. Cur-                              231.
rently we have worked in the developer segment but this work can                           [22] Baoning Wu and Brian D Davison. 2005. Identifying link farm spam pages. In
be extended to any domain. As part of future work, we plan to learn                             Special interest tracks and posters of the 14th international conference on World
                                                                                                Wide Web. ACM, 820–829.
a domain signal for the entire web. Currently, we assume that the                          [23] Baoning Wu, Vinay Goel, and Brian D Davison. 2006. Propagating Trust and
SERP contains relevant pages and slight re-ranking of pages based                               Distrust to Demote Web Spam. MTW 190 (2006).
                                                                                           [24] Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn Mason, and Benjamin Van Roy.
on domain will satisfy the users. In future, we plan to learn a signal                          2004. Making eigenvector-based reputation systems robust to collusion. In WAW,
which is a composition of query-title relevance and intent-speci�c                              Vol. 3243. Springer, 92–104.
domain preference and use it to re-rank results in web with more
impact.


REFERENCES
 [1] Brian Amento, Loren Terveen, and Will Hill. 2000.                             Does
     &Ldquo;Authority&Rdquo; Mean Quality? Predicting Expert Quality Ratings
     of Web Documents. In Proceedings of the 23rd Annual International ACM SIGIR
     Conference on Research and Development in Information Retrieval (SIGIR ’00).
 [2] Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo A
     Baeza-Yates. 2006. Link-Based Characterization and Detection of Web Spam.. In
     AIRWeb. 1–8.
 [3] András A Benczúr, Károly Csalogány, and Tamás Sarlós. 2006. Link-based simi-
     larity search to �ght web spam. In In AIRWEB. Citeseer.
 [4] Monica Bianchini, Marco Gori, and Franco Scarselli. 2003. PageRank and Web
     Communities.. In Web Intelligence. 365–371.
 [5] Sergey Brin and Lawrence Page. 2012. Reprint of: The anatomy of a large-scale
     hypertextual web search engine. Computer networks 56, 18 (2012), 3825–3833.
 [6] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio
     Silvestri. 2007. Know your neighbors: Web spam detection using the web topology.
     In Proceedings of the 30th annual international ACM SIGIR conference on Research
     and development in information retrieval. ACM, 423–430.
 [7] Manoj K Chinnakotla, Rupesh K Mehta, and Vipul Agrawal. 2014. Unsupervised
     Detection and Promotion of Authoritative Domains for Medical Queries in Web
     Search. In 11th International Conference on Natural Language Processing. 388.
 [8] André Luiz da Costa Carvalho, Paul-Alexandru Chirita, Edleno Silva De Moura,
     Pável Calado, and Wolfgang Nejdl. 2006. Site level noise removal for search
     engines. In Proceedings of the 15th international conference on World Wide Web.
     ACM, 73–82.
 [9] Arnaud Gaudinat, Natalia Grabar, and Célia Boyer. 2007. Automatic retrieval of
     web pages with standards of ethics and trustworthiness within a medical portal:
     What a page name tells us. Arti�cial Intelligence in Medicine (2007), 185–189.
[10] Arnaud Gaudinat, Natalia Grabar, Célia Boyer, et al. 2007. Machine learning
     approach for automatic quality criteria detection of health web pages. In Medinfo
     2007: Proceedings of the 12th World Congress on Health (Medical) Informatics;
     Building Sustainable Health Systems. IOS Press, 705.
[11] Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In Proceed-
     ings of the 31st international conference on Very large data bases. VLDB Endowment,
     517–528.
[12] Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In Proceed-
     ings of the 31st international conference on Very large data bases. VLDB Endowment,
     517–528.
[13] Ahmed Hassan. 2012. A Semi-supervised Approach to Modeling Web Search
     Satisfaction. In Proceedings of the 35th International ACM SIGIR Conference on
     Research and Development in Information Retrieval (SIGIR ’12).
[14] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: User
     Behavior As a Predictor of a Successful Search. In Proceedings of the Third ACM
     International Conference on Web Search and Data Mining (WSDM ’10).
[15] Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks:
     query reformulation as a predictor of search satisfaction. In Proceedings of the
     22nd ACM international conference on Conference on information &#38; knowledge
     management (CIKM ’13).
[16] Samuel Ieong, Nina Mishra, Eldar Sadikov, and Li Zhang. 2012. Domain Bias
     in Web Search. In Proceedings of the Fifth ACM International Conference on Web
     Search and Data Mining (WSDM ’12).
[17] Paul Mcnamee and James May�eld. 2004. Character n-gram tokenization for
     European language text retrieval. Information retrieval 7, 1 (2004), 73–97.