CCS CONCEPTS

and Promotion of Subtopic Level High ality Domains for Program ming eries in Web Search

Arpita Das

0 1

Microsoft

0 1

India

0 1

Microsoft

0 1

India

0 1

Prateek Agrawal

0 1

Microsoft

0 1

India

0 1

Sandeep Sahoo

0 1

Microsoft

0 1

India

0 1

Manoj Chinnakotla

0 1

Microsoft

0 1

India

0 1 0 Arpita Das , Saurabh Shrivastava, Prateek Agrawal, Sandeep Sahoo 1 High Qual-ity Domains for Programming Queries in Web Search

2017

With the advancement of technology in modern era, a signi cant portion of the web referred to as developer segment serves to satisfy the programming related information need of the users. User satisfaction in this segment not only depends on the relevance of the retrieved pages but also on the domains that these pages belong to. We aim to discover sub-topic level associations of the domains and queries. We propose a supervised deep neural network based approach using the click-through data of a commercial web search engine to discover and promote the domains which provide high quality and expert level content for a query intent. Experiments show that our domain speci c ranker performs signi cantly well, both qualitatively as well as quantitatively, on real-world coding query sets when compared with standard web ranking baseline. This paper further demonstrates how associating domains with query intents results in the formation of overlapping domain clusters where domains in each cluster represent a topical space of query intent(s).

CCS CONCEPTS

• Information systems → Page and site ranking; domain preference, web search, user behavior ACM Reference Format:

INTRODUCTION

With the increase in the number of technologies and coding infrastructures, developers are becoming more and more dependent on the web. A coding query may have various intents ranging from learning basics of a programming language to debugging a code S1PDFEJOHT GP IUF GJSTU *OUFSBJPM B3OLFST N"TUFSEB

D0UPCFS &-"3/ &HOFSBUJP BQHFT P$QZSJHIUªGTBFCVONEW DBEFNJQVSPT &-"3/0DUPCFSNTEB5IMO &-"3OJH F/YU snippet. The most relevant search result for a coding query is dependent on how much the result satis es the query intent. For example, if the query is about a particular function in a programming language, the developer will prefer a small description of the function and an example code snippet, however, if the query is about an error code he is probably looking for ways to debug it. Promoting the domain serving the correct intent will drastically improve the search engine result page(SERP).

The entire web can be grouped into intersecting clusters of domains where every cluster represents a latent topic space satisfying some query intent(s). Given a new query, we map the query to the nearest topic cluster and promote the domains associated with that cluster. For example, the query “how to format date in c#” belong to the clusters centered around coarse topics like “c#”, “time”, “changing date format” and have domains like “docs.microsoft.com”, “c-sharpcorner.com”, “dotnetperls.com” associated with them.

We extracted coding queries from the click logs of the commercial search engine Microsoft Bing for the past three years(2014-2016). Over this period we observed the trend of clicks for the queries with respect to 45453 coding domains. The distribution of the clicks gathered by di erent domains is not uniform as shown in Table 1. The domains like “stackover ow.com”and “msdn.microsoft.com” clearly dominate the click shares. One might argue that since clicks model user satisfaction, promoting the most clicked domains for the past year might improve the SERP. Interestingly, this is not the case because ultimately the satisfaction of user will depend upon the relevance of the result with respect to the query, therefore in the domain front also, it only makes sense to promote the domain that satis

es the user intent. For the query “connecting database in azure”, from authority perspective one can assume that a developer will prefer documents from “msdn.microsoft.com” or “docs.microsoft.com” but a third domain named “dzone.com” exists which contain speci c information about databases and their connections which exactly matches the query intent. Slight promotion of the third domain will result in the satisfaction of the user. Clicks capture the high level scenario of domain preference, but we discover and promote the domains which have high sub-topic level association with the query intent.

Retrieving intent speci c domain is still unexplored in the research world. However, work has been done to detect authoritative, trustworthiness etc of domains. Traditionally researchers have used link structure based approaches and supervised approaches to predict trustworthiness of a domain. Link based approaches such as PageRank, HITS, SALSA[ 5 ] uses the structure present in hypertext

Domain stackover ow.com msdn.microsoft.com

w3schools.com social.msdn.microsoft.com

technet.microsoft.com social.technet.microsoft.com microsoft.com codeproject.com answers.microsoft.com docs.oracle.com

Clicks

Detecting query-intent speci c domains for the developer segment in web is an unexplored problem in the world of research. However, several work has been done to solve the analogous research problems of eliminating spam websites, determining domain authority, trustworthiness, bias etc in the web.

Previous work on web spam removal or establishing reliability focused mostly on unsupervised techniques for detection of link spam (that creates tightly knit community of links to a ect linkbased ranking algorithm) and content spam (that malaciously spam the content of web pages). Researchers worked on automatic detection of suspicious signal in the link dependencies [ 1, 2, 8, 11, 20, 24 ] and the content of web pages [ 18, 19 ]. Castillo et al. combined linkbased and content-based features and used the topology of the web graph by exploiting the link dependencies among the web pages to detect spam pages [ 6 ]. Interconnections of spam farms is also exploited to combat spam pages [ 3, 12, 22, 23 ].

Establishing authority of a web page was tried using supervised approaches too. In health domain, search results can directly impact decisions related to people’s health so it is highly imperative for search engines to provide reliable information. Chinnakotla et al., Gaudinat et al., Sondhi et al. employed supervised machine learning techniques to learn the notion of trustworthiness of web pages in Health domain [ 7, 9, 10, 21 ]. Also, Hassan et al., modeled web search satisfaction of users [ 13–15 ].

Ieong et al. introduced domain bias which shows a user’s propensity to believe that a page is more relevant just because it comes from a particular domain [ 16 ]. They demonstrated the importance of domain preferences in web search even after factoring out position bias and relevance. This impact of the domain bias [ 16 ] motivated us to promote documents from domains satisfying the exact query intent.

We aim to learn a signal that promote the domains which satisfy the query intent. We use a convolutional neural network model to learn non-linear relationships between a domain and a query intent. Another way of putting it is, the neural network segment the queries into a set of ne grained topics and associate most likely domains to each of the topic space. Each topic space can be considered as a representation of a set of overlapping query intent(s).

We extracted coding queries and their clicked URLs from the Bing click logs. For feature extraction, we used character trigram based word hashing [ 17 ]. We attach the delimiter “#” to a word (say “pen” -> “#pen#” ) and extract its letter trigrams ( #pe, pen, en#). We obtained 52339 unique letter trigrams for the entire dataset of query-clicked domain pairs. We convert each word in the query and the domain to a vector of size 52339 and mark the presence of number of occurrences of each letter trigram in the word. This representation takes care of out-of-vocabulary words and words with spelling errors.

We build a convolutional neural network with three levels of alternating convolution, max pooling and recti ed linear (ReLU) layers and a fully connected layer at the top. The network gives a non linear projection of the query and domain vectors in their corresponding semantic spaces. Let x be the word hashed input term vector, is the output vector and h is the number of hidden layers used. Let, Hj represents the jth intermediate layer whose weight matrix is Wj and bias term is bj , where j = {1, 2,. . . ,h}. where j = {2,3,...,h} and H1 = W1x lj = f (Wj Hj 1 + bj )

= f (Wh Hh 1 + bh ) where we use tanh as the activation function f . The relevance R(d, q) of a domain d for a particular query q is calculated using:

R(d, q) =

d T q | d || q |

We use the supervision of the click logs to create positive and negative samples for our training data. We treat queries and the clicked domains as the positive samples (d+) and queries and combination of domains from SERP which are not clicked for the query and some randomly selected domains as negative data (d ). We train our network with the objective to maximize the conditional likelihood of the clicked domain given the queries or to minimize the loss function in equation 4.

L( ) = lo ÷ (q,d+)

P (d+ |q) where denotes the set of parameters of our network and P (d+ |q) is the posterior probability of the clicked domain given the query.

One might question if the signal is learnt from the clicked logs of a search engine then why the search engine itself does not re ect the desired behavior already. We argue that SERP of a search engine is not only dependent on clicked signal it takes other features into account too. Also, our model does not associate a domain to the particular query, it associates domain with a topical space that represent query intent(s) and that topical space is learnt from a large collection of coding queries. For example, “docs.oracle.com” is not associated with the query “read a le in java” but with the topics “java”, “ les” etc, so when a new query “write a le in java” arrives “docs.oracle.com” will still be promoted. (1) (2) (3) (4)

We combine our intent speci c domain score with relevance score of web ranker of Bing to promote both relevant and authoritative pages. We take the top 50 results from the initial retrieval and re-rank them using a scoring function designed to associate relevance and authority (Equation 5). Let the initial ranker assigns scores {s1,s2,. . . ,s50} to the top 50 URLs {u1,u1,. . . ,u50} retrieved for a query q. Let, {d1,d2,. . . ,d50} be the corresponding domains extracted from these URLs. The new scoring function is de ned as: (q, ui , di ) = si + ⇤ R(di , q) (5) where is the factor with which we boost the domain signal. We intentionally kept it’s value small to prevent irrelevant pages from preferred domains from being promoted. 4

EXPERIMENTS AND ANALYSIS

In this section, we rst describe the dataset and evaluation metric used in our experiments. We also present some interesting analysis that we can infer from the results.

Dataset Details. We collected past three years of Bing click logs and extracted queries of coding intent from them. We obtain 103 million unique query-clicked domain pairs for training the neural network. We preprocess every query by lower-casing them and removing stop words from them, we preserve the special characters as they are important in coding domain. For the preprocessing of domains we lower case them and remove pre xes like ‘http’ ,‘https’ ,‘www’ ,‘ftp’ etc if present. We run our re-ranking function on a set of 20,000 new coding queries from logs of 2017. We randomly sample 400 queries from the above set where our ranking logic introduce changes in the top 10 results and consider them as the test set. We evaluate the performance of the scoring function using our domain signal on these test queries against the current Bing ranking baseline.

Evaluation Metric. As pointed out by [ 7 ], standard IR metrics such as NDCG are not suitable for evaluating domain based signal. We also wanted to obtain a whole page comparison of the baseline and treatment therefore we chose the evaluation metric “Surplus” proposed by [ 7 ]. Following the similar setting, we show the top 10 results of baseline and treatment results to a human judge in two separate tabs in a single window. The judge can give the ratings on a seven-point scale :Left Much Better, Left Better, Left Slightly Better, Neutral, Right Slightly Better, Right Better and Right Much Better. We obtained three judgments per query for all the 400 queries in the test set to abate human judgment errors. Surplus for n queries is de ned as :

Surplus = nWnW+ nL n+LnT ⇤ 100 (6) where the technique scores nW wins, nL losses and nT ties.

The nal metric used for measurement isSurplusst r on , where strong win/losses are used, and Surplusweak where weak win/losses are used. A good surplus on a large query set implies that the technique is performing well with respect to the baseline.

Results and Analysis. The result of our technique with respect to the baseline is shown in Table 2. Our technique shows signi cant gains in weak and strong surplus over the baseline web ranker. Table 3 illustrates the qualitative analysis of our technique. For

Query set Test set Number of Queries 400

Surplusstron Surplusweak the query “page break in html” we are promoting “w3schools.com” (which caters to the query intent in topical space of “web page structuring in html” ) over domains like “cybertext.com”, “lvsys.com” etc. For the second query “excel vba protect sheet”, apart from promoting “msdn.microsoft.com” over “support.o ce.com ”, we also promote “mrexcel.com” (which has specialized content in excel) over “analysistabs.com” .

In the process of associating domains with query intents, we found that our model inherently clusters domains whose content lie in similar topic space. We show two such clusters in Figure 1. While searching for domains similar to “stackover ow.com”, we observe that other forums and question-answering platforms such as “social.msdn.microsoft.com”, “forums.asp.net”, “answers.microsoft.com, “superuser.com”, etc. come up as the closest ones. Similarly, when searched for domains similar to “w3schools.com”, domains such as “developer.mozilla.org”, “tizag.com”, “webdesign.about.com”, etc., were retrieved. Interestingly, all of these domains can be associated with a common topic space catering queries around designing web pages.

Another interesting observation that we came across is how a slight modi cation in query can change the a nity of domains containing relevant results. In Table 4, we demonstrate the same along two verticals. The left side portrays how a small change in query intent, with the same target coding language, changes the top retrieved domain. Whereas, the right side depicts how the change in target coding language, with same developer intent, changes the top retrieved domain. 5

CONCLUSIONS

In this paper, we proposed a novel deep learning based supervised technique to promote intent speci c domains in the developer segment using Bing clicked logs. The evaluation metric “Surplus” proves that our method performs better than the baseline web ranking algorithm. From the experiments conducted we prove that our model segments the queries into a set of topic based clusters and associates domain with each cluster. The topicality of cluster is representation of some coarse level of query intent which the developer is looking for.

The approach proposed is re-usable and scalable in nature. Currently we have worked in the developer segment but this work can be extended to any domain. As part of future work, we plan to learn a domain signal for the entire web. Currently, we assume that the SERP contains relevant pages and slight re-ranking of pages based on domain will satisfy the users. In future, we plan to learn a signal which is a composition of query-title relevance and intent-speci c domain preference and use it to re-rank results in web with more impact.

[1]

Brian

Amento , Loren Terveen, and

Will

Hill . 2000 . Does &Ldquo;Authority&Rdquo; Mean Quality? Predicting Expert Quality Ratings of Web Documents . In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00).

[2]

Luca

Becchetti , Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo A Baeza-Yates . 2006 . Link-Based Characterization and Detection of Web Spam. . In AIRWeb. 1-8.

[3] András

A Benczúr

Károly

Csalogány , and

Tamás

Sarlós . 2006 . Link-based similarity search to ght web spam . InIn AIRWEB . Citeseer.

[4]

Monica

Bianchini , Marco Gori, and

Franco

Scarselli . 2003 . PageRank and Web Communities. . In Web Intelligence . 365 - 371 .

[5]

Sergey

Brin and

Lawrence

Page . 2012 . Reprint of: The anatomy of a large-scale hypertextual web search engine . Computer networks 56 , 18 ( 2012 ), 3825 - 3833 .

[6]

Carlos

Castillo , Debora Donato, Aristides Gionis, Vanessa Murdock, and

Fabrizio

Silvestri . 2007 . Know your neighbors: Web spam detection using the web topology . In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM , 423 - 430 .

[7] Manoj

K Chinnakotla

, Rupesh K Mehta, and

Vipul

Agrawal . 2014 . Unsupervised Detection and Promotion of Authoritative Domains for Medical Queries in Web Search . In 11th International Conference on Natural Language Processing . 388 .

[8]

André

Luiz da Costa Carvalho , Paul-Alexandru

Chirita

, Edleno Silva De Moura, Pável Calado, and

Wolfgang

Nejdl . 2006 . Site level noise removal for search engines . In Proceedings of the 15th international conference on World Wide Web. ACM , 73 - 82 .

[9]

Arnaud

Gaudinat , Natalia Grabar, and

Célia

Boyer . 2007 . Automatic retrieval of web pages with standards of ethics and trustworthiness within a medical portal: What a page name tells us . Arti cial Intelligence in Medicine ( 2007 ), 185 - 189 .

[10] Arnaud

Gaudinat

, Natalia Grabar,

Célia

Boyer , et al. 2007 . Machine learning approach for automatic quality criteria detection of health web pages . In Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems. IOS Press , 705 .

[11]

Zoltán

Gyöngyi and

Hector

Garcia-Molina . 2005 . Link spam alliances . In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment , 517 - 528 .

[12]

Zoltán

Gyöngyi and

Hector

Garcia-Molina . 2005 . Link spam alliances . In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment , 517 - 528 .

[13]

Ahmed

Hassan . 2012 . A Semi-supervised Approach to Modeling Web Search Satisfaction . In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12).

[14] Ahmed

Hassan

, Rosie Jones, and Kristina Lisa Klinkner. 2010 . Beyond DCG: User Behavior As a Predictor of a Successful Search . In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10).

[15] Ahmed

Hassan

, Xiaolin Shi,

Nick

Craswell , and

Bill

Ramsey . 2013 . Beyond clicks: query reformulation as a predictor of search satisfaction . In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM '13).

[16] Samuel

Ieong

, Nina Mishra, Eldar Sadikov,

and Li

Zhang . 2012 . Domain Bias in Web Search . In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM '12).

[17]

Paul

Mcnamee and James May eld. 2004 . Character n-gram tokenization for European language text retrieval . Information retrieval 7 , 1 ( 2004 ), 73 - 97 .

[18] Gilad

Mishne

, David Carmel,

Ronny

Lempel , et al. 2005 . Blocking Blog Spam with Language Model Disagreement. . In AIRWeb , Vol. 5 . 1- 6 .

[19] Alexandros

Ntoulas

, Marc Najork,

Mark

Manasse , and

Dennis

Fetterly . 2006 . Detecting spam web pages through content analysis . In Proceedings of the 15th international conference on World Wide Web. ACM , 83 - 92 .

[20] Guoyang

Shen

Bin

Gao , Tie-Yan

Liu

, Guang Feng, Shiji Song, and

Hang

Li . 2006 . Detecting link spam using temporal information . In Data Mining , 2006 . ICDM' 06 . Sixth International Conference on. IEEE, 1049 - 1053 .

[21] Parikshit

Sondhi

, VG Vinod Vydiswaran, and ChengXiang Zhai . 2012 . Reliability Prediction of Webpages in the Medical Domain. . In

ECIR

, Vol. 12 . Springer, 219 - 231 .

[22]

Baoning

Wu and

Brian D

Davison . 2005 . Identifying link farm spam pages . In Special interest tracks and posters of the 14th international conference on World Wide Web. ACM , 820 - 829 .

[23] Baoning

, Vinay Goel, and

Brian D

Davison . 2006 . Propagating Trust and Distrust to Demote Web Spam . MTW 190 ( 2006 ).

[24] Hui

Zhang

, Ashish Goel, Ramesh Govindan, Kahn Mason, and Benjamin Van Roy. 2004 . Making eigenvector-based reputation systems robust to collusion . In WAW , Vol. 3243 . Springer, 92 - 104 .