Discovery and Promotion of Subtopic Level High �ality Domains for Programming �eries in Web Search Arpita Das Saurabh Shrivastava Prateek Agrawal Microsoft, India Microsoft, India Microsoft, India arpda@microosft.com sauras@microsoft.com pragraw@microsoft.com Sandeep Sahoo Manoj Chinnakotla Microsoft, India Microsoft, India sasaho@microsoft.com manojc@microsoft.com ABSTRACT snippet. The most relevant search result for a coding query is depen- With the advancement of technology in modern era, a signi�cant dent on how much the result satis�es the query intent. For example, portion of the web referred to as developer segment serves to sat- if the query is about a particular function in a programming lan- isfy the programming related information need of the users. User guage, the developer will prefer a small description of the function satisfaction in this segment not only depends on the relevance of and an example code snippet, however, if the query is about an the retrieved pages but also on the domains that these pages belong error code he is probably looking for ways to debug it. Promoting to. We aim to discover sub-topic level associations of the domains the domain serving the correct intent will drastically improve the and queries. We propose a supervised deep neural network based search engine result page(SERP). approach using the click-through data of a commercial web search The entire web can be grouped into intersecting clusters of do- engine to discover and promote the domains which provide high mains where every cluster represents a latent topic space satisfying quality and expert level content for a query intent. Experiments some query intent(s). Given a new query, we map the query to show that our domain speci�c ranker performs signi�cantly well, the nearest topic cluster and promote the domains associated with both qualitatively as well as quantitatively, on real-world coding that cluster. For example, the query “how to format date in c#” be- query sets when compared with standard web ranking baseline. long to the clusters centered around coarse topics like “c#”, “time”, This paper further demonstrates how associating domains with “changing date format” and have domains like “docs.microsoft.com”, query intents results in the formation of overlapping domain clus- “c-sharpcorner.com”, “dotnetperls.com” associated with them. ters where domains in each cluster represent a topical space of We extracted coding queries from the click logs of the commercial query intent(s). search engine Microsoft Bing for the past three years(2014-2016). Over this period we observed the trend of clicks for the queries CCS CONCEPTS with respect to 45453 coding domains. The distribution of the clicks gathered by di�erent domains is not uniform as shown in Table • Information systems → Page and site ranking; 1. The domains like “stackover�ow.com”and “msdn.microsoft.com” clearly dominate the click shares. One might argue that since clicks KEYWORDS model user satisfaction, promoting the most clicked domains for domain preference, web search, user behavior the past year might improve the SERP. Interestingly, this is not ACM Reference Format: the case because ultimately the satisfaction of user will depend Arpita Das, Saurabh Shrivastava, Prateek Agrawal, Sandeep Sahoo, and upon the relevance of the result with respect to the query, there- Manoj Chinnakotla. 2017. Discovery and Promotion of Subtopic Level fore in the domain front also, it only makes sense to promote the High Qual-ity Domains for Programming Queries in Web Search. *O domain that satis�es the user intent. For the query “connecting 1SPDFFEJOHT PG UIF GJSTU *OUFSOBUJPOBM 8PSLTIPQ PO -&"3OJOH /FYU database in azure”, from authority perspective one can assume that H&OFSBUJPO 3BOLFST  "NTUFSEBN  0DUPCFS    -&"3/&3    QBHFT a developer will prefer documents from “msdn.microsoft.com” or “docs.microsoft.com” but a third domain named “dzone.com” exists 1 INTRODUCTION which contain speci�c information about databases and their con- nections which exactly matches the query intent. Slight promotion With the increase in the number of technologies and coding infras- of the third domain will result in the satisfaction of the user. Clicks tructures, developers are becoming more and more dependent on capture the high level scenario of domain preference, but we dis- the web. A coding query may have various intents ranging from cover and promote the domains which have high sub-topic level learning basics of a programming language to debugging a code association with the query intent. Retrieving intent speci�c domain is still unexplored in the re- -&"3/&3 0DUPCFS  "NTUFSEBN 5IF/FUIFSMBOET $PQZSJHIUªGPSUIJTQBQFSCZJUTBVUIPST$PQZJOHQFSNJUUFEGPSQSJWBUFBOE search world. However, work has been done to detect authoritative, BDBEFNJDQVSQPTFT trustworthiness etc of domains. Traditionally researchers have used link structure based approaches and supervised approaches to pre- dict trustworthiness of a domain. Link based approaches such as PageRank, HITS, SALSA[5] uses the structure present in hypertext ICTIR, Oct2017, Netherlands Das et al. Domain Clicks Domain Clicks stackover�ow.com 42.01% ozgrid.com 0.10% msdn.microsoft.com 15.23% powershell.com 0.10% w3schools.com 4.29% pandas.pydata.org 0.09% social.msdn.microsoft.com 3.31% community.spiceworks.com 0.09% technet.microsoft.com 2.84% sourceforge.net 0.09% social.technet.microsoft.com 1.71% getbootstrap.com 0.09% microsoft.com 1.54% mkyong.com 0.09% codeproject.com 1.41% vbforums.com 0.08% answers.microsoft.com 1.24% webdesign.about.com 0.08% docs.oracle.com 1.22% blog.udemy.com 0.08% Table 1: Distribution of clicks among the top-100 domains speci�c to coding queries. The left half shows share of the top-10 domains while the right half shows the bottom-10 (91-100) domains. of the web pages to identify page quality. PageRank is a well known 2 RELATED WORK algorithm that uses link information to assign global importance Detecting query-intent speci�c domains for the developer segment scores to all pages on the web. Bianchini et al. pointed out the vul- in web is an unexplored problem in the world of research. However, nerabilities of the link based algorithms to spamming [4]. Since, it several work has been done to solve the analogous research prob- is possible to arti�cially boost authority score by forming an associ- lems of eliminating spam websites, determining domain authority, ation of highly interlinked content, content farm websites manages trustworthiness, bias etc in the web. to get high PageRank score. Contrary to the link based approaches, Previous work on web spam removal or establishing reliability supervised approaches are robust to hyperlinked structure manipu- focused mostly on unsupervised techniques for detection of link lation but they are heavily dependent on gold structured labeled spam (that creates tightly knit community of links to a�ect link- data. Obtaining large-scale human judged query-domain pairs is based ranking algorithm) and content spam (that malaciously spam extremely challenging in terms of cost and e�ciency. Click logs are the content of web pages). Researchers worked on automatic detec- assumed to be substitution for human judged data as clicks capture tion of suspicious signal in the link dependencies [1, 2, 8, 11, 20, 24] human behavior and feedback to queries. Chinnakotla et al., Sondhi and the content of web pages [18, 19]. Castillo et al. combined link- et al. used clicked data from web to learn supervised model to es- based and content-based features and used the topology of the web tablish reliability in the health segment [7, 21]. Our paper focuses graph by exploiting the link dependencies among the web pages to learn the signal that is a composition of reliability, authority etc to detect spam pages [6]. Interconnections of spam farms is also and serves the exact coding intent of the user using supervision exploited to combat spam pages [3, 12, 22, 23]. from Bing clickthrough data. Establishing authority of a web page was tried using supervised In this paper, we propose a novel deep learning based method to approaches too. In health domain, search results can directly impact maximize the conditional likelihood of a clicked domain for a given decisions related to people’s health so it is highly imperative for query intent. We train a three layered deep convolution neural search engines to provide reliable information. Chinnakotla et al., network to project query and domains into their corresponding Gaudinat et al., Sondhi et al. employed supervised machine learning semantic spaces. We consider the domains with minimum semantic techniques to learn the notion of trustworthiness of web pages in distance from the query to be slightly promoted in the SERP. We Health domain [7, 9, 10, 21]. Also, Hassan et al., modeled web search assume that the title of the search results in SERP is semantically satisfaction of users [13–15]. relevant to the query. We make this assumption because promoting Ieong et al. introduced domain bias which shows a user’s propen- a relevant domain will not make sense if the document from that sity to believe that a page is more relevant just because it comes from domain is irrelevant. For example, if the user query is “how to lower- a particular domain [16]. They demonstrated the importance of do- case in javascript”, domains like “w3schools.com, stackover�ow.com, main preferences in web search even after factoring out position developer.mozilla.org” should be promoted, however, if a document bias and relevance. This impact of the domain bias [16] motivated with title “how to uppercase in javascript” from “w3schools.com” us to promote documents from domains satisfying the exact query is promoted the relevance of search result is hampered. The key intent. contributions of the papers are : 1) We learn the deep correlations between domains and query intents for the developer segment in web. 2) We perform experiments to show how the a�nity for a 3 LEARNING INTENT SPECIFIC DOMAINS domain changes with a slight change in intent of the query. 3) We We aim to learn a signal that promote the domains which satisfy highlight how domains in the developer segment can be clustered the query intent. We use a convolutional neural network model based on the query intents. 4) We perform qualitative and quantita- to learn non-linear relationships between a domain and a query tive analysis of our ranker which incorporates domain signal using intent. Another way of putting it is, the neural network segment large scale coding query test set and compare them with standard the queries into a set of �ne grained topics and associate most web ranking baseline. likely domains to each of the topic space. Each topic space can Discovery and Promotion of Subtopic Level High �ality Domains for Programming �eries in Web Search ICTIR, Oct2017, Netherlands be considered as a representation of a set of overlapping query We combine our intent speci�c domain score with relevance intent(s). score of web ranker of Bing to promote both relevant and author- We extracted coding queries and their clicked URLs from the itative pages. We take the top 50 results from the initial retrieval Bing click logs. For feature extraction, we used character trigram and re-rank them using a scoring function designed to associate based word hashing [17]. We attach the delimiter “#” to a word (say relevance and authority (Equation 5). Let the initial ranker assigns “pen” -> “#pen#” ) and extract its letter trigrams ( #pe, pen, en#). scores {s 1 ,s 2 ,. . . ,s 50 } to the top 50 URLs {u 1 ,u 1 ,. . . ,u 50 } retrieved We obtained 52339 unique letter trigrams for the entire dataset of for a query q. Let, {d 1 ,d 2 ,. . . ,d 50 } be the corresponding domains query-clicked domain pairs. We convert each word in the query extracted from these URLs. The new scoring function is de�ned as: and the domain to a vector of size 52339 and mark the presence (q, ui , di ) = si + ⇤ R(di , q) (5) of number of occurrences of each letter trigram in the word. This representation takes care of out-of-vocabulary words and words where is the factor with which we boost the domain with spelling errors. signal. We intentionally kept it’s value small to prevent irrelevant We build a convolutional neural network with three levels of pages from preferred domains from being promoted. alternating convolution, max pooling and recti�ed linear (ReLU) layers and a fully connected layer at the top. The network gives 4 EXPERIMENTS AND ANALYSIS a non linear projection of the query and domain vectors in their In this section, we �rst describe the dataset and evaluation metric corresponding semantic spaces. Let x be the word hashed input used in our experiments. We also present some interesting analysis term vector, is the output vector and h is the number of hidden that we can infer from the results. layers used. Let, H j represents the j t h intermediate layer whose Dataset Details. We collected past three years of Bing click logs weight matrix is Wj and bias term is b j , where j = {1, 2,. . . ,h}. and extracted queries of coding intent from them. We obtain 103 l j = f (Wj H j 1 + b j ) (1) million unique query-clicked domain pairs for training the neural network. We preprocess every query by lower-casing them and where j = {2,3,...,h} and H 1 = W1x removing stop words from them, we preserve the special characters as they are important in coding domain. For the preprocessing of = f (Wh Hh 1 + bh ) (2) domains we lower case them and remove pre�xes like ‘http’ ,‘https’ where we use tanh as the activation function f . The rele- ,‘www’ ,‘ftp’ etc if present. We run our re-ranking function on a vance R(d, q) of a domain d for a particular query q is calculated set of 20,000 new coding queries from logs of 2017. We randomly using: sample 400 queries from the above set where our ranking logic T introduce changes in the top 10 results and consider them as the q R(d, q) = d (3) test set. We evaluate the performance of the scoring function using | d || q | our domain signal on these test queries against the current Bing We use the supervision of the click logs to create positive and ranking baseline. negative samples for our training data. We treat queries and the Evaluation Metric. As pointed out by [7], standard IR metrics clicked domains as the positive samples (d + ) and queries and com- such as NDCG are not suitable for evaluating domain based signal. bination of domains from SERP which are not clicked for the query We also wanted to obtain a whole page comparison of the baseline and some randomly selected domains as negative data (d ). We and treatment therefore we chose the evaluation metric “Surplus” train our network with the objective to maximize the conditional proposed by [7]. Following the similar setting, we show the top 10 likelihood of the clicked domain given the queries or to minimize results of baseline and treatment results to a human judge in two the loss function in equation 4. separate tabs in a single window. The judge can give the ratings on a ÷ seven-point scale :Left Much Better, Left Better, Left Slightly Better, L( ) = lo P(d + |q) (4) Neutral, Right Slightly Better, Right Better and Right Much Better. (q,d + ) We obtained three judgments per query for all the 400 queries in where denotes the set of parameters of our network and the test set to abate human judgment errors. Surplus for n queries P(d + |q) is the posterior probability of the clicked domain given the is de�ned as : query. nW n L One might question if the signal is learnt from the clicked logs of Surplus = ⇤ 100 (6) nW + n L + nT a search engine then why the search engine itself does not re�ect where the technique scores nW wins, n L losses and nT ties. the desired behavior already. We argue that SERP of a search engine The �nal metric used for measurement is Surplusst r on , where is not only dependent on clicked signal it takes other features into strong win/losses are used, and Surplusweak where weak win/losses account too. Also, our model does not associate a domain to the are used. A good surplus on a large query set implies that the tech- particular query, it associates domain with a topical space that nique is performing well with respect to the baseline. represent query intent(s) and that topical space is learnt from a large collection of coding queries. For example, “docs.oracle.com” Results and Analysis. The result of our technique with respect is not associated with the query “read a �le in java” but with the to the baseline is shown in Table 2. Our technique shows signi�cant topics “java”, “�les” etc, so when a new query “write a �le in java” gains in weak and strong surplus over the baseline web ranker. arrives “docs.oracle.com” will still be promoted. Table 3 illustrates the qualitative analysis of our technique. For ICTIR, Oct2017, Netherlands Das et al. Number of Query set Surplusst r on Surplusweak Queries Test set 400 1.486 9.807 Table 2: The table compares the performance of our re-ranking technique with Baseline web ranker on the test set. Results marked in boldfaced indicate that the surplus was found to be statistically signi�cant over the baseline at 95% con�dence level ( < 0.0001). W/L/T denote the number of Wins, Losses and Ties observed. Query: Page break html Query: excel vba protect sheet 1.cybertext.com 1.support.o�ce.com Baseline 2.lvsys.com Baseline 2.msdn.microsoft.com 3.w3schools.com 3.analysistabs.com Our Technique 1.w3schools.com Our Technique 1.msdn.microsoft.com (strong win) 2.stackover�ow.com (weak win) 2.support.o�ce.com 3.msdn.microsoft.com 3.mrexcel.com Table 3: This table compares the top 3 domains shown by baseline and our technique. Query Top Host Query Top Host c# string msdn.microsoft.com oop in python docs.python.org c# string out of memory exception stackover�ow.com oop in javascript developer.mozilla.org c# string tutorial tutorialspoint.com oop in c++ tutorialspoint.com Table 4: This table shows how a slight change in query intent changes the a�nity for most relevant domain. the query “page break in html” we are promoting “w3schools.com” (which caters to the query intent in topical space of “web page structuring in html” ) over domains like “cybertext.com”, “lvsys.com” etc. For the second query “excel vba protect sheet”, apart from promoting “msdn.microsoft.com” over “support.o�ce.com”, we also promote “mrexcel.com” (which has specialized content in excel) over “analysistabs.com” . In the process of associating domains with query intents, we found that our model inherently clusters domains whose content lie in similar topic space. We show two such clusters in Figure 1. While searching for domains similar to “stackover�ow.com”, we observe that other forums and question-answering platforms such as “so- cial.msdn.microsoft.com”, “forums.asp.net”, “answers.microsoft.com, “superuser.com”, etc. come up as the closest ones. Similarly, when searched for domains similar to “w3schools.com”, domains such as “developer.mozilla.org”, “tizag.com”, “webdesign.about.com”, etc., were retrieved. Interestingly, all of these domains can be associated with a common topic space catering queries around designing web pages. Another interesting observation that we came across is how a slight modi�cation in query can change the a�nity of domains containing relevant results. In Table 4, we demonstrate the same along two verticals. The left side portrays how a small change in query intent, with the same target coding language, changes the top retrieved domain. Whereas, the right side depicts how the change in target coding language, with same developer intent, changes the top retrieved domain. 5 CONCLUSIONS In this paper, we proposed a novel deep learning based supervised Figure 1: Examples of domain based clusters. Each cluster technique to promote intent speci�c domains in the developer captures topicality of underlying query intent(s). Discovery and Promotion of Subtopic Level High �ality Domains for Programming �eries in Web Search ICTIR, Oct2017, Netherlands segment using Bing clicked logs. The evaluation metric “Surplus” [18] Gilad Mishne, David Carmel, Ronny Lempel, et al. 2005. Blocking Blog Spam proves that our method performs better than the baseline web with Language Model Disagreement.. In AIRWeb, Vol. 5. 1–6. [19] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. ranking algorithm. From the experiments conducted we prove that Detecting spam web pages through content analysis. In Proceedings of the 15th our model segments the queries into a set of topic based clusters international conference on World Wide Web. ACM, 83–92. [20] Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, and Hang Li. 2006. and associates domain with each cluster. The topicality of cluster Detecting link spam using temporal information. In Data Mining, 2006. ICDM’06. is representation of some coarse level of query intent which the Sixth International Conference on. IEEE, 1049–1053. developer is looking for. [21] Parikshit Sondhi, VG Vinod Vydiswaran, and ChengXiang Zhai. 2012. Reliability Prediction of Webpages in the Medical Domain.. In ECIR, Vol. 12. Springer, 219– The approach proposed is re-usable and scalable in nature. Cur- 231. rently we have worked in the developer segment but this work can [22] Baoning Wu and Brian D Davison. 2005. Identifying link farm spam pages. In be extended to any domain. As part of future work, we plan to learn Special interest tracks and posters of the 14th international conference on World Wide Web. ACM, 820–829. a domain signal for the entire web. Currently, we assume that the [23] Baoning Wu, Vinay Goel, and Brian D Davison. 2006. Propagating Trust and SERP contains relevant pages and slight re-ranking of pages based Distrust to Demote Web Spam. MTW 190 (2006). [24] Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn Mason, and Benjamin Van Roy. on domain will satisfy the users. In future, we plan to learn a signal 2004. Making eigenvector-based reputation systems robust to collusion. In WAW, which is a composition of query-title relevance and intent-speci�c Vol. 3243. Springer, 92–104. domain preference and use it to re-rank results in web with more impact. REFERENCES [1] Brian Amento, Loren Terveen, and Will Hill. 2000. Does &Ldquo;Authority&Rdquo; Mean Quality? Predicting Expert Quality Ratings of Web Documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’00). [2] Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo A Baeza-Yates. 2006. Link-Based Characterization and Detection of Web Spam.. In AIRWeb. 1–8. [3] András A Benczúr, Károly Csalogány, and Tamás Sarlós. 2006. Link-based simi- larity search to �ght web spam. In In AIRWEB. Citeseer. [4] Monica Bianchini, Marco Gori, and Franco Scarselli. 2003. PageRank and Web Communities.. In Web Intelligence. 365–371. [5] Sergey Brin and Lawrence Page. 2012. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer networks 56, 18 (2012), 3825–3833. [6] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 423–430. [7] Manoj K Chinnakotla, Rupesh K Mehta, and Vipul Agrawal. 2014. Unsupervised Detection and Promotion of Authoritative Domains for Medical Queries in Web Search. In 11th International Conference on Natural Language Processing. 388. [8] André Luiz da Costa Carvalho, Paul-Alexandru Chirita, Edleno Silva De Moura, Pável Calado, and Wolfgang Nejdl. 2006. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. ACM, 73–82. [9] Arnaud Gaudinat, Natalia Grabar, and Célia Boyer. 2007. Automatic retrieval of web pages with standards of ethics and trustworthiness within a medical portal: What a page name tells us. Arti�cial Intelligence in Medicine (2007), 185–189. [10] Arnaud Gaudinat, Natalia Grabar, Célia Boyer, et al. 2007. Machine learning approach for automatic quality criteria detection of health web pages. In Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems. IOS Press, 705. [11] Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In Proceed- ings of the 31st international conference on Very large data bases. VLDB Endowment, 517–528. [12] Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In Proceed- ings of the 31st international conference on Very large data bases. VLDB Endowment, 517–528. [13] Ahmed Hassan. 2012. A Semi-supervised Approach to Modeling Web Search Satisfaction. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’12). [14] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: User Behavior As a Predictor of a Successful Search. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM ’10). [15] Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond clicks: query reformulation as a predictor of search satisfaction. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM ’13). [16] Samuel Ieong, Nina Mishra, Eldar Sadikov, and Li Zhang. 2012. Domain Bias in Web Search. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12). [17] Paul Mcnamee and James May�eld. 2004. Character n-gram tokenization for European language text retrieval. Information retrieval 7, 1 (2004), 73–97.