Introduction

Using Web Graph Structure for Person Name Disambiguation

Elena Smirnova

Konstantin Avrachenkov

Brigitte Trousse

brigitte.trousseg@inria.fr 0 0 AXIS research team 1 MAESTRO research team, INRIA Sophia Antipolis - Mediterranee , 2004 route des Lucioles, 06902 Sophia Antipolis Cedex , France

In the third edition of WePS campaign we have undertaken the person name disambiguation problem referred to as a clustering task. Our aim was to make use of intrinsic link relationships among Web pages for name resolution in Web search results. To date, link structure has not been used for this purpose. However, Web graph can be a rich source of information about latent semantic similarity between pages. In our approach we hypothesize that pages referring to one person should be linked through the Web graph structure, namely through topically related pages. Our clustering algorithm consists of two stages. In the rst stage, we nd topically related pages for each search result page using graph-based random walk method. Next, we cluster Web search result pages with common related pages. In the second stage, Web pages are further clustered using content-based clustering algorithm. The results of evaluation have showed that this algorithm can deliver competitive performance.

Person name disambiguation Web graph Related pages

Introduction

of information about latent semantic similarity between pages. For instance, an assumption that a group of similar pages is likely to be closely linked is generally taken by numerous works in Web community discovery. Analogously, in our approach we hypothesize that pages referring to one person should be linked through the Web graph structure, namely through topically related pages.

We perform a two-stage clustering algorithm. In the rst stage, we nd related pages for each Web search result page using graph-based random walk method. Next, we cluster Web search result pages with common related pages. The resulted clustering is therefore built using only Web link information. In the second stage, Web pages are further clustered using content-based clustering algorithm. More particularly, we build a term pro le with frequency score for all pages including related. Then we re-weight terms in each search result page pro le according to its related pages pro les. Finally, we apply Hierarchical Agglomerative Clustering algorithm to the set of un-clustered Web page pro les.

From experiments we found that Web structure based clustering itself can quite successfully disambiguate Web search results. Further improvements can be achieved by considering the content of the pages. The results of evaluation have showed that our algorithm can deliver competitive performance.

The rest of the paper is organized as follows. In the next section we describe an idea of related pages and our approach for nding related pages. In Section 3 we present the detailed description of the algorithm. The runs that we carried out are described in Section 4, while the results of evaluation of those runs are detailed in Section 5. We conclude in Section 6. 2 2.1

Related pages Motivation

In the following we explain our motivation behind the use of Web graph structure to disambiguate the referents. We start by de ning a related page as the one that addresses the same topic or one of the topics mentioned in a person page - page that contain a person name. For example, given personal Web page of a scientist, related pages would include co-authors pages, conference and project pages where scientist has participated, department pages that scientist belong to. Ideally, if Web links would re ect semantic relationship between pages, topically related pages of a person page could be found in its graph-based neighborhood. Moreover, we would observe person pages referring to one person interconnected through their related pages. Indeed, an assumption that similar pages are likely to be closely linked is generally taken by numerous works in Web community discovery [ 6, 11 ].

Kleinberg in [ 8 ] has given an illustrative example that ambiguous senses of the query can be separated on a query-focused subgraph. Speci cally, several densely linked parts of the graph can be uncovered using non-principal eigenvectors of AAT , where A is a subgraph adjacency matrix. Author suggested building query-focused subgraph using semantically intrinsic forward and backward links of Web search result pages. In our context, we can name the pages pointed to by those links as semantically related. Therefore, we can form a hypothesis that for ambiguous name query linked parts of the semantic subgraph form clusters corresponding to di erent individuals. 2.2

Implementation

The major problem of applying HITS [ 8 ] algorithm and other community discovery methods to WePS dataset consists in the lack of information about Web graph structure. In particular, full information about Web page backward links is not available without crawling the main part of the Web graph.

Personalized PageRank (PPR) [ 7 ] can be used to detect related pages of target page [ 12 ]. In this work [ 12 ], the personalization vector is a unit vector with all elements equal to zero and the entry corresponding to the target page equal to one. Theoretical and experimental results showed that, quite opportunely, Monte-Carlo method is a fast way to approximate top-k set of pages with the largest value of Personalized PageRank in a local manner, i.e., using only page forward links [ 4 ]. The lazy nature of Monte-Carlo iteration can be seen as a considerable advantage in terms of storage and time resources against methods requiring knowledge of Web graph structure. Moreover, Monte-Carlo method is highly parallelizable which reduces computational time on a cluster of computers. Since Personalized PageRank computes related pages of target page using only local forward-link information, globally related backward-link pages are usually missing. Therefore, generally we cannot expect overlap in related pages sets of two pages referring to one person - that would require global structure of the graph. Nevertheless, we found useful to examine content of related pages.

Alternatively to Personalized PageRank, we also consider related pages o ered by Google service3. Although the algorithm is proprietary, the main idea expressed in [ 10 ] was formulated as nding pages frequently co-cited with a target page. Relying on this explanation, the computation of related pages involves both backward and forward links of a page. Therefore, we consider Google related pages to be based on global Web graph information. 3 3.1

System Description Overview

An overview of our approach to name resolution problem is presented in Figure 1. In the rst stage, we cluster person pages appeared in search results based on Web structure. Thereto, we determine related pages of each person page and then cluster person pages that share some of related pages in one cluster. We consider this clustering as Web structure based since it is formed based on link relationship. As the entire link structure of the Web is unknown to us, some global topically related pages are missing during Monte-Carlo random walk process. Due to this introduced sparseness, we perform the second stage clustering where the rest of the person pages that did not show any link preferences are clustered based on the content. More particularly, we build a term pro le with frequency score for all pages including related. We re-weight terms in each person page pro le according to its related pages pro les. In this process weights of terms that appear in the related page pro le are increased. Finally, we apply Hierarchical Agglomerative Clustering (HAC) algorithm to the set of un-clustered Web pages pro les. 3.2

Web Structure based Clustering

Related pages. In the rst implementation we compute related pages of person page using Personalized PageRank. While we assumed that the presence of the link between pages implies their semantic relationship, there are links that exist purely for navigational purpose. To avoid negative e ect of these links we perform random walk of Monte-Carlo computation on links to pages with host name di erent from current. By host name we mean the rst level in the URL string associated with the link. We found this heuristic useful since links within one domain typically serve a navigational function rather than indicate semantic similarity. In addition, we 3 Google search `related:' operator. http://www.google.com/intl/en/help/operators.html#related lower the probability to follow the link that points to a page with high in-degree as inverse proportional to natural logarithm of in-degree. For example, main pages of large portals like Wikipedia or IMDB are universally popular and so, have a high number of incoming links that however do not necessarily carry semantic relationship. We note that at this point we employed a type of global information - the number of incoming links - that we found indispensable to avoid pathologies caused by the high value of global PageRank attributed to large portals. We requested the number of backward links for a given page from Google search engine4.

We estimated top K set of related pages for each person page. In experiments we used two values of K = f8; 16g and hence, two settings of Personalized PageRank computation. For K = 8 we set the number of iterations equal to 2000 and damping factor c equal to 0.2. In the second setting for K = 16 we doubled the number of iterations to 4000 and increased damping factor to 0.3.

An example of top 8 related pages list for personal Web page of a scientist is given in Table 1. For illustrative purposes related pages returned by Google service for the same person page are given in Table 2. Clearly, Web pages computed by Personalized PageRank refer to di erent topics related to the person but not necessarily contain the query-name in the content. These pages are homepages of current and previous workplaces, Web pages of co-authors and scienti c activities undertaken by the person. Quite di erently, a related pages set provided by Google contains pages of the scientist on large portals such as LinkedIn, DBLP and Videolectures. It is unlikely that these pages could be interconnected by short forward-link path and thus, it explains their absence in Personalized PageRank list.

Graph clustering. In the following step two person Web pages are merged in one cluster if they share some related pages. Since the whole link structure of the Web is unknown to us, related pages set is limited to pages reachable by forward links from a person page. We therefore address to the content of the pages in the next stage. 3.3

Content based Clustering

Page pro le. Preprocessing of Web pages include the following steps. First we convert Web pages into plain text using Apache HTML parser5. In one implementation we extract the full text of the page, while in the other we keep only the content of META tags. Next, we apply clean-up procedure. To distinguish the main content from navigational text, advertisements, etc. we use a simple heuristic that meaningful text in the page consists of at least 10 consecutive terms [ 9 ]. We consider as a term a sequence of letters and numbers of length more than one. Terms are stemmed using Porter's stemmer [ 13 ] and removed if presented in standard English stopword list.

We apply preprocessing step to all pages including person pages and pages related to them. Next, for each of these page we build a vector of terms with corresponding frequency score (tf ) in the page. After that, we use a re-weighting scheme as follows. The term t score at person page p, tf (p; t), is updated at each related page r in the following way:

tf 0(p; t) = tf (p; t) + tf (p; t) tf (r; t); where r is in the related pages set of person page p and person page p is the one from search results. This step resembles voting process. Terms that appear in related pages get promoted and thus, random term scores found in the person page are lowered. At the end, vector is normalized and top 30 most frequent terms are taken as a person page pro le. 4 Google search `link:' operator. http://www.google.com/intl/en/help/operators.html#link 5 http://htmlparser.sourceforge.net HAC clustering. Finally, we apply HAC algorithm on the basis of clustering from the rst stage to the rest of the Web page pro les. Speci cally, average-linkage HAC with cosine measure of similarity was used. Following previous work [ 5 ], the similarity threshold for HAC algorithm was xed to 0.1.

Algorithm 1 Person Name Disambiguation using Web Graph Structure

Input: Web search results 1. Web structure based clustering. for all page t ∈ search results do

compute related pages of t end for C0 ← cluster search result pages with shared related pages 2. Content based clustering. for all page t ∈ search results do build term profile of t for all page r ∈ related(t) do build term profile of r update term profile of t end for end for C ← cluster search result pages using HAC based on C0 return C During evaluation period of WePS campaign we experimented with two ways to compute related pages, the number of related pages and the type of content extracted from pages. We have chosen to combine less number of related pages with the full content of the page and, the other way, larger number of related pages with less extracted content. Experimentally we found that with larger lists pages might be quite loosely related to the person page and might promote irrelevant terms in the person page pro le. Therefore, we limited analyzed content of pages to a few words in meta tag description, title and snippet.

We submitted the following runs to clustering task. The run name in brackets is the name in o cial evaluation results.

PPR-HAC (AXIS 4). Top 8 related pages were computed using Personalized PageRank, the full content of the Web page was used in HAC step. G-HAC (AXIS 3). Top 8 related pages were provided by Google service, the full content of the Web page was used in HAC step.

G-HACext (AXIS 2). Top 16 related pages were provided by Google service, the content of meta tag description, title and snippet was used in HAC step.

PPR-HACext (AXIS 1). Top 16 related pages were computed using Personalized PageRank, the content of meta tag description, title and snippet was used in HAC step.

We also carried out four baseline runs:

HAC. The baseline method where HAC algorithm was applied to the content of Web pages. No link structure based clustering was performed.

PPR. The baseline method where clustering was based on top 8 related pages computed using Personalized PageRank. No content-based clustering was performed after.

G. The baseline method where clustering was based on top 8 related pages provided by Google service. No content-based clustering was performed after.

PPR-G. The baseline method where clustering was based on merged set of top 8 related pages computed using Personalized PageRank and provided by Google service. No content-based clustering was performed after.

Two other baselines were provided by organizers: One-in-one. The baseline method where every Web page is assigned to a di erent cluster. All-in-one. The baseline method where all Web pages are assigned to a single cluster. Results

Table 3 shows the results achieved by our methods. Concerning the baseline runs, we note that performance of link-based clustering methods is quite notable. Taking Web structure as the only input information, it is possible to deliver the same or superior performance as when processing the content. Remarkably, the combination of related pages computed using Personalized PageRank and obtained from Google service achieved the highest score among link-based baselines. We see that link-based baselines are e ective in terms of precision, while content-based baseline showed higher recall values. High precision value of link-based baselines indicates that sharing related pages between two pages is a strong evidence to their semantic similarity. In this case, an improvement of recall value is attributed to the problem of nding strong Web graph paths between two pages. We also found link-based methods bene cal in clustering ash or image made pages with little text where content-based algorithm experiences di cultes.

The signi cant improvement over the baseline methods (16-35% F-0.5) has been achieved by combination of link-based and content-based clusterings. However, we have not found any signi cant di erence among submitted combinations (as indicated by two-tailed paired t-test). All submitted runs are characterized by higher BCubed precision value compared to recall. As we noted above, the more balanced result could be obtained by discovering more link relatioships among pages in the Web graph. Results indicate that all submitted runs improved the corresponding baselines. This observation is con rmed by two-tailed paired t-test at signi cance level < 0:001 for all submitted runs against corresponding link-based and content-based baselines.

The o cial performance ranking over WePS-3 participants showed that our algorithm took the second place (in F-0.5 measure) among 8 competitors with in total 27 submitted runs. In addition to achieved performance, we note that our algorithm can be e ciently implemented due to parallel nature of Monte-Carlo computation.

Conclusion

We have described our approach to person name disambiguation task at WePS-3 evaluation campaign. Our main idea was to make use of patterns of Web structure to disambiguate Web search results. From our experiments we concluded that it is possible to quite successfully resolve a person name using Web structure as the only input information. Web structure based clustering showed to be e ective in terms of precision. Signi cant improvement of recall value (+60%) was achieved by combining link-based and content-based clusterings. However, we did not found any signi cant di erence in behaviour between di erent combinations. Overall we concluded that our algorithm can deliver competitive performance in comparison with other systems participated in WePS-3 campaign. In addition to that, our algorithm can be implemented e ciently and is suitable for use within a large-scale Web service. 7

[1] Artiles , J. , Gonzalo , J. , Sekine , S.: The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task . Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) ( 2007 )

[2] Artiles , J. , Gonzalo , J. , Sekine , S.: Weps 2 evaluation campaign: overview of the web people search clustering task . In: 2nd Web People Search Evaluation Workshop (WePS 2009 ), 18th WWW Conference ( 2009 )

[3] Artiles , J. , Gonzalo , J. , Verdejo , F. : A testbed for people searching strategies in the www . In: SIGIR'05 ( 2005 )

[4] Avrachenkov , K. , Litvak , N. , Nemirovsky , D. , Osipova , N.: Monte carlo methods in pagerank computation: When one iteration is su cient . SIAM Journal on Numerical Analysis 45 ( 2 ), 890 { 904 ( 2007 )

[5] Balog , K. , He , J. , Hofmann , K. , Jijkoun , V. , Monz , C. , Tsagkias , M. , Weerkamp , W., de Rijke, M. : The university of amsterdam at weps2. In: 2nd Web People Search Evaluation Workshop (WePS 2009 ), 18th WWW Conference ( 2009 )

[6] Flake , G. , Lawrence , S. , Giles , L.: E cient identi cation of web communities . In: KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 150 { 160 . ACM , New York, NY, USA ( 2000 )

[7] Haveliwala , T. : Topic-sensitive pagerank . Proceedings of the 11th WWW Conference pp. 517 { 526 ( 2002 )

[8] Kleinberg , J. : Authoritative sources in a hyperlinked environment . J. ACM 46 ( 5 ), 604 { 632 ( 1999 )

[9] Kohlschutter, C. , Fankhauser , P. , Nejdl , W. : Boilerplate detection using shallow text features . In: WSDM '10: Proceedings of the third ACM international conference on Web search and data mining . pp. 441 { 450 . ACM , New York, NY, USA ( 2010 )

[10] Law , K.L. , Harik , G.R. : Techniques for nding related hyperlinked documents using link-based analysis ( December 2009 ), assignee: Google Inc.

[11] Leskovec , J. , Lang , K.J. , Dasgupta , A. , Mahoney , M.W. : Statistical properties of community structure in large social and information networks . In: WWW '08: Proceeding of the 17th international conference on World Wide Web . pp. 695 { 704 . ACM , New York, NY, USA ( 2008 )

[12] Ollivier , Y. , Senellart , P. : Finding related pages using green measures: an illustration with wikipedia . In: AAAI'07: Proceedings of the 22nd national conference on Arti cial intelligence . pp. 1427 { 1433 . AAAI Press ( 2007 )

[13] Porter , M.F. : An algorithm for su x stripping pp . 313 { 316 ( 1997 )

[14] Wan , X. , Gao , J. , Li , M. , Ding , B. : Person resolution in person search results: Webhawk . In: CIKM' 05 ( 2005 )