<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Web Graph Structure for Person Name Disambiguation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Smirnova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantin Avrachenkov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brigitte Trousse</string-name>
          <email>brigitte.trousseg@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AXIS research team</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MAESTRO research team, INRIA Sophia Antipolis - Mediterranee</institution>
          ,
          <addr-line>2004 route des Lucioles, 06902 Sophia Antipolis Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the third edition of WePS campaign we have undertaken the person name disambiguation problem referred to as a clustering task. Our aim was to make use of intrinsic link relationships among Web pages for name resolution in Web search results. To date, link structure has not been used for this purpose. However, Web graph can be a rich source of information about latent semantic similarity between pages. In our approach we hypothesize that pages referring to one person should be linked through the Web graph structure, namely through topically related pages. Our clustering algorithm consists of two stages. In the rst stage, we nd topically related pages for each search result page using graph-based random walk method. Next, we cluster Web search result pages with common related pages. In the second stage, Web pages are further clustered using content-based clustering algorithm. The results of evaluation have showed that this algorithm can deliver competitive performance.</p>
      </abstract>
      <kwd-group>
        <kwd>Person name disambiguation</kwd>
        <kwd>Web graph</kwd>
        <kwd>Related pages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>of information about latent semantic similarity between pages. For instance, an assumption that
a group of similar pages is likely to be closely linked is generally taken by numerous works in
Web community discovery. Analogously, in our approach we hypothesize that pages referring to
one person should be linked through the Web graph structure, namely through topically related
pages.</p>
      <p>We perform a two-stage clustering algorithm. In the rst stage, we nd related pages for each
Web search result page using graph-based random walk method. Next, we cluster Web search
result pages with common related pages. The resulted clustering is therefore built using only
Web link information. In the second stage, Web pages are further clustered using content-based
clustering algorithm. More particularly, we build a term pro le with frequency score for all pages
including related. Then we re-weight terms in each search result page pro le according to its
related pages pro les. Finally, we apply Hierarchical Agglomerative Clustering algorithm to the
set of un-clustered Web page pro les.</p>
      <p>From experiments we found that Web structure based clustering itself can quite successfully
disambiguate Web search results. Further improvements can be achieved by considering the
content of the pages. The results of evaluation have showed that our algorithm can deliver
competitive performance.</p>
      <p>The rest of the paper is organized as follows. In the next section we describe an idea of related
pages and our approach for nding related pages. In Section 3 we present the detailed description
of the algorithm. The runs that we carried out are described in Section 4, while the results of
evaluation of those runs are detailed in Section 5. We conclude in Section 6.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related pages</title>
      <sec id="sec-2-1">
        <title>Motivation</title>
        <p>
          In the following we explain our motivation behind the use of Web graph structure to disambiguate
the referents. We start by de ning a related page as the one that addresses the same topic or
one of the topics mentioned in a person page - page that contain a person name. For example,
given personal Web page of a scientist, related pages would include co-authors pages, conference
and project pages where scientist has participated, department pages that scientist belong to.
Ideally, if Web links would re ect semantic relationship between pages, topically related pages of a
person page could be found in its graph-based neighborhood. Moreover, we would observe person
pages referring to one person interconnected through their related pages. Indeed, an assumption
that similar pages are likely to be closely linked is generally taken by numerous works in Web
community discovery [
          <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
          ].
        </p>
        <p>
          Kleinberg in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] has given an illustrative example that ambiguous senses of the query can be
separated on a query-focused subgraph. Speci cally, several densely linked parts of the graph
can be uncovered using non-principal eigenvectors of AAT , where A is a subgraph adjacency
matrix. Author suggested building query-focused subgraph using semantically intrinsic forward
and backward links of Web search result pages. In our context, we can name the pages pointed
to by those links as semantically related. Therefore, we can form a hypothesis that for
ambiguous name query linked parts of the semantic subgraph form clusters corresponding to di erent
individuals.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Implementation</title>
        <p>
          The major problem of applying HITS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] algorithm and other community discovery methods to
WePS dataset consists in the lack of information about Web graph structure. In particular, full
information about Web page backward links is not available without crawling the main part of
the Web graph.
        </p>
        <p>
          Personalized PageRank (PPR) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] can be used to detect related pages of target page [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. In
this work [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], the personalization vector is a unit vector with all elements equal to zero and the
entry corresponding to the target page equal to one. Theoretical and experimental results showed
that, quite opportunely, Monte-Carlo method is a fast way to approximate top-k set of pages
with the largest value of Personalized PageRank in a local manner, i.e., using only page forward
links [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The lazy nature of Monte-Carlo iteration can be seen as a considerable advantage in
terms of storage and time resources against methods requiring knowledge of Web graph structure.
Moreover, Monte-Carlo method is highly parallelizable which reduces computational time on a
cluster of computers. Since Personalized PageRank computes related pages of target page using
only local forward-link information, globally related backward-link pages are usually missing.
Therefore, generally we cannot expect overlap in related pages sets of two pages referring to
one person - that would require global structure of the graph. Nevertheless, we found useful to
examine content of related pages.
        </p>
        <p>
          Alternatively to Personalized PageRank, we also consider related pages o ered by Google
service3. Although the algorithm is proprietary, the main idea expressed in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was formulated as
nding pages frequently co-cited with a target page. Relying on this explanation, the computation
of related pages involves both backward and forward links of a page. Therefore, we consider
Google related pages to be based on global Web graph information.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <sec id="sec-3-1">
        <title>Overview</title>
        <p>An overview of our approach to name resolution problem is presented in Figure 1. In the rst
stage, we cluster person pages appeared in search results based on Web structure. Thereto, we
determine related pages of each person page and then cluster person pages that share some of
related pages in one cluster. We consider this clustering as Web structure based since it is formed
based on link relationship. As the entire link structure of the Web is unknown to us, some
global topically related pages are missing during Monte-Carlo random walk process. Due to this
introduced sparseness, we perform the second stage clustering where the rest of the person pages
that did not show any link preferences are clustered based on the content. More particularly, we
build a term pro le with frequency score for all pages including related. We re-weight terms in
each person page pro le according to its related pages pro les. In this process weights of terms
that appear in the related page pro le are increased. Finally, we apply Hierarchical Agglomerative
Clustering (HAC) algorithm to the set of un-clustered Web pages pro les.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Web Structure based Clustering</title>
        <p>Related pages. In the rst implementation we compute related pages of person page using
Personalized PageRank. While we assumed that the presence of the link between pages implies
their semantic relationship, there are links that exist purely for navigational purpose. To avoid
negative e ect of these links we perform random walk of Monte-Carlo computation on links to
pages with host name di erent from current. By host name we mean the rst level in the URL
string associated with the link. We found this heuristic useful since links within one domain
typically serve a navigational function rather than indicate semantic similarity. In addition, we
3 Google search `related:' operator. http://www.google.com/intl/en/help/operators.html#related
lower the probability to follow the link that points to a page with high in-degree as inverse
proportional to natural logarithm of in-degree. For example, main pages of large portals like
Wikipedia or IMDB are universally popular and so, have a high number of incoming links that
however do not necessarily carry semantic relationship. We note that at this point we employed
a type of global information - the number of incoming links - that we found indispensable to
avoid pathologies caused by the high value of global PageRank attributed to large portals. We
requested the number of backward links for a given page from Google search engine4.</p>
        <p>We estimated top K set of related pages for each person page. In experiments we used two
values of K = f8; 16g and hence, two settings of Personalized PageRank computation. For K = 8
we set the number of iterations equal to 2000 and damping factor c equal to 0.2. In the second
setting for K = 16 we doubled the number of iterations to 4000 and increased damping factor to
0.3.</p>
        <p>An example of top 8 related pages list for personal Web page of a scientist is given in Table
1. For illustrative purposes related pages returned by Google service for the same person page
are given in Table 2. Clearly, Web pages computed by Personalized PageRank refer to di erent
topics related to the person but not necessarily contain the query-name in the content. These
pages are homepages of current and previous workplaces, Web pages of co-authors and scienti c
activities undertaken by the person. Quite di erently, a related pages set provided by Google
contains pages of the scientist on large portals such as LinkedIn, DBLP and Videolectures. It is
unlikely that these pages could be interconnected by short forward-link path and thus, it explains
their absence in Personalized PageRank list.</p>
        <p>Graph clustering. In the following step two person Web pages are merged in one cluster if they
share some related pages. Since the whole link structure of the Web is unknown to us, related
pages set is limited to pages reachable by forward links from a person page. We therefore address
to the content of the pages in the next stage.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Content based Clustering</title>
        <p>
          Page pro le. Preprocessing of Web pages include the following steps. First we convert Web
pages into plain text using Apache HTML parser5. In one implementation we extract the full
text of the page, while in the other we keep only the content of META tags. Next, we apply
clean-up procedure. To distinguish the main content from navigational text, advertisements, etc.
we use a simple heuristic that meaningful text in the page consists of at least 10 consecutive
terms [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We consider as a term a sequence of letters and numbers of length more than one.
Terms are stemmed using Porter's stemmer [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and removed if presented in standard English
stopword list.
        </p>
        <p>We apply preprocessing step to all pages including person pages and pages related to them.
Next, for each of these page we build a vector of terms with corresponding frequency score (tf )
in the page. After that, we use a re-weighting scheme as follows. The term t score at person page
p, tf (p; t), is updated at each related page r in the following way:</p>
        <p>
          tf 0(p; t) = tf (p; t) + tf (p; t) tf (r; t);
where r is in the related pages set of person page p and person page p is the one from search
results. This step resembles voting process. Terms that appear in related pages get promoted and
thus, random term scores found in the person page are lowered. At the end, vector is normalized
and top 30 most frequent terms are taken as a person page pro le.
4 Google search `link:' operator. http://www.google.com/intl/en/help/operators.html#link
5 http://htmlparser.sourceforge.net
HAC clustering. Finally, we apply HAC algorithm on the basis of clustering from the rst
stage to the rest of the Web page pro les. Speci cally, average-linkage HAC with cosine measure
of similarity was used. Following previous work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the similarity threshold for HAC algorithm
was xed to 0.1.
        </p>
        <sec id="sec-3-3-1">
          <title>Algorithm 1 Person Name Disambiguation using Web Graph Structure</title>
          <p>Input: Web search results
1. Web structure based clustering.
for all page t ∈ search results do</p>
          <p>compute related pages of t
end for
C0 ← cluster search result pages with shared related pages
2. Content based clustering.
for all page t ∈ search results do
build term profile of t
for all page r ∈ related(t) do
build term profile of r
update term profile of t
end for
end for
C ← cluster search result pages using HAC based on C0
return C
During evaluation period of WePS campaign we experimented with two ways to compute related
pages, the number of related pages and the type of content extracted from pages. We have chosen
to combine less number of related pages with the full content of the page and, the other way,
larger number of related pages with less extracted content. Experimentally we found that with
larger lists pages might be quite loosely related to the person page and might promote irrelevant
terms in the person page pro le. Therefore, we limited analyzed content of pages to a few words
in meta tag description, title and snippet.</p>
          <p>We submitted the following runs to clustering task. The run name in brackets is the name in
o cial evaluation results.</p>
          <p>PPR-HAC (AXIS 4). Top 8 related pages were computed using Personalized PageRank, the
full content of the Web page was used in HAC step.
G-HAC (AXIS 3). Top 8 related pages were provided by Google service, the full content of the
Web page was used in HAC step.</p>
          <p>G-HACext (AXIS 2). Top 16 related pages were provided by Google service, the content of
meta tag description, title and snippet was used in HAC step.</p>
          <p>PPR-HACext (AXIS 1). Top 16 related pages were computed using Personalized PageRank,
the content of meta tag description, title and snippet was used in HAC step.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>We also carried out four baseline runs:</title>
          <p>HAC. The baseline method where HAC algorithm was applied to the content of Web pages. No
link structure based clustering was performed.</p>
          <p>PPR. The baseline method where clustering was based on top 8 related pages computed using
Personalized PageRank. No content-based clustering was performed after.</p>
          <p>G. The baseline method where clustering was based on top 8 related pages provided by Google
service. No content-based clustering was performed after.</p>
          <p>PPR-G. The baseline method where clustering was based on merged set of top 8 related pages
computed using Personalized PageRank and provided by Google service. No content-based
clustering was performed after.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Two other baselines were provided by organizers: One-in-one. The baseline method where every Web page is assigned to a di erent cluster. All-in-one. The baseline method where all Web pages are assigned to a single cluster.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Table 3 shows the results achieved by our methods. Concerning the baseline runs, we note that
performance of link-based clustering methods is quite notable. Taking Web structure as the only
input information, it is possible to deliver the same or superior performance as when processing
the content. Remarkably, the combination of related pages computed using Personalized
PageRank and obtained from Google service achieved the highest score among link-based baselines.
We see that link-based baselines are e ective in terms of precision, while content-based baseline
showed higher recall values. High precision value of link-based baselines indicates that sharing
related pages between two pages is a strong evidence to their semantic similarity. In this case,
an improvement of recall value is attributed to the problem of nding strong Web graph paths
between two pages. We also found link-based methods bene cal in clustering ash or image made
pages with little text where content-based algorithm experiences di cultes.</p>
      <p>The signi cant improvement over the baseline methods (16-35% F-0.5) has been achieved by
combination of link-based and content-based clusterings. However, we have not found any
signi cant di erence among submitted combinations (as indicated by two-tailed paired t-test). All
submitted runs are characterized by higher BCubed precision value compared to recall. As we
noted above, the more balanced result could be obtained by discovering more link relatioships
among pages in the Web graph. Results indicate that all submitted runs improved the
corresponding baselines. This observation is con rmed by two-tailed paired t-test at signi cance level
&lt; 0:001 for all submitted runs against corresponding link-based and content-based baselines.</p>
      <p>The o cial performance ranking over WePS-3 participants showed that our algorithm took
the second place (in F-0.5 measure) among 8 competitors with in total 27 submitted runs. In
addition to achieved performance, we note that our algorithm can be e ciently implemented due
to parallel nature of Monte-Carlo computation.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have described our approach to person name disambiguation task at WePS-3 evaluation
campaign. Our main idea was to make use of patterns of Web structure to disambiguate Web
search results. From our experiments we concluded that it is possible to quite successfully resolve
a person name using Web structure as the only input information. Web structure based clustering
showed to be e ective in terms of precision. Signi cant improvement of recall value (+60%) was
achieved by combining link-based and content-based clusterings. However, we did not found any
signi cant di erence in behaviour between di erent combinations. Overall we concluded that our
algorithm can deliver competitive performance in comparison with other systems participated
in WePS-3 campaign. In addition to that, our algorithm can be implemented e ciently and is
suitable for use within a large-scale Web service.
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task</article-title>
          .
          <source>Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Weps 2 evaluation campaign: overview of the web people search clustering task</article-title>
          .
          <source>In: 2nd Web People Search Evaluation Workshop (WePS</source>
          <year>2009</year>
          ),
          <source>18th WWW Conference</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A testbed for people searching strategies in the www</article-title>
          .
          <source>In: SIGIR'05</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Avrachenkov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Litvak</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nemirovsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osipova</surname>
          </string-name>
          , N.:
          <article-title>Monte carlo methods in pagerank computation: When one iteration is su cient</article-title>
          .
          <source>SIAM Journal on Numerical Analysis</source>
          <volume>45</volume>
          (
          <issue>2</issue>
          ),
          <volume>890</volume>
          {
          <fpage>904</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jijkoun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsagkias</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weerkamp</surname>
            , W., de Rijke,
            <given-names>M.</given-names>
          </string-name>
          : The university of amsterdam at weps2.
          <source>In: 2nd Web People Search Evaluation Workshop (WePS</source>
          <year>2009</year>
          ),
          <source>18th WWW Conference</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Flake</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawrence</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
          </string-name>
          , L.:
          <article-title>E cient identi cation of web communities</article-title>
          .
          <source>In: KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <volume>150</volume>
          {
          <fpage>160</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Haveliwala</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Topic-sensitive pagerank</article-title>
          .
          <source>Proceedings of the 11th WWW</source>
          Conference pp.
          <volume>517</volume>
          {
          <issue>526</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>J. ACM</source>
          <volume>46</volume>
          (
          <issue>5</issue>
          ),
          <volume>604</volume>
          {
          <fpage>632</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] Kohlschutter,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fankhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Nejdl</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Boilerplate detection using shallow text features</article-title>
          .
          <source>In: WSDM '10: Proceedings of the third ACM international conference on Web search and data mining</source>
          . pp.
          <volume>441</volume>
          {
          <fpage>450</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Law</surname>
            ,
            <given-names>K.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harik</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          :
          <article-title>Techniques for nding related hyperlinked documents using link-based analysis</article-title>
          (
          <year>December 2009</year>
          ), assignee: Google Inc.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Leskovec</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahoney</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          :
          <article-title>Statistical properties of community structure in large social and information networks</article-title>
          .
          <source>In: WWW '08: Proceeding of the 17th international conference on World Wide Web</source>
          . pp.
          <volume>695</volume>
          {
          <fpage>704</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ollivier</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senellart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Finding related pages using green measures: an illustration with wikipedia</article-title>
          .
          <source>In: AAAI'07: Proceedings of the 22nd national conference on Arti cial intelligence</source>
          . pp.
          <volume>1427</volume>
          {
          <fpage>1433</fpage>
          . AAAI Press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>An algorithm for su x stripping pp</article-title>
          .
          <volume>313</volume>
          {
          <issue>316</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Person resolution in person search results: Webhawk</article-title>
          . In: CIKM'
          <volume>05</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>