<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Web Person Name Disambiguation by Relevance Weighting of Extended Feature Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chong Long</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Shi</string-name>
          <email>lshig@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Yahoo! Global R&amp;D</institution>
          ,
          <addr-line>Beijing</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our approach to the Person Name Disambiguation clustering task in the Third Web People Search Evaluation Campaign(WePS3). The method focuses on two aspects: the extended feature sets, and feature relevance weighting. Bag-of-words and named entities are most commonly used features in many existing web entity disambiguation algorithms and we further extend this basic feature set with Wikipedia concepts. Then two feature weighting models are employed. One is the feature relevance to the target person name(or “query name”), and the other is the feature relevance to the text content. Similarity score is calculated according to the feature weights for clustering documents of the same person. Experiments show that the system based on our approach has generated the best results among all the WePS-3's submissions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Person name disambiguation has long been an important problem in natural language
processing and text mining. Due to prevalent occurrences on the web that identical
person names (or surface names) on different web pages refer to distinct people, being able
to resolve the referees of person names on web content is essential for many
applications. For instance,</p>
      <p>
        (1) In web search, 15-21% of the queries contain person names (11-17% of the
queries are composed of a person name in web search, with aditional terms and 4% are
identified simply as person names) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. If we are able to retrieve documents that match
the user’s intended person instead of the surface name, the relevance of search results for
people related queries can be substantially improved. (2) Many online social network
applications rely on person name as one of the major identities of their users. Resolution
of person name ambiguity is hence crucial for many online SNS services. (3) Along
with ambiguity of word sense, entity name ambiguity has been a major impediment for
many natural language processing tasks, such as text classification, clustering etc.
      </p>
      <p>
        WePS(http://nlp.uned.es/weps/) is a public evaluation campaign for web entity
disambiguation, providing annotated datasets for training and testing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In 2010, we
Yahoo! Software R&amp;D Beijing participated the Person Name Disambiguation Task in the
third workshop, namely WePS-3. In this task, 300 person names (or query names) are
provided along with the top 200 documents retrieved from the search engine for each of
the person name. The target is to cluster documents based on the identity of the person,
such that documents with names referring the same person are converged into the same
cluster.
      </p>
      <p>
        Our method for the task focuses on two aspects: the feature set and feature
weighting. Bag-of-words and named entities are most commonly used features in many
existing web entity disambiguation algorithms. Although they have been reported as
effective [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we further extend the basic feature set with Wikipedia concepts. Wikipedia
offers a large repository of a wide range of concepts. Compared with conventional named
entities consisting of person, location and organization, Wikipedia concepts have many
advantages such as well-organized, clean and accurate. For instance, “Support Vector
Machine” is better treated as a single coherent conceptual unit rather than three
individual words as it is an entry in Wikipedia. Concepts like “Support Vector Machine”
are also unable to be recognized by current NER tools because they do not belong to
person, location or organization. Since Wikipedia entries are edited by human, they
are very accurate compared with entities automatically recognized by NER tools.
Previous attempts to leverage Wikipedia for entity disambiguation concentrated on using
Wikipedia entries as referees for resolution instead of features. They tried to map a
surface name in the text to a Wikipedia entry. However, due to limited coverage of
Wikipedia on people, majority of the person names are actually out of Wikipedia
except some famous people and therefore this method does not apply to most of the people
on the web.
      </p>
      <p>To assign weights to the features that indicate their contribution in resolving the
person name’s identity, we employed two weighting models. Most of the existing methods
use TFIDF as feature weights. Though simple, TFIDF may not well represent the
feature’s relevance to the query name as well as the content of the text. Some researchers
use information extraction method to extract all the related entities. For example, if the
sentence “George Bush is the former president of the U.S.” fits a pattern, “the former
president” will be extracted as the profession of George Bush. But, pattern-based
methods normally lead to low recall due to its difficulty to enumerate all the highly accurate
patterns between elements. Our method views a feature’s contribution to person name
disambiguation from two different perspectives. First, the feature should be relevant or
related to the query name; second, the feature should represent the content of the text.
And accordingly, we employ two weighting models to measure feature relevance in
these two regards.</p>
      <p>In this paper, we first introduce the related works in Section 2; then we describe
complementing the conventional bag-of-words or named entity based features with
Wikipedia concepts in Section 3.1. And in Section 3.2 two feature weighting models
are introduced. One is the model to measure feature relevance to the query name and
the other is the relevance to the text content. Section 3.3 and Section 3.4 present our
method to calculate similarity measure and our clustering algorithm, respectively. The
experiment results on the WePS datasets are shown in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Web person name disambiguation is also viewed as cross-document co-reference
problem in the many previous work. Bagga et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] employed co-occuring word
vectors to calculate similarity between entity names. Niu et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] extended Bagga’s
method through information extraction. Mann and Yarowsky [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed the
clustering method based on extracted biographic data. However, Niu and Mann’s methods
were only evaluated on manually generated test data and mainly focus on person name
disambiguation. Wan et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] took the assumption that a query entity of a person
usually omits the middle name and implemented a person name disambiguation system
called “WebHawk”. Our approach will be able to deal with more general situations.
      </p>
      <p>
        Bekkerman and McCallum [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] focused on social network to find documents that
refer to a particular person through two methods: one is based on a link structure and
the other used agglomerative/conglomerate double clustering. Bunescu and Pasca [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and Cucerzan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used Wikipedia knowledge to disambiguate named entities.
However, different from our approach, they try to “map” the surface names in the text to a
Wikipedia entry. Due to the limitation of the coverage of the wikipedia entries of
people, this method cannot be applied to resolve the majority of the people who are not
famous enough to be included in Wikipedia.
      </p>
      <p>
        Recently Yoshida et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] proposed a two-stage clustering algorithm and further
used the bootstrapping algorithm in the second stage [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Their method relies heavily
on named entity extraction. In Section 4 we will show that our approach that
incorporates Wikipedia concepts outperforms those based on entities identified by conventional
named entity recognition modules.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section we present our proposed web person name disambiguation approach,
which consists of four main steps. The overview of our approach will be provided first,
followed by detailed steps.</p>
      <p>1. First, Wikipedia concepts are extracted as features of a web page, together with
other conventional features such as bag-of-words and named entities. The web page is
converted into a feature vector based on the three types of features extracted from the
text.</p>
      <p>2. Then the weight of each feature in the feature vector is estimated by two
weighting models: one is the feature’s relevance to the query name, and the other is the
relevance to the text content. Each feature in a vector is measured by its TFIDF score and
weights under two models.</p>
      <p>3. After that the similarity score between two different pages containing the same
query name is calculated through their feature vectors based on two similarity measures:
cosine similarity and overlap similarity.</p>
      <p>4. Finally, web pages referred to the same entity are clustered according to the
pairwise similarity score calculated in the previous step.
3.1</p>
      <sec id="sec-3-1">
        <title>Wikipedia Concept Extraction</title>
        <p>As mentioned in Section 2, much of the existing work takes named entities as important
features. We, in addition, include Wikipedia concepts(or called “Wikipedia elements”)
extracted from the text in our feature set. We first extract all the manually edited entries
from Wikipedia and build a Wikipedia concept dictionary. Given a web page (with html
tags removed), the FSA (Finite State Automata) is used to extract string sequences in the
text that match the Wikipedia concepts in the dictionary. In order to avoid overlapping,
we use the maximum matching. For instance, both “People’s Republic of China” and
“China” are Wikipedia concepts. Since “People’s Republic of China” contains the string
“China” in it, only the maximum match “People’s Republic of China” is extracted as a
Wikipedia concept feature. These features together with the bag-of-words and named
entities of person, location and organization names recognized with the Stanford NER
tool (http://nlp.stanford.edu/ner/index.shtml) form a feature vector that represents the
content of the web page.</p>
        <p>Therefore, our extended feature set has three types of features in all: Wikipedia
concepts, bag-of-words and named entities.</p>
        <p>Compared with bag-of-words and named entities, using Wikipedia concepts offers
the following merits:</p>
        <p>1. Wikipedia is a large, well-organized dictionary for named entities. For example,
“Support Vector Machines” is treated as three different words under the bag-of-words
model. However, with Wikipedia, this term is rather recognized as a single concept
since Wikipedia has a manually edited entry for it.</p>
        <p>
          2. Wikipedia’s redirect pages can help find other alternative names for an entity [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
For example, the redirect pages of “United States” correspond to acronyms (U.S.A.,
U.S., USA, US), Spanish translations (Los Estados Unidos, Estados Unidos),
misspellings (Untied States) or synonyms (Yankee land).
        </p>
        <p>3. Wikipedia’s disambiguation pages can guide the system to disambiguate a
number of entities. For example, the disambiguation page for the name “Michael Jordan”
lists 8 associated entities(people). If there is a name “Michael Jordan” in a web page
and it is closely related to one of the 8 people, it can help the system to make a decision.</p>
        <p>Our experiments on the WePS dataset shows in Section 4 that our system with
Wikipedia features outperforms the ones with only bag-of-words and named entity
features.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature Weighting Model</title>
        <p>After the web page is converted into a feature vector, every feature in the vector is
assigned a weight measuring its importance in recognizing the identity of the query
name. Each word, named entity and Wikipedia concept, is called a “unit”, denoted as u
in this paper. These units vary from each other according to their corresponding feature
weights. At the beginning, each unit is assigned a TFIDF score</p>
        <p>T F IDF (u) = tf (u) ² ¡ log df (u)
(1)
where tf (u) is u’s term frequency on the web page, and df (u) is u’s document
frequency on a large corpus. We use the Yahoo! search engine to collect the statistics of
df (u). Then we propose two feature weighting models: the query relevance model and
the content relevance model, to assign each unit a proper weight.</p>
        <p>Query Relevance Weighting Model Query relevance weighting is to measure how
relevant a feature is to the query name. Intuitively, relevant concepts of the query name
can better represent its identity. In our method, we base our weighting model on the
assumption that words or concepts that appear close to the query name in the text are
more relevant than distant ones. The distance d(u) is measured by the minimum number
of sentences between those contain query q and those contain u. d(u) = 0 if u and q
co-exist in the same sentence. All ‘u’s with 0 · d(u) · dmax are considered. We get
dmax = 11 from the training sets. Three polynomial functions are used: f1(u), f2(u)
and f3(u). If d(u) &gt; dmax, f1(u) = f2(u) = f3(u) = 0; if 0 · d(u) · dmax, they
can be computed as Equation 2 to Equation 4.
(2)
(3)
(4)
d(u)
f1(u) = 1 ¡ dmax
f2(u) = 1 ¡ (
d(u) )2
dmax
f3(u) = (1 ¡ ddm(ua)x )2</p>
        <p>Here we give an example. The following passage comes from a Wikipedia’s article
about Michael Jordan (http://en.wikipedia.org/wiki/Michael Jordan). This passage has
the following ten sentences, which are numbered from one to ten.</p>
        <p>1. In the 1990 - 91 season, Jordan won his second MVP award after averaging 31.5
ppg on 53.9% shooting, 6.0 rpg, and 5.5 apg for the regular season.</p>
        <p>2. The Bulls finished in first place in their division for the first time in 16 years and
set a franchise record with 61 wins in the regular season.</p>
        <p>3. With Scottie Pippen developing into an All-Star, the Bulls elevated their play.
4. The Bulls defeated the New York Knicks and the Philadelphia 76ers in the
opening two rounds of the playoffs.</p>
        <p>5. They advanced to the Eastern Conference Finals where their rival, the Detroit
Pistons, awaited them.</p>
        <p>6. However, this time the Bulls beat the Pistons in a surprising sweep.
7. In an unusual ending to the fourth and final game, Isiah Thomas led his team off
the court before the final minute had concluded.</p>
        <p>8. Most of the Pistons went directly to their locker room instead of shaking hands
with the Bulls.</p>
        <p>9. The Bulls compiled an outstanding 15 - 2 record during the playoffs, and
advanced to the NBA Finals for the first time in franchise history, where they beat the Los
Angeles Lakers four games to one.</p>
        <p>10. Perhaps the best known moment of the series came in Game 2 when, attempting
a dunk, Jordan avoided a potential Sam Perkins block by switching the ball from his
right hand to his left in mid-air to lay the shot in.</p>
        <p>The query name is “Michael Jordan”, or “Jordan”. Each sentence’s distance weight
is shown in Table 3.2. Two sentences in the passage above contains the query name:
No.1 and No.10, therefore, their units’ d(u) is 0. The units in the 2nd and the 9th
sentences get d(u) = 1 because sentences No.1 and No.10 are close to them,
respectively. With this method we can get the other sentences’ distances to the query name.
The distance weights under three weighting functions are listed in the last three rows,
respectively.</p>
        <p>
          The Gradient Boosted Decision Tree (GBDT) [
          <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
          ] is used for content weighting
with the above features. To train the machine learning relevance model, 1.3 million
popular web pages are collected as the content weighting training data to learn the
features and compute the frequencies. In addition, 400,000 query-url pairs are collected
for manual annotation. We can get a web page from each url, and a query can be viewed
as one of its features (or units). Annotators judge the relevance of a query to a web page
on a 5-point scale (Perfect, Excellent, Good, Fair, Bad). Then the 5-point scales are
scored as 1.0(Perfect) to 0(Bad), respectively.
        </p>
        <p>N</p>
        <p>Each element in the training data is written as (xi; gi)i=1 (N is the size of training
data) and we need to fit a function such that gi ¼ h(xi), i = 1; : : : ; N . The lost function
between g and h is</p>
        <p>N
X jgi ¡ h(xi)j2
i=1</p>
        <p>
          We apply the gradient descent in functional space to minimize the discrepancy [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
The algorithm of GBDT regression is as Algorithm 1:
        </p>
        <p>There are two parameters, M and ´. They are estimated with cross validation on the
content weighting training set.</p>
        <p>Here we have the relevance score 0 · r(u) · 1 of a unit u (including the query
name q). The higher value r(u) has, the more relevant the unit u is to the text content.</p>
        <p>Since r(u) is the relevance of u to the text content, we can use it to estimate vr(u),
which is the relevance of the unit u to the query name q.
(7)
(8)
(6)
and (2) overlap similarity</p>
        <p>Simcos(V; V 0) =</p>
        <p>V ² V 0
jV jjV 0j
vr(u) =
&lt;8 01 iiff qr(=q)u&lt; µ and r(u) &lt; µ
: r(q)r(u) if r(q) ¸ µ and r(u) ¸ µ
where µ is a parameter and we get µ = 0:5 from the training set.</p>
        <p>Here we have two weight models vp(u) and vr(u), and they are both used to
estimate the relevance of u to the query name q. We set the weights v(u) to be the maximum
of vp(u) and vr(u).</p>
        <p>v(u) = maxfvp(u); vr(u)g</p>
        <p>For each unit u, we can get its relevance score v(u). All unit scores of a web page
form a vector V . We can compute similarity between a pair of web pages through their
vectors V .
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Similarity Measures</title>
        <p>Let V and V 0 be two the feature vectors of two web pages with the same entity name.
Two types of similarity measures are used: (1) cosine similarity</p>
        <p>Simoverlap(V; V 0) = Pw2V v(w) + Pw02V 0 v0(w0)
;</p>
        <p>Pu2V;u2V 0 (v(u) + v0(u))
where each u is one of the common units shared by V and V 0; w and w0 are all units in
V and V 0, respectively. v(u), v0(u), v(w) and v0(w0) are the scores of u. w and w0 are
computed on one of the two weighting models.</p>
        <p>The performance of these two similarity measures is compared and presented in our
experiment section. The results show that the disambiguation result based on overlap
similarity is better than that of cosine similarity.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Clustering</title>
        <p>
          We employed the Hierarchical Agglomerative Clustering [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] algorithm to cluster
documents with the same person name. Suppose Ci and Cj are two clusters. If there are two
web pages V and V 0 in Ci and Cj , respectively,
and they satisfy
        </p>
        <p>V 2 Ci; V 0 2 Cj ;</p>
        <p>Sim(V; V 0) &gt; °;
Ci and Cj will be merged into one cluster. Where Sim(V; V 0) can be computed with
either Equation 11 or Equation 12. ° = 0:25 is tuned in the training sets.
(9)
(10)
(11)
(12)
(13)
Algorithm 2 pseudo-code of the clustering algorithm
1: C = ff1g; f2g; :::; fngg (n is the number of web pages)
2: m Ã n (m is the number of clusters)
3: while m &gt; 1 do
4: (Ci; Cj) Ã arg maxCi;Cj2C Sim(Ci; Cj ),</p>
        <p>where Sim(Ci; Cj) = maxx2Ci;y2Cj Sim(Vx; Vy)
5: if Sim(Ci; Cj) &lt;= ° then goto 10
6: Ci Ã Ci [ Cj
7: C Ã C n Cj
8: m Ã m ¡ 1</p>
      </sec>
      <sec id="sec-3-5">
        <title>9: end while</title>
        <p>10: Output the clustering results
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>In this section, we will evaluate our approach on the WePS datasets. The WePS-1 and
WePS-2 datasets are used as the training and the test data for evaluation first. And our
system’s performance on the WePS-3 campaign is also presented.</p>
        <p>
          There are totally 76 query names in WePS-1. They are randomly selected from US
Census, ambiguous person names in the English Wikipedia and program committee listing
of a Computer Science conference [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. For each query name, at most top 100 web pages
returned by Yahoo! search engine are collected for disambiguation, so there are 6445
pages in total. In WePS-2, 30 query names are selected. Each of the query name has
at most 150 pages from top search results, and there are 3444 web pages in total. In
WePS-3 there are 300 query names, with top 200 web pages returned by Yahoo! for
each query name, yielding 57355 evaluation pages in total. The WePS program
committee asked annotators to manually label document clustering for each query name.
The system’s performance is measured by comparing the clustering generated from the
algorithm with human labeled gold-standard test data. Two evaluation metrics are used
in WePS: the Purity F-score and the B-Cubed F-score [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>The WePS-1 datasets are used as the training data in our experiments to learn the
similarity metrics and tune the parameters. The WePS-2 datasets are used as the test
data. We submit our system’s outputs on the WePS-3 datasets for this campaign, without
any modification on the algorithm.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Results</title>
        <p>In this sub-section, we will evaluate our approach through three aspects. First of all, the
experimental results of the system under different feature sets and similarity measures
are provided. Then our proposed two relevance weighting models will be evaluated.
Finally we will show our system’s performance on the WePS-3 datasets.
Features and Similarity Measures We have introduced in Section 3.1 that there are
three types of features (or units): bag-of-words, named entities and Wikipedia concepts.
Two similarity measures are also described in Section 3.3. Evaluation results about
these two aspects are first presented, with TFIDF as the weight on each feature. The
results with different feature sets and similarity measure are shown in Table 2.</p>
        <p>FCubed and FP urity refer to B-Cubed F-score and Purity F-score, respectively.
“Cosine” and “Overlap” mean cosine similarity and overlap similarity. “Bag-of-words&amp;Named
Entity” means both bag-of-words and named entity features are used. “All” means all
the three types of features are used. From this table we can get four observations:
1. If we only use bag-of-words features, the results are not satisfactory. While
combined with named entities and Wikipedia concept features, we can get better results;
2. Using all three types of features does not yield much better results than two types
because there are many overlappings between named entities and Wikipedia concepts
(a Wikipedia concept can also be viewed as a named entity);</p>
        <p>3. The system based on overlap similarity outperforms the one based on cosine
similarity;</p>
        <p>4. The system can get the best results with overlap similarity and under
bag-ofwords and Wikipedia features(FCubed = 0:74 and FP urity = 0:81). Therefore, they
are used to do the experiments in the following sub-sections.</p>
        <p>Evaluation of the Weighting Models We will evaluate our proposed two relevance
weighting models in this sub-section.</p>
        <p>First, we evaluate the query relevance weighting model. There are three weighting
functions: f1, f2 and f3 (see Equation 2 to Equation 4). The experimental results using
these three functions as well as without using weighting functions are shown on the
lefthand-side of Figure 1. “1”,“2” and “3” are query relevance weighting functions f1, f2
and f3, respectively. “0” means no weighting functions (only TFIDF weights). We can
see from this figure that the performance of the system has been substantially improved
with query relevance weighting functions. The system performs almost equally well
under three functions.</p>
        <p>We compare the system’s performance under different weighting models. On the
right-hand-side of Figure 1, “No Models” means we only use the TFIDF weight. “Query”
Ͳ</p>
        <sec id="sec-4-2-1">
          <title>Query relevance weighting functions</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Relevance weighting models</title>
          <p>means to use query relevance weighting model. The weighting function f2 is used.
“Content” stands for using content relevance weighting model. “Both” means both two
weighting models. The figure shows marked improvement of performance when both
of the two weighting models are used.</p>
          <p>Results on the WePS-3 datasets In WePS-3 campaign, we evaluated our system on the
WePS-3 test datasets. Table 3 shows our results. “Best” and “median” are the best and
the median FCubed scores among all submissions, respectively. We have submitted three
groups of results named “YHBJ-1”, “YHBJ-2” and “YHBJ-3”. YHBJ-1 and
YHBJ2 are both based on the extended feature sets including bag-of-words and Wikipedia
concepts, with query relevance and content relevance weighting models. YHBJ-1 sets
° = 0:3 as the clustering threshold and YHBJ-2 sets ° = 0:25(see Equation 13).
YHBJ3 does not use content relevance weighting model so its performance is lower than the
other two submissions. From the table we can see that YHBJ-2 is the best result among
all WePS-3 submissions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>Our approach to web person name disambiguation extends existing bag-of-words
features with Wikipedia concepts. In order to measure feature weights for calculating
document clustering similarity, we employ two weighting models that take into account
feature relevance to the query name and text content. Experiment results on the WePS-3
task 1 confirms the effectiveness of our method which outperforms all other competing
algorithms.</p>
      <p>In the future, we can make the following further improvement on this method:
1. There are no more than 200 top ranked web pages for each query name of the
WePS datasets, but the large number of the rest of the search results contain a great deal
of information which can help to get better clusters. We plan to build up a model to use
the information returned by search engines as much as possible;</p>
      <p>2. Currently, our distance weighting functions are applied to entities and Wikipedia
concepts. We in the future plan to leverage semantic information in our weighting
function;</p>
      <p>3. Machine learning models can be used to calculate similarity scores in order to get
more accurate estimation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Javier</given-names>
            <surname>Artiles</surname>
          </string-name>
          , Julio Gonzalo, and
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          .
          <article-title>The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task</article-title>
          .
          <source>In the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          . ACL,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Javier</given-names>
            <surname>Artiles</surname>
          </string-name>
          , Julio Gonzalo, and
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          .
          <article-title>Weps 2 evaluation campaign: Overview of the web people search clustering task</article-title>
          .
          <source>In WWW</source>
          ,
          <year>April 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ron</given-names>
            <surname>Bekkerman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Disambiguating web appearances of people in a social network</article-title>
          .
          <source>In WWW</source>
          , pages
          <fpage>463</fpage>
          -
          <lpage>470</lpage>
          , May
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Razvan</given-names>
            <surname>Bunescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marius</given-names>
            <surname>Pasca</surname>
          </string-name>
          .
          <article-title>Using encyclopedic knowledge for named entity disambiguation</article-title>
          .
          <source>In EACL</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>April 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Silviu</given-names>
            <surname>Cucerzan</surname>
          </string-name>
          .
          <article-title>Large-scale named entity disambiguation based on wikipedia data</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <fpage>708</fpage>
          -
          <lpage>716</lpage>
          ,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ergin</given-names>
            <surname>Elmacioglu</surname>
          </string-name>
          , Yee Fan Tan, Su Yan,
          <string-name>
            <surname>Min-Yen Kan</surname>
          </string-name>
          , and
          <article-title>Dongwon Lee1</article-title>
          .
          <article-title>Psnus: Web people name disambiguation by simple clustering with rich features</article-title>
          .
          <source>In the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          . ACL,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jerome</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Greedy Function Approximation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Jerome</given-names>
            <surname>Friedman</surname>
          </string-name>
          . Stochastic Gradient Boosting. Stanford University,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Hastie</surname>
          </string-name>
          , Robert Tibshirani, and
          <string-name>
            <given-names>Jerome</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <source>The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition)</source>
          . Springer, New York,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gideon</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Yarowsky</surname>
          </string-name>
          .
          <article-title>Unsupervised personal name disambiguation</article-title>
          .
          <source>In HLT-NAACL</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          , May
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Cheng Niu,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Rohini</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Srihari</surname>
          </string-name>
          .
          <article-title>Weakly supervised learning for cross-document person name disambiguation supported by information extraction</article-title>
          .
          <source>In ACL</source>
          , pages
          <fpage>598</fpage>
          -
          <lpage>605</lpage>
          ,
          <year>July 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Deepa</given-names>
            <surname>Paranjpe</surname>
          </string-name>
          .
          <article-title>Entity-based cross-document coreferencing using the vector space model</article-title>
          .
          <source>In COLING-ACL</source>
          , pages
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          ,
          <year>August 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Deepa</given-names>
            <surname>Paranjpe</surname>
          </string-name>
          .
          <article-title>Learning document aboutness from implicit user feedback and document structure</article-title>
          .
          <source>In CIKM</source>
          ,
          <year>November 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Amanda</surname>
            <given-names>Spink</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Bernard J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <article-title>Searching for people on web search engines</article-title>
          .
          <source>Journal of Documentation</source>
          , (
          <volume>60</volume>
          ):
          <fpage>266</fpage>
          -
          <lpage>278</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xiaojun</surname>
            <given-names>Wan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mu</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Binggong</given-names>
            <surname>Ding</surname>
          </string-name>
          .
          <article-title>Person resolution in person search results: Webhawk</article-title>
          . In CIKM, pages
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>October 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Minoru</surname>
            <given-names>Yoshida</given-names>
          </string-name>
          , Masaki Ikeda, Shingo Ono,
          <string-name>
            <given-names>Issei</given-names>
            <surname>Sato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hiroshi</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Person name disambiguation on the web by two-stage clustering</article-title>
          .
          <source>In WWW</source>
          ,
          <year>April 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Minoru</surname>
            <given-names>Yoshida</given-names>
          </string-name>
          , Masaki Ikeda, Shingo Ono,
          <string-name>
            <given-names>Issei</given-names>
            <surname>Sato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hiroshi</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Person name disambiguation by boostrapping</article-title>
          .
          <source>In SIGIR</source>
          ,
          <year>July 2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>