WePS-3 Evaluation Campaign:
Overview of the Web People Search Clustering
and Attribute Extraction Tasks
Javier Artiles1 , Andrew Borthwick2 , Julio Gonzalo1 , Satoshi Sekine3 , and
Enrique Amigó1
1
NLP Group of UNED University,
Madrid, Spain
http://nlp.uned.es
2
Intelius, Inc.
Bellevue, WA, USA
http://search.intelius.com
3
CS Dept., New York University
New York, USA
http://www.nyu.edu
Abstract. The third WePS (Web People Search) Evaluation campaign
took place in 2009-2010 and attracted the participation of 13 research
groups from Europe, Asia and North America. Given the top web search
results for a person name, two tasks were addressed: a clustering task,
which consists of grouping together web pages referring to the same
person, and an extraction task, which consists of extracting salient at-
tributes for each of the persons sharing the same name. Continuing the
path of previous campaigns, this third evaluation aimed at merging both
problems into one single task, where the system must return both the
documents and the attributes for each of the different people sharing a
given name. This is not a trivial step from the point of view of eval-
uation: a system may correctly extract attribute profiles from different
URLs but then incorrectly merge profiles. This campaign also featured
a larger testbed and the participation of a state-of-the-art commercial
WePS system in the attribute extraction task. This paper presents the
definition, resources, evaluation methodology and results for the cluster-
ing and attribute extraction tasks.
Keywords: Web Search, Web People Search, Text Clustering, Attribute
Extraction, Meta-search Engines, Evaluation
1 Introduction
The Web People Search task has been defined in WePS campaigns as the problem
of organizing web search results for a given person name. The most frequently
used web search engines return a ranked list of URLs which typically refer to
various people sharing the same name. Ideally, the user would obtain groups
2 Authors Suppressed Due to Excessive Length
of documents that refer to the same individual, possibly with a list of person
attributes that help the user choosing the cluster that represents the person she
is looking for.
From a practical point of view, the people search task is highly relevant:
between 11 and 17% of web queries include a person name, 4% of web queries
are just a person name, and person names are highly ambiguous: according to
the US Census Bureau, only 90,000 different names are shared by more than
100,000,000 people. An indirect proof of the relevance of the problem is the fact
that, since 2005, a number of web startups have been created precisely to address
it (Spock.com - now Intelius - and Zoominfo.com being the best known).
From a research point of view, the task is challenging (the number of clus-
ters is not known a priori; the degree of ambiguity does not seem to follow a
normal distribution; and web pages are noisy sources from which attributes and
other indexes are difficult to extract) and has connections with Natural Lan-
guage Processing and Information Retrieval tasks (Text Clustering, Information
Extraction, Word Sense Discrimination) in the context of the WWW as data
source.
WePS-1 [3] was run as a Semeval 1 task in 2007, receiving submissions from
16 teams (being one of the largest tasks in Semeval) and WePS-2 [4] was run as
a workshop of the WWW 2009 Conference, with the participation of 19 research
teams. In the first campaign we addressed only the name co-reference problem,
defining the task as clustering of web search results for a given person name.
In the second campaign we refined the evaluation metrics [2] [1] and added an
attribute extraction task [15] for web documents returned by the search engine
for a given person name.
For this third campaign we aimed at merging both problems into one single
task, where the system must return both the documents and the attributes for
each of the different people sharing a given name. This is not a trivial step from
the point of view of evaluation: a system may correctly extract attribute profiles
from different URLs but then incorrectly merge profiles 4
WePS-1 and WePS-2 focused on consolidating a research community around
the problem and an optimal evaluation methodology. In WePS-3 the focus was
on implicating industrial stakeholders in the evaluation campaign, as providers
of input to the task design phase and also as providers of realistic scale datasets.
To reach this goal, we have incorporated as co-coordinator Andrew Borthwick,
principal scientist at Intelius, Inc. – one of the main Web People Search services
– which provides advanced people attribute extraction and profile matching from
web pages.
This paper presents an overview of the WePS-3 clustering and attribute ex-
traction tasks. The task definition is provided in Section 2, the WePS-3 testbed
is described in Section 3, the methodology to produce our gold-standard is ex-
4
WePS-3 also included a new task focused on the ambiguity of organization names.
Name ambiguity for organizations is a highly relevant problem faced by Online Rep-
utation Management systems. For a full description of this task and the results of
the evaluation please refer to [8].
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 3
plained in Section 4 and Section 5 includes the evaluation metrics and the cam-
paign design. We also provide an overview of the participating systems and the
results of the evaluation in Section 6. Finally, we end with some concluding
remarks in Section 7.
2 Task Definition
Given a set of web search results obtained using a person name as query, the
proposed tasks are to cluster these search results according to the different people
sharing the name and to extract certain biographical attributes for each person
(i.e., for each cluster of documents). Groups were allowed to perform only the
clustering task, or both tasks together.
Compared to previous WePS campaigns, the clustering task is defined in the
same way, but the testbed is larger and more diverse (see Section 3). Also there
is a closer relation between the clustering and attribute extraction tasks. The
WePS-3 Attribute Extraction task is different from WePS-2 in that systems are
requested to relate each attribute to a person (cluster of documents) instead of
just listing the attributes obtained from each document. This is the reason why
participants in the AE task are required to participate in the Clustering task
too. Systems are expected to output one attribute of each type in each cluster of
documents (i.e. only one affiliation, only one occupation, etc. for each person).
All attributes listed in Table 1 were included in the attribute extraction task5 .
Attribute class Example of attribute value
Date of birth 4 February 1888
Birth place Brookline, Massachusetts
Other name JFK
Occupation Politician
Affiliation University of California, Los Angeles
Award Pulitzer Prize
School Stanford University
Major Mathematics
Degree Ph.D.
Mentor Tony Visconti
Nationality American
Relatives Jacqueline Bouvier
Phone +1 (111) 111-1111
FAX (111) 111-1111
Email xxx@yyy.com
Web site http://nlp.cs.nyu.edu
Table 1. Table 1 Definition of 16 attributes of Person at WePS-2
5
Please refer to the WePS-3 Attribute Extraction Task Guidelines in the WePS web-
site (http://nlp.uned.es/weps) for a detailed definition of each attribute.
4 Authors Suppressed Due to Excessive Length
Compared to the WePS-2 Attribute Extraction there were two main mod-
ifications: (i) WePS-2 training data had an attribute ”education”, which was
separated into three attributes ”school”, ”degree” and ”major” in the test data.
WePS-3 use school/degree/major as independent attributes; (ii) The annotated
data in WePS-2 included ”work” and ”location”, but these were not used in the
WePS-2 evaluation and were not considered in WePS-3.
3 Data sets
3.1 Clustering Training Dataset
Participants used the WePS-1 and WePS-2 public clustering testbeds to develop
their systems. These datasets consist of the top web search results for a number
different ambiguous person names, and contain human assessments of the correct
way to group these documents according the different people mentioned with the
same name (see [3, 4] for a detailed explanation of the corpus creation and an-
notation guidelines). The output format remained as in WePS-2 (a ”clustering”
root element and ”entity” elements for each cluster of documents), except for
a slight change in the XML format: now ”doc” elements associated to a person
are enclosed in a ”documents” element, as in the Figure 1.
Fig. 1. Sample output from the clustering task.
3.2 Attribute Extraction Training Dataset
A training dataset was provided for the Attribute Extraction Task based on the
WePS-2 clustering and attribute extraction test datasets (see [15]). Given the
clustering gold standard and attributes extracted for documents in the WePS-
2 corpus, we generated a view of the extracted attributes grouped by cluster
instead of documents. This provided the participants with the kind of output
expected from their systems in WePS-3.
Both the clustering and attribute extraction output were provided in the
same XML file (see Figure 2). In this file each cluster of documents is specified
by the element entity, which contains the list of grouped documents and the list
of extracted attributes. For each attribute it’s required to indicate the type of
attribute (date of birth, occupation, etc.), the source from which it was extracted
(document ranking) and the value.
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 5
4th August 1979
Painter
[...]
Fig. 2. Sample output from the combined clustering and attribute extraction tasks.
3.3 Test Dataset
In WePS-3 we decided to substantially increase the amount of test data both in
number of documents and person names. The same dataset was used both for
the clustering and the attribute extraction tasks. A total of 300 person names
were used, compared to 30 names used in WePS-2. As we did in WePS-2, we
obtained names randomly from the US Census, Wikipedia and computer science
conference program committees. In addition to that, we included names for which
at least one person has one of the following occupations: attorney, corporate
executive or realtor. 50 names were extracted from each one of these sources to
make a total of 300 names.
In order to obtain person names where at least one person in the results sets
has a particular occupation we designed a simple algorithm. Given a small set
of keywords related to the occupation we are interested in (e.g. “real estate“ or
“housing” for realtor) we launch a query to a web search engine and randomly
show documents to an annotator until she finds one that refers to a person with
the intended occupation. Then we formulate a search query with that person’s
name. If the reference document is present in the top 200 search results we add
this name and these documents to our dataset.
For each name the top 200 web search results from the Yahoo! API 6 were
downloaded and archived with their corresponding search metadata (search snip-
pet, title, URL and position in the results ranking).
4 Assessments
Systems are requested to make clusters as accurate as possible over the whole set
of documents. However, given the annotation load required to manually cluster
6
http://developer.yahoo.com/search/
6 Authors Suppressed Due to Excessive Length
this amount of information, the evaluation was performed only on two people per
person name. This allowed us to simplify the annotation task from grouping a
large set of documents in an unknown number of people clusters to a classification
task where only two people are considered when examining each document in
the results. Even for this simplified annotation task a large amount of human
resources and time is required. To leverage this problem we used the services of
Mechanical Turk, distributing the task among many non-expert workers around
the world (see Section 4.1).
Before handing the test data to the annotators, we had to select the two
people (“person a“ and “person b“) that would be considered for each person
name. In each case, we chose a document in the search results as a reference to
classify other documents about that particular person. In general “person a“ is
related to the source from which the name was selected (e.g. a Wikipedia person
when the source is Wikipedia, a realtor when the source is realtor names, etc.),
while ”person b“ can be any other person in the search results.
To select a “person a“ reference document, we randomly iterate through the
search results until one of the following conditions is satisfied: (i) for Wikipedia
names we select one of the Wikipedia articles within the search results for
that name; (ii) for computer researchers we select a page that mentions the
researcher 7 (iii) occupation-related names already have a reference document
that we obtained as described in Section 3.3.
“person b“ can be anybody mentioned with the ambiguous name in the
results that does not share the distinctive feature of the first person (not a
Wikipedia entity or not having the lawyer, executive or realtor occupation). This
second person is also selected by randomly iterating through the results until
the conditions are satisfied by a certain web page. In the case of the Census
names the only requirement for the two selected people is that they be different,
but no constraints are set regarding the characteristics of the person. Finally,
for researchers names we found that most of them monopolize search results
and hence we did not extract a second person for these names. Still, a name
disambiguation system has to be able to recognize that most of these documents
belong to one individual and so we kept these names in the dataset.
Once we have the reference documents for all the names in the collection
we can proceed to the annotation process. Each worker will receive a set of ten
search results for a person name and two reference documents (one describing
“person a“ and other describing “person b“). The task for the annotator is to
classify each of the ten documents as referring to either one of the two selected
people or to a generic ”someone else“.
In Table 2 we show the average number of pages assigned to each person on
each source of name. The main result to highlight in this table is that people in
the conference source tend to monopolize the search results, followed by people
that appear in Wikipedia articles. Note that the average number of pages is well
below the total pages in the test dataset. The reason for this is that this table
7
Note that computer scientist names were obtained from a list of conference program
committee members, so we already know the researcher’s identity
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 7
only considers pages for which at least three annotators out of five agreed in the
assignment.
name Avg. number of pages classified as:
source person a person b someone else total
attorney 5.34 5.66 44.64 55.64
realtor 6.36 4.96 119.56 130.88
executive 7.48 4.20 58.12 69.80
census 4.58 3.00 19.64 27.22
conference 28.94 - 26.12 55.06
wikipedia 9.32 2.82 23.38 35.52
Table 2. Average number classified of pages by source of the name
For the evaluation of the Attribute Extraction task we didn’t rely on a previ-
ously generated gold standard. Instead we pooled the output of the participating
systems and submitted this for annotation in Mechanical Turk. We only evalu-
ated the extraction of attributes for the same people annotated in the clustering
task. For this reason, we only added to the pool attributes whose source is one
of the documents annotated as mentioning “person a“ or “person b“ according
to the Clustering gold standard. The annotators were given one webpage and a
set of up to ten < attribute, value > pairs and were asked to decide whether
each attribute fell into one of the following categories:
– Correct (this is a correct attribute that describes the person in the page).
– Incorrect for any reason other than being too long or too short. For in-
stance: The type of attribute is incorrect (e.g. gardener is incorrectly identi-
fied as a date of birth); the attribute is not attached to this person (e.g. this
attribute describes some other person describe on the page); or the attribute
was simply not found in the text describing the subject person.
– Correct, but too long or too short. The attribute is correct, but has one of
the following problems:
• Too short. The attribute is incomplete (e.g. ”director” when it should
say ”director of marketing”).
• Too long. The attribute contains a correct value but includes irrelevant
information (e.g. ”CEO in 1982” when it should say ”CEO” ).
– Impossible to tell because the web page is unreadable.
– The web page is readable, but the specified person is not on this page.
4.1 Mechanical Turk Methodology
For the annotation of the test data we used Mechanical Turk 8 . Mechanical Turk
(MTurk) offers a web portal to post tasks known as HITs (Human Intelligence
8
https://www.mturk.com/mturk/welcome
8 Authors Suppressed Due to Excessive Length
Tasks). Workers (known as ”Turkers”) can choose among the available tasks and
complete them for a small fee for each task. The use of MTurk for NLP tasks has
been studied before [16, 6] and has been found to be effective, but this evaluation
forced us to focus on the problem of detecting ”spam” annotations by Turkers
and to focus on ”employee relations” issues of how best to motivate and interact
with Turkers.
Spam annotations occur when a Turker attempts to rapidly do a large num-
ber of HITs without making a serious attempt at doing quality annotations or
by simply writing an automated script to do the HITs. In the following para-
graphs, we describe efforts we undertook to discourage spam annotations and to
encourage the highest quality workers.
The primary guard that MTurk has to encourage high-quality annotations
is the Turker’s ”HIT approval rate” (HAR), which is the percentage of each
Turker’s HITs that have been approved divided by the number of approved +
rejected HITs. Amazon’s instructions on the web site recommend that HITs be
posted requiring that Turker’s HITs have a minimum HAR of 95%. For our
initial annotations, we adopted this recommendation. This succeeded initially,
but we found some batches that showed obvious signs of spamming. Table 3
shows average inter-annotator agreement obtained in different portions of the
corpus according to the source of the name. For each of the annotated web pages,
agreement is measured as the percentage of annotators (five in each case) that
selected the most voted annotation. The table also shows the number of HITs
generated in each case (each HIT contains 10 pages related to an ambiguous
name) and the average number of seconds spent by the annotators working on
each HIT. Specifically, as can be seen in Table 3, the census batch had 14/4325
HITs done by workers with average inter-annotator agreement of 50% or higher,
while the realtor batch, which had the highest rate of IA agreement and which
took the longest time per HIT, had 1,386/4,695 HITs done by workers with IAA
of 50% or higher.
source avg.agreement avg.time per hit #hits
attorney 0.51 158.16 4120
census 0.48 54.46 4195
conference 0.50 76.87 4265
executives 0.55 163.04 4240
realtor 0.68 314.45 4695
wikipedia 0.48 131.33 4325
Table 3. Annotation statistics for the clustering task
Concerned that the some of our batches might have suffered from spam, we
instituted a number of changes in our MTurk methodology for the attribute
annotation task.
1. We raised the minimum HAR to 97%.
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 9
2. We added an additional requirement that the Turker must have had at least
500 approved HITs before doing our HITs.
3. On each HIT, we added at least one attribute which we knew to be very
likely false (it was an attribute drawn from a response for a different person.
Hence it would only be true in the very unlikely case that the two people
had the same attribute by chance).
4. We monitored the accuracy of the Turkers on each batch of HITs. Turkers
who marked too many of the ”trick attributes” as correct had their work
”rejected” (not paid for and resubmitted to other Turkers to be redone) and
were ”blocked”. Blocked workers are barred from ever performing work on
the account that blocked them.
5. In addition, we instituted a bonus program whereby we paid cash bonuses to
those workers who got the best score on the ”trick” attributes. The workers
with the best score received a 100% bonus. At the discretion of the manager
of the Turk project (Dr. Borthwick), Turkers with a score close to the best
were sometimes given 50% bonuses.
6. Finally, we established a dialog with the Turkers as described below.
We established a dialog with the Turkers in two ways. First of all, we took
care to notify the Turkers that we were monitoring them and would reward good
workers with bonuses and punish evil-doers with rejection of their HITs and by
blocking them from doing future work. We put this notice on every HIT and we
also alluded to it on the tag line of the HIT, where we put ”Bonus available!”
after the HIT description. Note that the rejection of an entire batch of HITs can
have severe consequences for a Turker as it can push his/her HAR below the
95% threshold required to get work.
The more interesting initiative, though, was the establishment of a two-way
dialog with the Turkers by posting on the Turker Nation bulletin board 9 .
We understand Turker Nation to be the most popular venue for this kind of
discussion [10]. As per the convention on Turker Nation, we used a single thread
as a point of discussion for all of the attribute extraction HITs. We posted notices
there every time a new batch of HITs was posted to Mechanical Turk and we
listed the ID’s of those workers who received bonuses so as to communicate to
Turkers that we were following through on our bonus commitment. Furthermore,
we used this board to field queries about how best to judge the HITs. In all,
we made 39 posts to this thread and fielded 47 questions and comments from
Turkers. Anecdotally, we believe that this dialog had a strong positive effective,
when taken in conjunction with our other initiatives. We could see from the
questions we got that at least some Turkers were taking this task very seriously.
One Turker, for instance, went to the trouble of collecting all of the Q and
A on the whole thread into one consolidated FAQ. He also commented ”Dr.
Borthwick is by far the best requester I’ve ever worked with. Interesting HITs,
fair pay + bonuses and good communication. Not sure what else I could ask
for.” Anecdotally, we noticed a strong correlation between workers who did a lot
9
http://turkers.proboards.com/index.cgi
10 Authors Suppressed Due to Excessive Length
of HITs with high accuracy and workers who were frequent posters on Turker
Nation.
Finally, a word on the financial model we used for this project. As can be
seen from Table 4, we devoted about 20% of our budget to worker bonuses. We
also strove to fulfill the philosophy of ”equal pay for equal work” by dividing
the HITs into batches according to how many attributes workers had to judge
(ranging from 2 - 10, although we omit the ”10 attribute” row from this table), so
we decreased the pay as the number of attributes to score decreased. Finally, we
strove to keep pay between $3 and $4 per hour, based on the advice of Amazon
salesmen that this was the maximum hourly rate that yielded significant benefit
on MTurk.
# attrs. # HITs pay per HIT pay for work bonus Amazon fee total cost effective
to score hourly pay
9 123 0.22 135.30 27.72 16.30 179.32 2.57
8 188 0.21 197.40 77.17 27.45 302.02 3.26
7 264 0.20 264.00 69.30 33.33 366.63 3.19
6 330 0.18 297.00 114.93 41.19 453.12 2.55
5 423 0.17 359.55 84.15 44.37 488.07 3.69
4 563 0.15 422.25 36.45 45.87 504.57 4.86
3 600 0.12 360.00 129.60 48.96 538.56 5.33
2 704 0.07 246.40 59.89 30.62 336.91 3.45
Total 3195 2281.9 599.21 288.111 3169.221
Table 4. Annotation statistics for the clustering task
5 Evaluation Methodology
For the evaluation of the clustering task we used the B-Cubed metrics [5]. These
metrics were introduced in our task in WePS-2 and have been proved to be the
only ones, among the different families of clustering metrics, that satisfy the
intuitive formal constraints for this problem [2].
B-Cubed metrics independently compute the precision and recall associated
to each item in the distribution. The precision of one item represents the amount
of items in the same cluster that belong to its category. Analogously, the recall
of one item represents how many items from its category appear in its cluster.
In WePS-2 an extended version of B-Cubed [2] was used to handle the prob-
lem of evaluating overlapping clustering (a clustering task where an element can
belong to more than one cluster, in our case, when document mentions multiple
people with the same ambiguous name). Due to the choices made in the design
of the WePS-3 testbed, we excluded the possibility of an overlapping clustering
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 11
(a document can only belong to one of the reference people or to someone else)
and hence we used the original version of the metric 10 .
The harmonic mean (F measureα=0,5 ) of B-Cubed Precision and B-Cubed
Recall was used for the ranking of systems. For each query we have evaluated
the clustering of documents mentioning two different people11 .
In the clustering annotation, a document with 3 or more votes (as explained
in Section 4, each document was annotated by five Mechanical Turk workers)
for person A, person B or ”someone else” was considered as a positive document
for the corresponding class. The system’s output was evaluated and averaged
the B-Cubed Precision and Recall values considering each element classified as
person A or person B, over the set of elements classified as person A, person B
or ”someone else”. Note that B-Cubed allows us evaluate the system’s clustering
solutions even though we do not have a full clustering assessment for each person
name. The reason for this is that B-Cubed evaluates on the element level, and
unlike Purity/Inverse Purity metrics, we do not have to choose a representative
class for each cluster in the output.
For the attribute extraction task participating systems were evaluated based
on the attributes they attached to the most representative cluster for each of the
people annotated in the clustering gold standard. The cluster with the best recall
of attributes for a person in the system output was considered its representative12
For instance, if we are evaluating “person A” of the name “Tiffany Hopkins”,
first we rank the clusters in the system output by their attributes recall to
“person A” and then we evaluate precision and recall of the attributes in the
best ranked cluster. The rationale for using attributes recall as selection criterion
is the following: a user confronted with the system output is likely to choose the
cluster that exposes the more attributes that identify the person.
Since the method used for the attribute extraction evaluation was pooling the
system outputs, recall is not guaranteed to be representative on the attribute
annotations: there might be attribute values which are not detected by any
system.
6 Participations and evaluation results
The WePS-3 organization was contacted by 34 teams expressing their interest in
the clustering task. Out of these, 8 teams submitted a total of 27 different runs.
Two baseline systems were included in the evaluation: “all-in-one” which places
all documents in a single cluster, and “one-in-one” which places each document
in a separate cluster.
10
Note that results for B-Cubed extended and regular B-Cubed are identical on a
non-overlapping clustering.
11
With the exception of 50 names from the computer science conference names, for
which only documents about one person where considered
12
We also considered the cluster F-measure as an criterion for choosing the representa-
tive cluster. We found that this method often missed the cluster with more relevant
attributes, resulting in extremely low evaluation results.
12 Authors Suppressed Due to Excessive Length
Many systems (YHBJ, AXIS, TALP, WOLVES) [12, 7, 9, 11] include Hierar-
chical Agglomerative Clustering (HAC) as part of their system pipeline. DAEDALUS [14]
intentionally departs from the usage of HAC and experiments with the k-Medioids
clustering method. In TALP [9] three clustering methods (Lingo, HAC, and 2-
steps HAC) where compared using basic features extracted from the web pages.
WOLVES [11] trained a pairwise model to predict the likelihood that two
documents refer to the same person. A variety of document features were used
(words, named entities, Wikipedia topics, person attributes) along with different
pairwise features that measure the similarity between documents (cosine, over-
lap, Jaccard index, etc). Then a clustering algorithm used these predictions to
group the documents. The clustering methods used include HAC and Markov
Clustering.
YHBJ [12] concentrates on the document representation and feature weight-
ing. It uses Wikipedia entries to extend a feature set based on bag-of-words and
named entities. The assignment of weights to the different features goes beyond
the widely used TFIDF metrics, considering the relevance of the features to the
name query and how representative it is of the main text of the page.
AXIS [7] analyzed patterns of Web graph structure as part of a two-stage
clustering algorithm that also incorporates content-based features. The detection
of related web pages is used to overcome the lack of information about Web graph
structure.
RGAI [13] represented every document as vector of extracted person attribute
values and proceed to apply a clustering algorithm (their experiments include
bottom-up clustering and the Xmeans algorithm).
Table 5 presents the results of the 8 participants and the 2 baseline clustering
systems. B-Cubed Precision, Recall and F-measure values are macro-averaged
over each person name13 . In the cases where a team submitted multiple run
we have chosen the run with the best score as the team representative in the
ranking. The table of results shows that:
– The best scoring system obtains balanced results in both precision and recall,
while the rest of the participants have biased scores towards one or other
metric. Note that the macro-averaged F-measured scores are lower compared
to the F-measure that would be obtained using directly the macro-averaged
Precision and Recall values. This indicates that, even though Precion or
Recall may obtain a high average value, it is usually at the cost of a low score
in the other metric. The Unanimous Improvement Ratio results14 confirmed
that only the top two systems in the ranking make a robust improvements
(independent of the weighting of Precision and Recall). According to UIR
13
Note that these tables reflect the scores obtained taking into account only two people
for each person name. For this reason this table should not be directly compared to
previous WePS campaigns.
14
The Unanimous Improvement Ratio (UIR) checks, for each system pair, to what
extent the improvement is robust across potential metric weighting schemes (see [1]).
This measure was also employed in WEPS2 campaign [4].
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 13
YHBJ 2 makes a robust improvement of RGAI AE 1, BYU and TALP 5;
and AXIS 2 show a robust improvement compared to BYU.
– As in the previous WePS campaigns, the correct selection of a cluster stop-
ping criterion is a key factor in the performance of systems. The unbalance of
Precion and Recall highlighted in the previous point shows how this affects
the performance of the clustering systems in WePS.
– Unlike previous WePS campaigns almost all the systems obtained scores
above the baselines. It is likely that the one-in-one baseline is obtaining
lower scores given that we are only considering two people for each name
and that these two people are generally well represented in the dataset. This
procedure excludes many people with only one document in the Web, which
usually rewards the one-in-one approach.
Table 6 presents the results for the Attribute Extraction task. In this task
Intelius (http://www.intelius.com) provided a baseline system that was evalu-
ated along with the participants. Both RGAI [13] and WOLVES [11] relied on
a rule base approaches, tailoring different heuristics for each attribute type.
Macro-averaged Scores
F-measure B-Cubed
rank run α =,5 Pre. Rec.
1 YHBJ 2 unofficial 0.55 0.61 0.60
2 AXIS 2 0.50 0.69 0.46
3 TALP 5 0.44 0.40 0.66
4 RGAI AE 1 0.40 0.38 0.61
5 WOLVES 1 0.40 0.31 0.80
6 DAEDALUS 3 0.39 0.29 0.84
7 BYU 0.38 0.52 0.39
one in one baseline 0.35 1.00 0.23
8 HITSGS 0.35 0.26 0.81
all in one baseline 0.32 0.22 1.00
Table 5. Clustering results: official team ranking
7 Conclusions
The WePS-3 campaign has continued the research effort on people search by
offering a larger testbed, integrating the clustering and attribute extraction task
and including the participation of experts from companies. The evaluation has
featured the use of Mechanical Turk to achieve a large amount of annotated
data, and in this process we have learnt about the oportunities and dangers of
such powerful tool. Participant teams in the campaign have further expanded
the variety of approaches to the people search problem by including external
14 Authors Suppressed Due to Excessive Length
Macro-averaged Scores
F-measure
run α =,5 Pre. Rec.
RGAI AE 3 0.18 0.22 0.24
RGAI AE 1 0.15 0.18 0.19
Intelius AE unofficial 0.13 0.16 0.17
RGAI AE 2 0.12 0.16 0.15
RGAI AE 4 0.12 0.15 0.16
RGAI AE 5 0.12 0.15 0.15
BYU 0.10 0.11 0.14
WOLVES AE 1 0.10 0.18 0.09
WOLVES AE 2 0.06 0.08 0.07
Table 6. Attribute Extraction results: official team ranking
sources of knowledge (Wikipedia), applying new clustering methods to the task
and new feature weighting schemes.
8 Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and
Innovation within the project QEAVis-Catiex (TIN2007-67581-C02-01).
We would like to thank Intelius for sharing their expertise in the field of
people search and for their collaboration on the task design and the gold standard
creation. The economic support made possible such a large scale annotation
effort.
References
1. E. Amigó, J. Gonzalo, and J. Artiles. Combining evaluation metrics via the unan-
imous improvement ratio and its application in weps clustering task. In 2nd Web
People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
2. E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clus-
tering evaluation metrics based on formal constraints. Information Retrieval, 2008.
3. J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Estab-
lishing a benchmark for the web people search task. In Proceedings of the Fourth
International Workshop on Semantic Evaluations (SemEval-2007). ACL, 2007.
4. J. Artiles, J. Gonzalo, and S. Sekine. Weps 2 evaluation campaign: overview of the
web people search clustering task. In 2nd Web People Search Evaluation Workshop
(WePS 2009), 18th WWW Conference, 2009.
5. A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the
vector space model. In Proceedings of the 17th international conference on Com-
putational linguistics. ACL, 1998.
6. C. Callison-Burch. Fast, cheap, and creative: evaluating translation quality using
Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empiri-
cal Methods in Natural Language Processing: Volume 1-Volume 1, pages 286–295.
Association for Computational Linguistics, 2009.
Overview of the WePS-3 Evaluation Campaign: Clustering and AE tasks 15
7. K. A. Elena Smirnova and B. Trousse. Using web graph structure for person name
disambiguation. In Third Web People Search Evaluation Forum (WePS-3), CLEF
2010, 2010.
8. J. G. D. S. Enrique Amig, Javier Artiles and L. Bing. Weps-3 evaluation campaign:
Overview of the on-line reputation management task. In Third Web People Search
Evaluation Forum (WePS-3), CLEF 2010, 2010.
9. D. Ferrs and H. Rodrguez. Talp at weps-3 2010. In Third Web People Search
Evaluation Forum (WePS-3), CLEF 2010, 2010.
10. J. Hoskins. personal communication from Amazon sales representative, 2010.
11. C. O. Iustin Dornescu and T. Lesnikova. Cross-document coreference for weps. In
Third Web People Search Evaluation Forum (WePS-3), CLEF 2010, 2010.
12. C. Long and L. Shi. Web person name disambiguation by relevance weighting of
extended feature sets. In Third Web People Search Evaluation Forum (WePS-3),
CLEF 2010, 2010.
13. I. T. Nagy and R. Farkas. Person attribute extraction from the textual parts of
web pages. In Third Web People Search Evaluation Forum (WePS-3), CLEF 2010,
2010.
14. J. V.-R. Sara Lana-Serrano and J.-C. Gonzlez-Cristbal. Daedalus at webps-3 2010:
k-medoids clustering using a cost function minimization. In Third Web People
Search Evaluation Forum (WePS-3), CLEF 2010, 2010.
15. S. Sekine and J. Artiles. Weps2 attribute extraction task. In 2nd Web People
Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
16. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast—but is it good?:
evaluating non-expert annotations for natural language tasks. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, pages 254–263.
Association for Computational Linguistics, 2008.