DAEDALUS at WebPS-3 2010:
          k-Medoids Clustering using a Cost Function
                       Minimization

    Sara Lana-Serrano1,3, Julio Villena-Román2,3, José Carlos González-Cristóbal1,3
                             1
                          Universidad Politécnica de Madrid
                                 2
                          Universidad Carlos III de Madrid
                   3
                     DAEDALUS - Data, Decisions and Language, S.A.
                slana@diatel.upm.es, jvillena@it.uc3m.es,
                        josecarlos.gonzalez@upm.es


       Abstract. This paper describes the participation of DAEDALUS team at the
       WebPS-3 Task 1, regarding Web People Search. The focus of our research is to
       evaluate and compare the computational requirements and results achieved by
       different solutions based on the minimization of cost functions applied to
       clustering algorithms. Our clustering technique is based on an implementation
       of k-Medoids algorithm, run over a sparse term-document matrix built with the
       terms of the pages that are associated to each of the person names. We define an
       empty-cluster that holds all the individuals that are not part of any other cluster.
       Based on the results obtained, we can conclude that although clustering
       techniques play a very relevant role in the resolution of the problem of name
       homonymy in a set of web pages, there is a previous challenge still to solve:
       how to determine which contents are relevant for describing the person in that
       webpage, thus which are not part of the other navigational information
       contained in the webpage.

       Keywords: Web People Search, Word Sense Disambiguation, Cross document
       coreference resolution, text clustering, k-medoids.


1      Introduction

The Web People Search (WePS) [1] [2], one of the tracks in CLEF 2010, is divided
into two tasks: Task 1 is related to Web People Search (WPS) and focuses on person
name ambiguity and person attribute extraction on Web pages, and Task 2 is related
to Online Reputation Management (ORM) for organizations and focuses on the
problem of ambiguity for organization names and the relevance of Web data for
reputation management purposes. In this paper, we will focus on Task 1, which
addresses the problem of name homonymy. The basic goal of this task is to cluster a
set of web pages, which are the result of a Web search for a person name, in as many
groups as entities sharing that name.
2      Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal


   Our research group is led by and named after DAEDALUS, a small private
company in the field of Information and Telecommunication Technologies and a
leading provider of language-based solutions in Spain, and research groups of two
universities, Universidad Politécnica de Madrid and Universidad Carlos III de
Madrid. We have taken part in CLEF since 2003 in many different tracks and tasks, as
part of the MIRACLE team till last year. This paper describes our participation at the
WebPS Web People Search (Task 1).
   Traditionally, the solutions to solve the problem of name disambiguation have been
typically based on techniques such as Hierarchical Agglomerative Clustering (HAC)
or Incremental Vector Space clustering algorithm, or some of their variants. The idea
behind the set of experiments that we have carried out this year is to evaluate the
computational cost and the precision values that are achieved when another type of
solution based on the minimization of cost function is applied. Specifically, our
research has focused on the clustering algorithm based on k-Medoids [3].


2      System Description

Our system is modular and built from of a set of small components that are easily
combined in different configurations and executed sequentially to build the final
result set. A common baseline algorithm was used in all experiments to process the
data collection, following these steps:
1. Text Extraction: Ad-hoc scripts are run on web pages to extract the relevant
   information.
2. Tokenization: This process extracts basic textual components. Some basic entities
   are also detected, such as numbers, initials, abbreviations, and years. So far,
   compounds, proper nouns, acronyms or other types of entity are not specifically
   considered. The outcomes of this process are single words, multi-words and years
   in numbers.
3. Conversion to lowercase: All document terms are normalized by changing all
   letters to lowercase.
4. Filtering: All words recognized as stopwords are filtered out. Stopwords in the
   target languages were initially obtained from the University of Neuchatel’s
   resources page [4] and afterwards extended using our own developed resources.
5. Stemming: This process is applied to each one of the words to be indexed or used
   for retrieval. Standard Porter stemmers [5] for each considered language have been
   used.
6. Sparse matrix generation: For each corpus associated to each person name, a
   sparse term-document matrix is built. This matrix stores the set of terms that are
   contained in each corpus, considering all documents corresponding to the same
   person name as a whole.
7. Clustering: This process generates the final results by applying a clustering
   algorithm over the information stored in the sparse matrix. A set of configuration
   parameters, depending on the algorithm itself, allow to define different experiment
   settings.
                                                    DAEDALUS at WebPS-3 2010          3


3      Experiments and Results

As mentioned before, the objective of our experiments was to make an exhaustive
evaluation and comparison of the computational costs and results achieved when
solutions based on the minimization of cost functions are used as a basic clustering
algorithm. For our experiments, we have specifically focused on an implementation of
k-Medoids clustering algorithm [3], due to time constraints and lack of resources,
although we plan to run the same experiments with other clustering algorithms such
as Fuzzy C-Means.
   In our settings, we use k as the result of the minimization of the cost function (see
Equation 1).

   ܿ‫ ݎ݁ݐݏݑ݈ܥݎ݁ݐ݊ܫܿ ݎ݁ݐݏݑ݈ܥܽݎݐ݊ܫ‬#ܷ݊ܽ‫ݎ݁ݐݏݑ݈ܥݕݎ‬
                   +                   ∗                + ߙ ∗ ݇ ∗ ܴ ∗ ln ܴ           (1)
           ݇                 ݇                  ݇
where:
         k: number of clusters (2 – R/8).
         cIntraCluster: summation of the intra-cluster distances.
         cInterCluster: summation of the inter-cluster distances.
         #UnaryCluster: number of clusters with only one element (the
medoid).
         R: size of the simple space once removed the vectors without
significant components.
         α: dispersion coefficient (0.1 - 0.9).

   This algorithm is continuously run over the sparse term-document matrix that was
built with the terms of the pages that are associated to each person name. The
maximum number of epochs has been set to 10. The algorithm was designed allowing
no cluster overlapping, i.e., the membership function returns just one value for each
individual, or, in other words, each individual belongs to one cluster maximum.
   Different experiments have been defined, using different combinations of the
textual information present in different components of the web pages with different
values of the dispersion coefficient in the cost function to minimize in Equation 1.
After studying some strategies, we finally decided to define a cluster (empty-cluster)
that groups all those individuals that have no significant term after the linguistic
processing. Finally we submitted 4 experiments to be evaluated, listed in Table 1.

                        Table 1. Description of the experiment set.

                                                                   Dispersion
       Run Identifier          Component in page
                                                                              α)
                                                                 coefficient (α
       daedalus_1          Body                                        0.3
       daedalus_2          Metadata + Title                            0.3
       daedalus_3          Metadata + Title + Body                     0.3
       daedalus_4          Metadata + Title + Body                     0.6
4        Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal


   Table 2 summarizes the main features of those experiments. The Pages column
shows the average number of pages associated to each person name; k is the average
number of cluster per person name; UnaryC is the average number of clusters that
have only just one member (thus the medoid); EmptyCPop is the average number of
individuals belonging to the empty-cluster; Population is the average value of the
population considering only those clusters whose population is greater than 1; and
finally maxPop is the average value of the population associated to the top populated
cluster for each person name.
   All submitted experiments show a cluster that tends to accumulate a high
percentage of individuals (87% on average) and also a high proportion of clusters
having just one element (61,83% on average).

                              Table 2. Evaluation of run features.

Run Identifier       Pages      k        UnaryC      EmptyCPop       Population   maxPop
daedalus_1           191.19     13.59     8.48            20.51         40.65     144.60
daedalus_2           191.19     12.62      7.34           21.42         39.34     147.00
daedalus_3           191.19     14.79     10.18            2.02         47.25     168.81
daedalus_4           191.19     14.82     10.20            2.02         45.72     169.07

   Table 3 shows the results of applying the Unanimous Improvement Ratio (UIR) of
A (rows) over B (columns), assuming as scores F-measure for α equal to 0.5. As
shown in the table, the daedalus_3 experiment exhibits a higher robustness to
variations of the α parameter than the rest of the submitted experiments.

                       Table 3. Unanimous Improvement Ratio (UIR).

Run Id     all_in_one daedalus_1 daedalus_2 daedalus_3 daedalus_4 one_in_one
all_in_one     0.00      -0.14      -0.12      -0.24      -0.23      -0.01
daedalus_1     0.14       0.00       0.03      -0.10      -0.16      -0.03
daedalus_2     0.12      -0.03       0.00      -0.12      -0.14      -0.03
daedalus_3     0.24       0.10       0.12       0.00       0.01      -0.02
daedalus_4     0.23       0.16       0.14      -0.01       0.00      -0.02
one_in_one     0.01       0.03       0.03       0.02       0.02       0.00

   Table 4 shows the evaluation results using B-Cubed measures. As inferred from
that metric, the achieved results present, in general, a behaviour closer to
all_in_one_baseline than to one_in_one_baseline. This result is consistent and agrees
with the figures shown in Table 2. This may be explained that, although some
clusters have been identified, the main set of individuals are assigned to a medoid and
the rest of the individuals (a few of them) are distributed along clusters whose
population is only one or two individuals.
                                                   DAEDALUS at WebPS-3 2010            5


                  Table 4. Results of experiments using B-Cubed Metrics.

Run Identifier     Avg. BCubed Precision          Avg. BCubed Recall       Avg. F-measure
all_in_one                 0,22                          1,00                    0,32
daedalus_1                 0,30                          0,71                    0,37
daedalus_2                 0,30                          0,72                    0,38
daedalus_3                 0,29                          0,84                    0,39
daedalus_4                 0,28                          0,85                    0,38
one_in_one                 1,00                          0,23                    0,35


4      Conclusions and Future Work

Based on the results obtained, we can conclude that although clustering techniques
play a very relevant role in the resolution of the problem of name homonymy in a set
of web pages, there is a previous challenge still to solve before studying whether it is
better to generate one cluster with all the individuals that have nothing in common
with the others, rather than creating one cluster for each of them.
    This challenge has a greater impact in the success of the experiments, and it is
related to the extraction of relevant information: given a web page whose structure is
unknown, it is essential to determine which contents in this page are useful for the
resolution of the problem (i.e., are relevant for describing the person to whom the
page is assumed to refer) and which contents should be filtered out (such as
navigational information, titles, headers and footers, disclaimers, etc.). And moreover,
regarding the first ones, it is necessary to determine if those contents refer exactly to
the expected person or else is referring to other different entities or concepts (such as
lists of similar authors, for instance). In our humble opinion, these should be the
challenges on which we should focus in future editions of WebPS.


Acknowledgements

This work has been partially supported by the Spanish Center for Industry
Technological Development (CDTI, Ministry of Industry, Tourism and Trade),
through the CONTENIDOS A LA CARTA Project, INGENIO 2010 Programme,
AVANZA I+D 2008. Other partners in the project are Agencia EFE, Germinus XXI,
11870.com and Universidad Politécnica de Madrid.


References

1. Overview of the WebPS 3 task at ImageCLEF 2010. Working Notes of CLEF
   2010. Padova. Italy. 2010.
6      Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal


2. Artiles, J.; Gonzalo, J. and Sekine, S. WePS 2 Evaluation Campaign: overview of
   the Web People Search Clustering Task. 2nd Web People Search Evaluation
   Workshop (WePS 2009), 18th WWW Conference. 2009.
3. Park. Hae-sang; Lee. Jong-seok; Jun. Chi-hyuck. A K-means-like Algorithm for K-
   medoids Clustering and Its Performance. Proceedings of the 36th CIE Conference
   on Computers & Industrial Engineering. pp.1222-1231. Taipei. Taiwan. Jun. 20-23
   (2006).
4. University of Neuchatel. IR Multilingual Resources at UniNE.
   http://members.unine.ch/jacques.savoy/clef/index.html
5. Porter.       M.       Snowball       stemmers     and      resources     page.
    http://www.snowball.tartarus.org