UNED PanMorCrepsTeam at M-WePNaD

               Pablo Panero, Manuel Moreno1 , Tomás Crespo2 ,
               Jorge Carrillo-de-Albornoz3 , and Enrique Amigó3
              1
                Junta de Andalucı́a, mmorenomaldonado@gmail.com
                  2
                 Salenda Software Factory, t.crespo.g@outlook.com
            3
              NLP&IR Group, UNED, {jcalbornoz,enrique}@lsi.uned.es


      Abstract. This paper describes the participation of the PanMorCresp
      team in the Multilingual Web Person Name Disambiguation task of
      IberEval 2017. The solutions consisted of different variants of the tra-
      ditional hierarchical agglomerative clustering algorithm. The four ap-
      proaches have been defined and implemented independently by three
      Master’s students over the same vocabulary generation software. The
      purpose of this is to analyse to what extent the HAC design and im-
      plementation can affect the effectiveness of clustering. Using a simplistic
      approach based on hierarchical agglomerative clustering method, con-
      sidering just word occurrence, is able to achieve relatively good results
      regarding the rest of systems presented in the campaign.


1   Introduction

This work proposes a new people name disambiguation system used in the Mul-
tilingual Web Person Name Disambiguation task of IberEval 2017. In this task
we receive a set of training web pages and an associated gold standard with the
grouping of these web pages according to the different individuals they refer to.
The goal is to group a new set of web pages belonging to a test set data, where
no information about the correct grouping is provided.
    It is usual to search information on the Web about people, where the query
that expresses the information need is a person name. Because different people in
the world share the same name, the results returned by a search engine can con-
tain web pages related to several persons, not only for the searched individual.
For this reason this task is really interesting, especially because of the multilin-
gual nature of the Web. Despite this, the previous campaigns dedicated to this
task focus only on corpora with web pages in a single language, for instance, the
WePS campaigns [4], [3] and [2] in English, and a Chinese campaign [5]. The
objective of the MWepDNaD task is providing a chance to develop person name
disambiguation systems, with the additional challenge that results for a query,
as well as each individual, can be written in multiple languages.
    In this work, using the same vector generation software, three Master’s stu-
dents have implemented independently the HAC agglomerative clustering meth-
ods taking different decisions about the vocabulary size, linkage and stop cri-
terion. The purpose of this is to check to what extent the implementation of
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


           the HAC algorithm and the related decisions can affect the effectiveness of ap-
           proaches.
               The rest of the paper is organised as follows: our proposed methods to dis-
           ambiguate person names are described in Section 2. Section 3 present the results
           obtained by our proposals, and the analysis and discussion of them can be found
           in Section 4. Finally, Section 5 presents the conclusions.


           2     Methods

           In a first step we transform each document into a vector of values which is
           used as input for the hierarchical agglomerative clustering algorithm. To this
           aim each document is divided in tokens by just splitting the text by blank
           characters. After this, each token is transformed into a lowercase representation
           in order to avoid ambiguity and decrease the number of words in the dictionary
           for the vector representation. The vocabulary is generated independently for
           each person name (query). Finally, all words with frequency one in the corpus
           for the corresponding query were removed. Due to computational constrains,
           for each entity only the most frequent n words in the dictionary generated in
           the previous step were selected. The experiments include variants for several n
           values. In all runs, we have used the presence of words as projection function.
               The four approaches have been defined and implemented independently by
           three Master’s students over the same vocabulary generation software. The eval-
           uated approaches in the competition are:

             – PanMorCresp Team - run 1: The vocabulary contains the 4000 most
               frequent terms.The feature projection is the word occurrence. The HAC
               algorithm works under the complete linkage (maximum distance between
               items from both clusters), and the used similarity criterion is the cosine. As
               stop criterion, it considers a similarity threshold, which is adapted for each
               clustering case. That is the average similarity between documents divided
               by n. Several n values have been checked over the training corpus. Finally,
               we set the n parameter at two.
             – PanMorCresp Team - run 2: The vocabulary generation criterion is the
               same as in the previous approach. In this case, the employed linkage is the
               average similarity between documents in both clusters. The similarity thresh-
               old was tunned over the training corpus.
             – PanMorCresp Team - run 3: As well as in the previous approaches,
               it uses cosine similarity. For this run, we have eliminated stopwords and
               puntctuation marks. The vocabulary contains the 7500 most frequent terms
               in the collection. It uses the single linkage and the stop criterion is based
               on similarity (0.65). The similarity threshold was tunned over the training
               corpus.
             – PanMorCresp Team - run 4: This approach is analogous than the pre-
               vious one, but using 9 clusters as stop criterion.


                                                                                                                        153
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


           3     Results

           We have submitted 4 runs and the results are shown in Tables 1 and 2. We report
           the following metrics: Reliability (R), Sensibility (S) and their harmonic mean
           F0.5 (R, S) [1]. The final value of the evaluation will be the average of F0.5 (R, S)
           in all person names evaluated.
               Table 1 shows the results achieved by our methods considering in the evalu-
           ation only related web pages, and Table 2 shows the results considering all web
           pages. In addition, both tables show the result of the two baselines provided
           by the organizers: One-in-one, where every Web page is assigned to a different
           cluster, and All-in-one, where all Web pages are assigned to a single cluster.


           Table 1. Results for the clustering task considering only related web pages. The run
           name is the name in official evaluation results.

                               Run                                 R    S F0.5 (R, S)
                               ALL-IN-ONE                         0.47 0.99  0.54
                               PanMorCresp Team - run 1           0.80 0.51  0.43
                               PanMorCresp Team - run 2           0.50 0.65  0.41
                               PanMorCresp Team - run 3           0.53 0.82  0.47
                               PanMorCresp Team - run 4           0.53 0.87  0.57
                               ONE-IN-ONE                          1.0 0.32  0.42


           Table 2. Results for the clustering task considering all web pages. The run name is
           the name in official evaluation results.

                                   Run                          R    S F0.5 (R, S)
                                   ALL-IN-ONE                  0.47 1.0   0.56
                                   PanMor Team - run 1         0.79 0.46  0.40
                                   PanMor Team - run 2         0.49 0.62  0.43
                                   PanMor Team - run 3         0.53 0.81   0.5
                                   PanMor Team - run 4         0.52 0.86  0.58
                                   ONE-IN-ONE                   1.0 0.25  0.36


           4     Discussion

           The results suggest that increasing the vocabulary (from 4000 to 7500 words)
           increases substantially the effectiveness of the algorithm, as well as using single
           linkage instead of other approaches such as the average linkage or complete
           linkage (runs 3 and 4 vs. runs 1 and 2). However, notice that the first run


                                                                                                                        154
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


           (using complete linkage and the average cost function value as stop criterion)
           achieves a high Reliability (precision) value. In fact, there exists a strong trade
           off (Reliability vs. Sensitivity) between the first run and the rest. Therefore, the
           results cannot be compared objectively. They depend to a great extent on the
           relative weight of Reliability and Sensitivity in the F measure.
               On the other hand, Run 4 outperforms substantially the third run: from
           47 to 57 when considering only related documents and from 0.5 to 0.58 in F
           when considering all documents. This improvement is mainly due to an increase
           in Sensitivity (recall). That is, using 9 clusters as stop criterion captures more
           relationships than using a similarity threshold without penalising the precision
           (Reliability).


           5     Conclusions

           We have presented in this paper the evaluation of four different runs in the
           Web Person Name Disambiguation task of IberEval 2017. The most remarkable
           result is that using a simplistic method (word occurrences, HAC, single linkage
           an number of clusters as stop criterion, is able to achieve an effectiveness which
           is (relatively) comparable with the best approach presented in the campaign. In
           fact, other more sophisticated approaches produce lower evaluation results.
               The approaches based on HAC have been designed and implemented inde-
           pendenly by different students. This experiment also suggests that, considering
           the HAC algorithm, its effectiveness is highly sensitive to the decisions about
           linkage, vocabulary size and stop criterion, as well as the relative weight of the
           complementary evaluation metrics (reliability and sensitivity).


           References

            1. Emrique Amigó & Julio Gonzalo & Felisa Verdejo. A General Evaluation Measure
               for Document Organization Tasks. In Proceedings of the 36th International ACM
               SIGIR Conference on Research and Development in Information Retrieval (SIGIR
               2013), pp. 643-652. (2013)
            2. Javier Artiles & Andrew Borthwick & Julio Gonzalo & Satoshi Sekine & Enrique
               Amigó. WePS-3 Evaluation Campaign: Overview of theWeb People Search Clus-
               tering and Attribute Extraction Tasks. In Third Web People Search Evaluation
               Forum (WePS-3), CLEF 2010 (2010).
            3. Javier Artiles & Julio Gonzalo & Satoshi Sekine. Weps 2 Evaluation Campaign:
               Overview of the Web People Search Clustering Task. In 2nd Web People Search
               Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
            4. Javier Artiles & Julio Gonzalo & Satoshi Sekine. The SemEval-2007 WePS Evalu-
               ation: Establishing a Benchmark for the Web People Search Task. In Proceedings
               of the Fourth International Workshop on Semantic Evaluations (SemEval-2007),
               pages 6469, Prague, Czech Republic, June 2007. Association for Computational
               Linguistics.


                                                                                                                        155
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


            5. Ying Chen & Peng Jin & Wenjie Li & Chu-Ren Huang. The Chinese Persons Name
               Disambiguation Evaluation: Exploration of Personal Name Disambiguation in Chi-
               nese News. In CIPS-SIGHAN Joint Conference on Chinese Language Processing,
               pp. 346-352. (2010)
            6. Soto Montalvo & Raquel Martı́nez & Leonardo Campillos & Agustı́n D. Delgado
               & Vı́ctor Fresno & Felisa Verdejo. MC4WePS: a multilingual corpus for web people
               search disambiguation, Language Resources and Evaluation. (2016)


                                                                                                                        156