LSI UNED at M-WePNaD: Embeddings for Person Name Disambiguation Andres Duque, Lourdes Araujo, and Juan Martinez-Romo Dpto. Lenguajes y Sistemas Informáticos Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain aduque@lsi.uned.es, lurdes@lsi.uned.es, juaner@lsi.uned.es Abstract. In this paper we describe the participation of the LSI UNED team in the multilingual web person name disambiguation (M-WePNaD) task of the IberEval 2017 competition. Our proposal is based on the use of word embeddings for representing the documents related to individu- als sharing the same person name. This may lead to an improvement of the clustering process that aims to eventually separate those documents depending on the individual they refer to. This is one of the first approx- imations to the use of techniques based on embeddings for this kind of tasks. Our preliminary experiments show that a system using a repre- sentation setting based on word embeddings is able to obtain promising results on the addressed task, overcoming the proposed baselines in all the tested configurations. Keywords: Person Name Disambiguation, word embeddings, document representation, clustering 1 Introduction Person Name Disambiguation (PND or PNaD) is the task that addresses to disambiguate proper names belonging to different people that can be found on the Web, using a search engine. That is, given a specific person name (e.g., John Smith) and the results offered by a search engine when that name is introduced, the final aim of the task is to cluster the webpages offered by the engine as a result. All the webpages contained in a particular cluster should ideally refer to the same person. It is a well known area of research in both the Natural Language Processing (NLP) and Information Retrieval (IR) communities, and its difficulty comes from the high ambiguity that can be found in person names, considered as named entities, as well as from the heterogeneus results that can be offered by the search engine (professional pages, blogs, social media links and many more) [8]. First works on PND presented a set of unsupervised clustering algorithms for testing how automatically extracted relevant features (proper nouns and most relevant words) and biographical information (birth place and year, occupation) could be used for improving unsupervised clustering [14]. However, PND tasks were widely popularized thanks to the WePS (Web People Search) campaigns Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 2 (WePS-1, WePS-2 and WePS-3) [3, 4, 1], which presented a standardized frame- work for evaluating systems addressing this kind of tasks. The PND task can be divided into two different parts: first, each of the doc- uments (web pages) retrieved by a search engine when looking for a particular name have to be transformed into a structured representation. This structure could be subsequently used by a clustering algorithm in what is considered to be the second part of the process. For the first subtask (representing the documents that refer to different people), a key aspect shared by most of the systems per- forming PND is the extraction of feature sets able to characterize the documents. These features will eventually represent the discriminators for determining which cluster should contain which document after the clustering process. Amongst these features, bag-of-words and named entities are normally used by almost all the systems presenting state-of-the-art results [7, 18]. Those basic features are then usually enriched with others such as Wikipedia concepts [13] or hostnames and page URLs [10]. Regarding the clustering algorithms, Hierarchical Agglomerative Clustering (HAC) seems to be able to offer the best results in this task, so many of the sys- tems able to offer the best results in the campaings make use of that algorithm or different versions of it [10, 12]. However, we can find other works in the literature achieving state-of-the-art results through the use of novel clustering algorithms, such as the proposed in [9], based on adaptative thresholds which circumvent the problem of depending on training data for determining the thresholds used by HAC. The main objective in the development of our system is to introduce the use of word embeddings to a specific PND task, more particularly to the first subtask, related to the representation of the documents used for performing the clustering process. Word embeddings were introduced in [5] as a distributed representation of words in which the dimensionality of a particular vocabulary was reduced to a much smaller, fixed size. This way, each word in the vocabulary is associated to a point in a vector space. Although there are some works in the literature that make use of embeddings for document representation in named entity disambiguation [11, 6], we have not found studies regarding their use in the specific Person Name Disambiguation task. Hence, our aim is to propose a system that uses these embeddings for eventually generate a vector which represents the whole document. This vector will be then used by the clustering algorithm for separate the documents belonging to different people bearing the same name. The rest of the paper is organized as follows: Section 2 introduces the task and the characteristics of the corpus used in it. The description of the system is detailed in Section 3, while the results obtained are presented in Section 4. Finally, Section 5 offers some conclusions and future lines of work. 139 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 3 2 The Task: Multilingual Web Person Name Disambiguation (M-WePNaD) As it has been stated before, the Person Name Disambiguation task consists in clustering the different webpages offered by a search engine when a particular person name is introduced as query, for distinguishing the different real individ- uals associated with that name. The specific task presented within the context of the IberEval 2017 competition represents a difference with common PND tasks in the sense that it is assessed in a multilingual setting. That is, the documents offered by the search engine, and considered for performing the final clustering, can be written in different languages. 2.1 Resources The main resource used for this task is the multilingual corpus MC4WePS, de- veloped by the organizers of the task [16]. The corpus, which aims to become a reference resource for this task, was built by extracting information from two search engines (Yahoo and Google), regarding 100 different person names, un- der two main criteria: ambiguity and multilingualism. The first criterion aims to obtain non-ambiguous, ambiguous and highly ambiguous names (related to 1, to between 2 to 9, and to more than 9 different people, respectively). The multilin- gualism criterion is satisfied by offering both monolingual or multilingual pages (that is, pages written in only one, or more than one, languages), and also by considering cases in which there exists both monolingual and multilingual pages that refer to the same individual. The corpus is split in two different parts: training and testing. Participants in the task were given the training part of the corpus, composed of the results related to 65 person names, together with the Gold Standard containing the different clusters for each person name and the correspondence between each single document and the cluster it belongs to, amongst those associated to a particular person name. The testing part of the corpus, containing the results from the remaining 35 person names, was released to the participants at the end of the training phase, and in this case no Gold Standard was provided. Participants had to run their systems and assign, for each document belonging to a person name, the identifier of the cluster to which it belonged. For each of the web pages retrieved by the search engine when looking for a specific name, the HTML document was obtained and transformed into plain text, using Apache Tika 1 . Hence, the corpus contains, for each result, the HTML document, the associated text document, and a XML file containing metadata such as the URL of the search result, ISO 639-1 codes for the languages the page is written in, the download date and the name of the annotator. 1 https://tika.apache.org/ 140 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 4 2.2 Evaluation The evaluation metrics used for this task take into account the possibility of presenting overlapping clusters, that is, a document can belong to two or more different clusters at the same time. The metrics are: Reliability (R), Sensitivity (S), and their harmonic mean F0.5 [2]. Given that the Gold Standard takes into account web pages which are not considered to be related to any individual bearing the ambiguous name, the evaluation is performed under two different settings: results considering only related web pages and results considering all web pages. Two different baselines are offered for comparison, both in the training phase and in the results of the testing phase: – ALL-IN-ONE: All the documents related to the same person name are gathered in a single cluster. – ONE-IN-ONE: A different cluster is proposed for each single document related to the same person name. 3 System Description The main objective of the system proposed for this task is to explore the use of word embeddings for eventually generating a vector representation of the docu- ments from the corpus. Hence, we will be able to apply a clustering algorithm over those vectors for determining the different clusters for each person name proposed in the task. 3.1 Preprocessing An initial preprocessing step is needed for preparing the documents retrieved by the search engine. As we stated before, the plain text from each document is provided within the corpus, and that is the main source of information that we will use in our process. We are interested in only considering named entities within the documents for representing them, that is, we consider that named entities are the most representative elements in the documents for this particular task. Hence, from the text documents, we perform a removal of the stopwords and we extract the named entities that can be found, through the use of the Stanford Named Entity Recognizer 2 . This way, we can transform our text documents into a “bag of named entities”, which will be used for building the vectors representing the documents. Although we can find text written in different languages in the documents, we run the Stanford NER as if the document was always written in English (standard configuration). In addition to this, as we will explain later on, we also maintain a version of each text document in which we retain all the words but the stopwords, without performing named entity recognition, in order to develop an experiment considering all the words in the documents, and not only named entities. 2 https://nlp.stanford.edu/software/CRF-NER.shtml 141 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 5 3.2 Document vectors The first step for building the vectors which represent the documents is to trans- form each word in the preprocessed documents (bags of named entities) to a specific word vector. For this purpose, we use pre-trained word embeddings, more particularly a collection of word vectors generated from Wikipedia con- cepts, publicly available for research purposes 3 . This collection presents around 1.7 million vectors representing Wikipedia concepts, and more than 1.5 million regular words [17]. The word vectors, with 300 dimensions, are built following the Skip-gram model used in Word2Vec [15], which has proved to present better overall results than the CBOW model presented in the same work. Although we considered the possibility of building our own word vectors using the provided corpus, the preliminary experiments that we conducted did not offer promising results. This may be due to the reduced size of the corpus, which leads to a poor vector representation of the words. Using the pre-trained collection, we transform each word in our preprocessed document into a word vector, and then we calculate the average vector of all the vectors related to the words in the document. This way, we generate a doc- ument vector of 300 dimensions which represents each particular document in the corpus. 3.3 Clustering algorithm Once that we have transformed each document related to a particular person into a vector, we should be able to apply a clustering algorithm over the vectors of each specific person name, in order to separate those clusters related to different individuals sharing that name. We performed a set of preliminary experiments testing different clustering algorithms directly over the document vectors, but the obtained results were not successful enough (when compared with the baselines in the training dataset). Because of that, we focused on developing our own algorithm, based on the characteristics of the training corpus provided by the organizers. We observed that, for most of the person names in the corpus, there usually existed a big cluster, related to one individual, and gathering most of the documents of that person name, and then many small clusters, each of them related to a different individual also sharing that name. That is, the corpus seems to be somehow biased towards person names for which most of the results offered by the search engine refer to the same individual (the most “important” one), and then a reduced percentage of results related to other people. Following this intuition, we have developed our clustering algorithm adapted to the characteristics of the corpus. The first step would be to select those documents related to the “important” individual for a person name. For this purpose, we need to calculate the similarity between each document and the rest of documents related to the same person name. Thanks to the conversion from text documents to vectors, we can easily compute this calculation by using 3 https://github.com/ehsansherkat/ConVec 142 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 6 cosine similarity. The similarity weight associated to each document will be the average of the similarity between that document, and the rest of documents related to the same person name: n P cos(di , dk ) k=1 k6=i wi = , (1) n−1 where di is the vector representing document i, dk the vector representing document k and cos(di , dk ) is the cosine similarity between di and dk . The total number of documents related to a particular person name is n. Once we have this similarity weight wi for document i, we can perform a first pruning step, in which we will consider that all the documents with a similarity weight above a specific threshold γ should be gathered in the same initial cluster. For this cluster to be seen as representing the “important” individual for that person name, we should select a high threshold. This way most of the documents will be assigned to that cluster. After that, we will generate a different cluster for each of the remaining documents related to that person name, in order to follow the intuitions that we explained before. The final output of the system for each person name will be a list containing the document identifiers, followed by the identifier of the cluster to which each document has been assigned. Through the experiments performed using the training dataset provided by the organizers, we have observed that the best value for the threshold is γ = 0.75, that is, considering all the documents with a similarity weight wi ≥ 0.75 to belong the same big cluster. Then, there is a cluster of size 1 for each of the remaining documents. However, for our experiments with the test dataset, and considering that we are allowed to propose up to 5 runs of our system, we will generate different runs by slightly varying this threshold around this value of γ = 0.75, as we will explain in the results section. 4 Results As we said before, the evaluation is conducted using the measures of reliability and sensitivity, and their harmonic mean F0.5 . Tables 1 and 2 show the results obtained by the different runs of our system, for the testing dataset of the M- WePNaD task. Run 1 corresponds to a threshold of γ = 0.70, Run 2 to γ = 0.75, Run 3 to γ = 0.80 and Run 4 to γ = 0.85. The last run, Run 5, is a slightly different configuration of the system in which we consider all the words in the documents (except stopwords) to be representative of them, and not only named entities. The rest of the process remains the same (extraction of word vectors and construction of document vectors, and clustering algorithm). The threshold selected for this last run is γ = 0.75. As we can observe, results are consistent in relation to the best run of our system, Run 3. That is, for the testing dataset, a threshold of γ = 0.80 is able to achieve the best results, although the differences with the other runs are small. 143 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 7 Table 1. Results considering only related web pages System R S F0.5 Run 3 0.59 0.85 0.61 Run 4 0.74 0.71 0.61 Run 2 0.52 0.93 0.58 Run 5 0.52 0.92 0.57 Run 1 0.49 0.97 0.56 Baseline - ALL-IN-ONE 0.47 0.99 0.54 Baseline - ONE-IN-ONE 1.00 0.32 0.42 Table 2. Results considering all web pages System R S F0.5 Run 3 0.59 0.81 0.60 Run 2 0.52 0.92 0.59 Run 5 0.52 0.90 0.59 Run 1 0.49 0.97 0.58 Run 4 0.74 0.66 0.58 Baseline - ALL-IN-ONE 0.47 1.00 0.56 Baseline - ONE-IN-ONE 1.00 0.25 0.36 We can also observe how using all the words in the documents (Run 5) is always under the run that only considers named entities and uses the same threshold (Run 2, with γ = 0.75). This implies that the addition of all the possible words in the documents introduces more noise than valuable information for the final disambiguation. In general, for the runs that make use of named entities, we can see how as we increase the threshold (that is, as the “important” cluster contains less documents), the reliability increases while the sensitivity decreases. That is, with small values of γ the results are closer to the “ALL-IN-ONE” baseline (all the documents in the same cluster), and as we increase γ we get closer to the “ONE-IN-ONE” baseline. However, those baselines are always overcome by all the runs of our system, which indicates that the strategy adopted for performing the clustering is valid for this particular task. The differences between the two tables are quite small. In general, considering all web pages can be seen as a slightly harder task, since it expects the systems to determine those unrelated results. However, the number of unrelated web pages in the corpus is small and hence the results are quite similar between the two settings. We can observe that the order of the runs for our system is almost the same in both cases, except for run 4 (the highest value of gamma for the configuration that only takes named entities into account), which performs worse when considering all web pages. 144 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 8 5 Conclusions and Future Work In this paper we have described our participation in the multilingual web person name disambiguation (M-WePNaD) task of the IberEval 2017 competition. The main contribution of our work is the application of embedding techniques for rep- resenting text documents related to different individuals. We have shown how word vectors extracted from pre-trained collections offer interesting possibilities for creating document vectors, that is, vectors representing whole documents, which can then be used for performing the final clustering process in the task. Results presented in Section 4 indicate the appropriateness of our proposal, from the point of view of overcoming the baselines proposed by the organizers of the task. It is important to remark that our system uses word embeddings in a very preliminary way, that is, we do not perform any other preprocessing apart from the extraction of named entities, and we have even tested the system with- out extracting them, only removing stopwords. This indicates that additional processing of the texts with other techniques (analysis of languages, type of web page, URL, etc.), will probably offer interesting improvements over the proposed technique. Also, the ad-hoc creation of word vectors whose contexts are closer to the task might also improve the results presented in this work, which have been obtained with pre-trained embeddings. Finally, the exploration of different clustering techniques is an important factor for the improvement of the system. Acknowledgments This work has been partially supported by the Spanish Ministry of Science and Innovation within the projects EXTRECM (TIN2013-46616-C2-2-R) and PROSA-MED (TIN2016-77820-C3-2-R), as well as by the Universidad Nacional de Educación a Distancia (UNED) through the FPI-UNED 2013 grant. References 1. Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., Corujo, A.: Weps-3 evaluation campaign: Overview of the online reputation management task. In: CLEF 2010 (Notebook Papers/LABs/Workshops) (2010) 2. Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document or- ganization tasks. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 643–652. SIGIR ’13, ACM, New York, NY, USA (2013), http://doi.acm.org/10.1145/2484028.2484081 3. Artiles, J., Gonzalo, J., Sekine, S.: The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In: Proceedings of the 4th Interna- tional Workshop on Semantic Evaluations. pp. 64–69. Association for Computa- tional Linguistics (2007) 4. Artiles, J., Gonzalo, J., Sekine, S.: Weps 2 evaluation campaign: overview of the web people search clustering task. In: 2nd web people search evaluation workshop (WePS 2009), 18th www conference. vol. 9 (2009) 145 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 9 5. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of machine learning research 3(Feb), 1137–1155 (2003) 6. Cai, R., Wang, H., Zhang, J.: Learning entity representation for named entity dis- ambiguation. In: Chinese Computational Linguistics and Natural Language Pro- cessing Based on Naturally Annotated Big Data, pp. 267–278. Springer (2015) 7. Chen, Y., Martin, J.: CU-COMSEM: Exploring rich features for unsuper- vised web personal name disambiguation. In: Proceedings of the 4th In- ternational Workshop on Semantic Evaluations. pp. 125–128. SemEval ’07, Association for Computational Linguistics, Stroudsburg, PA, USA (2007), http://dl.acm.org/citation.cfm?id=1621474.1621498 8. Delgado, A.D., Martı́nez, R., Fresno, V., Montalvo, S.: A data driven ap- proach for person name disambiguation in web search results. In: Pro- ceedings of COLING 2014, the 25th International Conference on Compu- tational Linguistics: Technical Papers. pp. 301–310. Dublin City University and Association for Computational Linguistics, Dublin, Ireland (August 2014), http://www.aclweb.org/anthology/C14-1030 9. Delgado, A.D., Martnez, R., Montalvo, S., Fresno, V.: Person name disambiguation in the web using adaptive threshold clustering. Journal of the Association for Infor- mation Science and Technology pp. n/a–n/a, http://dx.doi.org/10.1002/asi.23810 10. Elmacioglu, E., Tan, Y.F., Yan, S., Kan, M.Y., Lee, D.: Psnus: Web people name disambiguation by simple clustering with rich features. In: Proceedings of the 4th International Workshop on Semantic Evaluations. pp. 268–271. Association for Computational Linguistics (2007) 11. Fang, W., Zhang, J., Wang, D., Chen, Z., Li, M.: Entity disambiguation by knowl- edge and text jointly embedding. CoNLL 2016 p. 260 (2016) 12. Liu, Z., Lu, Q., Xu, J.: High performance clustering for web person name disam- biguation using topic capturing. In: Proceedings of The First International Work- shop on Entity-Oriented Search (EOS). pp. 1–6. ACM, New York, NY, USA (2011), http://research.microsoft.com/en-us/um/beijing/events/eos2011/9.pdf 13. Long, C., Shi, L.: Web person name disambiguation by relevance weighting of ex- tended feature sets. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy (2010), http://ceur-ws.org/Vol-1176/CLEF2010wn- WePS-LongEt2010.pdf 14. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Pro- ceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. pp. 33–40. Association for Computational Linguistics (2003) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 16. Montalvo, S., Martı́nez, R., Campillos, L., Delgado, A.D., Fresno, V., Verdejo, F.: Mc4weps: a multilingual corpus for web people search disambiguation. Language Resources and Evaluation pp. 1–28 17. Sherkat, E., Milios, E.: Vector embedding of wikipedia concepts and entities. arXiv preprint arXiv:1702.03470 (2017) 18. Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of the 33rd Interna- tional ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. pp. 10–17. SIGIR ’10, ACM, New York, NY, USA (2010), http://doi.acm.org/10.1145/1835449.1835454 146