ATMC team at M-WePNaD task Agustı́n D. Delgado Universidad Nacional de Educacin a Distancia (UNED) Juan del Rosal, 16, 28040 - Madrid agustin.delgado@lsi.uned.es Abstract. This paper presents our participation in the task Multilin- gual Web Person Name Disambiguation (M-WePNaD) at IBEREVAL 2017 workshop. Given a ranking of search results written in different languages retrieved by a search engine when looking for a person name, the goal of the task is to group the web pages according to the indi- vidual they refer to. We have grouped the search results by means of a clustering algorithm which does not need any kind of prior information. On the other hand, we deal with multilingualism by two different ways. The first one just use a machine translation tool. The second one is a method to compare search results written in different languages which is based on giving a special role to those features written the same way in several languages. Both approaches get similar results, but the second one is more efficient because it avoids additional preprocessing caused by the translation of the search results. 1 Introduction The disambiguation of person names on the Web has been addressed in last years due to two main reasons: (i) Person names are a kind of named entities (NEs) specially ambiguous, so that their disambiguation has been studied in several scenarios like Cross-Document Coreference Resolution [3, 11], Entity Linking and wikification [12, 13] or author name disambiguation [10, 16]; and (ii) The search scenario on the Web presents several challenges: web pages do not talk about an specific topic; search results could not have in common an specific structure as happens with news, scientific papers or references; and the proposed methods must be efficient due to users expect quick responses to their queries. Person name disambiguation on the Web has been addressed as a clustering problem composed by two phases. The goal of the first phase is to represent the search results by means of suitable features to identify and distinguish different individual with the same name. On the other hand, the second phase is to apply a clustering algorithm to group the search results according to the individual they refer to. In particular, the best systems of the state-of-the-art represent the search results with a rich selection of features from different nature and groups the web pages by means of the Hierarchical Agglomerative Clustering (HAC) algorithm after learning a similarity threshold by means of training data [2]. However, some authors have pointed out that the results obtained by HAC are Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) very sensitive with respect to little variations of the similarity threshold, so this methodology is not robust. On the other hand, this problem has been addressed assuming that all the search results are written in the same language. However, the search engines are able to retrieve web pages written in several languages and there are increasingly web pages written in different languages due to the popularization of the Inter- net in non-English speaking countries [18]. There are few proposals that take into account the presence of multilingualism in this problem. For instance, in [15] is presented a method based on extracting biographical information of the individuals, like birth dates and places. For this purpose, the authors propose to learn several patterns of each biographical fact in several languages by means of training data. However, this approach needs enough training data for each biographical fact in each language, which requires a huge human effort. On the other hand, in [14] the authors claim that Latent Dirichlet Allocation (LDA) is able to deal with the problem for any language. These authors used a collec- tion that contains news written in English, Spanish, Bulgarian and Romanian to check the suitability of their approach in several languages. Nevertheless, the web pages associated to each entity are written in the same language, so the disambiguation process is not multilingual. In this paper, we present several methods to deal with multilingualism in person name disambiguation on the Web. To this end, we have used a data set called MC4WePS1 [17] provided by the M-WePNaD organizers. First, we detail our approaches in Section 2. Next, we present the results in Section 3 and we discuss them in Section 4. Finally, Section 5 presents some conclusions and future lines of work. 2 Methods This section presents four methods to solve the M-WePNaD task, which could be divided into two kinds: those that just take into account the original con- tent of the web pages; and those that employ a machine translation tool. First, Subsection 2.1 describes the clustering algorithm used to group the web pages according to the individual they refer to. Right after, we detail the translation process and the preprocessing of the search results in Subsections 2.2 and 2.3 re- spectively. Finally, Subsection 2.4 presents several approaches to compare search results written in different languages. 2.1 Clustering Algorithm We have used the algorithm Adaptive Threshold Clustering (ATC) [7, 8] to group the search results. ATC is composed by three phases and its grouping strategy is the following: the goal of the two first phases is to obtain initial cohesive clusters with a high value of precision, while the third phase merges them in order to improve the recall score. The phases of ATC are briefly described as follows: 1 http://nlp.uned.es/web-nlp/resources 129 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) – Phase 1 (grouping by links): the search results are grouped if they are linked or they share some link under the assumption that they refer the same individual in that case. Therefore, each web page is represented by its URL and its links. Note that the M-WePNaD organizers provide the URL of all the search results in their metadata files. – Phase 2 (UPND algorithm): the search results are grouped by means of the algorithm UPND [5]. In this phase, the web pages are represented by means of their capitalized 3-grams, which are described as a sequence of three consecutive words with their first letter written in uppercase. This kind of features has shown suitable to distinguish between different individuals [6]. – Phase 3 (fusion of clusters): merge of the most similar clusters generated in the previous phases. The clusters are represented as bag of words by means of their centroids. However, some features are filtered of the centroids according to the following properties: (i) they have a low document frequency within the cluster; and (ii) they appear in most of the clusters. In this phase the search results are represented by means of their 1-grams due to these features allow to represent as much search results as possible unlike the capitalized 3-grams. We have used the configuration of ATC described in [8]: these authors show that the results are not affected with respect to the function used to weight the capitalized 3-grams, so they are weighted with the binary function because it is the most simple one. However, in the case of the 1-grams, the TF-IDF function gets better results. On the other hand, ATC compares search results Wi and Wj by means of their cosine similarity sim(Wi , Wj ) and a mathematical function γ(Wi , Wj ) called adaptive threshold which returns a similarity threshold which depends on the search results characteristics and their number of shared features. The web pages are merged if sim(Wi , Wj ) > γ(Wi , Wj ). This way, ATC is able to estimate the number of clusters, so it does not need any prior information to group the search results unlike HAC or k-means algorithms. On the other hand, the presence of web pages from social media platforms could lead to obtain worse performance [4]. Thus, we have applied an heuristic to treat social media platforms [7], which do not allow comparisons between web pages from the same social platform. In addition, this heuristic is extended to web pages of people search engines in the phase 1 because these web pages usually contain links to profiles of different social platforms of people with the same name, so they could lead to merge incorrectly web pages when they are represented by their links. 2.2 Translation Process Some experiments are based on the use of a machine translation tool. This Subsection describes the translation process conducted for these runs for each person name. First, we have to select the anchor language to translate the web pages to. As the computational cost must be light in a web search scenario, we have decided 130 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) to translate as few web pages as possible. Thus, we identify the anchor language as the most frequent language of the search results contained in the ranking. Although the M-WePNaD organizers provide the language of each search result annotated by experts, we have used a language identification tool available in the Internet2 that uses a Naive Bayes classifier which looks to sequences of characters within the text to detect the language. We have evaluated the performance of the language detector taking into account the language annotations from the experts obtaining 96.17% accuracy. On the other hand, we have used the translation service provided by the Russian technology company Yandex 3 . This tool is able to translate documents from 94 different languages, including those identified by the language detec- tor. This tool uses statistical techniques to translate the documents by means of several dictionaries and modeling each language with web pages written in several languages, for instance, taking the version of the web site of companies in different languages and comparing them. This translator could not perform correctly due to mistakes made by the language detector. For instance, a Spanish web page that has been detected as Catalan is not entirely translated because the translator does not find Catalan words in the text with the exception of the shared vocabulary between the two languages. 2.3 Preprocessing The preprocessing of the search results when using the machine translation tool for each person name is the following: we obtain the plain text of the search results using the parsers provided by the library TiKa Apache 4 . This tool is also able to obtain the links of the search results used in the phase 1 of ATC. Right after we identify the language of each search result by means of the language detection tool. Next, we translate to the anchor language those search results written in other languages. Then, we split the plain texts into sentences and we delete the stop words of the anchor language because all the search results are written in the same language after the translation step. In addition, we delete the person name due to it is the query so we assume that all the search results contain them. Those experiments which do not use the machine translation tool conduct the same preprocessing with the exception of two differences: (i) we do not translate any search result; and (ii) we delete the stop words of the language identified by the language detector. After the preprocessing phase, we extract the textual features of each sentence used by ATC: capitalized 3-grams and 1-grams. Finally, we remove those features which only appear in one search result of the ranking. 2 https://code.google.com/p/language-detection/ 3 https://www.yandex.com/ 4 https://tika.apache.org/ 131 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 2.4 Approaches We have conducted several experiments based on ATC which take different rep- resentations of the search results or apply different policies when comparing them: – Run 1 (ATC): the search results are represented by means of their original textual features without using any translation resource. – Run 2 (ATC+TRAD): we translate to the anchor languages those search results written in other languages. – Run 3 (ATC+CENT TRAD): we translate separately to the anchor language only the 1-grams contained in the centroids used to represent the clusters in the last phase of the algorithm. – Run 4 (ATMC): we compare the search results taking into account those fea- tures written the same way in different languages without using any trans- lation resource. We have called this run Adaptive Threshold for Multilingual Clustering (ATMC) [9]. Below we detail this method. Runs 2 and 3 allow us to study the suitability of applying a machine transla- tion tool to compare the search results. In particular, Run 3 allows us to study the suitability of translating some words separately with respect to translate the whole document as Run 2 does. On the other hand, Runs 1 and 4 do not use any translation resource to make the disambiguation process lighter in terms of cost because it avoids an additional phase dedicated to translate the web pages. This is desirable in problems related to searching on the Web due to users want quick responses to their queries. Run 1 just applies the ATC algorithm [8] using the features extracted of the original content the web pages, while Run 4 com- pares the documents written in different languages providing more importance to those features which are written the same way in both languages. ATMC compares search results written in different languages giving a special role to those of their features orthographically identical in several languages. This usually happens with NEs as organizations or person names. However, it also happens with other kind of information which is not usually detected as NEs, for instance, titles of films, books, TV shows, papers, and so on, which could be useful to identify an individual. Let be W = {W1 , W2 , . . . , WN } the set of search results returned by a search engine when looking for a person name. We denote as Fi to the set of features of the search SN result Wi that is written in the language li . On the other hand, L(W) = i=1 {li } denotes the set of languages of the search results contained in W. We can tag each feature f ∈ F with the set of languages of the search results where it appears computing L(f ) = {li ∈ L(W)|f ∈ Fi } ⊆ L(W). Then, any feature f ∈ Fi must hold that li ∈ L(f ). Given two features f, f 0 ∈ F, they are comparable features if L(f ) ∩ L(f 0 ) 6= ∅. The set of comparable features of Wi and Wj are defined as follows: Fi,j = {fi ∈ Fi |lj ∈ L(fi )} ⊆ Fi (1) 132 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) Fj,i = {fj ∈ Fj |li ∈ L(fj )} ⊆ Fj (2) This definition can be easily generalized for clusters of search results just taking into account the set of languages of the web pages contained in each cluster. In addition, note that li = lj implies that Fi = Fi,j and Fj = Fj,i , so this guarantees that comparing web pages taking Fi,j and Fj,i has no effect in the monolingual scenario. On the other hand, li 6= lj implies that we do not compare the search results taking into account features that we already know are not shared by both web pages by means of the language detection, so it is more probable that they can be grouped than using all their features. This means that if we only use comparable features to compare the search results then we give more benefit to those comparisons between web pages written in different languages. In order to avoid this effect as much as possible, we propose to balance the comparison of the search results taking into account all their features and their comparable features by means of the following formulas: simM L (Wi , Wj ) = αi,j · sim(Fi , Fj ) + (1 − αi,j ) · sim(Fi,j , Fj,i ) (3) γM L (Wi , Wj ) = αi,j · γ(Fi , Fj ) + (1 − αi,j ) · γ(Fi,j , Fj,i ) (4) |Fi,j |+|Fj,i | where αi,j = |F i |+|Fj | is the proportion of comparable features with respect to all the features of the compare search results. A high value of αi,j means that most of features are comparable, so the similarity and the adaptive threshold values would be similar to the ones used by ATC assumming a monolingual scenario. On the other hand, if αi,j has a low value, then few features are com- parable so they are more weighted when comparing search results written in different languages. 3 Results Tables 1 and 2 show the results obtained by the proposed approaches with the test collection and the baselines ALL IN ONE and ONE IN ONE provided by the M-WePNaD organizers. In particular, Table 1 shows the results obtained when taking into account only the search results related to some individual according to the annotators, while Table 2 shows the results obtained when considering all the search results. On the other hand, the baseline ALL IN ONE just returns one cluster which contains all the search results for each person name, while ONE IN ONE returns each search results as a singleton cluster for each person name. The evaluation metrics used are reliability (R) and sensibility (S) [1], and their F-measure (F0.5 ), which weights equally both metrics. 4 Discussion The baseline ALL IN ONE improves the results of ONE IN ONE which means that most individuals in the collection have associated several web pages. The 133 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) Run R S F0.5 ALL IN ONE 0.47 0.99 0.54 ONE IN ONE 1.00 0.32 0.42 Run 1 (ATC) 0.79 0.83 0.79 Run 2 (ATC+TRAD) 0.82 0.79 0.80 Run 3 (ATC+CENT TRAD) 0.80 0.84 0.81 Run 4 (ATMC) 0.79 0.85 0.81 Table 1. Results obtained by the proposed approaches with the test collection of the M-WePNaD task taking into account related search results. Run R S F0.5 ALL IN ONE 0.47 1.00 0.56 ONE IN ONE 1.00 0.25 0.36 Run 1 (ATC) 0.78 0.73 0.74 Run 2 (ATC+TRAD) 0.82 0.69 0.73 Run 3 (ATC+CENT TRAD) 0.79 0.74 0.75 Run 4 (ATMC) 0.78 0.75 0.75 Table 2. Results obtained by the proposed approaches with the test collection of the M-WePNaD task taking into account all the search results. tables also show that the proposed approaches improve the results of both base- lines. However, the results between the four approaches are close. This could be explained because of two reasons: several person names in the collection are monolingual (9 of 35) and we translate as less web pages as possible, which modifies the representation of few web pages. In particular, the results of ATC are slightly worse than the ones obtained by the experiments that use the ma- chine translation tool (ATC+TRAD and ATC+CENT TRAD) and ATMC. On the one hand, this means that the translation process has a positive im- pact. In particular, ATC+CENT TRAD is more suitable because it translate a lower amount of text. This experiment obtains a lower reliability score with respect to ATC+TRAD but it gets better results of sensibility. This is explained because when the words are translated separately (ATC+CENT TRAD) the translator always return the same output, but when we translate the whole texts (ATC+TRAD), the translation of each word could be different depend- ing on the context, so the documents share more vocabulary in the case of ATC+CENT TRAD which leads to a higher number of groupings. Finally, ATMC slightly improves ATC+ORIGINAL although both approaches use the original features. This means that the proposed method to compare web pages written in different languages is suitable. In particular, ATC and ATMC get the same reliability score, but ATMC obtains higher sensibility, which means that ATMC is able to group correctly a higher number of search results without loss of pre- cision. In addition, ATC+TRAD and ATC+CENT TRAD do not improve the results of ATMC although they use a machine translation tool. Then, ATMC 134 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) is a better choice because it does not need an additional preprocessing step for translating the search results which necessarily increase the processing time of the disambiguation process. Note that this is desirable in a scenario involving searching on the Web due to users expect response as soon as possible. Regarding the results of both tables, the baseline ALL IN ONE is the only experiment that improves its results when considering all the web pages, includ- ing the not related search results. These web pages have been identified by the annotators according to several criteria, for instance, they do not mention any individual with the person name given as query, or they refer to other categories of NEs which are not person names, as happens with John Fitzgerald Kennedy International Airport instead of the former president of the United States. These not related search results are grouped by the annotators in the same cluster for each person name although they could refer to different people, so this situation benefits ALL IN ONE but has a negative impact to ONE IN ONE. On the other hand, the proposed approaches do not identify and group the not related search results, so they get worse results when considering all the web pages. 5 Conclusions This paper has described our participation in the M-WePNaD task at IBEREVAL 2017 workshop, which goal is to address person name disambiguation on the Web in a multilingual scenario. We have proposed four approaches to address the multilingualism in the problem. Two of them are based on the use of a ma- chine translation tool while the other ones do not use any translation resource in order to avoid additional preprocessing steps. On the one hand, the use of a translator improves slightly the results obtained using the original features. On the other hand, we have seen that the comparable features are useful to com- pare web pages written in different languages without the need of translation resources. As future work, we want to explore how to enrich the representation by means of comparable features. For instance, this representation could be ex- tended identifying features written similarly in different languages in addition to those written orthographically the same. Those features could be detected by means of NEs alignment techniques and cognate identification methods. In addition, a future line of work is to detect not related search results due to they have a negative impact in our methods when we consider the whole ranking of web pages. This kind of search results could be identified by means of checking if they mention the person name given as query, and those mentions are not other categories of NEs than person names. Acknowledgment This work has been part-funded by the Spanish Ministry of Science and Inno- vation (MAMTRA-MED Project, TIN2016-77820-C3-2-R and MED-RECORD Project, TIN2013-46616-C2-2-R). 135 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) References 1. Enrique Amigó & Julio Gonzalo & Felisa Verdejo: A General Evaluation Measure for Document Organization Tasks Proceedings of the 36th International ACM SI- GIR Conference on Research and Development in Information Retrieval, pp. 643– 652, 2013. http://doi.acm.org/10.1145/2484028.2484081. 2. Javier Artiles: Web People Search. PhD Thesis, E.T.S. Ingenierı́a In- formática, UNED, 2009. http://e-spacio.uned.es/fez/eserv/tesisuned:IngInf- Jartiles/Documento.pdf. 3. Amit Bagga & Breck Baldwin: Entity-based Cross-document Coreferencing Using the Vector Space Model. Proceedings of the 17th International Conference on Com- putational Linguistics - vol. 1, pp. 79–85, 1998. University of Amsterdam (2015). http://dx.doi.org/10.3115/980451.980859. 4. Richard Berendsen: Finding People, Papers, and Posts: Vertical Search Al- gorithms and Evaluation. PhD Thesis. University of Amsterdam (2015). http://doi.acm.org/10.1145/2484028.2484081. 5. Agustı́n D. Delgado & Raquel Martı́nez & Vı́ctor Fresno & Soto Montalvo: A Data Driven Approach for Person Name Disambiguation in Web Search Results. Proceedings of the 25th International Conference on Computational Linguistics, pp. 301–310, 2014. http://aclweb.org/anthology/C/C14/C14-1030.pdf. 6. Agustı́n D. Delgado & Raquel Martı́nez & Soto Montalvo & Vı́ctor Fresno: An Unsupervised Algorithm for Person Name Disambigua- tion in the Web. Procesamiento del Lenguaje Natural, 53:51–58, 2014. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5042. 7. Agustı́n D. Delgado & Raquel Martı́nez & Soto Montalvo & Vı́ctor Fresno: Tratamiento de redes sociales en desambiguacin de nombres de per- sona en la web. Procesamiento del Lenguaje Natural, 57:117-124, 2016. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5344. 8. Agustı́n D. Delgado & Raquel Martı́nez & Soto Montalvo & Vı́ctor Fresno: Person Name Disambiguation in the Web Using Adaptive Threshold Cluster- ing. Journal of the Association for Information Science and Technology, 2017. https://doi.org/10.1002/asi.23810. 9. Agustı́n D. Delgado: Desambiguación de nombres de persona en la Web en un contexto multilingüe. PhD Thesis, E.T.S. Ingenierı́a Informática, UNED, 2017. 10. Johanna Geiß & Michael Gertz: With a Little Help from My Neighbors: Per- son Name Linking Using the Wikipedia Social Network. Proceedings of the 25th International Conference Companion on World Wide Web, pp. 985–990, 2016. http://dx.doi.org/10.1145/2872518.2891109. 11. Chung Heong Gooi & James Allan: Cross-Document Coreference on a Large Scale Corpus. Human Language Technology Conference of the North Ameri- can Chapter of the Association for Computational Linguistics, pp. 9–16, 2004. http://aclweb.org/anthology/N/N04/N04-1002.pdf 12. Toni Grütze & Gjergji Kasneci & Zhe Zuo & Felix Naumann: Bootstrapping Wikipedia to answer ambiguous person name queries. Workshops Proceedings of the 30th International Conference on Data Engineering Workshops, pp. 56.61. 2014. http://dx.doi.org/10.1109/ICDEW.2014.6818303. 13. Zhengyan He & Houfeng Wang & Sujian Li: The Task 2 of CIPS-SIGHAN 2012 Named Entity Recognition and Disambiguation in Chinese Bakeoff. Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 108–114. 2012. http://www.aclweb.org/anthology/W12-6321. 136 Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) 14. Zornitsa Kozareva & Sujith Ravi: Unsupervised Name Ambiguity Resolution Using a Generative Model. Proceedings of the First Workshop on Unsupervised Learning in NLP, pp. 105–112, 2011. http://dl.acm.org/citation.cfm?id=2140458.2140471. 15. Gideon S. Mann & David Yarowsky: Unsupervised Personal Name Disambiguation. Proceedings of the Seventh Conference on Natural Language Learning at HLT- NAACL 2003 - Volume 4, pp. 33–40. http://dx.doi.org/10.3115/1119176.1119181. 16. Fakhri Momeni & Philipp Mayr: Using Co-authorship Networks for Au- thor Name Disambiguation. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL 2016), pp. 261–262. http://doi.acm.org/10.1145/2910896.2925461. 17. Soto Montalvo & Raquel Martı́nez & Leonardo Campillos & Agustı́n D. Del- gado & Vı́ctor Fresno & Felisa Verdejo: MC4WePS: a multilingual corpus for web people search disambiguation Language Resources and Evaluation (2016). http://dx.doi.org/10.1007/s10579-016-9365-4. 18. Daniel Pimienta & Daniel Prado & Álvaro Blanco: Twelve years of mea- suring linguistic diversity in the Internet: balance and perspectives. UN- ESCO publications for the World Summit on the Information Society (2009). http://unesdoc.unesco.org/images/0018/001870/187016e.pdf. 137