=Paper=
{{Paper
|id=Vol-2277/paper34
|storemode=property
|title=
Analysis and Visualization Algorithm for Cross-Language Author Names Disambiguation
|pdfUrl=https://ceur-ws.org/Vol-2277/paper34.pdf
|volume=Vol-2277
|authors=Zinaida Apanovich,Vladimir Isachenko
|dblpUrl=https://dblp.org/rec/conf/rcdl/ApanovichI18
}}
==
Analysis and Visualization Algorithm for Cross-Language Author Names Disambiguation
==
Analysis and Visualization Algorithm for Cross-Language Author Names Disambiguation © Zinaida Apanovich © Vladimir Isachenko A.P. Ershov Institute of Informatics Systems, Novosibirsk State University, Novosibirsk, Russia apanovich@iis.nsk.su. vv.isachenko@gmail.com Abstract. A new algorithm for the cross-language disambiguation of author names is presented. The algorithm uses the matching of Russian and English papers and journal titles. An interactive visualization tool simplifies the analysis and modification of the results obtained. Keywords: cross-language disambiguation, clustering, interactive visualization. Nepomniaschy, Nepomnyashchiy, etc. Because of these 1 Introduction name variations, his papers in the Scopus data base are assigned to four people with distinct Scopus identifiers. The recent years have been marked by a rapid spread of Moreover, some of his papers are merged with the papers large-scale knowledge bases, such as Google Knowledge by Владимир Непомнящий from Moscow and so Vault, Deep Dive, Microsoft Academic Graph, etc., assigned to yet another “virtual” person. extracting facts from texts and integrating data from Recognizing the great importance of this issue, large multiple sources automatically which is potentially error- bibliographic data sources started such projects as prone. Most errors are related to the differences in data ORCID (http://orcid.org/). It provides persistent digital source schemas and identity resolution errors. identifiers (Open Researcher and Contributor ID) that Name ambiguity in the context of bibliographic distinguish every researcher from any other. Also, citation records is a difficult problem affecting the ORCID supports automated linkages between a quality of content in digital libraries. It has been a researcher and his or her professional activities, such as subject of intensive research [4, 5, 10, 12]. An important publications. Nevertheless, this project has not coped aspect of this problem is multilingualism. Multilingual with the problem entirely, and further investigations are resources such as DBPedia, VIAF, WorldCat, etc., are needed. becoming increasingly common. An algorithm for the cross-language identity Although English is the main language for research resolution using the SBRAS Open Archive is presented and the Internet, a great number of research publications in [2]. The algorithm relies heavily on the information belong to non-English authors and are translated from about Siberian researchers and their affiliations and for various foreign languages, which makes the task of this reason has a very limited application. Another integrating multiple data sources even more difficult. problem is the absence of a trustworthy data source [4, 9] Naturally, this poses the problem of the cross-language since our experiments have shown erroneous authorship disambiguation of named entities, and, in particular, the attributions in all important international data source cross-language disambiguation of the authors of such as DBLP, Scopus, etc. A possible solution to this scientific publications. Also, algorithms aimed at cross- problem would be using as a ground truth source a language identity resolution have recently gained national data base such as eLIBRARY.ru importance in the field of the Semantic Web [7, 8, 15]. (https://elibrary.ru). Our previous research demonstrated that Russian The authors have committed themselves to names allowing several transliterations represent a answering the following question: To what extent a data challenge. Experiments with several multilingual source such as eLIBRARY.ru can be used to refine the datasets have shown that Russian names admitting quality of the identity resolution of English data sources? several transliterations are often treated as homonyms, To this end, a way of establishing correspondence and several persons with identical name variations are between the Russian-named and English-named entities treated as synonyms [1, 2]. This is especially unpleasant has to be developed. The transliteration-based matching when errors occur in the resources used to calculate of personal names was already described in [2]; our new scientific ratings. For example, the family name of a algorithm, however, has an additional matching step, researcher of the A.P. Ershov Institute of Informatics enabling us to create groups of confirmed papers for an Systems of the Siberian Branch of the Russian Academy individual researcher. Another issue is establishing the of Sciences (IIS SB RAS), Валерий Александрович correspondence between the titles of original Russian Непомнящий can be transliterated as Nepomnyashchii, papers and their English translations as well as between journal titles in Russian and their English translations. Proceedings of the XX International Conference Due to this extended matching step, the new clustering “Data Analytics and Management in Data Intensive algorithm for disambiguation of authors have proven to Domains” (DAMDID/RCDL’2018), Moscow, Russia, be more efficient that the previous one. Finally, an October 9-12, 2018 193 interactive visualization algorithm provides that the language of SpringerLink is English, and that of comprehensible disambiguation results and enables their eLIBRARY.ru is Russian, even when it stores data on the modification. As for the visualization issue, only a few English publications of Russian researchers. The main works directly relate to our program [6, 11, 13, 14]. None problem, hence, is how to match entities described in of them is related to the cross-language disambiguation different languages. issue. The paper is organized as follow 3 The algorithm description s: first, the datasets and metadata essential for our algorithm are presented. After that, the matching and The main steps of our algorithm are as follows. clustering algorithm and implementation details are 1. Given a full Russian name, all possible forms and described. Finally, we demonstrate an interactive English transliterations are generated. visualization, which facilitates the comprehension of the 2. English forms of the name are used for the disambiguation results and allows users to improve them. keyword search of publications in the SpringerLink digital library. 3. An extended set of potential homonyms of the 2 Datasets and their metadata person, specified by the full Russian name, is used Data sparsity is the decisive factor leading to author to extract groups of publications from disambiguation errors in all the existing data sources. eLIBRARY.ru. Since metadata provide an important evidence for name 4. All the publications extracted from SpringerLink disambiguation tasks, the lack of key metadata can result are matched against the eLIBRARY.ru groups of in a poor disambiguation outcome. With publishers publications. providing more metadata with more frequent updating, 5. The papers unmatched at the previous step are recent citations have higher metadata availability and the further analyzed and clustered. lists of the available metadata are growing continuously. 6. Interactive visualization makes it possible to We have chosen the SpringerLink digital library analyze and refine the clustering result. (https://link.springer.com/) as an English-language bibliographic data source. The main reason for it is a The general scheme of our algorithm is shown in Fig. 1. permanently expanding set of metadata. SpringerLink is Next, we describe each step in more detail. currently one of the largest digital libraries with over 10 million documents in various research fields 3.1 Extended transliteration including computer science, mathematics, life sciences, eLIBRARY.ru identifies the researchers by their materials, philosophy, psychology, etc. It provides normalized name, affiliation and location of the detailed meta-data about its publications, such as the employing organization. Since eLIBRARY.ru is a paper title, list of authors, ISSN, authors’ affiliations, Russian-language data source, all the three attributes are publication date, venue (journal or conference title), key written in Cyrillic. The format of the normalized name is words, subject abstract, references, full texts in pdf. format, etc. One of the recent innovations is the However, several English name variations can “translated from” label for the papers written in foreign correspond to a normalized Russian name. It can be < languages. This additional data makes possible the First Name Middle Name Last Name>, , , etc. All these forms should be first eLIBRARY.ru stores data in the field of science, generated in Russian and then transliterated in English. technology, medicine and education on more than 28 Again, every Russian name can be transliterated in many million publications, more than 500 000 researchers and ways. For example, the Russian family name Ершов can over 3, 000 registered organizations. The A.P. Ershov be spelt as Ershov, Yershov, Jerszow, and the first name Institute of Informatics Systems of the Siberian Branch Андрей can be written as Andrei, Andrey, Andrew. of the Russian Academy of Sciences is an organization Therefore, in order to identify in an English knowledge registered at eLIBRARY.ru; it regularly inputs and base all the possible synonyms of a person from updates information concerning its employee’s eLIBRARY.ru, our program generates the most publications. The sets of metadata provided by complete list of English spellings for each Russian name. eLIBRARY.ru are similar to those of SpringerLink, but This procedure is applied in the character by character access to these metadata is restricted. To be more manner. specific, the list of publications of an author is freely Given a full normalized Russian name, the program available, but detailed metadata on his/her papers is not generates a set of all possible English transliterations and free. Therefore, our disambiguation algorithm is based form variations E_strings, as explained earlier [2]. This on the freely available data at eLIBRARY.ru. Another step should allow extracting the most complete set of essential difference between these two data sources is synonyms for a given person. 194 Figure 1 The general scheme of our algorithm of confirmed eLIBRARY.ru papers for each potential 3.2 Extraction of papers from SpringerLink homonym of a given author. The groups of confirmed Each generated string s∈E_strings is used for key word papers of SpringerLink are created by matching the search in SpringerLink. Publications having one of the papers from SpringerLink and eLIBRARY.ru. key words as the author are retained. Sets of meta-data 3.4 Matching the papers from SpringerLink and such as the title, list of authors, list of author’s eLIBRARY.ru affiliations, publication date, venue (title of journal or conference proceedings), keywords, pdf_url are The authors of the articles extracted from SpringerLink extracted from SpringerLink. If a publication extracted is can be both homonyms and synonyms. The a translation of a Russian original paper, SpringerLink disambiguation algorithm should process the list of provides a special label “translated from.” For example, articles and determine which of their authors are the paper by A. P. Ershov Design characteristics of a synonyms and which of them are homonyms. In other multilanguage programming system words, the list of articles should be clustered into the has the label “Translated from Kibernetika, No. 4, pp. subsets S1, S2,…, Sn such that each subset of articles is 11–27, July–August, 1975.” Also, such attributes as DOI authored by a single person and all his or her name (https://doi.org/10.1007/BF01070432) and bibliographic variations are synonyms. The subset S1 should contain data (Cybernetics July 1975, Volume 11, Issue 4, pp. the articles authored by the person under consideration. 526–541) are provided. Moreover, the SpringerLink To this end, the list of publications S extracted from database gives the ISSN of the translated version. This SpringerLink is matched against the lists of publications information can be used for cross-language matching of E extracted from eLIBRARY.ru. Note, that the papers of the original paper in Russian and its translation in eLIBRARY.ru are already clustered into the groups E1, English. E2,…, Em corresponding to individual authors. Therefore, if a paper si ∈S is recognized as identical to a 3.3 Extraction of paper lists from eLIBRARY.ru paper ej belonging to a group Em from eLIBRARY.ru, it eLIBRARY.ru specifies persons by their full Russian is assigned to a group Sm. name in format, A paper si ∈S is considered to be identical to a paper affiliation and location of the employing organization. ej ∈E in the following cases: eLIBRARY.ru lists of publications are used to create confirmed groups. The simplest solution would be to Case 1 Title(si) = Title (ei) AND Authors(si) = extract from eLIBRARY.ru the publication list of a Authors(ei) person specified by his or her full name. The problem is, The unique identifier of a paper is its DOI; however, that a person under consideration can have regrettably, only 56% of our sample papers have DOI several homonyms and “partial” homonyms, when a specified in ELIBRARY.ru. The title cannot identify a short form of the person’s name coincides with the short paper uniquely as some authors can have several form of another person’s name. For example, a full publications with the same title. Nevertheless, the exact Russian name “Andrei Petrovich Ershov” has a short match of titles and author names can be considered as form “A. P. Ershov.” However, Alexander Petrovich evidence that the papers were authored by the same Ershov from the Lavrentiev Institute of Hydrodynamics person. However, some paper titles differ in and Alexei Petrovich Ershov from Moscow State SpringerLink and ELIBRARY.ru due to scanning errors. University have the same short form of their names and For example, the paper titled as SCHEMATOLOGY IN A used to be erroneously identified as synonyms. To MULTI-LANGUAGE OPTIMIZER in eLIBRARY.ru has prevent this kind of errors, our algorithm creates groups the title Schematology in a MJ I/T I-language OPT imizer 195 in SpringerLink. In the absence of the paper titles exact For each paper s ∈ A match, both titles are stemmed by the Porter stemmer and For each paper t ∈ A d ∶=similarity_score(s , t) their overlap score is calculated. If this score exceeds a If (d > threshold) threshold value, the titles are considered coinciding. The if (Group(s) = −1 and Group(t) = −1) discovered matching is written in a special file for further NewGroup(s, t) user control. Else MergeGroups(s, t) Case 2 Cross-language identification of paper and When merging two groups, the algorithm monitors journal titles. that both groups do not belong to the set of the confirmed Many Russian journals are first published in Russian groups. If this happens, the merging does not occur, since and then translated in English. A typical example is the the confirmed groups correspond to the articles by Программирование journal which is published in distinct authors. English as Programming and Computer Software. About 40% of eLIBRARY.ru older entries have only Russian 3.6 Papers similarity scores calculation description and do not have their English counterpart. To calculate the assignment likelihood of an ambiguous These publications, however, are very important for author A to a confirmed group G, we consider similarity making confirmed paper groups as large as possible. between pA and pi ∈ pG. Given the attributes collected by There are several problems involved in this situation. the SpringerLinkExtractor, all the attributes are pairwise First, it is impossible to compare papers by title when the compared, which results in a number of scores that are title of an original paper is in Russian and the title of a summarized in the final step. translated paper is in English. Besides, the original and Titles of papers similarity If an exact match of the the translated papers have disjoint sets of attributes such paper titles A and B is found, the title_similarity_score as venue, ISSN, publication data, page numbers, etc. is set to 1.0. Otherwise, the titles of the papers A and B Although SpringerLink provides information about are stemmed, and the title_similarity_score is set to the journal titles in the Latin alphabet only, every translated overlap ratio of their word lists. paper in the database mentions its Russian original. For Co-authors similarity Co-author_similarity_score example, the paper by A. P. Ershov Design uses Jaccard Index to evaluate the overlap ratio of their characteristics of a multilanguage programming system co-author lists. has the label “Translated from Kibernetika, No. 4, pp. Subjects and keywords similarity The 11–27, July–August, 1975” in SpringerLink. Moreover, subject_similarity_score and keyword_similarity_score the SpringerLink database provides the ISSN of the are calculated in the same way as the co- translated version. This information suffices to find the author_similarity_score. Russian version of the paper if it is available in Date similarity The date_similarity_score is set to eLIBRARY.ru. The corresponding English-language 0.1 if the timestamp difference of the papers A and B is article is marked as matched and the pair of papers is less than five years. If the timestamps difference of the saved for further processing. papers A and B is more than twenty five years, it is set to The average number of papers assigned to the - 0.1. confirmed groups during the matching step was about Venue similarity The publication_venue_score (i.e., 69%, while the number of erroneously attributed conference/journal title) is set to 0.1 if there is an exact publications was close to zero. The main reason why the match between their titles. system cannot assign some papers to their author is data Text similarity Text_similarity_score is evaluated sparsity. To extend the set of the identified authors of by TF_IDF and cosin similarity measure. papers, a clustering algorithm was applied to the The final assignment likelihood is calculated as the unmatched papers. sum of all the above scores. 3.5 Clustering unmatched papers 4 Interactive visualization for understanding The two important aspects of our algorithm is the cross- and editing the matching and clustering language generation of the confirmed groups and similarity evaluation between unmatched papers. The results papers of SpringerLink, which were not grouped at the previous step, are now grouped together if their similarity Several interrelated visualizations seek to simplify the exceeds a specified threshold. (The threshold and all the understanding and editing of the matching and clustering attributes score values can be adjusted during the results. A global view of the obtained groups of interactive visualization step). The schema of the publications is shown in Fig. 2 as a pie chart. Each algorithm is as follows. segment of the pie chart corresponds to a separate group Let A = ∪(Agi) be a set of papers obtained after the of publications attributed to a single author. The size of matching step of SpringerLink and eLIBRARY.ru a segment in the pie chart is proportional to the number papers, where gi is a group number. For the group of unmatched papers gi = -1. Then the following algorithm is applied: 196 Figure 2 A global view of the matching and clustering results of documents assigned to this group. A short textual The “Show details” button provides access to the description of a chosen documents group appears after visualization of individual group parameters. For the mouse click on a segment of the pie chart in the right example, a group of papers can be represented as an panel. A set of checkboxes in the center of the global adjacency matrix A shown in Fig. 3. Each entry aij ∈ A is view enables the interactive adjustment of the clustering shown as a colored circle with its radius proportional to results. Users can change the list of parameters taken into the similarity value between the papers pi and pj. account by the clustering cost function as well as the If a paper is assigned to the group by matching with group similarity threshold value. When the “Recalculate eLIBRARY.ru procedure the corresponding diagonal groups” button is pressed, the system automatically circle is green, otherwise it is blue. For example, a group recalculates the clustering results, which makes the of papers assigned by the matching and clustering clustering algorithm interactively adjustable through the procedure to the employee of the IIS SBRAS Kas’yanov visualization. V.N. is shown in Fig. 3. It is easy to see that all but one The “Change groups” button displays another paper by Kas’yanov V.N. were found in eLIBRARY.ru, window which allows interactive modification of the and many of them have descriptions in Russian only. clustering results by dragging a paper from one group to When an entry aij is chosen by a mouse click, it is another. highlighted in red, and the description of the The “Save results” button allows saving the updated corresponding document pair appears, as well as a clustering results. detailed explanation of the coefficient obtained. 197 Figure 3 Paper similarity adjacency matrix The "Co-authorship" button opens another window representing co-authors of scientific publications in the These two views allow for a visual search of the so- form of a matrix (see Fig. 4). called "group outliers" that do not really belong to the same author. By choosing a paper of interest with a mouse click and pressing the "Remove from the group" button, the user can change the paper allocation. The clustering algorithm will either automatically move this paper to another group, or create a new group containing this paper. Conclusion The newly developed matching procedure provides the algorithm presented in this paper with the ability not only to cluster the papers correctly, but also to determine the exact identity of authors, including the name and location of the affiliating organization. Figure 4 Co-authorship table for a group of papers The program implementing the algorithm has been tested on a dataset of 100 persons employed by the IIS One more window can be opened by the "By year" SB RAS at various time periods. Also, this dataset button. This view is shown in Figure 5. It represents contains Academician A.P. Ershov whose papers have distribution of papers by year. been input into eLIBRARY.ru by IIS SB RAS. The total number of papers found in SpringerLink for all Russian names in this dataset was equal to 3,175. All the results obtained by the program were verified manually. For each person listed in the test dataset the following values were calculated: • total number of papers found in SpringerLink for each Russian full name listed in the test dataset; • number of articles actually authored by a researcher specified in the test dataset; • number of papers that have been correctly recognized by the matching algorithm; Figure 5 Distribution of papers by year • number of papers that have been correctly recognized by the matching + clustering algorithm; 198 These experiments have shown that 69.4 percent of [8] Lawrie D., Mayfield J., McNamee P., Oard D. papers have been correctly recognized by the matching W.: Cross-Language Person-Entity Linking algorithm; 86.6 percent is the share of papers that have from Twenty Languages (2015) been correctly recognized by the clustering algorithm; [9] Reijnhoudt, L., Costas, R., Noyons, E., Boerner, and 95 percent of papers have been correctly recognized K., Scharnhorst, A.: "Seed+ expand": A by the matching + clustering algorithm. validated methodology for creating high quality publication oeuvres of individual researchers. References In: Proceedings of ISSI 2013 Vienna, arXiv:1301.5177 (2013) [1] Apanovich Z.V., Marchuk A.G.: Experiments [10] Schulz, Chr., Mazloumian A., Petersen A. M., on using the LOD cloud datasets to enrich the content of a scientific knowledge base. Penner O., Helbing D.: Exploiting citation In:KESW 2013, CCIS 394, pp. 1-14. Springer networks for large-scale author name Verlag, Berlin Heidelberg (2013). disambiguation. In: EPJ Data Science, 3 (11). pp. 1-14. (2014) [2] Apanovich Z., Marchuk A.: Experiments on [11] Shen Q., Wu T., Yang H., Wu Y., Qu H., Cui Russian-English identity resolution. In: Proceedings of the ICADL-2015 Conference W.: NameClarifier: A Visual Analytics System Seul, South Korea, LNCS 9469, pp. 12-21. for Author Name Disambiguation. In: IEEE Springer International Publishing Switzerland Transactions on Visualization and Computer (2015). Graphics. vol. 23, no. 1. pp. 141-150. ( 2017). [12] Song Y., Huang J., Councill I.G, Jia Li C., [3] D'Angelo, C.A., Giu_rida, C., Abramo, G.: A heuristic approach to author name Giles L.: Efficient Topic-based Unsupervised disambiguation in bibliometrics databases for Name Disambiguation. In: Proc. of the 7th large-scale research assessments. In: Journal of ACM/IEEE-CS Joint Conf. on Digital Libraries, the American Society for Information Science pp. 342–351 (2007) and Technology 62(2), pp. 257-269 (2011) [13] Stoffel F., Jentner W., Behrisch M., Fuchs J., [4] Ferreira A. A., Gonçalves M. A., Laender A. Keim D.: Interactive Ambiguity Resolution of H. F.: Disambiguating Author Names in Large Named Entities in Fictional Literature.In: Bibliographic Repositories. In: Internat. Conf. Computer Graphics Forum, v.36 n.3, pp.189- on Digital Libraries, New Delhi, India ( 2013) 200, (2017). [14] Strotmann A., Zhao D.: Bubela T.: Author [5] Hickey, T. B., Toves J. A.: Managing Ambiguity in VIAF. In: D-Lib Magazine 20 name disambiguation for collaboration network (July/August). (July/August). (2014). doi: analysis and visualization.In: Proceedings of the 10.1045/july2014- American Society for Information Science and hickey.http://www.dlib.org/dlib/july14/hickey/0 Technology, 46(1). pp. 1–20. ( 2009). 7hickey.html. [15] Sun Z., Hu W., Li C.: Cross-Lingual Entity [6] Kang H., Getoor L., Shneiderman B., Bilgic M., Alignment via Joint Attribute-Preserving Licamele L.: Interactive entity resolution in Embedding. In: d'Amato C. et al. (eds) ISWC relational data. In: A visual analytic tool and its 2017, Part I, LNCS 10587, pp. 628– evaluation. Visualization and Computer 644,( 2017). DOI: 10.1007/978-3-319-68288- Graphics, IEEE Transactions on, 14(5), pp. 4_37 999–1014, (2008). [7] Lawrie D., Mayfield J., McName P., Oard D. W.: Creating and curating a Cross-language Person-entity linking collection. (2012) 199