=Paper= {{Paper |id=Vol-2277/paper34 |storemode=property |title= Analysis and Visualization Algorithm for Cross-Language Author Names Disambiguation |pdfUrl=https://ceur-ws.org/Vol-2277/paper34.pdf |volume=Vol-2277 |authors=Zinaida Apanovich,Vladimir Isachenko |dblpUrl=https://dblp.org/rec/conf/rcdl/ApanovichI18 }} == Analysis and Visualization Algorithm for Cross-Language Author Names Disambiguation == https://ceur-ws.org/Vol-2277/paper34.pdf
   Analysis and Visualization Algorithm for Cross-Language
                Author Names Disambiguation
               © Zinaida Apanovich                              © Vladimir Isachenko
              A.P. Ershov Institute of Informatics Systems, Novosibirsk State University,
                                          Novosibirsk, Russia
              apanovich@iis.nsk.su.                           vv.isachenko@gmail.com
            Abstract. A new algorithm for the cross-language disambiguation of author names is presented. The
     algorithm uses the matching of Russian and English papers and journal titles. An interactive visualization tool
     simplifies the analysis and modification of the results obtained.
            Keywords: cross-language disambiguation, clustering, interactive visualization.

                                                                    Nepomniaschy, Nepomnyashchiy, etc. Because of these
 1 Introduction                                                     name variations, his papers in the Scopus data base are
                                                                    assigned to four people with distinct Scopus identifiers.
 The recent years have been marked by a rapid spread of             Moreover, some of his papers are merged with the papers
 large-scale knowledge bases, such as Google Knowledge              by Владимир Непомнящий from Moscow and so
 Vault, Deep Dive, Microsoft Academic Graph, etc.,                  assigned to yet another “virtual” person.
 extracting facts from texts and integrating data from                  Recognizing the great importance of this issue, large
 multiple sources automatically which is potentially error-         bibliographic data sources started such projects as
 prone. Most errors are related to the differences in data          ORCID (http://orcid.org/). It provides persistent digital
 source schemas and identity resolution errors.                     identifiers (Open Researcher and Contributor ID) that
     Name ambiguity in the context of bibliographic                 distinguish every researcher from any other. Also,
 citation records is a difficult problem affecting the              ORCID supports automated linkages between a
 quality of content in digital libraries. It has been a             researcher and his or her professional activities, such as
 subject of intensive research [4, 5, 10, 12]. An important         publications. Nevertheless, this project has not coped
 aspect of this problem is multilingualism. Multilingual            with the problem entirely, and further investigations are
 resources such as DBPedia, VIAF, WorldCat, etc., are               needed.
 becoming increasingly common.                                          An algorithm for the cross-language identity
     Although English is the main language for research             resolution using the SBRAS Open Archive is presented
 and the Internet, a great number of research publications          in [2]. The algorithm relies heavily on the information
 belong to non-English authors and are translated from              about Siberian researchers and their affiliations and for
 various foreign languages, which makes the task of                 this reason has a very limited application. Another
 integrating multiple data sources even more difficult.             problem is the absence of a trustworthy data source [4, 9]
 Naturally, this poses the problem of the cross-language            since our experiments have shown erroneous authorship
 disambiguation of named entities, and, in particular, the          attributions in all important international data source
 cross-language disambiguation of the authors of                    such as DBLP, Scopus, etc. A possible solution to this
 scientific publications. Also, algorithms aimed at cross-          problem would be using as a ground truth source a
 language identity resolution have recently gained                  national data base such as eLIBRARY.ru
 importance in the field of the Semantic Web [7, 8, 15].            (https://elibrary.ru).
     Our previous research demonstrated that Russian                     The authors have committed themselves to
 names allowing several transliterations represent a                answering the following question: To what extent a data
 challenge. Experiments with several multilingual                   source such as eLIBRARY.ru can be used to refine the
 datasets have shown that Russian names admitting                   quality of the identity resolution of English data sources?
 several transliterations are often treated as homonyms,            To this end, a way of establishing correspondence
 and several persons with identical name variations are             between the Russian-named and English-named entities
 treated as synonyms [1, 2]. This is especially unpleasant          has to be developed. The transliteration-based matching
 when errors occur in the resources used to calculate               of personal names was already described in [2]; our new
 scientific ratings. For example, the family name of a              algorithm, however, has an additional matching step,
 researcher of the A.P. Ershov Institute of Informatics             enabling us to create groups of confirmed papers for an
 Systems of the Siberian Branch of the Russian Academy              individual researcher. Another issue is establishing the
 of Sciences (IIS SB RAS), Валерий Александрович                    correspondence between the titles of original Russian
 Непомнящий can be transliterated as Nepomnyashchii,                papers and their English translations as well as between
                                                                    journal titles in Russian and their English translations.
Proceedings of the XX International Conference                      Due to this extended matching step, the new clustering
“Data Analytics and Management in Data Intensive                    algorithm for disambiguation of authors have proven to
Domains” (DAMDID/RCDL’2018), Moscow, Russia,                        be more efficient that the previous one. Finally, an
October 9-12, 2018



                                                              193
interactive     visualization      algorithm      provides         that the language of SpringerLink is English, and that of
comprehensible disambiguation results and enables their            eLIBRARY.ru is Russian, even when it stores data on the
modification. As for the visualization issue, only a few           English publications of Russian researchers. The main
works directly relate to our program [6, 11, 13, 14]. None         problem, hence, is how to match entities described in
of them is related to the cross-language disambiguation            different languages.
issue.
    The     paper       is    organized      as     follow         3 The algorithm description
s: first, the datasets and metadata essential for our
algorithm are presented. After that, the matching and              The main steps of our algorithm are as follows.
clustering algorithm and implementation details are                  1. Given a full Russian name, all possible forms and
described. Finally, we demonstrate an interactive                        English transliterations are generated.
visualization, which facilitates the comprehension of the            2. English forms of the name are used for the
disambiguation results and allows users to improve them.                 keyword search of publications in the
                                                                         SpringerLink digital library.
                                                                     3. An extended set of potential homonyms of the
2 Datasets and their metadata
                                                                         person, specified by the full Russian name, is used
Data sparsity is the decisive factor leading to author                   to extract groups of publications from
disambiguation errors in all the existing data sources.                  eLIBRARY.ru.
Since metadata provide an important evidence for name                4. All the publications extracted from SpringerLink
disambiguation tasks, the lack of key metadata can result                are matched against the eLIBRARY.ru groups of
in a poor disambiguation outcome. With publishers                        publications.
providing more metadata with more frequent updating,                 5. The papers unmatched at the previous step are
recent citations have higher metadata availability and the               further analyzed and clustered.
lists of the available metadata are growing continuously.            6. Interactive visualization makes it possible to
     We have chosen the SpringerLink digital library                     analyze and refine the clustering result.
(https://link.springer.com/) as an English-language
bibliographic data source. The main reason for it is a             The general scheme of our algorithm is shown in Fig. 1.
permanently expanding set of metadata. SpringerLink is             Next, we describe each step in more detail.
currently one of the largest digital libraries with over
10 million documents in various research fields                    3.1 Extended transliteration
including computer science, mathematics, life sciences,            eLIBRARY.ru identifies the researchers by their
materials, philosophy, psychology, etc. It provides                normalized name, affiliation and location of the
detailed meta-data about its publications, such as the             employing organization. Since eLIBRARY.ru is a
paper title, list of authors, ISSN, authors’ affiliations,         Russian-language data source, all the three attributes are
publication date, venue (journal or conference title), key         written in Cyrillic. The format of the normalized name is
words, subject abstract, references, full texts in pdf             .
format, etc. One of the recent innovations is the                      However, several English name variations can
“translated from” label for the papers written in foreign          correspond to a normalized Russian name. It can be <
languages. This additional data makes possible the                 First Name Middle Name Last Name>, , , etc. All these forms should be first
     eLIBRARY.ru stores data in the field of science,              generated in Russian and then transliterated in English.
technology, medicine and education on more than 28                 Again, every Russian name can be transliterated in many
million publications, more than 500 000 researchers and            ways. For example, the Russian family name Ершов can
over 3, 000 registered organizations. The A.P. Ershov              be spelt as Ershov, Yershov, Jerszow, and the first name
Institute of Informatics Systems of the Siberian Branch            Андрей can be written as Andrei, Andrey, Andrew.
of the Russian Academy of Sciences is an organization              Therefore, in order to identify in an English knowledge
registered at eLIBRARY.ru; it regularly inputs and                 base all the possible synonyms of a person from
updates information concerning its employee’s                      eLIBRARY.ru, our program             generates the most
publications. The sets of metadata provided by                     complete list of English spellings for each Russian name.
eLIBRARY.ru are similar to those of SpringerLink, but              This procedure is applied in the character by character
access to these metadata is restricted. To be more                 manner.
specific, the list of publications of an author is freely              Given a full normalized Russian name, the program
available, but detailed metadata on his/her papers is not          generates a set of all possible English transliterations and
free. Therefore, our disambiguation algorithm is based             form variations E_strings, as explained earlier [2]. This
on the freely available data at eLIBRARY.ru. Another               step should allow extracting the most complete set of
essential difference between these two data sources is             synonyms for a given person.




                                                             194
                                    Figure 1 The general scheme of our algorithm
                                                                   of confirmed eLIBRARY.ru papers for each potential
3.2 Extraction of papers from SpringerLink                         homonym of a given author. The groups of confirmed
Each generated string s∈E_strings is used for key word             papers of SpringerLink are created by matching the
search in SpringerLink. Publications having one of the             papers from SpringerLink and eLIBRARY.ru.
key words as the author are retained. Sets of meta-data
                                                                   3.4 Matching the papers from SpringerLink and
such as the title, list of authors, list of author’s
                                                                   eLIBRARY.ru
affiliations, publication date, venue (title of journal or
conference proceedings), keywords, pdf_url are                     The authors of the articles extracted from SpringerLink
extracted from SpringerLink. If a publication extracted is         can be both homonyms and synonyms. The
a translation of a Russian original paper, SpringerLink            disambiguation algorithm should process the list of
provides a special label “translated from.” For example,           articles and determine which of their authors are
the paper by A. P. Ershov Design characteristics of a              synonyms and which of them are homonyms. In other
multilanguage programming system                                   words, the list of articles should be clustered into the
has the label “Translated from Kibernetika, No. 4, pp.             subsets S1, S2,…, Sn such that each subset of articles is
11–27, July–August, 1975.” Also, such attributes as DOI            authored by a single person and all his or her name
(https://doi.org/10.1007/BF01070432) and bibliographic             variations are synonyms. The subset S1 should contain
data (Cybernetics July 1975, Volume 11, Issue 4, pp.               the articles authored by the person under consideration.
526–541) are provided. Moreover, the SpringerLink                      To this end, the list of publications S extracted from
database gives the ISSN of the translated version. This            SpringerLink is matched against the lists of publications
information can be used for cross-language matching of             E extracted from eLIBRARY.ru. Note, that the papers of
the original paper in Russian and its translation in               eLIBRARY.ru are already clustered into the groups E1,
English.                                                           E2,…, Em corresponding to individual authors.
                                                                   Therefore, if a paper si ∈S is recognized as identical to a
3.3 Extraction of paper lists from eLIBRARY.ru                     paper ej belonging to a group Em from eLIBRARY.ru, it
eLIBRARY.ru specifies persons by their full Russian                is assigned to a group Sm.
name in  format,                      A paper si ∈S is considered to be identical to a paper
affiliation and location of the employing organization.            ej ∈E in the following cases:
eLIBRARY.ru lists of publications are used to create
confirmed groups. The simplest solution would be to                Case 1 Title(si) = Title (ei) AND Authors(si) =
extract from eLIBRARY.ru the publication list of a                 Authors(ei)
person specified by his or her full name. The problem is,              The unique identifier of a paper is its DOI;
however, that a person under consideration can have                regrettably, only 56% of our sample papers have DOI
several homonyms and “partial” homonyms, when a                    specified in ELIBRARY.ru. The title cannot identify a
short form of the person’s name coincides with the short           paper uniquely as some authors can have several
form of another person’s name. For example, a full                 publications with the same title. Nevertheless, the exact
Russian name “Andrei Petrovich Ershov” has a short                 match of titles and author names can be considered as
form “A. P. Ershov.” However, Alexander Petrovich                  evidence that the papers were authored by the same
Ershov from the Lavrentiev Institute of Hydrodynamics              person. However, some paper titles differ in
and Alexei Petrovich Ershov from Moscow State                      SpringerLink and ELIBRARY.ru due to scanning errors.
University have the same short form of their names and             For example, the paper titled as SCHEMATOLOGY IN A
used to be erroneously identified as synonyms. To                  MULTI-LANGUAGE OPTIMIZER in eLIBRARY.ru has
prevent this kind of errors, our algorithm creates groups          the title Schematology in a MJ I/T I-language OPT imizer




                                                             195
in SpringerLink. In the absence of the paper titles exact            For each paper s ∈ A
match, both titles are stemmed by the Porter stemmer and                 For each paper t ∈ A
                                                                             d ∶=similarity_score(s , t)
their overlap score is calculated. If this score exceeds a
                                                                             If (d > threshold)
threshold value, the titles are considered coinciding. The                        if (Group(s) = −1 and Group(t) = −1)
discovered matching is written in a special file for further                           NewGroup(s, t)
user control.                                                                     Else
                                                                                       MergeGroups(s, t)
Case 2 Cross-language identification of paper and
                                                                         When merging two groups, the algorithm monitors
journal titles.
                                                                     that both groups do not belong to the set of the confirmed
     Many Russian journals are first published in Russian            groups. If this happens, the merging does not occur, since
and then translated in English. A typical example is the             the confirmed groups correspond to the articles by
Программирование journal which is published in                       distinct authors.
English as Programming and Computer Software. About
40% of eLIBRARY.ru older entries have only Russian                   3.6 Papers similarity scores calculation
description and do not have their English counterpart.               To calculate the assignment likelihood of an ambiguous
These publications, however, are very important for                  author A to a confirmed group G, we consider similarity
making confirmed paper groups as large as possible.                  between pA and pi ∈ pG. Given the attributes collected by
     There are several problems involved in this situation.          the SpringerLinkExtractor, all the attributes are pairwise
First, it is impossible to compare papers by title when the          compared, which results in a number of scores that are
title of an original paper is in Russian and the title of a          summarized in the final step.
translated paper is in English. Besides, the original and                Titles of papers similarity If an exact match of the
the translated papers have disjoint sets of attributes such          paper titles A and B is found, the title_similarity_score
as venue, ISSN, publication data, page numbers, etc.                 is set to 1.0. Otherwise, the titles of the papers A and B
     Although SpringerLink provides information about                are stemmed, and the title_similarity_score is set to the
journal titles in the Latin alphabet only, every translated          overlap ratio of their word lists.
paper in the database mentions its Russian original. For                 Co-authors similarity Co-author_similarity_score
example, the paper by A. P. Ershov Design                            uses Jaccard Index to evaluate the overlap ratio of their
characteristics of a multilanguage programming system                co-author lists.
has the label “Translated from Kibernetika, No. 4, pp.                   Subjects      and     keywords       similarity     The
11–27, July–August, 1975” in SpringerLink. Moreover,                 subject_similarity_score and keyword_similarity_score
the SpringerLink database provides the ISSN of the                   are calculated in the same way as the co-
translated version. This information suffices to find the            author_similarity_score.
Russian version of the paper if it is available in                       Date similarity The date_similarity_score is set to
eLIBRARY.ru. The corresponding English-language                      0.1 if the timestamp difference of the papers A and B is
article is marked as matched and the pair of papers is               less than five years. If the timestamps difference of the
saved for further processing.                                        papers A and B is more than twenty five years, it is set to
     The average number of papers assigned to the                    - 0.1.
confirmed groups during the matching step was about                      Venue similarity The publication_venue_score (i.e.,
69%, while the number of erroneously attributed                      conference/journal title) is set to 0.1 if there is an exact
publications was close to zero. The main reason why the              match between their titles.
system cannot assign some papers to their author is data                 Text similarity Text_similarity_score is evaluated
sparsity. To extend the set of the identified authors of             by TF_IDF and cosin similarity measure.
papers, a clustering algorithm was applied to the                        The final assignment likelihood is calculated as the
unmatched papers.                                                    sum of all the above scores.
3.5 Clustering unmatched papers
                                                                     4 Interactive visualization for understanding
The two important aspects of our algorithm is the cross-             and editing the matching and clustering
language generation of the confirmed groups and
similarity evaluation between unmatched papers. The                  results
papers of SpringerLink, which were not grouped at the
previous step, are now grouped together if their similarity          Several interrelated visualizations seek to simplify the
exceeds a specified threshold. (The threshold and all the            understanding and editing of the matching and clustering
attributes score values can be adjusted during the                   results. A global view of the obtained groups of
interactive visualization step). The schema of the                   publications is shown in Fig. 2 as a pie chart. Each
algorithm is as follows.                                             segment of the pie chart corresponds to a separate group
Let A = ∪(Agi) be a set of papers obtained after the                 of publications attributed to a single author. The size of
matching step of SpringerLink and eLIBRARY.ru                        a segment in the pie chart is proportional to the number
papers, where gi is a group number. For the group of
unmatched papers gi = -1.
    Then the following algorithm is applied:




                                                               196
                            Figure 2 A global view of the matching and clustering results


of documents assigned to this group. A short textual                    The “Show details” button provides access to the
description of a chosen documents group appears after               visualization of individual group parameters. For
the mouse click on a segment of the pie chart in the right          example, a group of papers can be represented as an
panel. A set of checkboxes in the center of the global              adjacency matrix A shown in Fig. 3. Each entry aij ∈ A is
view enables the interactive adjustment of the clustering           shown as a colored circle with its radius proportional to
results. Users can change the list of parameters taken into         the similarity value between the papers pi and pj.
account by the clustering cost function as well as the                  If a paper is assigned to the group by matching with
group similarity threshold value. When the “Recalculate             eLIBRARY.ru procedure the corresponding diagonal
groups” button is pressed, the system automatically                 circle is green, otherwise it is blue. For example, a group
recalculates the clustering results, which makes the                of papers assigned by the matching and clustering
clustering algorithm interactively adjustable through the           procedure to the employee of the IIS SBRAS Kas’yanov
visualization.                                                      V.N. is shown in Fig. 3. It is easy to see that all but one
    The “Change groups” button displays another                     paper by Kas’yanov V.N. were found in eLIBRARY.ru,
window which allows interactive modification of the                 and many of them have descriptions in Russian only.
clustering results by dragging a paper from one group to            When an entry aij is chosen by a mouse click, it is
another.                                                            highlighted in red, and the description of the
    The “Save results” button allows saving the updated             corresponding document pair appears, as well as a
clustering results.                                                 detailed explanation of the coefficient obtained.




                                                              197
                                      Figure 3 Paper similarity adjacency matrix

   The "Co-authorship" button opens another window
representing co-authors of scientific publications in the             These two views allow for a visual search of the so-
form of a matrix (see Fig. 4).                                    called "group outliers" that do not really belong to the
                                                                  same author. By choosing a paper of interest with a
                                                                  mouse click and pressing the "Remove from the group"
                                                                  button, the user can change the paper allocation. The
                                                                  clustering algorithm will either automatically move this
                                                                  paper to another group, or create a new group containing
                                                                  this paper.

                                                                  Conclusion
                                                                  The newly developed matching procedure provides the
                                                                  algorithm presented in this paper with the ability not only
                                                                  to cluster the papers correctly, but also to determine the
                                                                  exact identity of authors, including the name and location
                                                                  of the affiliating organization.
Figure 4 Co-authorship table for a group of papers                    The program implementing the algorithm has been
                                                                  tested on a dataset of 100 persons employed by the IIS
    One more window can be opened by the "By year"                SB RAS at various time periods. Also, this dataset
button. This view is shown in Figure 5. It represents             contains Academician A.P. Ershov whose papers have
distribution of papers by year.                                   been input into eLIBRARY.ru by IIS SB RAS. The total
                                                                  number of papers found in SpringerLink for all Russian
                                                                  names in this dataset was equal to 3,175. All the results
                                                                  obtained by the program were verified manually. For
                                                                  each person listed in the test dataset the following values
                                                                  were calculated:
                                                                  • total number of papers found in SpringerLink for
                                                                      each Russian full name listed in the test dataset;
                                                                  • number of articles actually authored by a researcher
                                                                      specified in the test dataset;
                                                                  • number of papers that have been correctly
                                                                      recognized by the matching algorithm;
Figure 5 Distribution of papers by year                           • number of papers that have been correctly recognized
                                                                      by the matching + clustering algorithm;




                                                            198
These experiments have shown that 69.4 percent of                    [8] Lawrie D., Mayfield J., McNamee P., Oard D.
papers have been correctly recognized by the matching                     W.: Cross-Language Person-Entity Linking
algorithm; 86.6 percent is the share of papers that have                  from Twenty Languages (2015)
been correctly recognized by the clustering algorithm;               [9] Reijnhoudt, L., Costas, R., Noyons, E., Boerner,
and 95 percent of papers have been correctly recognized                   K., Scharnhorst, A.: "Seed+ expand": A
by the matching + clustering algorithm.                                   validated methodology for creating high quality
                                                                          publication oeuvres of individual researchers.
References                                                                In: Proceedings of ISSI 2013 Vienna,
                                                                          arXiv:1301.5177 (2013)
   [1] Apanovich Z.V., Marchuk A.G.: Experiments
                                                                     [10] Schulz, Chr., Mazloumian A., Petersen A. M.,
         on using the LOD cloud datasets to enrich the
         content of a scientific knowledge base.                          Penner O., Helbing D.: Exploiting citation
         In:KESW 2013, CCIS 394, pp. 1-14. Springer                       networks for large-scale author name
         Verlag, Berlin Heidelberg (2013).                                disambiguation. In: EPJ Data Science, 3 (11).
                                                                          pp. 1-14. (2014)
   [2]   Apanovich Z., Marchuk A.: Experiments on
                                                                     [11] Shen Q., Wu T., Yang H., Wu Y., Qu H., Cui
         Russian-English identity resolution. In:
         Proceedings of the ICADL-2015 Conference                         W.: NameClarifier: A Visual Analytics System
         Seul, South Korea, LNCS 9469, pp. 12-21.                         for Author Name Disambiguation. In: IEEE
         Springer International Publishing Switzerland                    Transactions on Visualization and Computer
         (2015).                                                          Graphics. vol. 23, no. 1. pp. 141-150. ( 2017).
                                                                     [12] Song Y., Huang J., Councill I.G, Jia Li C.,
   [3]   D'Angelo, C.A., Giu_rida, C., Abramo, G.: A
         heuristic approach to author name                                Giles L.: Efficient Topic-based Unsupervised
         disambiguation in bibliometrics databases for                    Name Disambiguation. In: Proc. of the 7th
         large-scale research assessments. In: Journal of                 ACM/IEEE-CS Joint Conf. on Digital Libraries,
         the American Society for Information Science                     pp. 342–351 (2007)
         and Technology 62(2), pp. 257-269 (2011)                    [13] Stoffel F., Jentner W., Behrisch M., Fuchs J.,
   [4]   Ferreira A. A., Gonçalves M. A., Laender A.                      Keim D.: Interactive Ambiguity Resolution of
         H. F.: Disambiguating Author Names in Large                      Named Entities in Fictional Literature.In:
         Bibliographic Repositories. In: Internat. Conf.                  Computer Graphics Forum, v.36 n.3, pp.189-
         on Digital Libraries, New Delhi, India ( 2013)                   200, (2017).
                                                                     [14] Strotmann A., Zhao D.: Bubela T.: Author
   [5]   Hickey, T. B., Toves J. A.: Managing
         Ambiguity in VIAF. In: D-Lib Magazine 20                         name disambiguation for collaboration network
         (July/August). (July/August). (2014). doi:                       analysis and visualization.In: Proceedings of the
         10.1045/july2014-                                                American Society for Information Science and
         hickey.http://www.dlib.org/dlib/july14/hickey/0                  Technology, 46(1). pp. 1–20. ( 2009).
         7hickey.html.                                               [15] Sun Z., Hu W., Li C.: Cross-Lingual Entity
   [6]   Kang H., Getoor L., Shneiderman B., Bilgic M.,                   Alignment via Joint Attribute-Preserving
         Licamele L.: Interactive entity resolution in                    Embedding. In: d'Amato C. et al. (eds) ISWC
         relational data. In: A visual analytic tool and its              2017, Part I, LNCS 10587, pp. 628–
         evaluation. Visualization and Computer                           644,( 2017). DOI: 10.1007/978-3-319-68288-
         Graphics, IEEE Transactions on, 14(5), pp.                       4_37
         999–1014, (2008).
   [7]   Lawrie D., Mayfield J., McName P., Oard D.
         W.: Creating and curating a Cross-language
         Person-entity linking collection. (2012)




                                                               199