=Paper= {{Paper |id=Vol-3066/spaper1 |storemode=property |title=Russian and International Data Sources: Integration of Data on Russian Research Organizations (short paper) |pdfUrl=https://ceur-ws.org/Vol-3066/spaper1.pdf |volume=Vol-3066 |authors=Zinaida Apanovich |dblpUrl=https://dblp.org/rec/conf/ssi/Apanovich21 }} ==Russian and International Data Sources: Integration of Data on Russian Research Organizations (short paper)== https://ceur-ws.org/Vol-3066/spaper1.pdf
Russian and International Data Sources: Integration of Data
on Russian Research Organizations
Zinaida V. Apanovich
A.P. Ershov Institute of Informatics Systems, Siberian Branch, Russian Academy of Sciences, Lavrentieva pr., 6,
Novosibirsk, 630000


                 Abstract
                 This paper considers international and Russian-language data sources providing information
                 about Russian research-related organizations. Information about research organizations is an
                 important attribute that enables one to identify the authors of scientific publications, as well as
                 to analyze the geographical distribution of publications and to assess the impact on the citation
                 of the publications associated with geographic factors. However, information about national
                 research organizations, for example, information about Russian research organizations, is often
                 incomplete or distorted in international databases. Data sources such as GRID, Russian and
                 English chapters of Wikipedia, Wikidata and eLIBRARY.ru are considered. It is demonstrated
                 that Russian-language data sources contain more information about Russian research-related
                 organizations than most international data sources, but this information is not available in Eng-
                 lish-language data sources. To solve this problem, a method for integrating information from
                 multilingual data sources has been developed. Experiments on the comparison and integration
                 of information about Russian research organizations in international and Russian data sources
                 are outlined. An experimental version of the database of scientific organizations comprising
                 3143 scientific organizations has been created. The work is an intermediate step towards the
                 creation of an open and extensible knowledge graph.

                 Keywords 1
                 Knowledge graph, multi-lingual knowledge graphs, identity resolution, research-related organ-
                 izations, correctness

1. Introduction
    Information on research organizations is an important attribute that enables the identification of the
authors of scientific publications, as well as the analysis of the geographical distribution of publications
and assessment of the impact on the citation of the publications associated with a geographic factor [3].
Regrettably, for example, information about Russian research organizations, is often incomplete or dis-
torted in international databases.
    One of the largest international open databases of scientific organizations is GRID [4] (Global Re-
search Identifier Database, https://www.grid.ac/). GRID is a free and openly accessible global database
of research-related organizations, cataloging research-related organizations and providing each of them
with a unique and persistent identifier. Its data is downloadable as Excel, JSON or RDF (ttl format)
files. This database contains information on more than 102,390 research-related organizations from 220
countries. The GRID data are integrated into the SN SciGraph knowledge graph developed by Springer
(https://www.springernature.com/gp/researchers/scigraph).
    The information on the organizations presented in GRID includes their postal address (100%), geo-
graphic coordinates (longitude and latitude) (99%), and the URL (90%). Each organization is provided


SSI-2021: Scientific Services & Internet, September 20–23, 2021, Moscow (online)
EMAIL: apanovich_09@mail.ru (Z.V. Apanovich)
ORCID: 0000-0002-5767-284X (Z.V. Apanovich)
              © 2020 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
with such attributes as geonames_city_id, geonames_country_id, geonames_country_code, etc. Thanks
to the use of GRID data, all information about publications stored in the SN SciGraph is geo-referenced.
    Also, GRID maintains links to global bibliographic resources such as ROR (Research Organization
Registry, https: //ror.org), Crossref https://www.crossref.org/, and ISNI (International Standard Name
Identifier, https: //isni.oclc.org/). GRID currently contains data on 2066 Russian research organizations.
    However, the information pertaining to Russian research organizations is incomplete and contains
obvious inaccuracies. For example, GRID has a page dedicated to the Siberian Branch of the Russian
Academy of Sciences (SB RAS, https://www.grid.ac/institutes/grid.415877.8). This page indicates that
“Институт космофизических исследований и аэрономии им. Ю.Г. Шафера Сибирского отделе-
ния Российской академии наук” is the Russian appellation for “SB RAS.” In actual fact, this is the
Russian equivalent of the “Shafer Institute of Cosmophysical Research and Aeronomy”
(https://www.grid.ac/institutes/grid.435157.1).
    Also, along with several institutes formally related to the SB RAS, some educational organizations
of different subordination are listed as the subsidiary organizations (“child institutes”) of the SB RAS.
For example, the East-Siberian Institute of the Ministry of Internal Affairs of the Russian Federation,
(https://www.grid.ac/institutes/grid.445063.0), Siberian Law Institute of Russian Federal Drug Control
Service (https://www.grid.ac/institutes/grid.445537.4), etc. are mentioned by GRID as “child institutes”
of the SB RAS.
    Since GRID provides links to global bibliographic resources such as ROR, Crossref, and ISNI, it is
interesting to find out whether the information about Russian organizations presented on these interna-
tional platforms differs from that in GRID. Regrettably, our experiments have shown that ROR copies
information about Russian research-related organizations, either correct or erroneous, from GRID. For
example, the outdated information on the website of the no longer existing Novosibirsk Humanitarian
Institute presented in GRID (https://grid.ac/institutes/grid.445355.6) is also found in ROR
(https://ror.org/00nnwpb90).
    A more important example of a copied error is the link between the GRID page dedicated to the
Siberian Branch of the Russian Academy of Sciences (https://www.grid.ac/institutes/grid.415877.8)
and the ID of this organization in ROR, https://ror.org/02frkq021, which gives two “equivalent” names
of this organization: “SB RAS” and “Institute of Space Research and Aeronomy Named after Yu.G.
Shafer of the Siberian Branch of the Russian Academy of Sciences”.
    Both GRID and ROR present incomplete data; for example, none of them contains information about
the A.P. Ershov Institute of Informatics Systems SB RAS, nor of many other Russian scientific organ-
izations.
    There is a striking discrepancy in the number of Russian organizations in GRID and the relevant
number in the largest database of Russian research-related organizations eLIBRARY.ru [5].
    These examples suggest that Russian-language data sources contain more complete and correct in-
formation on Russian organizations than their English-language counterparts. The largest Russian-lan-
guage data sources on Russian organizations are eLIBRARY.ru and Russian Wikipedia. It is reasonable
to compare the organizations shown in GRID and in Russian Wikipedia.
    These examples signal the need for the integration of the information contained in international and
in Russian data resources. To solve this problem, a method for integrating information from multilin-
gual data sources has been developed.

2. GRID and Russian data sources: data comparison

2.1.    Grid and Wikipedia
    The GRID database maintains links to the pages of Russian organizations in the English-language
Wikipedia. Although these links look natural to an English speaking user, it would be even more natural
to search for the information on Russian organizations in the Russian-language Wikipedia. At the time
of our experiments, GRID contained 2019 pages of Russian organizations, only 412 of which had links
to the pages in the English-language Wikipedia. Among these 412 pages, 398 pages were related by
interlanguage links to the pages of the Russian-language Wikipedia. Predictably, the Russian-language
Wikipedia contains more information about the Russian organizations presented in GRID. For example,

                                                    134
the GRID page devoted to the Federal Agency for Scientific Organizations (FASO,
https://www.grid.ac/institutes/grid.484124.f) states that FASO was established in 2013, the link “Insti-
tute Links” (https://fano.gov.ru/en/) claims that this address cannot be reached, and a link to the English
Wikipedia is not available. However, there is a relevant page in the Russian Wikipedia (https://ru.wik-
ipedia.org/wiki/Федеральное_агентство_научных_организаций) stating that this organization was
abolished on May 15, 2018. The same information is duplicated in the Wikidata dataset (Federal
Agency for Scientific Organizations, Q16711297) but GRID does not show this page.
    To test our hypothesis, we extracted a list of Russian research organisations from GRID, together
with such attributes as the English name of an organization, its Russian name, acronyms, aliases, link
to the organization’s web-site, link to the organization’s page in the English Wikipedia, city and coun-
try. The pages of only 412 organizations of all the Russian research organizations presented in GRID
had a link to a page in the English version of Wikipedia, and 398 of them had a link to a Russian
Language page in Wikipedia.
    Then, we used the data extracted from GRID to search for the appropriate organizations in the Rus-
sian version of Wikipedia by means of Wikipedia_API.
    The attributes used included the URL of the organization's website, its English-language name and
Russian-language names, etc. Even if not highly efficient, this search produced another 674 pages found
in the Russian-language Wikipedia. Among them, 353 Russian pages were linked by cross-language
links to the English-language Wikipedia.
    In total, 835 matchings between the Russian Wikipedia and GRID pages were found. Thus, this
experiment has shown that though the Russian-language version of Wikipedia stores much more infor-
mation about Russian scientific organizations than the English-language version, this information re-
mains inaccessible to the English-language databases.
    The main problem was that explicit links are not many, and the search for the names of organizations
is complicated because different databases contain different names of the same organizations. We plan
to improve the existing imperfect matching algorithm.

2.2.    GRID and eLIBRARY.ru
    eLIBRARY.ru (Q4037789) it is the leading electronic library of scientific periodicals in Russia in
the world. It stores data on science, technology, medicine and education and includes information on
over 34 million publications, more than 1 million researchers, and over 12, 000 organizations. Contrast
with the number of Russian organizations stored in GRID (2066) is striking. What is the reason for the
big difference? It is easy to see that only a part of the eLIBRARY.ru -listed organizations are research-
related. This list contains all federal ministries and bodies subordinated to these ministries, regional
administrations, commercial organizations, banks, hospitals, individual entrepreneurs, etc. For exam-
ple, it is possible to find House-Building Plant No. 7, a company having neither publications nor refer-
ences; the only information on this organization is its postal and legal address. In total, about one third
of the list of the organizations stored by eLIBRARY.ru (4505 organizations) is not related to publication
activity. Just like GRID, eLIBRARY.ru contains many descriptions of no longer existing organizations.
    Each organization in eLIBRARY.ru is described using such attributes as the full name of the organ-
ization in Russian and in English, the Russian and English acronym, country, region, Russian and Eng-
lish name of the city in which the organization is located, postal address in Russian, Russian and English
postal address, legal address, parent organization, type of organization, fax, email and web-site URL.
Each organization has a unique identifier. For example, the identifier of the Ershov Institute of Infor-
matics Systems is 593, and that of the SB RAS is 2378. There are no links to global bibliographic data
sources in eLIBRARY.ru. In order to compare the data on the organizations listed in eLIBRARY.ru.
and GRID, a program was written. It discovered only 709 matchings between the eLIBRARY.ru and
GRID pages describing Russian research-related organizations. The main reason for the small number
of matchings is the spelling difference in the names of organizations given in two different data sources.
    Currently, there is a data source that tries to integrate information about organizations from all lan-
guage chapters of Wikipedia. Moreover, it collects all the identifiers assigned to the organizations by
global organizations. This data source is Wikidata (wikidata.org).



                                                     135
3. eLIBRARY.ru and Wikidata
    An example of a highly promising international data source is Wikidata.org. Wikidata emerged in
2014 [6] as a structured data source for fact management in various language versions of Wikipedia.
The Wikidata’s developers plan to make it the central management platform for Wikipedia, integrating
data from all Wikipedia language "chapters".
    To integrate data, each entity is assigned an identifier independent of a specific language version,
and all statements concerning this entity and found in all language versions of Wikipedia are combined.
Like GRID, Wikidata supports links to global bibliographic resources by specifying the identifiers of
organizations in these data sources. In particular, the following data sources are indicated in wiki-
data.org: Virtual International Authority File database (VIAF ID, property P214), Library of Congress
authority ID (authority ID, property P244), GRID.ac global research identifier database ID (property
P2427), ROR Research Organization Registry ID (property P6782), Russian organization number
(property P7011), ISNI International Standard Name Identifier ID (property P213), eLIBRARY.ru or-
ganization ID, (property P2463), and Crossref funder ID (P3153).
    Wikidata also contains the short names of an organization in its native language and in English,
information about its type and geographical location (country, region), dates of inception and closure.
For example, the SB RAS page (https://www.wikidata.org/wiki/Q3032414) provides alternative names
of this institution in 14 languages. However, for unknown reasons, its official name (P1448) is shown
in the Belorussian language and the list of its subsidiaries (P355) contains the same mistakes as the list
of child institutions shown on the corresponding GRID page.
     Another example is the A.P. Ershov Institute of Informatics Systems (Q4201722), which is not
considered as a subsidiary of the SB RAS. Besides, the Wikidata page indicates that the institute was
named after Alexandra Petrovna Ershova (Q60830445), a Russian theater teacher rather than Academi-
cian Andrey Ershov (Q1961494), Russian computer scientist. Besides, the image one can see at this
page (provided by wikidata.org) has evidently nothing to do with Academician Andrey Ershov, while
the photo provided by the corresponding Russian – language page in Wikipedia is true.
    Numerous facts of this kind point to the need to establish correspondence between data, compare
and verify the difference between Russian and international data sources.
    Despite the existence of a special property describing the identifier of an institution in eLI-
BRARY.ru, few Russian institutions represented in Wikidata.org have eLIBRARY.ru identifiers. So,
by running a SPARQL query looking for scientific organizations (wd: Q16519632) located in Russia
and having a eLIBRARY.ru identifier specified in wikidata.org, we obtained a list of eighty-six insti-
tutions, mainly educational. Novosibirsk State University, for example, has the eLIBRARY.ru identifier
214. An example of a query retrieving scientific organizations from the Wikidata website with an iden-
tifier of eLIBRARY.ru is shown in Figure 1.




Figure 1: SPARQL query retrieving scientific organizations with the identifier of eLIBRARY.ru in wiki-
data.org

    Table 1 shows the number of entities Organizations and Scientific Organizations that have identifiers
in various global bibliographic data sources. For example, only four institutions out of 274 subordinates
to the Russian Academy of Sciences (wd: Q4201890) have a eLIBRARY.ru identifier, which means
that the task of comparing data in the above-mentioned sources is very relevant. Below we are going to
dwell on our approach to solving this problem.




                                                    136
Table 1.
The number of entities Organizations and Scientific Organizations having identifiers in various global
bibliographic data sources
   Data sources and the corre-    Number of Scientific organiza-       Number of Organizations
  sponding Wikidata properties    tions (Q1651963, 21473 in to-     (Q43229, 34911 in total) hav-
                                  tal) having a wikidata property       ing a wikidata property
           GRID (P2427)                         989                               1535
       eLIBRARY.ru (P2463)                       86                                103
          OGRN (P7011)                          863                               1186
            VIAF (P214)                         769                               2094
    Library of Congress (P244)                  648                               1538
           ROR (P6782)                          982                               1527
            ISNI (P213)                         714                                717

    Note that as the data stored in Wikidata is incomplete, it is currently impossible to obtain all reliable
information using the SPARQL query. For example, a SPARQL query searching for all Russian scien-
tific organizations returns only the organizations explicitly stating that they are located in Russia. If the
requirement of Russian affiliation is made optional (OPTIONAL), the resulting data set includes some
organizations without any information about the country of their origin. Therefore, the problem of
matching and comparing entities described in different data sources needs to be solved programmati-
cally.
    The input of the data integration program is the table Organizations. Each row of this table corre-
sponds to an institution listed in eLIBRARY.ru. The columns of the table Organizations correspond to
such attributes of eLIBRARY.ru as the full name of an organization in Russian, its name in English,
Russian abbreviation of the name, English abbreviation of the name, country, region, Russian name of
the city were the organization is located, English name of the city, postal address of the organization in
Russian and in English, its legal address, its parent organization, type of the organization, its fax, and
official web site.
    The integration algorithm results in an extended table showing whether the description of an organ-
ization was found in Wikidata. In case of a positive result, the corresponding row of the table is supple-
mented with information extracted from Wikidata. In particular, the Wikidata_name of the organisation,
Wikidata_identifier, Wikidata_alias, Wikidata_year of foundation, and international identifiers, such as
VIAF_Id, eLIBRARY_Id, GRID_Id, are added. The options to the negative result are “organization not
found” and “there is not enough data to identify the organization”.
    The integration algorithm is structured as follows:
        1. Pre-processing the names of organizations (translation into lower case, deletion of words
            such as "ЗАО" (CJSC, closed joint-stock company), "OOO"(private limited liability com-
            pany), OAO (OJSC, open joint-stock company), “им”, “имени" (named after), etc., cyclic
            replacement of some words using a dictionary of synonyms). For example, the words “RF”,
            “Russia” and “the Russian Federation” in the name of an institution should be considered
            as synonyms. Also, the words “mayor's office”, “administration” and “government” are of-
            ten used interchangeably in the names of organizations.
        2. API Wikidata-based search for entities by one of the names of an organization specified in
            the table (4 variants of the name, transformed names, URL). If the search is successful, a
            JSON file is returned with brief information about the element found. This data allow for
            extracting additional information about the entity: its name, identifier, entity type, all the
            names available. Wikidata contains information about all available names of an entity in the
            “Also known as” field.
        3. Checking whether an entity found in Wikidata is equivalent to an institution from eLI-
            BRARY.ru. To do this, the coefficient of matching between all the variations of full and
            transformed names of the entity is calculated, the URLs of the organizations are compared,
            the type of the entity is found, the country where it is based and location inside the country
            are checked.
        4. Supplementing the original table with information from Wikidata.

                                                      137
Step 3 of the integration algorithm is the most difficult and consists of the following stages.

   3.1 Comparing the names of organizations. When two lists of names of an organization extracted
   from eLIBRARY.ru and Wikidata are compared, a string similarity of these names is estimated.
   First, each pair of strings is tested for complete textual coincidence. In case of a negative result, each
   string is divided into separate words, which are transferred to the nominative form using the module
   of morphological analysis. For the resulting rows, the matching coefficient is calculated using the
   following formula:
                                                         𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑤𝑜𝑟𝑑𝑠
                        𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = 2 ∙
                                                           𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠
   The best matching coefficient is calculated for all the name variants. If for a pair of names this
   coefficient is greater than 0.68, the result is considered positive, and the rest of the attributes are
   compared.
   3.2 Comparing the URLs of organizations. When comparing the URLs, we noticed that their ref-
   erences may differ from a database to database. For example, the URL of Administration of the
   Arkhangelsk Region in eLIBRARY.RU is http://www.dvinaland.ru/, and in Wikidata it is
   https://dvinaland.ru/. Thus, the difference between the “http” and “https” constructions and the pres-
   ence of the “www” construct in one of the addresses will make the result of the comparison negative.
   To avoid such situations, the constructs specified are removed when the URLs are compared.
   3.3 Checking the type of the object found. You need to make sure that the object found is indeed
   an organization. Using a SPARQL query, we generated a CSV file containing the names of all sub-
   classes of the Organization class in the Wikidata ontology. The type of the entity found is compared
   with the elements of this file. If the comparison is negative, the entity is rejected.
   3.4 Checking the location of the organization. If Russia is not specified as the country where the
   institution is based, the search stops and the database records that the organization is not located in
   Russia. If Russia or another country is not indicated on the Wikidata page, information about the
   location of the headquarters is extracted. As a rule, Wikidata indicates the city and administrative-
   territorial unit. Note that Wikidata may indicate the Moscow region instead of Moscow, for example,
   which can lead to an incorrect comparison result. To solve this problem, we used information about
   a hierarchy of geographic objects. We created a JSON file that includes the distribution of all Russian
   cities by regions. The program extracts the name of the city from Wikidata; if it does not match the
   location specified in eLIBRARY.RU, the name of the city will be replaced with the name of the
   region in which the city is based. Subsequently, the geographic locations are compared again. If the
   locations do not match, the organization is considered incorrect and the program goes to the next
   institution.
   Thus, at the moment, an organization is considered to be correctly identified in two cases:
        1. Site URLs and some pairs of name variations match.
        2. The entity is an organization, the names of the organization in the two sources coincide,
            information about the sites is not complete, the organization is located in Russia.
   For all organizations recognized by the algorithm as identical, information is combined based on an
extension of the schema.org ontology. Currently, correspondence has been established between 3143
organizations in Wikidata and eLibrary.ru. The resulting experimental data source will be further ex-
panded and integrated with other data sources, such as GRID.

4. Conclusion
   Experiments with English-language and Russian-language data sources have shown that Russian-
language information sources contain more information about Russian-speaking scientific organiza-
tions. Regrettably, this information remains largely inaccessible to English-language data sources. To
solve this problem, a method for integrating information from multilingual data sources has been de-
veloped. An experimental version of the database of scientific organizations comprising 3143 scientific
organizations has been created. It is planned to turn this base into an open and extensible knowledge


                                                      138
graph. The authors also believe that in order to maintain the completeness and correctness of infor-
mation about scientific entities, each scientific organization should maintain its own page on interna-
tional platforms, which would indicate all the identifiers of the organization.

5. Acknowledgements
   The author thanks Yurieva I.O. and Chvyrova O.S. for participation in the implementation of various
versions of the integration program.

6. References
[1] Z. Apanovich, Matching of authors and publications in multilingual bibliographic knowledge ba-
    ses, in: CEUR Workshop Proceedings. SSI 2019, Proceedings of the 21st Conference on Scientific
    Services and Internet, 2020, pp. 26–37.
[2] A. Haira, V. Radevski, K. Tochtermann, Author profile Enrichment for Cross-linking Digital Li-
    braries. Research and Advanced Technology for Digital Libraries Springer International Publish-
    ing. Lecture Notes in Computer Science 9316 (2015) 124–136.
    https://doi.org/10.1007/978-3-319-24592-8_10.
[3] A. Manocci, F. Osborne, E. Motta, Geographical trends in academic conferences: An analysis of
    authors’ affiliations. Data Science 2 (1) (2019) 181–203. https://doi.org/10.3233/DS-190015.
[4] Global Research Identifier Database. URL: https://www.grid.ac/.
[5] Scientific online library eLIBRARY.ru. URL: https://www.elibrary.ru/
[6] A. Ismailov, D. Kontokostas, S. Auer, J. Lehmann, S. Hellmann, Wikidata through the Eyes of
    DBpedia. URL: http://www.semantic-web-journal.net/system/files/swj1462.pdf




                                                   139