<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>(Semi)automated disambiguation of scholarly repositories⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miriam Baglioni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Mannocci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gina Pavone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele De Bonis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Manghi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR-ISTI - National Research Council, Institute of Information Science and Technologies “Alessandro Faedo”</institution>
          ,
          <addr-line>56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>OpenAIRE AMKE</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The full exploitation of scholarly repositories is pivotal in modern Open Science, and scholarly repository registries are kingpins in enabling researchers and research infrastructures to list and search for suitable repositories. However, since multiple registries exist, repository managers are keen on registering multiple times the repositories they manage to maximise their traction and visibility across diferent research communities, disciplines, and applications. These multiple registrations ultimately lead to information fragmentation and redundancy on the one hand and, on the other, force registries' users to juggle multiple registries, profiles and identifiers describing the same repository. Such problems are known to registries, which claim equivalence between repository profiles whenever possible by cross-referencing their identifiers across diferent registries. However, as we will see, this “claim set” is far from complete and, therefore, many replicas slip under the radar, possibly creating problems downstream. In this work, we combine such claims to create duplicate sets and extend them with the results of an automated clustering algorithm run over repository metadata descriptions. Then we manually validate our results to produce an “as accurate as possible” de-duplicated dataset of scholarly repositories.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scholarly Registries</kwd>
        <kwd>Scholarly Repositories</kwd>
        <kwd>De-duplication</kwd>
        <kwd>Open Science</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Scholarly repositories are essential for Open Science practice, as they enable access to research
products and grant their long-term preservation. Besides, they play a crucial role in improving
the visibility, discoverability and reuse of research products [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ].
      </p>
      <p>Given the ever-increasing number of repositories over the years, there is a growing need for
scholarly repository registries providing repositories with an identity to enable non-ambiguous
19th IRCDL (The Conference on Information and Research science Connecting to Digital and Library science), February
23–24, 2023, Bari, Italy
⋆All authors contributed equally.
reference, support provenance tracking and impact assessment. Moreover, registries hold
authoritative descriptive profiles of repositories intended to foster their discoverability.</p>
      <p>
        To this end, specialised scholarly repository registries have been set up [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref6">6, 1, 2, 3</xref>
        ] in order to
store a broad range of information about registered repositories, such as the type of content,
discipline and subjects, access rights, and licensing information of the resources they store.
      </p>
      <p>As a plurality of registries does exist and serves diferent scientific domains, communities and
use cases, repository managers are incentivised to register their repository in more than one
registry to boost their online presence and traction at the expense of information maintainability
and up-to-dateness. Such a variety of scholarly repository registries provides a full-spectrum
overview across scientific disciplines and research applications and becomes a rich asset for
scholarly registries’ users downstream, such as researchers, scholarly service providers, and
Open Science infrastructures listing and aggregating their content.</p>
      <p>However, the availability of multiple registries, repository identifiers and profiles poses
non-trivial questions and challenges regarding their authoritativeness, disambiguation, and
coverage. In particular, such fragmentation quickly becomes a drawback in terms of information
redundancy and scattering, as multiple registrations produce diferent identifiers for the same
repository, arbitrarily used across the scholarly communication track record, and may lead to
potential information inconsistencies across registries.</p>
      <p>Unsurprisingly, registry managers are aware of such drawbacks and claim equivalence of
repository profiles whenever possible by cross-referencing their identifiers and PIDs across
diferent registries. As we will see, however, this “claim set” is far from complete. Nevertheless,
this is the only “ground knowledge” at our disposal; hence, we consider it valid without further
challenge.</p>
      <p>In this work, we conflate such claims to infer the duplicate sets of diferent profiles about
the same repository across the registries. Then, we further extend such duplicate sets by
integrating them with the clusters obtained by an automated clustering algorithm run over
repository profiles. Finally, we deliver the dataset of repository duplicates, taking care of
manually validating all the cases where the de-duplication contributed to integrating the claims
provided by the registries, and we draw some conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The many scholarly communication services and academic search systems spun in the last
decade have grown in parallel with a diferent, often siloed, mindset to target a broad and
diverse range of use cases and application contexts [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ]. Consequently, despite
modelling very similar aspects of the academic domain (often the same ones), the respective
data models ended up being quite distant in order to capture the diferent peculiarities at
hand. Scholarly registries are no exception as they have often been developed to target diverse
research communities, academic disciplines and research applications, as can be easily derived
by inspecting the documented data models and schemas, as documented in Section 3.
      </p>
      <p>Intuitively, talking about their content comparison and interconnection is paramount to
ensure their consistency and pave the way towards full interoperability and information exchange
across scholarly registries.</p>
      <p>To the best of our knowledge, no prior study analysed and compared the content of publicly
available scholarly registries and highlighted their inherent ambiguity. Therefore, in this work,
we address the disambiguation of scholarly repositories across scholarly repository registries,
which are at the centre of the present study.</p>
      <p>To date, the only tangible trace towards this direction can be found in the claims provided in
some cases by scholarly registries to cross-reference other repository registration across other
registries. As we will see further into our analysis, these eforts are, however, not enough, and
the “claim set” provided by the registries is far from ideal, as many subsequent registrations of
the same research repositories went so far unnoticed.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data and methods</title>
      <p>
        For this study, we selected four prominent scholarly repository registries, namely
FAIRsharing1 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], re3data2 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], OpenDOAR3, and ROAR4, whose details are summarised in Table 1 and
the following.
      </p>
      <p>
        FAIRsharing Hosted at the University of Oxford in the UK and launched in 2011,
FAIRsharing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a web-based, searchable portal of three interlinked registries, containing both
in-house and crowdsourced, manually curated descriptions of standards, databases (including
repositories and knowledge bases) and data policies, which are persistently identifiable via
DOIs. FAIRsharing maps the landscape of these three resources, monitoring their relationships,
development, evolution and integration, such as the implementation and use of standards in
databases or their adoption in data policies by funders, journals and other organisations.
FAIRsharing is also an endorsed output of the RDA FAIRsharing WG5, and its management combines
a community-driven approach where the internal curators are supported by the maintainers of
the resources themselves, which get credited via their ORCID.
      </p>
      <p>
        As of February 2022, FAIRsharing has over 3,600 resources, of which 1,853 are databases, i.e.,
scholarly repositories of interest for this analysis. The content, licenced under a CC-BY-SA
licence6, can be browsed by (among other fields) registry type, discipline and country, and a
live statistics page provides several at-glance-views of the landscape7.
re3data re3data8 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a global registry of research data repositories from all academic
disciplines. Since its launch in 2012, this registry has been funded by the German Research
Foundation9 (DFG). The urgency of avoiding duplication of efort and serving the research
1FAIRsharing – https://fairsharing.org
2re3data registry – https://re3data.org
3OpenDOAR registry – https://v2.sherpa.ac.uk/opendoar
4ROAR registry – http://roar.eprints.org/information.html
5RDA FAIRsharing WG – https://www.rd-alliance.org/group/fairsharing-registry-connecting-data-policies-standards-databases.
html
6FAIRsharing licence – https://fairsharing.org/licence
7FAIRsharing stats – https://fairsharing.org/summary-statistics
8re3data registry – https://re3data.org
9German Research Foundation (DFG) – https://dfg.de
community with a single, sustainable registry is mentioned on the “About” web page of the
website registry, referring to the merger between re3data and Databib in 2013. Under the project
“re3data.org – Community Driven Open Reference for Research Data Repositories (COREF)”, the
registry provides “customisable and extendable core repository descriptions that are persistently
identifiable and can be referred to and cited appropriately. This includes unique identification
in machine-to-machine communication”10. re3data was created to meet the need for a resource
specifically dedicated to data repositories [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In fact, the already existing registries OpenDOAR
and ROAR mainly focused on repositories for scholarly publications and only hosted a residual
share of data repositories. Furthermore, there was a need for a more detailed description of each
research data repository, e.g., containing precise information on access and reuse conditions.
The registry can be browsed by content type, discipline or country.
      </p>
      <p>For convenience, we retrieved the data via the OpenAIRE project11 as it already integrates the
registry. As of February 2022, it contains 2,793 data repository profiles; the registry content is
released under a CC-BY licence. The collected data follow the re3data schema version 2.212. On
August 2021, re3data delivered a new schema version13, and we used this one for the crosswalk.
OpenDOAR OpenDOAR14 is a directory listing only Open Access repositories. The service
was launched in 2005 as a result of a collaboration between the University of Nottingham and
Lund University, funded by OSI, Jisc, SPARC Europe and CURL. The listed repositories are
grouped into five types (Undetermined, Institutional, Disciplinary, Aggregating, Governmental),
which can be browsed by type of content, software, countries and regions15. To be included
in OpenDOAR, repositories must meet the inclusion criteria, and the submission has to be
accepted by the curators’ team. The submission request to OpenDOAR is carried out in two
parts: first, the application is sent by filling in a form with basic information. If the admission
criteria are met, further information is requested.</p>
      <p>We retrieved the data via OpenAIRE as it natively integrates the registry content. As of Feb
2022, OpenDOAR lists 5,811 repositories worldwide; the content of the registry is redistributed
under a CC-BY-NC-ND licence. The collected data follow the schema accessible through their
website16.</p>
      <p>ROAR The Registry of Open Access Repositories (ROAR) is hosted at the University of
Southampton, UK, and it is funded by the Jisc17. As declared on the project’s web page, its aim
is “to promote the development of Open Access by providing timely information about the
growth and status of repositories throughout the world”. A registered account is needed to add
a new repository, and the submission will be reviewed and eventually accepted or rejected with
editorial comments18.
10re3data mission statement – https://www.re3data.org/about
11OpenAIRE – https://www.openaire.eu
12re3data.org 2.2 schema – https://www.re3data.org/schema/2-2
13re3data 3.1 schema – https://www.re3data.org/schema
14OpenDOAR registry – https://v2.sherpa.ac.uk/opendoar
15OpenDOAR advanced browsing – https://v2.sherpa.ac.uk/cgi/search/repository/advanced
16OpenDOAR schema – https://v2.sherpa.ac.uk/api/metadata-schema.html
17ROAR registry – http://roar.eprints.org/information.html
18ROAR registration – http://roar.eprints.org/cgi/roar_register</p>
      <p>The data have been downloaded directly from the site, choosing the option Multiline CSV.
The CSV data formatting is consistent with the schema available online19. As of February 2022,
it contains 5,444 data repository profiles. The registry’s content can be browsed by country,
year, repository type, institutional association, repository software20 is redistributed under a
CC-BY licence.</p>
      <p>As mentioned, in some cases, three out of four registries provide claims (sometimes provided
by users), establishing “same-as” equivalences among repository profiles registered across
registries. However, the claim set is far from complete, as FAIRsharing provides claims to
re3data only, ROAR provides claims to OpenDOAR only, and re3data provides claims to all
the other three (often not at the same time, though). Moreover, we empirically noticed that no
registry provides claims addressing internal duplicates (see Section 4 for more details). Finally,
no assurance about their correctness nor the existence of missing claims is given. Yet, such
claims are the only “ground knowledge” at our disposal; hence, we consider them valid without
further challenge.</p>
      <p>In order to address such shortcomings and identify further repository duplicates, in the
present work, we first use the claims provided by the three registries and conflate them to
create duplicate sets whenever they regard diferent profiles of the same (allegedly) repository.
Then, we further extend the duplicate sets by integrating them with the clusters obtained by
an automated clustering algorithm run over repository profiles. Whenever a duplicate set
intersects with one or more clusters, we try to extend it with new repository profiles provided
by the clusters; otherwise, it is left untouched. Finally, all the clusters not intersecting with any
duplicate set are promoted to duplicate sets. As the last step, a manual validation is performed
for all the duplicate sets where the de-duplication is involved. All the duplicate sets obtained as
such, together with the code and original data, can be found in our Gitea repository21.</p>
      <p>For the sake of clarity, the methodology is represented in Figure 1 and described in details in
the following paragraphs.</p>
      <p>Conflating registry claims To conflate the claims, we arbitrarily start from
FAIRsharing and programmatically explore, for each repository profile in FAIRsharing, the claims
19ROAR schema – http://roar.eprints.org/cgi/schema
20ROAR stats – http://roar.eprints.org/view
21Data and code – https://code-repo.d4science.org/miriam.baglioni/Registries
referring to repository profiles in re3data (e.g., [fs:2114] →[rd:r3d100010191])22 to seed the
starting duplicate sets. Then, for any given duplicate set, we verify whether it exists a
claim from the re3data profile pointing back to the starting FAIRsharing repository (e.g.,
[fs:2114]→[rd:r3d100010191]→[fs:2114]). If this is not the case, we note down the three profiles
involved as a problematic duplicate set, which has to be controlled manually at a later stage
(e.g., [fs:3652]→[rd:r3d100012729]→ [fs:1724]), and we move forward.</p>
      <p>Next, we try to extend any given duplicate set by conflating the re3data profile claims towards
OpenDOAR. A new profile from OpenDOAR is added to a duplicate set if and only if it is
not already part of another; otherwise, the duplicate set is marked as problematic and will be
controlled manually. Then the claims pointing from the re3data profile to ROAR are conflated,
and, as for OpenDOAR, one profile from ROAR is added to the duplicate set only if it is not
already part of another set; otherwise, the set will be controlled manually.</p>
      <p>If one profile from ROAR is added to a duplicate set, we consider its claims. Since ROAR
provides claims only towards OpenDOAR, a profile from OpenDOAR could be added to the
duplicate set if it is not already present in another set.
22Hereafter, we refer to repository profiles by indicating repository registration identifiers prefixed according to the
involved registry (i.e., fs: – od: – rd: – rr:). Each example links to the profile in the relevant registry in its current
version, which might difer from the one observed at the time of writing. The metadata as we collect them are
provided for transparency and validation of the reported examples.</p>
      <p>Finally, for each re3data claim not already processed, we seed a new duplicate set and search
for FAIRsharing, OpenDOAR, and ROAR claims as done before. Similarly, we check the ROAR
repositories claims towards OpenDOAR that have not already been processed. In this case, if the
OpenDOAR profile is present in another duplicate set, we extend the already formed duplicate
set with the new ROAR profile 23.</p>
      <p>
        De-duplication In order to automatically detect repository duplicates and cluster them, we
opted for FDup [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. FDup has been developed within the efort of the OpenAIRE-Advance 24
and OpenAIRE-Nexus25 projects and is currently applied in the production of the OpenAIRE
Research Graph [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to de-duplicate metadata records of research products and organisations.
      </p>
      <p>FDup provides an eficient de-duplication framework capable of comparing the metadata
records in a “big data” collection to identify the groups of equivalent ones, hereafter referred to
as duplicate sets.</p>
      <p>To this aim, FDup is given a dataset consisting of all the repository profiles across the four
registries, for which we selected a handful of relevant, common fields: original identifier and
source registry for identification, repository name, and homepage URL. Then, FDup performs
a two-phase processing: the candidate identification phase optimises the otherwise quadratic
complexity of comparing all possible pairs of profiles by pre-fetching the pairs of profiles that
are likely representing the same repository; the duplicate sets identification matches all pairs of
candidate profiles to efectively validate their equivalence and finally generates the duplicates
sets by grouping all identified matching pairs of profiles via their transitive closure.</p>
      <p>The outcome of this process is a graph, where two nodes (i.e., repository profiles) are
connected by an edge whenever similar. The graph is then explored to identify its connected
components, which are the subgraphs in which each pair of nodes is connected via a path of
edges (i.e., duplicate sets). The outcome of this phase strongly depends on the threshold chosen
for the similarity match. For our analysis, we chose a 0.9 threshold to increase the precision
level of the duplicate sets containing equivalent repositories.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <p>While ROAR and FAIRsharing provide just one-to-one claims, re3data can claim duplicates in
more than one registry. From re3data, in fact, we get one set composed of all four registries (e.g.,
[rd:r3d100012322]) and four sets composed of three registries out of four (e.g., [rd:r3d100012274],
[rd:r3d100013438], [rd:r3d100013665], and [rd:r3d100013717]). The remaining claims are
one-toone: 451 claims to FAIRsharing, 15 to OpenDOAR, and 4 to ROAR.</p>
      <p>After conflating the registries’ claims, we get 3,548 duplicate sets, whose composition is
reported in Figure 2. The 88.8% of these sets consist of two profiles. As expected, the majority of
the duplicate sets are composed of coupled profiles from OpenDOAR and ROAR (74%), followed
23Multiple profiles from ROAR can be added to the same duplicate set because more than one ROAR repository can
claim the same OpenDOAR one, e.g., [rr:919] and [rr:5425] both claim [od:1047]
24OpenAIRE-Advance – https://cordis.europa.eu/project/id/777541
25OpenAIRE-Nexus – https://cordis.europa.eu/project/id/101017452
3,548 (100%)
duplicate sets
3,151 (88.8%)
 = 2
348 (9.8%)
 = 3
by FAIRsharing and re3data (25%). The remaining involve duplicate sets where a re3data profile
is paired with one in either ROAR or OpenDOAR.</p>
      <p>Sets composed of three profiles are 9.8% of the total, and the majority (98%) involve one profile
in OpenDOAR and two in ROAR, suggesting the presence of duplicate repository registrations
within ROAR. The remaining 2% comprehends profiles drawn from re3data, OpenDOAR, and
ROAR.</p>
      <p>Finally, the last 1.4% are sets composed of four or five profiles. In all but one, ROAR profiles
appear more than once, up to four times in one set with OpenDOAR. Also, one duplicate from
OpenDOAR is present. Summing up, 13.45% of the repositories registered in ROAR, which
provide a claim, can be considered duplicates (8.1% of the total number of repositories).</p>
      <p>The conflation produced six problematic duplicate sets, which involve only claims between
FAIRsharing and re3data:
• [fs:3652]→[rd:r3d100012729]→[fs:1724]. [fs:1724] was deprecated and has been subsumed
into [fs:3652]. The set is [fs:3652] and [rd:r3d100012729].
• [fs:3340]→[rd:r3d100010543]→[fs:2107]. [fs:2117] was deprecated and has been replaced
by [fs:3340]. The set is [fs:3340] and [rd:r3d100010543].
• [rd:r3d100010412]→[fs:2424]→[rd:r3d100011538]. [rd:r3d100011538] is a duplicate of
[rd:r3d100010412]; the set can be considered right
• [rd:r3d100011257]→[fs:1730]→[rd:r3d100012862]. [rd:r3d100011257] has been merged
into [rd:r3d100012862]; the set is [rd:r3d100012862] and [fs:1730]
• [rd:r3d100011343]→[fs:2163]→[rd:r3d100000039]. [rd:r3d100011343] do not belong
2,230 (100%)
clusters
2,022 (90.7%)</p>
      <p>= 2
186 (8.34%)
 = 3
22 (0.96%)
4 ≤  ≤ 5</p>
      <p>1,396 (69.04%)
ROAR + OpenDOAR</p>
      <p>425 (21.02%)
FAIRsharing + re3data
69 (3.4%)
re3data +
OpenDOAR or ROAR</p>
      <p>109 (5.4%)
ROAR inward</p>
      <p>15 (0.7%)
OpenDOAR inward</p>
      <p>1 (0,05%)
re3data inward</p>
      <p>130 (69,89%)
2+ inward duplicates</p>
      <p>56 (30,11%)
three registries
involved</p>
      <p>to the same set, and it refers to a diferent organisation. The set is: [fs:2163] and
[rd:r3d100000039]
• [rd:r3d100013223]→[fs:2524]→[rd:r3d100012397]. [rd:r3d100013223] is a duplicate of
[rd:r3d100012397]; the set can be considered to be right
After this manual validation, the checked claims are then used to extend the duplicate sets.</p>
      <p>From FDup automatic clustering, we get 2,230 clusters, whose composition is reported in
Figure 3. Most of them (90.7%) involves two profiles. This time, inward duplicate profiles
(i.e., within the same registry) are immediately visible: 5.4% of the clusters are composed of
duplicates within ROAR, and 0.7% of duplicates within OpenDOAR. Also, one duplicate within
re3data is present. As before, most of the remaining clusters (69.04%) are composed of ROAR
and OpenDOAR profiles, then 21.02% of FAIRsharing and re3data profiles, and the last 3.3%
involve a re3data profile with another one from OpenDOAR or ROAR. The 8.34% of the clusters
comprise three profiles, 69.89% of which contain more than one profile from the same registry.
Eight clusters are composed entirely of profiles within ROAR, which, together with OpenDOAR,
is the registry with the highest number of multiple internal registrations for the same repository.
The remaining 30.11% is composed of a combination of three registries out of four. The last
0.96% of the clusters are composed of four or five profiles, six of which have at least one profile
for each repository, and only five have no duplicate from the same registry.</p>
      <p>The exploitation of clusters to extend the duplicate sets produced 208 unique sets26 (e.g.,
([od:4194], [rd:r3d100011201], [fs:2560]) extending ([fs:2560], [rd:r3d100011201])), 428 duplicate
sets are obtained via de-duplication only, and 1,400 clusters overlap completely with the
corresponding duplicate set, while 1,720 duplicate sets come from registry claims only. The 70.7% of
the FDup-extended sets are obtained by adding one profile, 21.7% by adding two profiles, and
the remaining by adding up to five profiles, getting two sets of cardinality 8.</p>
      <p>Please note that the extended sets may not be complete. As an example, consider ([od:2373],
[rr:2115], [rr:3755], [rr:3591], [rr:4562]) and ([od:2373], [rr:3591], [rr:4562], [rr:4695]) which are
completely contained in the first set but [rr:4695]. The first one is obtained via the set ([rr:3591],
[rr:4562], [od:2373]) and the cluster ([rr:4695], [rr:3591]) and the second via the sets ([rr:3591],
[rr:4562], [od:2373]) and ([rr:3755], [od:2115]), and the cluster ([rr:4562], [od:2115]). This was
because we had two diferent clusters for the same repository, which were registered with
diferent names.</p>
      <p>Assuming that the claims from the registries are correct, we have manually validated all
the duplicate sets coming from FDup only or extended via a cluster to verify their correctness.
Those extended via registry claims are also checked to understand why the FDup could not
cluster them correctly. To check the correctness of a given set, we considered the repository
URLs and names as they were in our initial data. When both are the same, the repository is
considered the same. When one of the two is considerably diferent, we inspected the URLs and
registry content as a tie-break. Incidentally, some repositories are no longer in the registry (7
among those inspected in the repository). These repositories come from claims among registries
(ROAR to OpenDOAR mainly). There are also 16 wrong duplicate sets (9 because of FDup being
wrong, 7 because of incorrect claims in the registries), and 74 not working URLs. Furthermore,
diferent duplicate sets can still refer to the same repository: de-duplication and registry claims
do not have a common profile on whose basis our methodology can extend the duplicate set.
Although we experimentally verified the presence of four of them, we cannot provide their
exact number.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>On the one hand, the registry claims are a good starting point to identify multiple registrations of
the same repository, but they are far from complete, and, most importantly, they are sometimes
proved to be inaccurate. On the other hand, a systematic procedure to automatically extend
this claim set can be of aid, but can lead to incomplete duplicate sets or even disjoint and wrong
ones.
26The extended duplicate sets are 239, but 31 have been extended via merging: more than one set match with the same
cluster, e.g., ([rr:978], [rr:976], [rr:5221], [rr:2328], [od:239], [od:241]) via ([rr:976], [rr:978]) from de-duplication,
and ([od:241], [rr:978]) and ([od:239], [rr:2328], [rr:5221], [rr:976]) from registries, thus producing one single set
The best of the two worlds can be achieved by combining the two approaches, but human
intervention is still needed to obtain duplicate sets as accurately as possible; however, manual
inspection is expensive and entails other known problems. Moreover, in some cases, time and
efort do not sufice, as domain expert knowledge is required to determine whether two profiles
can be considered the same (e.g., [rd:r3d100010553] and [fs:1956] whose website URLs resolve to
very diferent pages). Furthermore, some choices would also remain discretionary, for example,
in the case of two repositories being registered with the same name and with URLs resolving,
in one case to a list of collections or subsets of a repository, and in the other to one of these
subsets. Unfortunately, there is no clear way to address such a conundrum without domain
expert knowledge.</p>
      <p>In conclusion, this study and the results presented here outline a first-of-its-kind wake-up
call to raise awareness about the inherent ambiguity residing in scholarly repository registries.
While information scattering, consistency and duplication are not new in data quality literature,
siloed, out-of-sync, scholarly repository registries can have a detrimental impact on the global
scientific track record.</p>
      <p>Firstly and foremost, inconsistency and incompleteness of a registry content can negatively
afect its image and directly hamper its uptake and reliability in the eyes of the research
community(ies) of reference.</p>
      <p>Moreover, scholarly registries’ downstream users, such as researchers, scholarly service
providers, and Open Science infrastructures listing and aggregating their content, are exposed to
potentially conflictual information, which has to be reconciled on an arbitrary (possibly manual)
basis. The OpenAIRE infrastructure, for example, aggregates data sources (i.e., repositories)
from a variety of registries, such as FAIRsharing, OpenDOAR, re3data, and CRIS Systems. When
the same repository is ingested from multiple registries, OpenAIRE needs to reconcile the
duplicates into one single record. One copy is elected as master (if possible) and is enriched with
all the identifiers from the other copies. As it is not always possible to choose a master, multiple
copies of the same data source can still appear in the OpenAIRE portal. The de-duplication of
data sources within OpenAIRE mostly entails manual work, and the cross-references provided
by the registries and the de-duplication have to be checked and enriched with other repositories
that were not included to get duplicate sets as accurate as possible.</p>
      <p>Finally, as subsequent registrations yield multiple identifiers (e.g., a DOI) for the same
repository, assessing metrics reflecting the usage, adoption rate, and impact of a repository across the
academic community can be hindered if equivalence across diferent registrations cannot be
precisely pinpointed.</p>
      <p>In our opinion, such problems can be solved mostly via an agreed-upon solution, paving the
way towards the full support of interoperability across registries to enable a seamless exchange
of the wealth of information contained therein.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially funded by the EC H2020 OpenAIRE-Nexus (Grant agreement
101017452).
OpenAIRE Research Graph Dump, 2021. doi:10.5281/zenodo.5801283.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Pampel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vierkant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bertelmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kindling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klump</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Goebelbecker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gundlach</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Schirmbacher</surname>
          </string-name>
          , U. Dierolf,
          <source>Making Research Data Repositories Visible: The re3data.org Registry</source>
          ,
          <source>PLOS ONE 8</source>
          (
          <year>2013</year>
          )
          <article-title>e78080</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0078080</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rolando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Borgman</surname>
          </string-name>
          ,
          <article-title>If We Share Data, Will Anyone Use Them? data Sharing and Reuse in the Long Tail of Science and Technology</article-title>
          ,
          <source>PLoS ONE 8</source>
          (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0067332</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Molloy</surname>
          </string-name>
          ,
          <article-title>Big data: The potential role of research data management and research data registries</article-title>
          ,
          <source>in: IFLA WLIC</source>
          <year>2014</year>
          , Lyon, France,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Pasquetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Borgman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Woford</surname>
          </string-name>
          ,
          <source>Uses and Reuses of Scientific Data: The Data Creators' Advantage, Harvard Data Science Review</source>
          <volume>1</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1162/99608f92. fc14bf2d.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <article-title>Theory and Practice of Data Citation</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>69</volume>
          (
          <year>2018</year>
          )
          <fpage>6</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.23917. arXiv:
          <volume>1706</volume>
          .
          <fpage>07976</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Molloy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Van Den</surname>
          </string-name>
          <string-name>
            <surname>Eynden</surname>
          </string-name>
          ,
          <article-title>Show me the data: The pilot UK Research Data Registry</article-title>
          ,
          <source>International Journal of Digital Curation</source>
          <volume>9</volume>
          (
          <year>2014</year>
          )
          <fpage>132</fpage>
          -
          <lpage>141</lpage>
          . doi:
          <volume>10</volume>
          .2218/ijdc.v9i1.
          <fpage>307</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martín-Martín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thelwall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Orduna-Malea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Delgado</surname>
          </string-name>
          López-Cózar, Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and
          <article-title>OpenCitations' COCI: A multidisciplinary comparison of coverage via citations</article-title>
          ,
          <source>Scientometrics</source>
          (
          <year>2020</year>
          ).
          <source>doi:10. 1007/s11192-020-03690-4</source>
          . arXiv:
          <year>2004</year>
          .14329.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gusenbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. R.</given-names>
            <surname>Haddaway</surname>
          </string-name>
          ,
          <article-title>Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and</article-title>
          26 other resources,
          <source>Research Synthesis Methods</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>181</fpage>
          -
          <lpage>217</lpage>
          . doi:
          <volume>10</volume>
          .1002/jrsm.1378.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Visser</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. J. van Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Waltman</surname>
          </string-name>
          ,
          <article-title>Large-scale comparison of bibliographic data sources: Scopus, Web of Science</article-title>
          , Dimensions, Crossref, and Microsoft Academic (
          <year>2021</year>
          )
          <article-title>22</article-title>
          . doi:
          <volume>10</volume>
          .1162/qss\_a\_
          <volume>00112</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Harzing</surname>
          </string-name>
          ,
          <article-title>Two new kids on the block: How do Crossref and Dimensions compare with Google Scholar</article-title>
          , Microsoft Academic,
          <source>Scopus and the Web of Science?, Scientometrics</source>
          <volume>120</volume>
          (
          <year>2019</year>
          )
          <fpage>341</fpage>
          -
          <lpage>349</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11192-019-03114-y.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aryani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Manghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stocker</surname>
          </string-name>
          , Open Science Graphs Must Interoperate!, in: ADBIS,
          <article-title>TPDL and EDA 2020 Common Workshops</article-title>
          and Doctoral Consortium, Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.-A.</given-names>
            <surname>Sansone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McQuilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rocca-Serra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Beltran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Izzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Lister</surname>
          </string-name>
          , M. Thurston,
          <article-title>FAIRsharing as a community approach to standards, repositories and policies</article-title>
          ,
          <source>Nature Biotechnology</source>
          <volume>37</volume>
          (
          <year>2019</year>
          )
          <fpage>358</fpage>
          -
          <lpage>367</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41587-019-0080-8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M. De Bonis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Manghi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Atzori</surname>
          </string-name>
          ,
          <article-title>Fdup: a framework for general-purpose and eficient entity deduplication of record collections</article-title>
          ,
          <source>PeerJ Computer Science</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <article-title>e1058</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Atzori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baglioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schirrwagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dimitropoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. La</given-names>
            <surname>Bruzzo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Foufoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Horst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Czerniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kiatropoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kokogiannaki</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Bonis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Artini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ottonello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lempesis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ioannidis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Manola</surname>
          </string-name>
          , P. Principe,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>