Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex Elia Rizzetto 1, Silvio Peroni 1 1 Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Abstract This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data. Keywords Bibliographic collection, entity mapping, OpenCitations, OpenAlex 1. Introduction Open bibliographic metadata collections play a pivotal role in enabling reproducible studies in the fields of bibliometrics, scientometrics and science of science and permit transparent procedures in the context of research assessment exercises, thus enabling the implementation of norms and guidelines that intend to reform the research assessment around the world, such as the Coalition for Advancing Research Assessment (CoARA1). As the volume and diversity of scholarly publications continue to expand, the need for comprehensive and interoperable bibliographic databases becomes increasingly pronounced. This study delves into the process of mapping entities between two important open bibliographic metadata collections, OpenCitations Meta [1] and OpenAlex [2]. These mapping processes are a critical step towards enabling researchers, institutions, and platforms to access and utilise information seamlessly across diverse collections. In our work, the primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources (BRs) in OpenCitations Meta, thereby interlinking and aligning these collections. This paper presents the results of the mapping and provides details on the methodology adopted to accomplish this task. By shedding light on the complexities inherent in aligning bibliographic metadata collections, we aim to contribute valuable insights into the challenges and opportunities associated with such endeavours. Furthermore, the study investigates the mapping process's implications to assess the quality of the involved datasets. Analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable 1 https://coara.eu/ 20th IRCDL (The conference on Information and Research science Connecting to Digital and Library Science, February 22–23, 2024, Bressanone - Brixen, Italy) elia.rizzetto@studio.unibo.it (E. Rizzetto); silvio.peroni@unibo.it (S. Peroni) 0009-0003-7161-9310 (E. Rizzetto); 0000-0003-0530-4305 (S. Peroni) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings tool for identifying potential inconsistencies in the processed data. The importance of such considerations lies in their capacity to enhance data quality, fortify interoperability, and foster a more cohesive scholarly metadata landscape. The rest of the paper is structured as follows. In Section “Material and methods”, we introduce the processed data and the mapping methodology. Then, in Section “Results”, we present the result of the mapping analysis. Section “Discussions” discusses some of the most relevant outcomes, highlighting the broader implications of mapping large bibliographic metadata collections for data integration, quality enhancement, and improved interoperability within the scholarly domain. Finally, in Section “Conclusions”, we conclude the paper by sketching out some future works. 2. Material and methods The following subsections analyse multi-mapped and non-mapped BRs in more detail. 2.1. Data The two collections involved in the mapping process are OpenCitations Meta (OC Meta) and the OpenAlex catalogue (henceforth, just OpenAlex). In particular, only a subset of the entities in both collections has been considered for the mapping, namely – following OpenCitations nomenclature – bibliographic resources (BRs) [3][4], i.e. journal articles, conference papers, datasets, journals, books, book chapters, etc. The specific versions of the dataset used for the analysis described in the present study are version 5 of OC Meta [5] and a snapshot of the OpenAlex database released on October, 18th 20232. OC Meta is the OpenCitations [6] database collecting metadata of scholarly bibliographic entities. The metadata exposed by OC Meta includes the basic metadata describing the BRs involved as citing or cited entities in the OpenCitations collection of bibliographic citations, i.e. OpenCitations Index [7]. In particular, OC Meta stores known persistent identifiers for each BR (DOI3, PMID4, PMCID5, ISSN6, and ISBN7), the title, type, publication date, page interval, the venue of publication, and the volume and issue numbers if the venue is a journal. In addition, OC Meta contains metadata regarding the main actors involved in the publication of each BR, i.e. the names of the authors, editors, and publishers, and their persistent identifiers (ORCID8 and Crossref ID9) where available. All entities in OC Meta are persistently identified by the OpenCitations Meta Identifier (OMID), and their properties and relations are specified in compliance with the OpenCitations Data Model (OCDM) [3][4]. Notably, OC Meta also tracks the changes in its data and provides provenance information using Linked Open Data technologies. All OC Meta data is published under a CC0 license, is made accessible online via REST API10 and SPARQL endpoint11, and periodical dumps can be downloaded in tabular format (CSV files) and RDF (JSON-LD files)12. JSON-LD and CSV files are produced from a triplestore storing the whole OC Meta graph. 2 https://openalex.s3.amazonaws.com/RELEASE_NOTES.txt 3 https://www.doi.org/the-identifier/what-is-a-doi/ 4 https://pubmed.ncbi.nlm.nih.gov/ 5 https://www.ncbi.nlm.nih.gov/pmc/about/public-access-info/ 6 https://www.issn.org/understanding-the-issn/what-is-an-issn/ 7 https://www.isbn-international.org/content/what-isbn/10 8 https://info.orcid.org/what-is-orcid/ 9 https://www.crossref.org/ 10 https://w3id.org/oc/meta/api/v1 11 https://w3id.org/oc/meta/sparql 12 https://w3id.org/oc/download OpenAlex is a collection of scholarly metadata curated and published by OurResearch 13, and initiated in response to the discontinuation of the Microsoft Academic Graph (MAG) [8]. It features five types of entities providing rich metadata: Works (such as journal articles, books, and datasets), Sources (i.e. where works are contained, such as journals, conferences, and repositories), Authors, Institutions, and Concepts. Metadata include external persistent identifiers (PIDs): DOI, PMID, PMCID, and MAG ID for Work entities (journal articles, proceeding papers, etc.); ISSN, Wikidata ID 14, MAG ID and Fatcat ID 15 for Source entities (journals, books, etc.). Within OpenAlex, entities are identified with a persistent ID scheme, i.e. the OpenAlex ID. Data is published under CC0 license and accessible via a REST API, a web-based GUI, or as downloadable snapshots of the whole database (JSON-Lines files) [2]. In the scope of this paper, the most relevant differences between OC Meta and OpenAlex concern the number of BRs in the two collections, the data sources they use, and some differences in the data models:  OpenAlex is the largest open scholarly data collection, currently comprising 246,844,573 Works and 249,408 Sources, for a total of 247,093,981 BRs. The latest version of OpenCitations Meta includes 105,953,699 BRs.  Data in OpenAlex is provided mainly by Crossref and inherited by the now-ceased Microsoft Academic Graph, but it also includes data from PubMed [9], the Directory of Open Access Journals (DOAJ) [10], Unpaywall [11], arXiv [12], Zenodo [13], the ISSN International Centre16, and the Internet Archive’s General Index 17. OC Meta’s sources are Crossref, the National Institute of Health Open Citation Collection (NIH- OCC, providing PubMed data) [14], OpenAIRE [15], and the Japan Link Center (JaLC) [16]18.  In OpenAlex, Works can only have one ID value per each ID scheme, and Sources admit a list of up to two ISSNs or a single literal value for each of the other ID schemes. On the contrary, in the OCDM, and therefore in OC Meta, there are no limits on the number of possible values for each ID scheme. This substantial difference in how the two collections represent their data implies that, for example, if a journal article has been assigned two DOIs, they can be linked to the same entity (and the same OMID) in OC Meta, but not in OpenAlex. Another noteworthy difference is that OpenAlex does not support ISBNs, while OC Meta does. 2.2. Mapping process The process leading to the mapping of these two collections is explained as follows. Initially, two tables are produced, which contain the internal IDs of the collections to be mapped with each other. The first table is produced by parsing the CSV dump of OC Meta, and, for each row, contains the OMID, external PIDs, and type for each BR in OC Meta that has external PIDs. The other table, produced from the JSON-Lines copy of the OpenAlex database, links each external PID in OpenAlex to the OpenAlex ID to which it is associated. The table containing OpenAlex data is converted into a local SQL database. Then, the table containing OC Meta BRs to be mapped is iterated line by line, and each PID associated with each entity is looked up in the database containing PID-OpenAlex ID associations. The result consists of three additional tables: 13 https://ourresearch.org/ 14 https://www.wikidata.org/wiki/Wikidata:Identifiers 15 https://fatcat.wiki/ 16 https://www.issn.org/ 17 https://archive.org/details/GeneralIndex 18 The data provided by JaLC is not included in the dump version processed for the mapping described by the present work (v5), but is included in the latest version (v6). 1. A table storing OMID, OpenAlex ID, and type of the BRs, if exactly one OpenAlex ID per OMID has been found; 2. A table storing OMID, OpenAlex IDs, and type of the BRs, if multiple OpenAlex IDs per OMID have been found (multi-mapped BRs); 3. A table storing OMID and type of the BRs, if no OpenAlex ID has been found (non- mapped BRs). The primary purpose of the mapping is to enable the addition of OpenAlex IDs to other available external persistent identifiers (PIDs) among the metadata of bibliographic resources already existing in OC Meta. This mapping can be performed automatically on resources that have not been assigned an OpenAlex ID yet, as well as applied more than once to resources that have already been assigned one. This latter use is particularly relevant in the case of multiple OC Meta BRs or OpenAlex resources being merged into a new entity, with a new OMID or OpenAlex ID respectively. Indeed, even though OMIDs and OpenAlex IDs are persistent identifiers, OpenCitations Meta and OpenAlex both allow for the possibility to merge existing entities under the same OMID/OpenAlex ID (while at the same time keeping track of this merge), entailing a new execution of the mapping process. The potential uses of the outcome of the mapping process go beyond the ingestion of new metadata, proving it to be a useful tool for gaining a deeper understanding of the quality of the collections involved and for helping to identify any problems and inconsistencies therein. For this reason, the results of the mapping process regarding multi-mapped BRs and non- mapped BRs are analysed quantitatively and qualitatively according to the methodology described in the following subsections. 2.3. Multi-mapped BRs analysis: methodology The mapping revealed that mapped entities in different datasets might go beyond a simple 1 to 1 alignment. Indeed, it is possible that one BR in OC Meta shares one or more external PIDs with more BRs in OpenAlex. These cases will be referred to as multi-mapped BRs. Such cases, after being saved separately from the rest of the results, have first been checked manually by investigating sample resources, inspecting their full metadata in both datasets, making use of external APIs (Crossref [17] and DataCite [18]) and accessing the documents’ location on the web via their PIDs. This study led to proposing an ad hoc categorisation, to frame the causes of such multi-mapping scenarios. We applied such categorisation to the instances of multi-mapped BRs by using heuristics to understand which category applies to each specific case. The categories for OC Meta BRs that are multi-mapped to OpenAlex Works are the following: 1. Category A includes cases where two or more Works among the ones that are multi- mapped to a single OC Meta BR share at least one external PID. Given that external PIDs, such as DOIs, should be uniquely assigned to a BR, having more than one entity with the same external PID in the OpenAlex dataset means that there are either duplicate entities or errors in the metadata. 2. Category B includes cases where the same entity in OC Meta is mapped to different versions of the same publication, each represented by a Work entity in OpenAlex – e.g. in the case of having a version of record and one or more preprint and/or postprint versions. Preprints and postprints are hosted in a preprint server or a digital repository. DOIs of preprints or postprints are determined by considering the DOI prefix and looking it up on a list of DOI prefixes reserved for institutions that manage preprint servers or digital repositories for non-peer-reviewed publications. 3. Category C includes cases where the same entity in OC Meta is mapped to exactly 2 different Works in OpenAlex, and neither is a preprint or postprint version. The most likely causes for this scenario are errors in the data source used by OC Meta, bugs in OC Meta software, or different DOIs intentionally linked to the same OC Meta entity. 4. Category D includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to category B, but it only includes preprint versions and detects them by checking for version number (e.g. “/v1”) in the DOI value. 5. Category E includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to categories B and D, but detects preprint versions by analysing the DOI value and checking if it contains semantic indicators that associate the DOI with a preprint server (e.g. “/arxiv” or “/zenodo”). 6. Category F includes cases where the multi-mapped OpenAlex Works include a version of record, together with one or more Works of type “peer-review”, “letter”, “editorial”, “erratum”, or “other”. For example, the DOI for an erratum notice and a DOI for the journal article that is being corrected may be wrongly assigned the same OMID in OC Meta, due to errors in the data source. OC Meta BRs that are multi-mapped to OpenAlex Sources fit only into one category, “A”, which groups cases where two or more multi-mapped OpenAlex Sources share at least one ISSN. The categorisation process (represented as pseudocode in Listing 1) takes as input: 1. Multi-mapped BRs in the form of a table where each row represents the association of one BR in OC Meta with n BRs in OpenAlex, storing an OMID in the omid field and a list of OpenAlex IDs in the openalex_id field; 2. A list of 80 DOI prefixes that are assigned by Crossref and DataCite to organisations or institutions that manage preprint servers or digital repositories hosting non-peer- reviewed versions. 3. A list of strings that, when found inside a DOI value, indicate that the associated publication is hosted in a preprint server (e.g. “/arxiv”, “/preprints”, “/osf.io”). 4. A SQL database storing full metadata of the OpenAlex BRs involved in the multi- mapping. The process differentiates between OpenAlex Works and OpenAlex Sources. For rows storing Works, the process includes querying the database for external PIDs associated with each Work. If any PID is associated with multiple Works in the row, the categorisation is labelled with “A”. Subsequently, each multi-mapped Work is examined. If version-marked DOIs are present, the categorisation is labelled with “D. Otherwise, an assessment is made for DOI prefixes associated with preprint servers, leading to categorisations such as “B” for preprint server association, “E” for preprint indicators, “F” for meeting specific OpenAlex database criteria, and “C” for rows with only two Works. For rows storing Sources, the process involves querying the database for ISSNs associated with each Source. If any ISSN is associated with multiple Sources in the row, the categorisation is labelled with “A”. Rows that remain unclassified after these steps are marked as unclassified. Listing 1 Pseudocode representing the process for multi-mapped categorization. FUNCTION categorizationProcess(table, doiPrefixes, preprintIndicators, database): FOR EACH row IN table: IF Works IN row.openalex_id: externalPIDs = queryDatabaseForExternalPIDs(row) IF hasDuplicates(externalPIDs): row.category = "A" ELSE: FOR EACH work IN row.openalex_id: IF work.hasDOIs(): IF hasVersionMarkedDOI(work, versionedDOIregex): row.category = "D" ELSE IF isPublishedByPreprintOrganization(work, doiPrefixes) AND (work.isSubmittedVersion() OR work.isAcceptedVersion()): row.category = "B" ELSE IF containsPreprintIndicator(work, preprintIndicators): row.category = "E" ELSE IF allDOIsHaveSamePrefix(work): IF work.isPeerReview() OR work.isEditorial() OR work.isErratum() OR work.isLetter(): row.category = "F" ELSE IF countWorksInRow(row) == 2: row.category = "C" ELSE IF Sources IN row.openalex_id: issns = queryDatabaseForISSNs(row) IF hasDuplicates(issns): row.category = "A" ELSE: row.category = "non classified" 2.4. Non-mapped BRs provenance analysis: methodology The results of the mapping process also include the resources that have not been mapped, since they also can provide useful insights on the nature of the processed data. In particular, non-mapped BRs are analysed with respect to their provenance information in OC Meta, specifically the primary source they have been derived from (Crossref, DataCite, NIH-OCC, JaLC, OpenAIRE). This analysis is performed by programmatically examining the RDF data including provenance information of all entities in the OC Meta collection, and considering only the nodes concerning non-mapped BRs. For each of these entities, we may have one or more primary sources depending on the number of times the entities' metadata have been modified, and on the source used as raw data provider for implementing such modifications. For instance, if metadata information of a journal article was initially provided by Crossref during the first ingestion into OC Meta, and additional information was subsequently found for it from DataCite during a later data ingestion, both of these sources will be considered for the present analysis. The provenance analysis process then counts the number of BRs for each source (or set of sources, in the case of resources originating from multiple sources) and for each type of BR (e.g. journal article, book, etc.). It was also decided to separate the counts based on the presence or absence of external PIDs to ensure additional granularity and significance of the results. Indeed, if an entity in OC Meta is not associated with any IDs other than OMID, it cannot be mapped to OpenAlex. 3. Results Table 1 shows the number of processed BRs for both datasets and the general results of a quantitative analysis of the mapping output. As mentioned above, a BR entity in OC Meta can be mapped to a BR entity in OpenAlex only if both entities are associated with at least one external PID in common. Thus, the BRs in the OC Meta CSV dump that are theoretically mappable to at least one entity in OpenAlex are 90,270,131, and the set of OpenAlex BRs to which an OC Meta BR can be mapped amounts to 159,039,790 resources. Of the 90,270,131 mappable resources in the OC Meta CSV dump, most (approximately 97%) map to at least one resource in OpenAlex. However, a small number of these (173,513, roughly 0.2%) align (i.e. share external PIDs) with more than one entity in OpenAlex (multi-mapped BRs). At the same time, and vice versa, there is a consistent number of BRs in OC Meta (5,722,979) that do not uniquely map to a BR in OpenAlex, meaning that there are also cases where two or more BRs in OC Meta are aligned with the same entity in OpenAlex. These latter cases will be referred to as inverted multi-mapped BRs. Finally, 18,133,712 BRs in OC Meta do not map to any resource in OpenAlex, whether because, after being processed, they have been found not to have any corresponding entity in OpenAlex despite having external PIDs (2,963,534 BRs); because they do not have any external PID (9,000,386 BRs); or because they are not included in the CSV dump files, thus were not processed. Concerning the latter scenario, it is worth mentioning that the OC Meta software, when producing CSV dump files from the triplestore, does not represent journal issues and journal volumes as table rows. However, almost all BRs of these types lack external PIDs, with their OMID being the only persistent identifier. Table 1 Number of processed, mapped, multi-mapped and non-mapped bibliographic resources. OC Meta Total No. of BRs in triplestore 105,953,699 No. of processed BRs (stored in CSV files) 99,270,517 No. of processed BRs with PIDs also supported by OpenAlex (stored in CSV files) 90,270,131 OpenAlex Number of BRs in dump 245,207,435 Number of BRs with PIDs supported also by OC Meta 159,039,790 Mapping OC Meta → OpenAlex No. of BRs in OC Meta mapped to exactly one BR in OpenAlex (1:1) 87,605,238 No. of BRs in OC Meta, which map to the same BR in OpenAlex as at least one other BR in OC Meta (n:1, where n>1) 5,722,979 No. of multi-mapped BRs in OC Meta (1:n, where n>1) 173,513 No. of non-mapped BRs in OC Meta 18,133,712 3.1. Multi-mapped entities Multi-mapped BRs have been analysed with respect to the number of OpenAlex entities mapped to a single BR in OC Meta. As shown in the distribution histogram in Figure 1, most cases involve two OpenAlex IDs per OMID (91.5%), followed by cases involving 3 OpenAlex IDs per OMID at a much lesser proportion (6.2%). The remaining cases (more than 3 OpenAlex IDs per OMID) are significantly less frequent, with values lower than 1.3%. It should also be mentioned, though, that some multi-mapped BRs are connected to a particularly high number of OpenAlex IDs: there are isolated cases of OC Meta BRs being mapped to more than 100 entities in OpenAlex, and even an outlier case involving 1,051 OpenAlex IDs. Such examples, though not common, may also help reveal potential anomalies or inconsistencies in both datasets. Figure 1: Histogram representing the distribution of multi-mapped OMIDs by the number of the OpenAlex IDs found for a single OMID. Table 2 and Table 3 show the results of the categorisation of multi-mapped BRs, grouped by the type specified in OC Meta, involving OpenAlex Works and Sources, respectively. As concerns Works, most cases remain unclassified19. Nonetheless, we notice that BRs types that are most frequently involved in multi-mapping are journal articles, books, book chapters, resources whose type is not specified, and proceedings articles. Most cases, among the ones it was possible to classify, concern journal articles: publications for which the same PID (e.g. DOI) is assigned to multiple entities in OpenAlex (category A) and publications that are represented in 2 different Work entities in OpenAlex (category C). Other common cases for journal articles involve their publication in different versions: the preprint and/or postprint version, and possibly the version of record, are all merged into the same entity in OC Meta (categories B, D, and E). Other noteworthy cases involve book chapters in OC Meta mapped to 2 OpenAlex Works (category C) and resources of unspecified type assigned version-marked DOIs (category D). Regarding Sources, the most common case involves journals for which the same ISSN is attributed to more than one entity in OpenAlex (category A). Non-classified Sources, mostly journals, are likely caused by OpenAlex not associating different ISSNs to the same journal entity. Journals, indeed, can be assigned two different ISSNs, one for the print version and one for the online version; sometimes they can even receive more than two ISSNs, if for example 19 These could potentially include mappings that the categorization heuristics failed to catch, or concern general errors in the data sources used by OC Meta and/or OpenAlex. there have been changes in the journal name. While OC Meta tends to prioritise the fundamental continuity of the journal entity – regardless of variations in names, the number of ISSNs, or diverse publication media – OpenAlex occasionally encounters challenges in consolidating all ISSNs under a single entity. In Example 1, for the journal identified as “br/06602375171”, the “Journal of Health”20, OC Meta has two ISSNs, each assigned to a different entity in OpenAlex (S2764583335, associated with the online ISSN, and S4210187171, associated with the print ISSN). omid openalex_id (Example 1) br/06602375171 S2764583335 S4210187171 Table 2 Number of multi-mapped OC Meta BRs for each BR type and category. Cases involving OpenAlex Work entities. OC Meta br Unclassi type A B C D E F fied Tot.: 167054 39,758 9,421 3,8984 12,496 1,376 887 64,132 journal article 38,179 8,722 35,744 10,196 1,030 805 50,579 book 27 1 581 31 0 4 8,511 book chapter 341 8 1,112 21 4 36 2,002 607 502 609 1,753 265 29 1,503 proceedings article 477 10 452 108 16 0 666 proceedings 8 24 230 13 1 0 508 report 13 1 155 1 0 0 167 reference book 0 0 7 0 0 0 69 reference entry 99 7 22 0 1 13 57 web content 2 146 14 335 58 0 47 dataset 1 0 38 37 0 0 10 dissertation 0 0 9 1 1 0 9 series 0 0 4 0 0 0 3 standard 0 0 6 0 0 0 1 book section 0 0 1 0 0 0 0 journal 4 0 0 0 0 0 0 20 https://journal.gunabangsa.ac.id/index.php/joh/ Table 3 Number of multi-mapped OC Meta BRs for each BR type and category. Cases involving OpenAlex Source entities. OC Meta br type A Unclassified Total: 6459 4,076 2,383 journal 4,057 2,345 book series 17 38 series 2 0 3.2. Non-mapped entities The entities in OC Meta that have not been mapped to any entity in OpenAlex (i.e. non- mapped entities) have been analysed with regard to the source they have been provided by. Provenance information is available as RDF data for the great majority of non-mapped entities, with only 2094 being left out. Approximately 83% of non-mapped entities do not have any other PID than their OMID, therefore they cannot be mapped until any other PID also supported by OpenAlex is associated with them in OC Meta data. Table 4 illustrates a representative sample of the results of provenance analysis, concerning the ten most frequent bibliographic entity types among non-mapped entities: it shows how many non-mapped entities derive from each source or set of sources, and entities are grouped by the type of BR and by the presence/absence of other PIDs besides OMID. Table 4 Number of non-mapped OC Meta BRs for each provenance source and BR type. The column "External PID?" indicates whether the values in the row refer to BRs for which other PIDs than OMID are registered in OC Meta. The intersection symbol (∩) connecting two data sources indicates that the counts on the row refer to BRs for which the OC Meta provenance data provides multiple data sources across the snapshots. Extern al proceed journal journal unspeci journal referen PIDs? ings issue book volume dataset fied article ce book report journal total by - 5,383,11 5,064,03 2,521,88 1,576,74 1,242,10 1,419,21 253,284 188,453 135,997 103,263 type → 5 0 6 4 1 2 5,370,79 4,870,25 2,407,89 1,547,90 no 0 428,182 0 188,426 0 61,561 Crossref 3 8 3 7 yes 31 79,934 108,658 95 46 473,428 5,467 26 15 55 no 0 1,075 0 0 0 0 0 0 0 0 Zenodo yes 11,487 0 1,247 0 0 355,075 202,018 0 1,993 19 no 786 102,115 3,757 22,602 0 1 0 1 0 40,080 NIH yes 0 0 0 0 0 153 42,009 0 0 1,499 no 0 2 0 1 0 9 0 0 0 1 Datacite 1,238,17 yes 0 0 300 0 3 162,061 1075 0 133,865 6 Zenodo no 0 2,830 0 1,730 0 0 0 0 0 5 ∩ Crossref yes 17 37 2 0 190 16 57 0 0 0 Datacite no 1 457 1 436 0 0 0 0 0 0 ∩ Crossref yes 0 23 28 0 3,521 287 7 0 115 0 NIH no 0 3,847 0 910 0 0 0 0 0 13 ∩ Zenodo yes 0 0 0 910 0 0 374 0 0 21 NIH no 0 3,307 0 2,125 0 0 0 0 0 0 ∩ Crossref yes 0 120 0 0 0 0 2,246 0 0 3 NIH no 0 12 0 27 0 0 0 0 0 0 ∩ Zenodo ∩ yes 0 13 0 0 0 0 31 0 0 0 Crossref Zenodo no 0 0 0 1 0 0 0 0 0 0 ∩ Datacite ∩ yes 0 0 0 0 6 0 0 0 3 0 Crossref Zenodo no 0 0 0 0 0 0 0 0 0 0 ∩ Datacite yes 0 0 0 0 165 0 0 0 6 0 4. Discussion The mapping process and the analysis of its results concerned the study and use of a great amount of data from the involved databases, requiring, for example, the consideration of all bibliographic entities in their entirety. This study highlighted problems and inconsistencies within the used datasets. First, concerning OC Meta, the process provided an opportunity to conduct counts of the number of entities contained in the CSV and JSON-LD files comprising the dump. This highlighted a discrepancy between the number of BRs contained in the triplestore and the number of BRs actually reported in the dump files constructed from the triplestore. Additionally, it was observed that this numerical difference is also reflected in the RDF files containing provenance information. The analysis of multi-mapped and the count of inverted multi-mapped BRs posed interesting questions as well. A comparison between OC Meta and OpenAlex from the perspective of two different data models helped emphasise that both collections have duplicate entities, i.e. resources sharing the same external PID (e.g., DOI or ISSN, which should be uniquely assigned), with at least one other resource within the collection. In the case of multi- mapped BRs, it was further found that the alignments of a single OMID to multiple OpenAlex IDs could be attributed partly to natural diversities between data models, partly to errors in data sources, and partly to errors in the software used to populate the collection. Generally, OC Meta tends to erroneously group various expressions of a resource (preprints, postprints, and versions of record) into a single entity, propagating errors present in data sources, even when there should be two separate entities (e.g. in the case of a version of record and its preprint, which should be two separate entities according to the OCDM). In contrast, OpenAlex generally tends to have separate entities due to limits on the number of possible values for each ID scheme and more intensive data correction activities made possible by the use of web crawlers. From the perspective of OC Meta, while some of these multi-mapped cases result from data representation choices, others are the result of errors often originated from sources (especially in cases where an OMID is aligned to a very high number of OpenAlex IDs). Regarding non-mapped BRs, we observed that, despite OpenAlex formally including a greater number of entities than OC Meta, approximately 5 million OMIDs are not associated with any corresponding OpenAlex ID. This is partly because some of the resources counted as non-mapped BRs (15,170,179 BRs) were not included in the CSV files that the mapping process takes as input; therefore they were not processed at all during the mapping phase. Of the other 2,963,533 non-mapped resources, those with one or more external PIDs are particularly interesting, as one would expect them to have at least one corresponding entity in OpenAlex. In this regard, it should be noted that, in the case of the 108,658 non-mapped books from Crossref, many resources likely have only ISBNs among the external IDs, which are not supported by OpenAlex and therefore cannot be used for mapping. Another interesting case is the set of dataset resources from DataCite, totalling 1,238,173 entities, which can be explained by the fact that DataCite is not among the sources used by OpenAlex. More generally, the 15,061,152 non-mapped BRs without external PIDs underscore the unique contribution made by OC Meta by assigning a persistent identifier, i.e. OMID, to entities that would otherwise lack one. Indeed, the OCDM permits to represent journal issues and journal volumes as first- class entities, while they are typically represented only as metadata associated with journal articles (as is the case for OpenAlex). 5. Conclusions The results of the mapping of OpenCitations Meta bibliographic resources to OpenAlex bibliographic resources have provided valuable insights into the integration of bibliographic metadata entities, showcasing that the majority of processed OC Meta resources are successfully mapped with exactly one entity in OpenAlex. This achievement is significant, as it allows for the direct ingestion of OpenAlex IDs into the metadata of the corresponding bibliographic resources in OC Meta. This seamless integration enhances the interconnectedness and interoperability of these two substantial bibliographic collections. However, challenges were encountered in the case of multi-mapped BRs, leading to the decision to temporarily exclude them from being included in OC Meta. While this choice poses a limitation, the analysis of these multi-mapped entities has proven instrumental in identifying inconsistencies within both datasets. Furthermore, the examination of non- mapped resources, considering their type and provenance, has underlined the impact of using different data sources and different identifiers in the collections to map, resulting in quite a significant limitation of the mapping coverage. Addressing the limits and inconsistencies revealed by the mapping results, OpenCitations has proactively taken measures to rectify errors and enhance the quality of its data, particularly in the production process of dump files. Future developments are envisioned, e.g. to further refine the management of scenarios involving bibliographic resources being associated with multiple values for the same ID scheme (e.g. multiple DOIs for the same journal article). Improvements like these aim to bolster the robustness of the mapping process as well as the quality of the data, ensuring a more accurate and comprehensive representation of bibliographic entities. Acknowledgements This project has been partially funded by the European Research Council Executive Agency under service contract ERCEA/2023/VLVP/0007 and the European Union’s Horizon Europe framework programme under grant agreement No. 101095129 (GraspOS Project). References [1] A. Massari, F. Mariani, I. Heibi, S. Peroni, and D. Shotton, ‘OpenCitations Meta’. Jun. 28, 2023. doi: https://doi.org/10.48550/arXiv.2306.16191. [2] J. Priem, H. Piwowar, and R. Orr, ‘OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts’, presented at the 26th Internation Conference on Science and Technology Indicators, arXiv, 2022. doi: 10.48550/ARXIV.2205.01833. [3] M. Daquino et al., ‘The OpenCitations Data Model’, in The Semantic Web – ISWC 2020, J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, and L. Kagal, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 447–463. doi: 10.1007/978-3-030-62466-8_28. [4] M. Daquino, A. Massari, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’. figshare, 2023. doi: 10.6084/M9.FIGSHARE.3443876.V8. [5] ‘OpenCitations Meta CSV dataset of all bibliographic metadata’. doi: https://doi.org/10.6084/m9.figshare.21747461.v5. [6] S. Peroni and D. Shotton, ‘OpenCitations, an infrastructure organization for open scholarship’, Quant. Sci. Stud., vol. 1, no. 1, pp. 428–444, Feb. 2020, doi: 10.1162/qss_a_00023. [7] I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6. [8] A. Sinha et al., ‘An Overview of Microsoft Academic Service (MAS) and Applications’, in Proceedings of the 24th International Conference on World Wide Web, Florence Italy: ACM, May 2015, pp. 243–246. doi: 10.1145/2740908.2742839. [9] K. Canese, J. Jentsch, and C. Myers, ‘PubMed: The Bibliographic Database’, in The NCBI Handbook, 2nd ed., 2013, p. 9. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK153385/ [10] H. Morrison, ‘Directory of Open Access Journals (DOAJ)’, Charlest. Advis., vol. 18, no. 3, pp. 25–28, Jan. 2017, doi: 10.5260/chara.18.3.25. [11] K. Dhakal, ‘Unpaywall’, J. Med. Libr. Assoc., vol. 107, no. 2, Apr. 2019, doi: 10.5195/jmla.2019.650. [12] S. Sigurdsson, ‘The future of arXiv and knowledge discovery in open science’, in Proceedings of the First Workshop on Scholarly Document Processing, Online: Association for Computational Linguistics, 2020, pp. 7–9. doi: 10.18653/v1/2020.sdp-1.2. [13] European Organization For Nuclear Research and OpenAIRE, ‘Zenodo: Research. Shared.’, 2013, doi: 10.25495/7GXK-RD71. [14] C. Maloney, E. Sequeiera, C. Kelly, R. Orris, and J. Beck, ‘Pubmed central’, in The NCBI Handbook, 2nd ed., 2013. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK153388/ [15] C. Atzori, A. Bardi, P. Manghi, and A. Mannocci, ‘The OpenAIRE Workflows for Data Management’, in Digital Libraries and Archives, vol. 733, C. Grana and L. Baraldi, Eds., in Communications in Computer and Information Science, vol. 733. , Cham: Springer International Publishing, 2017, pp. 95–107. doi: 10.1007/978-3-319-68130-6_8. [16] M. Hara, ‘Introduction of Japan Link Center (JaLC)’. ORCID, 2020. doi: 10.23640/07243.12469094.V1. [17] G. Hendricks, D. Tkaczyk, J. Lin, and P. Feeney, ‘Crossref: The sustainable source of community-owned scholarly metadata’, Quant. Sci. Stud., vol. 1, no. 1, pp. 414–427, Feb. 2020, doi: 10.1162/qss_a_00022. [18] J. Brase, ‘Datacite - A Global Registration Agency for Research Data’, SSRN Electron. J., 2010, doi: 10.2139/ssrn.1639998.