=Paper= {{Paper |id=Vol-3220/talk2 |storemode=property |title=How to structure citations data and bibliographic metadata in the OpenCitations accepted format |pdfUrl=https://ceur-ws.org/Vol-3220/invited-talk2.pdf |volume=Vol-3220 |authors=Arcangelo Massari,Ivan Heibi |dblpUrl=https://dblp.org/rec/conf/jcdl/MassariH22 }} ==How to structure citations data and bibliographic metadata in the OpenCitations accepted format== https://ceur-ws.org/Vol-3220/invited-talk2.pdf
How to structure citations data and bibliographic metadata
in the OpenCitations accepted format
Arcangelo Massari1 , Ivan Heibi1
1
    Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy


                                             Abstract
                                             The OpenCitations organization is working on ingesting citation data and bibliographic metadata directly provided by the
                                             community (e.g., scholars and publishers). The aim is to improve the general coverage of open citations, which is still far
                                             from being complete, and use the provided metadata to enrich the characterization of the citing and cited entities. This paper
                                             illustrates how the citation data and bibliographic metadata should be structured to comply with the OpenCitations accepted
                                             format.

                                             Keywords
                                             OpenCitations, Bibliographic metadata, Open citations, CSV



1. Introduction                                                                                                                       2. Metadata and citations
The Declaration on Research Assessment [1], the Leiden                                                                                OpenCitations manages and processes two different CSV
 Manifesto for Research Metrics [2], and the Initiative                                                                               files to separately characterize the ingested documents,
 for Open Citations (I4OC, https://i4oc.org/) have success-                                                                           one containing their metadata (META-CSV), and a sec-
 fully convinced almost all major academic publishers to                                                                              ond one holding their citations (CITS-CSV). On this sec-
 release their publication reference lists. To date, more                                                                             tion we discuss how these files should be structured and
 than 1.2 billion citations are available through the Cross-                                                                          defined before providing them to OpenCitations. The
 ref REST API [3] and distributed by OpenCitations [4]                                                                                discussion presented in this section is based on a more
 as structured, separated from the original bibliographic                                                                             exhaustive documentation [8].
 source and under the CC0 license [5].                                                                                                   CSV files are logically structured as tables. In META-
    Nevertheless, the coverage of open citations is still                                                                             CSV each document (row), is characterised by 11 at-
 far from complete [6]. On the one hand, some publish-                                                                                tributes (columns):
 ers have not yet made their citations public. On the
                                                                                                                                           • id. the ID(s) of the corresponding document. A
 other hand, many citations are lost because they are only
                                                                                                                                             document can have more than one ID, each ID is
 present in unstructured format within PDF files, espe-
                                                                                                                                             defined by its type (using an acronym) and value.
 cially in social sciences.
                                                                                                                                             Multiple IDs must be separated using single white
    OpenCitations is working on ingesting citations and
                                                                                                                                             space, as follow:
 bibliographic metadata directly coming from the commu-
 nity (e.g., scholars and publishers). In this way, projects                                                                                         ID abbreviation + “:” + ID value
 like EXCITE [7] - aimed at extracting citations from PDFs                                                                                   For example “doi:10.3233/ds-170012” indicates
- could significantly contribute to increasing the data cov-                                                                                 a DOI identifier having the value “10.3233/ds-
 erage.                                                                                                                                      170012”.
    The following section illustrates how to structure the                                                                                 • title. a textual value to express the title of the
 citation data and bibliographic metadata in the accepted                                                                                    document.
 format of OpenCitations. We conclude this paper with a                                                                                    • author and editor. Data regarding the authors
 description of the upcoming future related works.                                                                                           and the editors of the document. Each character
                                                                                                                                             (author/editor) is defined by several attributes,
                                                                                                                                             e.g., his family name or ID. Multiple characters
                                                                                                                                             are separated by a semicolon followed by a white
                                                                                                                                             space character. Generally, the definition of an
JCDL’22: ULITE-ws, Understanding LIterature references in academic                                                                           actor follows this structure:
full TExt, June 24–06, 2022, Cologne, Germany
Envelope-Open arcangelo.massari@unibo.it (A. Massari); ivan.heibi2@unibo.it                                                                  Family Name + “,” + “ ” + Given Name + “ ” + “[”
(I. Heibi)                                                                                                                                                     + IDs + “]”
Orcid 0000-0002-8420-0696 (A. Massari); 0000-0001-5366-5194
(I. Heibi)                                                                                                                                   The IDs of the authors/editors are specified in
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                             square brackets and follow the format used for
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                               the “id” attribute.



                                                                                                                                  1
Arcangelo Massari et al. CEUR Workshop Proceedings                                                                                1–4



        e.g. “Peroni, Silvio [orcid:0000-0003-0530-4305]”              • The fields “title”, “pub_date”, and “author” (or “ed-
                                                                         itor”) are mandatory for the resources of type
       In case of no IDs, the square brackets are omitted                book, dataset (or data file), dissertation, edited
       from the character description either. The given                  book, journal article, monograph, other, peer re-
       name is not mandatory, however, the description                   view, posted content (or web content), proceed-
       of the character should still contain a comma to                  ings article, report, and reference book. Moreover,
       indicate such absence (e.g. “Peroni, [orcid:0000-                 this information is compulsory if the ”type” field
       0003-0530- 4305]”)                                                is empty.
     • pub_date. the date of publication of the doc-                   • The “title” and “venue” fields are required for the
       ument. The date is defined according to ISO                       resources of type book chapter, book part, book
       86014[9], the ISO standard for “Representation of                 section, book track, component, and reference
       dates and times”:                                                 entry.
                         YYYY-MM-DD                                    • Only the “title” field is required for the resources
                                                                         of type book series, book set, journal, proceedings,
       It is mandatory to specify at least the publication               proceedings series, report series, standard, and
       year. The values of the month and day are not                     standard series.
       required. However, if the day is specified, the                 • Regarding the resources of journal volume type,
       month must be specified as well.                                  the fields “venue” and “volume”, or “venue” and
     • venue. data regarding the venue of the document.                  “title”, are mandatory. Conversely, as for re-
       For example, if the document is a journal article,                sources of journal issue type, the fields “venue”
       the venue defines the journal where the document                  and “issue”, or “venue” and “title”, are mandatory.
       has been published. Each venue is described as
       follows:                                                      Table 1 shows an example of a well-formed META-
                                                                 CSV representation. The table contains a small sample of
                Venue Title + “ ” + “[” + IDs + “]”              ten documents (rows) and their corresponding attributes
         The IDs of a venue are described using the same         (columns).
         format used previously. In case of no identifiers,          On the other hand, in the CITS-CSV each entity
         the square brackets are omitted.                        (row) represents a citation. A citation is characterised
                                                                 by 4 attributes (columns): citing_id, citing_publica-
       • volume and issue. these values are required
                                                                 tion_date, cited_id, and cited_publication_date. The
         only if the document is contained in a journal
                                                                 c i t i n g _ i d and c i t e d _ i d values represent the identifiers
         volume or a journal issue.
                                                                 of the citing and cited document, respectively. These
       • page. the page range of the corresponding doc-
                                                                 values are both mandatory, and they are structured fol-
         ument, defined through the specification of the
                                                                 lowing the same scheme used for id definition in META-
         first and the last page, divided by a hyphen “-”.
                                                                 CSV. The citing_publication_date and cited_publica-
       • type. a textual value to identify the document.         tion_date represent the date of publication of the citing
         This value is taken from the list of the currently      and cited document, respectively. Both these values are
         supported bibliographic resource types: book,           optional, and follow the same structural scheme used for
         book chapter, book part, book section, book se-         p u b _ d a t e definition in META-CSV.
         ries, book set, book track, component, dataset (or          Table 2 shows an example of a well-formed CITS-CSV
         data file), dissertation, edited book, journal, jour-   representation. The table contains a small sample of
         nal article, journal issue, journal volume, mono-       ten different citations (rows) and their corresponding
         graph, other, peer review, posted content (or web       attributes (columns).
         content), proceedings, proceedings article, pro-
         ceedings series, reference book, reference entry,
         report, report series, standard, and standard se- 3. Discussion and conclusion
         ries.
       • publisher. the publisher name of the correspond- This paper described how to define well-formed CSV files
         ing document. To define a publisher we apply the storing citations and metadata of bibliographic resources,
         same format used in the definition of the v e n u e . ready to be provided and later processed by OpenCita-
                                                                  tions.
   If the resource identifier is specified in the “id” field, all    The ingestion of bibliographic metadata will be pos-
the other fields are optional. Conversely, if the “id” field sible starting from the release of OpenCitations Meta
is empty, there are mandatory fields that vary depending (OC-Meta), expected by the end of 2022. OC-Meta will
on the resource type:                                             store bibliographic metadata for the documents involved




                                                             2
Arcangelo Massari et al. CEUR Workshop Proceedings                                                                                        1–4



(as citing or cited entities) in OpenCitations citation in- [5] S. Peroni, D. Shotton, Open citation: Defi-
dexes.                                                          nition, 2018. doi:1 0 . 6 0 8 4 / M 9 . F I G S H A R E . 6 6 8 3 8 5 5 . V 1 ,
   The ingestion of the citations is possible thanks to         artwork Size: 95436 Bytes Publisher: figshare.
CROCI, the Crowdsourced Open Citations Index, which         [6] A. Martín-Martín, Coverage of open citation data
allows individuals identified by ORCIDs to deposit the          approaches parity with web of science and scopus,
citation data that they have legal right to submit [10].        OpenCitations blog (2021).
Citation data are submitted to either Figshare (https: [7] A. Hosseini, B. Ghavimi, Z. Boukhers, P. Mayr,
//figshare.com) or Zenodo (https://zenodo.org), accompa-        Excite–a toolchain to extract, match and publish
nied by the ORCID of the contributor. Aftwerwards, the          open literature references, in: 2019 ACM/IEEE Joint
submitter can inform OpenCitations using the GitHub             Conference on Digital Libraries (JCDL), IEEE, 2019,
issue tracker on the CROCI repository (https://github.          pp. 432–433.
com/opencitations/croci/issues).                            [8] A. Massari, How to produce well-formed CSV files
   Future works include implementing an interface that          for OpenCitations, 2022. URL: https://doi.org/10.
simplifies and automates the entire publication process         5281/zenodo.6597141. doi:1 0 . 5 2 8 1 / z e n o d o . 6 5 9 7 1 4 1 .
via CROCI, also providing input data validation and mod- [9] M. Wolf, C. Wicksteed, Date and time formats,
ification suggestions.                                          https://www.w3.org/TR/NOTE-datetime, 1997.
   Moreover, CROCI currently handles only DOI-to-DOI [10] I. Heibi, S. Peroni, D. M. Shotton, Crowdsourc-
citations. The upcoming plan is to let CROCI manage             ing open citations with CROCI - an analysis of
also any-to-any citations.                                      the current status of open citations, and a pro-
                                                                posal, CoRR abs/1902.02534 (2019). URL: http:
                                                                //arxiv.org/abs/1902.02534. a r X i v : 1 9 0 2 . 0 2 5 3 4 .
Acknowledgments
This work was funded from the European Union’s Hori-
zon 2020 research and innovation program under grant
agreement No 101017452 (OpenAIRE-Nexus Project). We
want to thank Silvio Peroni for supervising the entire
work on OpenCitations, Philipp Mayr-Schlegel and Ah-
san Shahid for the feedback on the documentation from
which this demo paper is drawn, and Davide Brambilla
for the valuable insights about CROCI and its future de-
velopments.


References
 [1] R. Cagan, San francisco declaration on research
     assessment, Disease Models & Mechanisms (2013)
     dmm.012955. URL: https://journals.biologists.com/
     dmm/article/doi/10.1242/dmm.012955/261854/
     San-Francisco-Declaration-on-Research-Assessment.
     doi:1 0 . 1 2 4 2 / d m m . 0 1 2 9 5 5 .
 [2] D. Hicks, P. Wouters, L. Waltman, S. de Rijcke,
     I. Rafols, Bibliometrics: The leiden manifesto for
     research metrics, Nature 520 (2015) 429–431. URL:
     https://www.nature.com/articles/520429a. doi:1 0 .
     1038/520429a.
 [3] G. Hendricks, D. Tkaczyk, J. Lin, P. Feeney, Cross-
     ref: The sustainable source of community-owned
     scholarly metadata, Quantitative Science Studies 1
     (2020) 414–427. doi:1 0 . 1 1 6 2 / q s s _ a _ 0 0 0 2 2 .
 [4] S. Peroni, D. Shotton, OpenCitations, an infrastruc-
     ture organization for open scholarship, Quantita-
     tive Science Studies 1 (2020) 428–444. doi:1 0 . 1 1 6 2 /
     qss_a_00023.




                                                                      3
                                                                                                                                                                                                                                                               A. Appendix
                                                                                                                                                                                                                                                                             Arcangelo Massari et al. CEUR Workshop Proceedings




4
                                                                                          Table 1: A sample of ten documents characterized by
                                                                                                   their corresponding metadata attributes

    id                                title                               author                                        pub_date    venue                             volume   issue   page      type              publisher                 editor
                                                                                                                                                                                                                   Springer International
                                                                          Peroni, Silvio [orcid:0000-0003-0530-4305];               17th ISWC
    doi:10.1007/978-3-030-00668-6_8   The SPAR Ontologies                                                               2018                                                           119-136   book chapter      Publishing
                                                                          Shotton, David [orcid:0000-0001-5506-523X]                [doi:10.1007/978-3-030-00668-6]
                                                                                                                                                                                                                   [crossref:297]
                                                                                                                                    Data Science                                                                   IOS Press
    doi:10.3233/DS-170012             Automating semantic publishing      Peroni, Silvio [orcid:0000-0003-0530-4305]    2017                                          1        1-2     155-173   journal article
                                                                                                                                    [issn:2451-8484 issn:2451-8492]                                                [crossref:7437]
    doi:10.1007/978-3-476-00160-3                                                                                                                                                                                  Springer Science and
    isbn:9783476021144                Literatur                                                                         2005                                                                     book              Business Media LLC        Gfrereis, Heike
    isbn:9783476001603                                                                                                                                                                                             [crossref:297]
    doi:10.1057/9780230316645                                                                                                                                                                                      Springer Science and
    isbn:9780230276604                New Waves in Philosophy of Law                                                    2011                                                                     book              Business Media LLC        Mar, Maksymilian Del
    isbn:9780230316645                                                                                                                                                                                             [crossref:297]
    doi:10.4324/9781003115830                                                                                                                                                                                      Informa UK Limited
                                      Governing Savages                   Markus, Andrew                                2020-7-31                                                                book
    isbn:9781003115830                                                                                                                                                                                             [crossref:301]
    doi:10.1515/9781503600836                                                                                                                                                                                      Walter de Gruyter GmbH
                                      Newsworthy                          Barbas, Samantha                              2020-6-24                                                                book
    isbn:9781503600836                                                                                                                                                                                             [crossref:374]
                                      On the theory of                                                                              High Temperature                                                               Pleiades Publishing Ltd
    doi:10.1134/s0018151x17020055                                         Gladkov, S. O.                                2017-5                                        55       3       321-325   journal article
                                      convection of electrons in metals                                                             [issn:0018-151X issn:1608-3156]                                                [crossref:137]
                                                                                                                                    High Temperature                                                               Pleiades Publishing Ltd
    doi:10.1134/s0018151x17050029     Stability of boiling shock          Avdeev, A. A.                                 2017-9                                        55       5       753-760   journal article
                                                                                                                                    [issn:0018-151X issn:1608-3156]                                                [crossref:137]
                                      The high-temperature                                                                          High Temperature                                                               Pleiades Publishing Ltd
    doi:10.1134/s0018151x17050224                                         Zhakin, A. I.                                 2017-9                                        55       5       767-776   journal article
                                      and radiative effect on concrete                                                              [issn:0018-151X issn:1608-3156]                                                [crossref:137]
                                      Relaxation of Rayleigh                                                                        High Temperature                                                               Pleiades Publishing Ltd
    doi:10.1134/s0018151x18010169                                         Skrebkov, O. V.                               2018-1                                        56       1       77-83     journal article
                                      and Lorentz Gases in Shock Waves                                                              [issn:0018-151X issn:1608-3156]                                                [crossref:137]
                                                                                                                                                                                                                                                                             1–4
                                       Table 2: A sample of ten citations characterized by their
                                                related attributes

    citing_id                          citing_publication_date       cited_id                           cited_publication_date
    doi:10.1016/j.websem.2012.08.001   2012-12                       doi:10.1087/2009202                2009-04-01
    doi:10.1016/j.websem.2012.08.001   2012-12                       doi:10.1371/journal.pcbi.1000361
    doi:10.1016/j.websem.2012.08.001   2012-12                       doi:10.1007/978-3-642-33876-2_35   2012
    doi:10.1016/j.websem.2012.08.001   2012-12                       doi:10.1186/2041-1480-1-S1-S6      2010-06-22
    doi:10.1016/j.websem.2012.08.001   2012-12                       doi:10.1145/945645.945664          2003-10-23
    pmid:23636598                      2013                          pmid:19151427                      2005
    pmid:23636598                      2013                          pmid:19782561                      2008-10
    pmid:23636598                                                    pmid:18686754                      2012-09-05
    pmid:23636598                      2013                          pmid:15890079                      2009-07-15
                                                                                                                                 Arcangelo Massari et al. CEUR Workshop Proceedings




    pmid:23636598                      2013                          pmid:18191757




5
                                                                                                                                 1–4